Name : perl-Regexp-Ignore
| |
Version : 0.03
| Vendor : obs://build_opensuse_org/devel:languages:perl
|
Release : lp155.6.1
| Date : 2023-07-20 16:58:23
|
Group : Development/Libraries/Perl
| Source RPM : perl-Regexp-Ignore-0.03-lp155.6.1.src.rpm
|
Size : 0.06 MB
| |
Packager : https://www_suse_com/
| |
Summary : Let us ignore unwanted parts, while parsing text.
|
Description :
Markup languages, like HTML, are difficult to parse. The reason is that you can have a line like:
< font size=+1>H< /font>ello < font size=+1>W< /font>orld
How can we find the string \"Hello World\", in the above line, and replace it by \"Hello Universe\" (which is a lot deeper)? Or how can we run a speller on the text and replace the mistakes with suggestions for the correct spelling?
This module come to help you doing exactly that.
Actually the module let you first split the text to the parts you are interested in and the unwanted parts. For example, all the HTML tags can be taken as unwanted parts.
Then it let you parse the part you are interested in (while totally ignoring the unwanted parts).
In the end it let you merge back the unwanted parts with the possibly changed parts you were interested in.
There is just one catch. It uses the assumption that when you replace the above \"Hello World\" to \"Hello Universe\", all the unwanted parts between the start of the match to the end of the match, will be pushed after the text that will replace the match. This is not really understood right? Look at the example:
The text:
< font size=+1>H< /font>ello < font size=+1>W< /font>orld
will be first split and we will get the \"cleaned\" text:
Hello World
Then we can parse it using something like:
s/Hello World/Hello Universe/;
This will give us the changed \"cleaned\" text:
Hello Universe
When we will merge with the unwanted parts we will get
< font size=+1>Hello Universe< /font>< font size=+1>< /font>
So, the unwanted parts in the match were pushed after the replacer.
Why this assumption?
Because. Actually, I could not find any better assumption. I can not guess what will be the unwanted parts in a match and the replacer of the match might be longer or shorter then the match itself. So, in fact, we have three reasonable possibilities: 1. Push the unwanted parts before the replacer. 2. Push the unwanted parts after the replacer. 3. Spread the unwanted parts in the replacer in the same proportions that they are spread in the match.
So I chose the second option. It is very similar to the first, and by far a lot simpler (to implement and to use) then the third.
As you see in the example above, usually it should not break the markup language. It might, though, give some surprises - in the example above, \"Hello Universe\" is all marked to be with bigger fonts.
All in all, I believe that it provides big help when parsing formatted texts.
So now, that we know what the module can give us, let\'s check how we use the module.
The class Regexp::Ignore is an abstract class: there is a method, *get_tokens*, in the class that is not implemented. So the user of this class must inherit it and implement the *get_tokens* method. The *get_tokens* method actually splits the text into tokens and mark them \"wanted\" or \"unwanted\".
Don\'t panic - it might sound very difficult, but it is not. Moreover, the module comes with some classes that already inherit from Regexp::Ignore, and you can use them. For more details about implementing the *get_tokens* method and an implementation example, see below.
After we have the inherited class that implements the *get_tokens* method, and we call *split* to split the text, we can go on with our parsing like the SYNOPSIS above. We can use the method *s* which is parallel to the perl s// operator, and if we need more complex text manipulation, we can replace text directly using the b< replace> method.
When we finish to change the text, we can call the *merge* method that will build the resulted text from the changed \"cleaned\" text and the unwanted parts.
|
RPM found in directory: /packages/linux-pbone/ftp5.gwdg.de/pub/opensuse/repositories/devel:/languages:/perl:/CPAN-R/15.5/noarch |