Name : perl-Text-Ngram
| |
Version : 0.13
| Vendor : obs://build_opensuse_org/devel:languages:perl
|
Release : 4.6
| Date : 2014-07-17 03:23:03
|
Group : Development/Libraries/Perl
| Source RPM : perl-Text-Ngram-0.13-4.6.src.rpm
|
Size : 0.03 MB
| |
Packager : (none)
| |
Summary : Ngram analysis of text
|
Description :
n-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to aid topic analysis, language determination and so on. The n-gram spectrum of a document can be used to compare and filter documents in multiple languages, prepare word prediction networks, and perform spelling correction.
The neat thing about n-grams, though, is that they\'re really easy to determine. For n=3, for instance, we compute the n-gram counts like so:
the cat sat on the mat --- $counts{\"the\"}++; --- $counts{\"he \"}++; --- $counts{\"e c\"}++; ...
This module provides an efficient XS-based implementation of n-gram spectrum analysis.
There are two functions which can be imported:
ngram_counts This first function returns a hash reference with the n-gram histogram of the text for the given window size. The default window size is 5.
$href = ngram_counts(\\%config, $text, $window_size);
The only necessary parameter is $text.
The possible value for \\%config are:
flankbreaks If set to 1 (default), breaks are flanked by spaces; if set to 0, they\'re not. Breaks are punctuation and other non-alfabetic characters, which, unless you use \'punctuation => 0\' in your configuration, do not make it into the returned hash.
Here\'s an example, supposing you\'re using the default value for punctuation (1):
my $text = \"Hello, world\"; my $hash = ngram_counts($text, 5);
That produces the following ngrams:
{ \'Hello\' => 1, \'ello \' => 1, \' worl\' => 1, \'world\' => 1, }
On the other hand, this:
my $text = \"Hello, world\"; my $hash = ngram_counts({flankbreaks => 0}, $text, 5);
Produces the following ngrams:
{ \'Hello\' => 1, \' worl\' => 1, \'world\' => 1, }
lowercase If set to 0, casing is preserved. If set to 1, all letters are lowercased before counting ngrams. Default is 1.
$href_p = ngram_counts( {lowercase => 0}, $text, 4 );
punctuation If set to 0 (default), punctuation is removed before calculating the ngrams. Set to 1 to preserve it.
$href_p = ngram_counts( {punctuation => 1}, $text, 2 );
spaces If set to 0 (default is 1), no ngrams contaning spaces will be returned.
$href = ngram_counts( {spaces => 0}, $text, 3);
If you\'re going to request both types of ngrams, than the best way to avoid calculating the same thing twice is probably this:
$href_with_spaces = ngram_counts($text[, $window]); $href_no_spaces = $href_with_spaces; for (keys %$href_no_spaces) { delete $href->{$_} if / / }
add_to_counts This incrementally adds to the supplied hash; if \'$window\' is zero or undefined, then the window size is computed from the hash keys.
add_to_counts($more_text, $window, $href)
|
RPM found in directory: /packages/linux-pbone/ftp5.gwdg.de/pub/opensuse/repositories/devel:/languages:/perl/SLE_11_SP3/i586 |
Hmm ... It's impossible ;-) This RPM doesn't exist on any FTP server
Provides :
Ngram.so
perl(Text::Ngram)
perl-Text-Ngram
Requires :