Name : libtextcat
| |
Version : 2.2
| Vendor : (none)
|
Release : 1.guru.suse100
| Date : 2006-01-02 16:08:36
|
Group : System/Libraries
| Source RPM : libtextcat-2.2-1.guru.suse100.src.rpm
|
Size : 0.03 MB
| |
Packager : Pascal Bleser < guru_unixtech_be>
| |
Summary : N-Gram-Based Text Categorization Library
|
Description :
Libtextcat is a library with functions that implement the classification technique described in Cavnar & Trenkle, \"N-Gram-Based Text Categorization\". It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy.
The central idea of the Cavnar & Trenkle technique is to calculate a \"fingerprint\" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency. Fingerprints are compared with a simple out-of-place metric.
Considerable effort went into making this implementation fast and efficient. The language guesser processes over 100 documents/second on a simple PC, which makes it practical for many uses. It was developed for use in our webcrawler and search engine software, in which it it handles millions of documents a day.
|
RPM found in directory: /packages/linux-pbone/ftp.gwdg.de/pub/linux/misc/suser-guru/rpm/10.0/RPMS/x86_64 |