Name : perl-Text-CSV-Separator
| |
Version : 0.20
| Vendor : obs://build_opensuse_org/devel:languages:perl
|
Release : lp155.7.1
| Date : 2023-07-20 17:54:41
|
Group : Development/Libraries/Perl
| Source RPM : perl-Text-CSV-Separator-0.20-lp155.7.1.src.rpm
|
Size : 0.03 MB
| |
Packager : https://www_suse_com/
| |
Summary : Determine the field separator of a CSV file
|
Description :
This module provides a fast detection of the field separator character (also called field delimiter) of a CSV file, or more generally, of a character separated text file (also called delimited text file), and returns it ready to use in a CSV parser (e.g., Text::CSV_XS, Tie::CSV_File, or Text::CSV::Simple). This may be useful to the vulnerable -and often ignored- population of programmers who need to process automatically CSV files from different sources.
The default set of candidates contains the following characters: \',\' \';\' \':\' \'|\' \'\\t\'
The only required parameter is the CSV file path. Optionally, the user can specify characters to be excluded or included in the list of candidates.
The routine returns an array containing the list of candidates that passed the tests. If it succeeds, this array will contain only one value: the field separator we are looking for. On the other hand, if no candidate survives the tests, it will return an empty list.
The technique used is based on the following principle:
* *
For every line in the file, the number of instances of the separator character acting as separators must be an integer constant > 0 , although a line may also contain some instances of that character as literal characters.
* *
Most of the other candidates won\'t appear in a typical CSV line.
As soon as a candidate misses a line, it will be removed from the candidates list.
This is the first test done to the CSV file. In most cases, it will detect the separator after processing the first few lines. In particular, if the file contains a header line, one line will probably be enough to get the job done. Processing will stop and return control to the caller as soon as the program reaches a status of 1 single candidate (or 0 candidates left).
If the routine cannot determine the separator in the first pass, it will do a second pass based on several heuristic techniques. It checks whether the file has columns consisting of time values, comma-separated decimal numbers, or numbers containing a comma as the group separator, which can lead to false positives in files that don\'t have a header row. It also measures the variability of the remaining candidates. Of course, you can always create a CSV file capable of resisting the siege, but this approach will work correctly in many cases. The possibility of excluding some of the default candidates may help to resolve cases with several possible winners. The resulting array contains the list of possible separators sorted by their likelihood, being the first array item the most probable separator.
The module also provides an alternative interface with a simpler syntax, which can be handy if you think that the files your program will have to deal with aren\'t too exotic. To use it you only have to add the *lucky => 1* key-value pair to the parameters hash and the routine will return a single value, so you can assign it directly to a scalar variable. If no candidate survives the first pass, it will return \'undef\'. The code skips the 2nd pass, which is usually unnecessary, so the program won\'t store counts and won\'t check any existing regularities. Hence, it will run faster and will require less memory. This approach should be enough in most cases.
|
RPM found in directory: /packages/linux-pbone/ftp5.gwdg.de/pub/opensuse/repositories/devel:/languages:/perl:/CPAN-T/15.5/noarch |