NAME
Text::CSV::Separator - Determine the field separator of a CSV file
VERSION
Version 0.15 April 6, 2007
SYNOPSIS
use Text::CSV::Separator qw(get_separator);
my @char_list = get_separator(
path => $csv_path,
exclude => $array1_ref, # optional
include => $array2_ref, # optional
echo => 'on', # optional
);
my $separator;
if (@char_list) {
if (@char_list == 1) { # successful detection
$separator = $char_list[0];
} else { # several candidates passed the tests
warning message or any other action
} else { # no candidate passed the tests
warning message or any other action
}
# "I'm Feeling Lucky" alternative interface
# Don't forget to include the 'lucky' parameter
my $separator = get_separator(
path => $csv_path,
lucky => 1,
exclude => $array1_ref, # optional
include => $array2_ref, # optional
echo => 'on', # optional
);
DESCRIPTION
This module provides a fast detection of the field separator character (also called field delimiter) of a CSV file, or more generally, of a character separated text file (also called delimited text file), and returns it ready to use in a CSV parser (e.g., Text::CSV_XS, Tie::CSV_File, or Text::CSV::Simple). This may be useful to the vulnerable -and often ignored- population of programmers who need to process automatically CSV files from different sources.
The default set of candidates contains the following characters: ',' ';' ':' '|' '\t'
The only required parameter is the CSV file path. Optionally, the user can specify characters to be excluded or included in the list of candidates.
The routine returns an array containing the list of candidates that passed the tests. If it succeeds, this array will contain only one value: the field separator we are looking for. On the other hand, if no candidate survives the tests, it will return an empty list.
The technique used is based on the following principle:
For every line in the file, the number of instances of the separator character acting as separators must be an integer constant > 0 , although a line may also contain some instances of that character as literal characters.
Most of the other candidates won't appear in a typical CSV line.
As soon as a candidate misses a line, it will be removed from the candidates list.
This is the first test done to the CSV file. In most cases, it will detect the separator after processing the first few lines. In particular, if the file contains a header line, one line will probably be enough to get the job done. Processing will stop and return control to the caller as soon as the program reaches a status of 1 single candidate (or 0 candidates left).
If the routine cannot determine the separator in the first pass, it will do a second pass based on several heuristic techniques. It checks whether the file has columns consisting of time values, comma-separated decimal numbers, or numbers containing a comma as the group separator, which can lead to false positives in files that don't have a header row. It also measures the variability of the remaining candidates. Of course, you can always create a CSV file capable of resisting the siege, but this approach will work correctly in many cases. The possibility of excluding some of the default candidates may help to resolve cases with several possible winners. The resulting array contains the list of possible separators sorted by their likelihood, being the first array item the most probable separator.
The module also provides an alternative interface with a simpler syntax, which can be handy if you think that the files your program will have to deal with aren't too exotic. To use it you only have to add the lucky => 1 key-value pair to the parameters hash and the routine will return a single value, so you can assign it directly to a scalar variable. If no candidate survives the first pass, it will return undef
. The code skips the 2nd pass, which is usually unnecessary, so the program won't store counts and won't check any existing regularities. Hence, it will run faster and will require less memory. This approach should be enough in most cases.
FUNCTIONS
- get_separator(%options)
-
Returns an array containing the field separator character (or characters, if more than one candidate passed the tests) of a CSV file. In case no candidate passes the tests, it returns an empty list.
The available parameters are:
- path
-
Required. The path to the CSV file.
- exclude
-
Optional. Array containing characters to be excluded from the candidates list.
- include
-
Optional. Array containing characters to be included in the candidates list.
- lucky
-
Optional. If selected, get_separator will return one single character, or
undef
in case no separator is detected. Off by default. - echo
-
Optional. Writes to the standard output messages describing the actions performed. Off by default. This is useful to keep track of what's going on, especially for debugging purposes.
EXPORT
None by default.
EXAMPLE
Consider the following scenario: Your program must process a batch of csv files, and you know that the separator could be a comma, a semicolon or a tab. You also know that one of the fields contains time values. This field will provide a fixed number of colons that could mislead the detection code. In this case, you should exclude the colon (and you can also exclude the other default candidate not considered, the pipe character):
my @char_list = get_separator(
path => $csv_path,
exclude => [':', '|'],
);
if (@char_list) {
my $separator;
if (@char_list == 1) {
$separator = $char_list[0];
}
...
}
# Using the "I'm Feeling Lucky" interface:
my $separator = get_separator(
path => $csv_path,
lucky => 1,
exclude => [':', '|'],
);
MOTIVATION
Despite the popularity of XML, the CSV file format is still widely used for data exchange between applications, because of its much lower overhead: It requires much less bandwidth and storage space than XML, and it also has a better performance under compression (see the References below).
Unfortunately, there is no formal specification of the CSV format. The Microsoft Excel implementation is the most widely used and it has become a de facto standard, but the variations are almost endless.
One of the biggest annoyances of this format is that in most cases you don't know a priori what is the field separator character used in a file. CSV stands for "comma-separated values", but most of the spreadsheet applications let the user select the field delimiter from a list of several different characters when saving or exporting data to a CSV file. Furthermore, in a Windows system, when you save a spreadsheet in Excel as a CSV file, Excel will use as the field delimiter the default list separator of your system's locale, which happens to be a semicolon for several European languages. You can even customize this setting and use the list separator you like. For these and other reasons, automating the processing of CSV files is a risky task.
This module can be used to determine the separator character of a delimited text file of any kind, but since the aforementioned ambiguity problems occur mainly in CSV files, I decided to use the Text::CSV:: namespace.
REFERENCES
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
http://www.xml.com/pub/a/2004/12/15/deviant.html
SEE ALSO
There's another module in CPAN for this task, Text::CSV::DetectSeparator, which follows a different approach.
ACKNOWLEDGEMENTS
Many thanks to Xavier Noria for wise suggestions. The author is also grateful to Thomas Zahreddin, Benjamin Erhart and Ferdinand Gassauer for valuable comments.
AUTHOR
Enrique Nell, <perl_nell@telefonica.net>
COPYRIGHT AND LICENSE
Copyright (C) 2006 by Enrique Nell.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.