NAME

Acme::CSV - a module providing manipulation routines for comma separated value (CSV) records

SYNOPSIS

use Acme::CSV qw(CSVinit CSVvalidate CSVjoin CSVsplit);

#
# Validation
#
%fieldLayout = CSVinit($firstRecord);
if (! CSVvalidate(%CSVfields, I<field list>)) {
    die "Fields missing."};
}

#
# Parsing
#
while (<INPUT>) { 
    @record = CSVsplit($_);
    I<process the records>;
}

LICENSE

Copyright (c)2001 Christopher Rath <Christopher@Rath.ca> and Mark Mielke <Mark@Mielke.cc>.

Distributed under the GNU Lesser General Public License v2.1. See the accompanying lgpl.txt file for the license text; if the file was missing you may always obtain a copy from http://www.fsf.org/.

WARRANTY

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

DESCRIPTION

This module defines some functions for reading comma separated value (CSV) files (e.g., Remedy-ARS generated files). The module also allows some a delimiter other than a comma to be used, or a selection of delimiters to be used (e.g., comma and semicolon).

The CSVinit($firstRecord) function is used to parse the initial record of a CSV file; which by definition will contain the field names for each field in the CSV records which follow in the file. This initial record is itself a CSV record.

The CSVvalidate(\%output_from_CSVinit, <field_list>) function is used to verify that the incoming file contains the full list of fields specified in the field_list provided to the function.

The CSVsplit($rawRecord) function splits a line, passed as $rawRecord (a CSV record), and returns it split into fields; sort-of like split().

The $CSV::Delimiters variable can be set to specify an alternate, or set of alternate, field delimiters. In general, it should be specified as a local.

The CSVjoin(@fields_to_join) function takes an array of fields and turns them into a comma separated value record. If $CSV::Delimiters has been used to specify multiple delimiters then CSVjoin() will use the first delimiter of the set for creation of CSV records.

USAGE

CSV::Delimiters

local $CSV::Delimiters = ",;";
%fieldLayout = CSVinit($firstRecord);
@Fields = CSVsplit($aRawRecord);

This will cause either of ',' or ';' to be valid delimiters, even if mixed. local() is used to temporarily override the value, instead of permanently overriding the value.

CSVjoin() will prefer the first character in the Delimiters string, or the old behaviour of a ',' if the Delimiters string is not defined.

If you peek at the code, it actually implements the split core twice. The first with $CSV::Delimiters, and the second with ',' hard-coded. This is to maintain efficiency for code that does not make use of a dynamic set of delimiter characters.

CSVinit()

CSVinit() is used to initialize the data structures for further CSV parsing. The function scans the input record passed to it. If it contains data, then that data is used to define the field titles. If no data is passed then some default column data is used instead. The function returns an associative array containing field names and the corresponding field number.

Parameters

[1] $rawRecord - Record defining field layouts.

Returns

Returns an associative array of field layout.

CSVvalidate()

CSVvalidate() is used to validate a set of field names. A list of fields is validated against the associative array previously built by a call to CSVinit(). Returns true or false.

Parameters

[1] \%fields - the fields as built by CSVinit(). (reference)

[?] $... - a list of field names to check.

Returns

Returns 0 if any single field is not found in %fields; otherwise 1.

CSVjoin()

CSVjoin() is used to join CSV data.

Parameters

Context 1:

[1] \@fields  - An array of fields to join into a CSV.
[2] "minimum"|"quoteall" - Defaults to "minimum", define 
        whether fields must be quoted when not req'd.

Context 2:

[?] $... - An array of fields to join into 
        a CSV.

Returns

Returns a string which may be "un"join'ed using CSVsplit(). except in the case of a newline contained within a field)

CSVsplit()

CSVsplit is used to split CSV data, just like Perl's own split() does.

Parameters [1] $rawRecord - the record to split.

Returns

Returns an array of values split out of $rawRecord.

CSV SPECIFICATION

This section attempts to define CSV records and files in a fairly rigorous fashion. The point behind this is to make this module usable without having to read and understand the source code.

CSV Records

The basic idea behind a CSV record is this: literal field values are delimited by commas. The immediate complication that arises is, of course, ``What should be done when a comma must appear within a field?'' Within the bounds of current practice, there are two immediate solutions to this complication:

  1. Use a predefined escape character to tag commas which appear within fields.

  2. Allow quotation marks to enclose a field and _protect_ a comma appearing within a field.

As with all work-arounds, these "immediate solutions" have complications of their own (these secondary complications are numbered the same as their primary counterparts):

  1. Escape characters must themselves be escaped in order to appear as a value within a field (e.g., if a literal comma is expressed as ``\,'', then a literal backslash must appear as ``\\'').

  2. Quotation marks must somehow be protected if they are to appear as a literal character within a field.

Given that the essence of CSV files is simplicity, I have decided to reject all escape and escaped characters with the exception of quoation marks appearing within quotation marks. That is, the case of the escaped comma has been rejected from this specification.

Within the context of Perl, the string(3) library and the UNIX shells, an additional level of complexity is added to this equation when we begin to ask, ``What is the meaning or significance of whitespace within a CSV record?'' The meaning of whitespace is a key technical detail which must be accounted for in both the specification and its implementation; otherwise, everyone's implementation will produce semi-random results based upon that implementors opinion regarding whitespace.

Semi-Formal CSV Record Specification

This specification uses the syntax described in Appendix A of the first edition of O'Reilly's Programming Perl book (i.e., the Perl 4 camel book).

CSV_RECORD ::= (* FIELD DELIM *) FIELD REC_SEP

FIELD ::= QUOTED_TEXT | TEXT

DELIM ::= `,'

REC_SEP ::= `\n'

TEXT ::= LIT_STR | ["] LIT_STR [^"] | [^"] LIT_STR ["]

LIT_STR ::= (* LITERAL_CHAR *)

LITERAL_CHAR ::= NOT_COMMA_NL

NOT_COMMA_NL ::= [^,\n]

QUOTED_TEXT ::= ["] (* NOT_A_QUOTE *) ["]

NOT_A_QUOTE ::= [^"] | ESCAPED_QUOTE

ESCAPED_QUOTE ::= `""'

Notes

This specification does not grant any special status to whitespace characters. This means that all whitespace is part of some field value.

The TEXT non-terminal is attempting to express the cases where quotation marks exist but do not completely encapsulate the field value; in cases like this, the quotation marks should be treated as literal characters making up part of the field value.

One ambiguity exist in this specification that I have been unable to properly express. The case of a field with the value ,abc""de,. Should the double quotation marks be treated as an escaped quotation mark or as two quotation marks? I believe that occurences of "" should be treated as escaped quotation marks only within a quoted string.

The LITERAL_CHAR non-terminal exists partially as a place-holder. Escaped characters may be easily accomadated by this specification at a later date by OR-ing them to the right side of LITERAL_CHAR.

Some of the non-terminals exist solely as documentation/reading aids. The NOT_A_COMMA_NL is one example of this case; its name helps express the meaning of the regex (which should assist other readers of this document to detect errors in the specification).

The ESCAPED_QUOTE non-terminal includes the PASCAL-like case of ``""'' and excludes the more traditional UNIX ``\\"''. This is not my preference; I have included it here because I know there exists at least one commercial tool that produces CSV records containing the PASCAL-like construct and not the UNIX-like one.

AUTHORS

Christopher Rath (christopher@rath.ca) wrote the CSV specification and everything in the module except the essential snippet of code that actually does the work :).

Mark Mielke (mark@mielke.cc) took the specification and wrote the essential piece of code that actually breaks the CSV records into its constituent fields. He also took the initial .pl version and .pm'ed it (this only makes sense, since this module is only usable in perl5).

Alex Ayars (pause@nodekit.org) found this module somewhere and put it on the CPAN.

BUGS

Not Thread-Safe

This module, and hence CSVinit()/CSVsplit(), is not thread-safe or re-entrant.

Fields Spanning Lines

This module currently fails in one of the test-cases, although the test output listed herein, above, has been constructed to show the actual output of this module, as opposed to the correct output:

"Multi-
line",test

This is due to the fact that perl is reading one line at a time with:

while (<DATA>) { ... }

So the first line is read ("Multi-) and evaluated. The _second_ time around the loop the second line (line",test) is read and evaluated. There is no workaround available, this is simply a limitation of the module.