NAME

Text::LooseCSV - Highly forgiving variable length record text parser; compare to MS Excel

SYNOPSIS

use Text::LooseCSV;
use IO::File;

$fh = new IO::File $fname;
$f = new Text::LooseCSV($fh);

# Some optional settings
$f->word_delimiter("\t");
$f->line_delimiter("\n");
$f->no_quotes(1);

# Parse/split a line
while ($rec = $f->next_record())
{
    if ($rec == -1)
    {
        warn("corrupt rec: ", $f->cur_line);
        next;
    }

    # process $rec as arrayref
    ...
}


# Or, (vice-versa) create a variable-length record file
$line = $f->form_record( [ 'Debbie Does Dallas','30.00','VHS','Classic' ] );

DESCRIPTION

Why another variable-length text record parser? I've had the privilege to parse some of the gnarliest data ever seen and everything else I tried on CPAN choked (at the time I wrote this module). This module has been munching on millions of records of the filthiest data imaginable at several production sites so I thought I'd contribute.

This module follows somewhat loose rules (compare to MS Excel) and will handle embedded newlines, etc. It is capable of handling large files and processes data in line-chunks. If MAX_LINEBUF is reached, however, it will mark the current record as corrupt, return -1 and start over again at the very next line. This will (of course) process tab-delimited data or whatever value you set for word_delimiter.

Methods are called in perl OO fashion.

WARNING this module messes with $/ line_delimiter sets $/ and is always called during construction. Don't change $/ during program execution!

METHOD DETAILS

new (constructor)
$f = new Text::LooseCSV($fh);

Create a new Text::LooseCSV object for all your variable-length record needs with an optional file handle, $fh (e.g. IO::File). Set properties using the accessor methods as needed.

If $fh is not given, you can use input_file() or input_text().

Returns a blessed Text::LooseCSV object.

line_delimiter
$current_value = $f->line_delimiter("\n");

Get/set LINE_DELIMITER. LINE_DELIMITER defines the line boundary chunks that are read into the buffer and loosely defines the record delimiter.

For parsing, this does not strictly affect the record/field structures as fields may have embedded newlines, etc. However, this DOES need to be set correctly.

Default = "\r\n" NOTE! The default is Windows format.

Always returns the current set value.

WARNING! line_delimiter() also sets $/ and is always called during construction. Due to buffering, don't change $/ or LINE_DELIMITER during program execution!

word_delimiter
$current_value = $f->word_delimiter("\t");

Get/set WORD_DELIMITER. WORD_DELIMITER defines the field boundaries within the record. WORD_DELIMITER may only be set to a single character, otherwise a warning is generated and the new value is ignored.

Default = "," NOTE! Single character only.

Always returns the current set value.

WARNING! Due to buffering, don't change WORD_DELIMITER during program execution!

quote_escape
$current_value = $f->quote_escape("\\");

Get/set QUOTE_ESCAPE. For data that have fields enclosed in quotes, QUOTE_ESCAPE defines the escape character for '"' e.g. for the default QUOTE_ESCAPE = '"', to embed a quote character in a field (MS Excel style):

"field1 ""junk"" and more, etc"

Default = '"'

Always returns the current set value.

WARNING! Due to buffering, don't change QUOTE_ESCAPE during program execution!

word_line_delimiter_escape
$current_value = $f->word_line_delimiter_escape("\\");

Get/set WORD_LINE_DELIMITER_ESCAPE. Sometimes you'll encounter (or want to create) files where WORD_DELIMITER and/or LINE_DELIMITER's are embedded in the data and the creator had the notion (courtesy?) to escape those characters when they appeared within a field with say, '\'. If so, you'll want to set WORD_LINE_DELIMITER_ESCAPE to that character.

If WORD_LINE_DELIMITER_ESCAPE is specified, this character must be escaped by the same character to be included in a field. e.g. for a tab-delimited file where WORD_LINE_DELIMITER_ESCAPE => '\' follows is a sample record with an embedded newline:

me>TAB<you>TAB<this is a single field that contains an escaped line terminator\ an escaped tab\>TAB< and an actual \\>TAB<this is the next field...

Do not use WORD_LINE_DELIMITER_ESCAPE for data with fields that are enclosed in quotes.

WORD_LINE_DELIMITER_ESCAPE cannot be '_', will otherwise be silently ignored.

Default = undef()

Always returns the current set value.

WARNING! Due to buffering, don't change WORD_LINE_DELIMITER_ESCAPE during program execution!

no_quotes
$current_value = $f->no_quotes($bool);

Get/set NO_QUOTES. Instruct form_record to strip WORD_DELIMITER and LINE_DELIMITER from fields within the record and never to enclose fields in quotes.

By default, if, during record formation a WORD_DELIMITER or LINE_DELIMITER is encountered in a field value, that field will be enclosed in quotes. However, if NO_QUOTES = 1 any occurence of WORD_DELIMITER or LINE_DELIMITER will be stripped from the value and no enclosing quotes will be used.

If ALWAYS_QUOTE = 1 this attribute is ignored and quotes will always be used.

Only affects form_record.

Default = 0 (by default records created with form_record may have fields enclosed in quotes)

Always returns the current set value.

always_quote
$current_value = $f->always_quote($bool);

Get/set ALWAYS_QUOTE. Always enclose fields in quotes when using form_record. Only affects form_record. Takes precedence over no_quotes.

Default = 0

Always returns the current set value.

max_linebuf
$current_value = $f->max_linebuf($integer);

Get/set MAX_LINEBUF. A file is read in line chunks and because newlines are allowed to be embedded in the field values, many lines may be read and buffered before the whole record is determined. MAX_LINEBUF sets the maximum number of lines that are used to parse a record before the first line of that block is determined junk and -1 is returned from next_record. Processing then continues at the very next line in the file.

Default = 1000

Always returns the current set value.

recadd
$current_value = $f->recadd($bool);

Get/set RECADD. If set to true, LINE_DELIMITER (actually $/) will be added to the end of the value returned from form_record. Only affects form_record

Default = 0

Always returns the current set value.

input_file
$current_value = $f->input_file($fh);

Get/set the filehandle of the file to be parsed (e.g. IO::File object). May also be set in the constructor.

Default = undef

Always returns the current set value.

input_text
$textbuf = $f->input_text($text_blob);

Alternative to input_file, feed the entire text of a file or scalar to $f at once. Accepts scalar or scalar reference.

Returns the internal textbuf attr.

next_record
$rec = $f->next_record();

Parses and returns an arrayref of the fields of the next record.

return '' if EOF is encountered

return -1 if the next record is corrupted (incomplete, etc) or if MAX_LINEBUF is reached

WARNING! Due to buffering, don't change $/ or LINE_DELIMITER during program execution!

cur_line
$raw = $f->cur_line();

Returns the raw text line currently being processed (including a line terminator if originally present).

form_record
$line = $f->form_record($array_of_fields);

Returns a WORD_DELIMITED joined text scalar variable-length record of $array_of_fields. Also see recadd.

$array_of_fields may be an array or arrayref.

BUGS

None as yet. This code has been used at several production sites before publishing to the public.

AUTHORS

Reed Sandberg (reed_sandberg 'AT' yahoo dot com)

COPYRIGHT

Copyright (C) 2001-2005 Reed Sandberg All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 680:

You forgot a '=back' before '=head1'