NAME
Text::LooseCSV - Highly forgiving variable length record text parser; compare to MS Excel
SYNOPSIS
use Text::LooseCSV;
use IO::File;
$fh = new IO::File $fname;
$f = new Text::LooseCSV($fh);
# Some optional settings
$f->word_delimiter("\t");
$f->line_delimiter("\n");
$f->no_quotes(1);
# Parse/split a line
while ($rec = $f->next_record())
{
if ($rec == -1)
{
warn("corrupt rec: ", $f->cur_line);
next;
}
# process $rec as arrayref
...
}
# Or, (vice-versa) create a variable-length record file
$line = $f->form_record( [ 'Debbie Does Dallas','30.00','VHS','Classic' ] );
DESCRIPTION
Why another variable-length text record parser? I've had the privilege to parse some of the gnarliest data ever seen and everything else I tried on CPAN choked (at the time I wrote this module). This module has been munching on millions of records of the filthiest data imaginable at several production sites so I thought I'd contribute.
This module follows somewhat loose rules (compare to MS Excel) and will handle embedded newlines, etc. It is capable of handling large files and processes data in line-chunks. If MAX_LINEBUF is reached, however, it will mark the current record as corrupt, return -1 and start over again at the very next line. This will (of course) process tab-delimited data or whatever value you set for word_delimiter
.
Methods are called in perl OO fashion.
WARNING this module messes with $/ line_delimiter
sets $/ and is always called during construction. Don't change $/ during program execution!
METHOD DETAILS
new (constructor)
-
$f = new Text::LooseCSV($fh);
Create a new Text::LooseCSV object for all your variable-length record needs with an optional file handle, $fh (e.g. IO::File). Set properties using the accessor methods as needed.
If $fh is not given, you can use input_file() or input_text().
Returns a blessed Text::LooseCSV object.
line_delimiter
-
$current_value = $f->line_delimiter("\n");
Get/set LINE_DELIMITER. LINE_DELIMITER defines the line boundary chunks that are read into the buffer and loosely defines the record delimiter.
For parsing, this does not strictly affect the record/field structures as fields may have embedded newlines, etc. However, this DOES need to be set correctly.
Default = "\r\n" NOTE! The default is Windows format.
Always returns the current set value.
WARNING! line_delimiter() also sets $/ and is always called during construction. Due to buffering, don't change $/ or LINE_DELIMITER during program execution!
word_delimiter
-
$current_value = $f->word_delimiter("\t");
Get/set WORD_DELIMITER. WORD_DELIMITER defines the field boundaries within the record. WORD_DELIMITER may only be set to a single character, otherwise a warning is generated and the new value is ignored.
Default = "," NOTE! Single character only.
Always returns the current set value.
WARNING! Due to buffering, don't change WORD_DELIMITER during program execution!
quote_escape
-
$current_value = $f->quote_escape("\\");
Get/set QUOTE_ESCAPE. For data that have fields enclosed in quotes, QUOTE_ESCAPE defines the escape character for '"' e.g. for the default QUOTE_ESCAPE = '"', to embed a quote character in a field (MS Excel style):
"field1 ""junk"" and more, etc"
Default = '"'
Always returns the current set value.
WARNING! Due to buffering, don't change QUOTE_ESCAPE during program execution!
word_line_delimiter_escape
-
$current_value = $f->word_line_delimiter_escape("\\");
Get/set WORD_LINE_DELIMITER_ESCAPE. Sometimes you'll encounter (or want to create) files where WORD_DELIMITER and/or LINE_DELIMITER's are embedded in the data and the creator had the notion (courtesy?) to escape those characters when they appeared within a field with say, '\'. If so, you'll want to set WORD_LINE_DELIMITER_ESCAPE to that character.
If WORD_LINE_DELIMITER_ESCAPE is specified, this character must be escaped by the same character to be included in a field. e.g. for a tab-delimited file where WORD_LINE_DELIMITER_ESCAPE => '\' follows is a sample record with an embedded newline:
me<TAB>you<TAB>this is a single field that contains an escaped line terminator\ an escaped tab\<TAB> and an actual \\<TAB>this is the next field...
Do not use WORD_LINE_DELIMITER_ESCAPE for data with fields that are enclosed in quotes.
WORD_LINE_DELIMITER_ESCAPE cannot be '_', will otherwise be silently ignored.
Default = undef()
Always returns the current set value.
WARNING! Due to buffering, don't change WORD_LINE_DELIMITER_ESCAPE during program execution!
no_quotes
-
$current_value = $f->no_quotes($bool);
Get/set NO_QUOTES. Instruct
form_record
to strip WORD_DELIMITER and LINE_DELIMITER from fields within the record and never to enclose fields in quotes.By default, if, during record formation a WORD_DELIMITER or LINE_DELIMITER is encountered in a field value, that field will be enclosed in quotes. However, if NO_QUOTES = 1 any occurence of WORD_DELIMITER or LINE_DELIMITER will be stripped from the value and no enclosing quotes will be used.
If ALWAYS_QUOTE = 1 this attribute is ignored and quotes will always be used.
Only affects
form_record
.Default = 0 (by default records created with
form_record
may have fields enclosed in quotes)Always returns the current set value.
always_quote
-
$current_value = $f->always_quote($bool);
Get/set ALWAYS_QUOTE. Always enclose fields in quotes when using
form_record
. Only affectsform_record
. Takes precedence overno_quotes
.Default = 0
Always returns the current set value.
max_linebuf
-
$current_value = $f->max_linebuf($integer);
Get/set MAX_LINEBUF. A file is read in line chunks and because newlines are allowed to be embedded in the field values, many lines may be read and buffered before the whole record is determined. MAX_LINEBUF sets the maximum number of lines that are used to parse a record before the first line of that block is determined junk and -1 is returned from
next_record
. Processing then continues at the very next line in the file.Default = 1000
Always returns the current set value.
recadd
-
$current_value = $f->recadd($bool);
Get/set RECADD. If set to true, LINE_DELIMITER (actually $/) will be added to the end of the value returned from
form_record
. Only affectsform_record
Default = 0
Always returns the current set value.
input_file
-
$current_value = $f->input_file($fh);
Get/set the filehandle of the file to be parsed (e.g. IO::File object). May also be set in the constructor.
Default = undef
Always returns the current set value.
input_text
-
$textbuf = $f->input_text($text_blob);
Alternative to
input_file
, feed the entire text of a file or scalar to $f at once. Accepts scalar or scalar reference.Returns the internal textbuf attr.
next_record
-
$rec = $f->next_record();
Parses and returns an arrayref of the fields of the next record.
return '' if EOF is encountered
return -1 if the next record is corrupted (incomplete, etc) or if MAX_LINEBUF is reached
WARNING! Due to buffering, don't change $/ or LINE_DELIMITER during program execution!
cur_line
-
$raw = $f->cur_line();
Returns the raw text line currently being processed (including a line terminator if originally present).
form_record
-
$line = $f->form_record($array_of_fields);
Returns a WORD_DELIMITED joined text scalar variable-length record of $array_of_fields. Also see
recadd
.$array_of_fields may be an array or arrayref.
BUGS
None as yet. This code has been used at several production sites before publishing to the public.
AUTHORS
Reed Sandberg, <reed_sandberg Ó’ yahoo>
COPYRIGHT
Copyright (C) 2001-2007 Reed Sandberg All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 689:
Non-ASCII character seen before =encoding in 'Ó’'. Assuming CP1252