NAME
File::Tabular - searching and editing flat tabular files
SYNOPSIS
use File::Tabular;
my $f = File::Tabular->new($filename);
my $row = $f->fetchrow;
print $row->{field1}, $row->{field2};
$row = $f->fetchrow(where => 'someWord');
$row = $f->fetchrow(where => 'field1 > 4 AND field2 >= "01.01.2001"');
$row = $f->fetchrow(where => qr/some\s+(complex\s*)?(regex|regular expression)/i);
$f->rewind;
my $rows = $f->fetchall(where => 'someField =~ ^[abc]+');
print $_->{someField} foreach @$rows;
$f->rewind;
$rows = $f->fetchall(where => '+field1:someWord -field2:otherWord',
orderBy => 'field3, field6:num, field5:-alpha');
$f->rewind;
my $hashRows = $f->fetchall(where => 'foo AND NOT bar',
key => 'someField');
print $hashRows->{someKey}{someOtherField};
# open for updates, and remember the updates in a journal file
$f = File::Tabular->new("+<$filename", {journal => $journalFile});
# updates at specific positions (line numbers)
$f->splices(4 => 2, undef, # delete 2 lines from position 4
7 => 1, {f1 => $v1, f2 => $v2, ...}, # replace line 7
9 => 0, { ...}, # insert 1 new line at position 9
22 => 0, [{...}, {...}, ...] # insert several lines at pos. 22
...
-1 => 0, [{...}, {...}, ...] # append at the end
);
# shorthand to add new data at the end
$f->append({f1 => $v1, f2 => $v2, ...});
# same thing, but use the "Hash::Type" associated to the file
$f->append($f->ht->new($v1, $v2, ...));
$f->clear; # removes all data (but keeps the header line)
# updates at specific keys, corresponding to @keyFields
$f->writeKeys({key1 => {f1 => $v1, f2 => $v2, ...}, # add or update
key2 => undef, # remove
...
}, @keyFields);
# replay the updates on a backup file
my $bck = File::Tabular->new("+<$backupFile");
$bck->playJournal($journalFile);
# get info from associated filehandle
printf "%d size, %d blocks", $f->stat->{size}, $f->stat->{blocks};
my $mtime = $f->mtime;
printf "time last modified : %02d:%02d:%02d", @{$mtime}{qw(hour min sec)};
DESCRIPTION
A tabular file is a flat text file containing data organised in rows (records) and columns (fields).
This module provides database-like functionalities for managing tabular files : retrieving, searching, writing, autonumbering, journaling. However, unlike other modules like DBD::CSV, it doesn't try to make it look like a database : rather, the API was designed specifically for work with tabular files. Instead of SQL, search queries are specified in a web-like fashion, with support for regular expressions and cross-field searches. Queries are compiled internally into perl closures before being applied to every data record, which makes it quite fast.
Write operations take a list of modifications as argument; then they apply the whole list atomically in a single rewrite of the data file.
Here are some of the reasons why you might choose to work with a tabular file rather than a regular database :
no need to install a database system (not even buy one)!
easy portability and data exchange with external tools (text editor, spreadsheet, etc.)
search queries immediately ready for a web application
good search performance, even with several thousand records
On the other hand, tabular files will probably be inappropriate if you need very large volumes of data, complex multi-table data models or frequent write operations.
METHODS
new (open1, open2, ..., {opt1 => v1, opt2 => v2, ...})
-
Creates a new tabular file object. The list of arguments
open1, open2, ...
is fed directly to "open" in perlfunc for opening the associated file. A reference to an already opened filehandle can also be given instead.The final hash ref is a collection of optional parameters, taken from the following list :
- fieldSep
-
field separator : any character except '%' ('|' by default). Escape sequences like
\t
are admitted. - recordSep
-
record separator ('\n' by default).
- crlf
-
if true, filehandles use the PerlIO layer
:crlf
. This is the default, for unfortunate historical reasons. - fieldSepRepl
-
string to substitute if fieldSep is met in the data. (by default, url encoding of fieldSep, i.e. '%7C' )
- recordSepRepl
-
string to substitute if recordSep is met in the data (by default, url encoding of recordSep, i.e. '%0A' )
- autoNumField
-
name of field for which autonumbering is turned on (none by default). This is useful to generate keys : when you write a record, the character '#' in that field will be replaced by a fresh number, incremented automatically. This number will be 1 + the largest number read so far (it is your responsability to read all records before the first write operation).
- autoNum
-
initial value of the counter for autonumbering (1 by default).
- autoNumChar
-
character that will be substituted by an autonumber when writing records ('#' by default).
- flockMode
-
mode for locking the file, see "flock" in perlfunc. By default, this will be LOCK_EX if open1 contains '>' or '+<', LOCK_SH otherwise.
- flockAttempts
-
number of attempts to lock the file, at 1 second intervals, before returning an error. Zero by default. If nonzero, LOCK_NB is added to flockMode; if zero, a single locking attempt will be made, blocking until the lock is available.
- headers
-
reference to an array of field names. If not present, headers will be read from the first line of the file.
- printHeaders
-
if true, the headers will be printed to the file. If not specified, treated as 'true' if open1 contains '>'.
- journal
-
name of journaling file, or reference to a list of arguments for "open" in perlfunc. The journaling file will log all write operations. If specified as a simple file name, it will be be opened in '>>' mode.
A journal file can then be replayed through method "playJournal" (this is useful to recover after a crash, by playing the journal on a backup copy of your data).
- rxDate
-
Regular expression for matching a date. Default value is
qr/^\d\d?\.\d\d?\.\d\d\d?\d?$/
. This will be used by "compileFilter" to perform appropriate comparisons. - date2str
-
Ref to a function for transforming dates into strings suitable for sorting (i.e. year-month-day). Default is :
sub {my ($d, $m, $y) = ($_[0] =~ /(\d\d?)\.(\d\d?)\.(\d\d\d?\d?)$/); $y += ($y > 50) ? 1900 : 2000 if defined($y) && $y < 100; return sprintf "%04d%02d%02d", $y, $m, $d;}
- rxNum
-
Regular expression for matching a number. Default value is
qr/^[-+]?\d+(?:\.\d*)?$/
. This will be used by "compileFilter" to perform appropriate comparisons. - preMatch/postMatch
-
Strings to insert before or after a match when filtering rows (will only apply to search operator ':' on the whole line, i.e. query
"foo OR bar"
will highlight both "foo" and "bar", but query"~ 'foo' OR someField:bar"
will not highlight anything; furthermore, a match-all request containing just '*' will not highlight anything either). - avoidMatchKey
-
If true, searches will avoid to match on the first field. So a request like
$ft->fetchall(where => '123 OR 456')
will not find the record with key 123, unless the word '123' appears somewhere in the other fields. This is useful when queries come from a Web application, and we don't want users to match a purely technical field.This search behaviour will not apply to regex searches. So requests like
$ft->fetchall(where => qr/\b(123|456)\b/)
or$ft->fetchall(where => ' ~ 123 OR ~ 456')
will actually find the record with key 123.
fetchrow(where => filter)
-
returns the next record matching the (optional) filter. If there is no filter, just returns the next record.
The filter is either a code reference generated by "compileFilter", or a string which will be automatically fed as argument to "compileFilter"; this string can contain just a word, a regular expression, a complex boolean query involving field names and operators, etc., as explained below.
fetchall(where => filter, orderBy => cmp)
fetchall(where => filter, key => keySpecif)
-
finds all next records matching the (optional) filter. If there is no filter, finds all remaining records.
The filter is either a code reference generated by "compileFilter", or a string which will be automatically fed as argument to "compileFilter".
The return value depends on context and on arguments :
if no key parameter is given, and we are in a scalar context, then
fetchall
returns a reference to an array of records.The optional orderBy parameter can be a field name, a ref to a list of field names, a string like
"field1: -alpha, field2:-num, ..."
, or, more generally, a user-provided comparison function; see "cmp" in Hash::Type for a fully detailed explanation.Otherwise, the resulting array is in data source order.
if no key parameter is given, and we are in a list context, then
fetchall
returns a pair : the first item is a reference to an array of records as explained above ; the second item is a reference to an array of line numbers corresponding to those records (first data line has number 0). These line numbers might be useful later if you update the records through the "splices" method. No orderBy is allowed iffetchall
is called in list context.if a key parameter is given, then
fetchall
returns a reference to a hash, whose values are the retrieved records, and whose keys are built according to the keySpecif argument. This must be either a single field name (scalar), or a a list of field names (ref to an array of scalars). Values corresponding to those field names will form the key for each entry of the hash; if necessary, multiple values are joined together through $;. No orderBy argument is allowed, because hashes have no ordering.
rewind
-
Rewinds the file to the first data line (after the headers)
ht
-
Returns the instance of Hash::Type associated with the file.
headers
-
returns the list of field names
stat
-
returns a hash ref corresponding to a call of stat on the associated filehandle. Keys of the hash have names as documented in stat. Ex:
printf "%d size, %d blocks", $f->stat->{size}, $f->stat->{blocks};
atime
,mtime
,ctime
-
each of these methods returns a hash ref corresponding to a call of localtime on the last access time, last modified time, or last inode change time of the associated filehandle (see stat for explanations). Keys of the hash have names as documented in localtime. Ex:
my $mtime = $f->mtime; printf "time last modified : %02d:%02d:%02d", @{$mtime}{qw(hour min sec)};
splices
-
splices(pos1 => 2, undef, # delete 2 lines pos2 => 1, row, # replace 1 line pos3 => 0, [row1, row2 ...] # insert lines ... -1 => 0, [row1, ... ] # append lines ); # special case : autonum if pos== -1
Updates the data, in a spirit similar to "splice" in perlfunc (hence the name of the method). The whole file is rewritten in an atomic operation, deleting, replacing or appending data lines as specified by the "splice instructions". Returns the number of "splice instructions" performed.
A splice instruction is a triple composed of :
a position (line number) that specifies the place where modifications will occur. Line numbers start at 0. Position -1 means end of data.
a number of lines to delete (might be zero).
a ref to a hash or to a list of hashes containing new data to insert (or
undef
if there is no new data).
If there are several splice instructions, their positions must be sorted in increasing order (except of course position -1, meaning "end of data", which must appear last).
Positions always refer to line numbers in the original file, before any modifications. Therefore, it makes no sense to write
splices(10 => 5, undef, 12 => 0, $myRow)
because after deleting 5 rows at line 10, we cannot insert a new row at line 12.
The whole collection of splice instructions may also be passed as an array ref instead of a list.
If you intend to fetch rows again after a splice, you must rewind the file first.
append(row1, row2, ...)
-
This appends new records at the end of data, i.e. it is a shorthand for
splices(-1 => 0, [row1, row2, ...])
clear
-
removes all data (but keeps the header line)
writeKeys({key1 => row1, key2 => ...}, @keyFields)
-
Rewrites the whole file, applying modifications as specified in the hash ref passed as first argument. Keys in this hash are compared to keys built from the original data, according to
@keyFields
. Therefore,row1
may replace an existing row, if the key corresponding tokey1
was found ; otherwise, a new row is added. Ifrow1
isundef
, the corresponding row is deleted from the file.@keyFields
must contain the name of one or several fields that build up the primary key. For each data record, the values corresponding to those fields are taken and joined together through $;, and then compared tokey1
,key2
, etc.If you intend to fetch rows again after a writeKeys, you must rewind the file first.
playJournal(open1, open2, ...)
-
Reads a sequence of update instructions from a journal file and applies them to the current tabular file. Arguments
open1, open2, ...
will be passed to perl open for opening the journal file ; in most cases, just give the filename.The journal file must contain a sequence of instructions as encoded by the automatic journaling function of this module ; to activate journaling, see the
journal
parameter of the "new" method. compileFilter(query [, implicitPlus])
-
Compiles a query into a filter (code reference) that can be passed to "fetchrow" or "fetchall".
The query can be
a regular expression compiled through
qr/.../
. The regex will be applied to whole data lines, and therefore covers all fields at once. This is the fastest way to filter lines, because it avoids systematic splitting into data records.a data structure resulting from a previous call to
Search::QueryParser::parse
a string of shape
K_E_Y : value
(without any spaces before or after ':'). This will be compiled into a regex matchingvalue
in the first column. The special spelling is meant to avoid collision with a real field hypothetically named 'KEY'.a string that will be analyzed through
Search::QueryParser
, and then compiled into a filter function. The query string can contain boolean combinators, parenthesis, comparison operators, etc., as documented in Search::QueryParser. The optional second argument implicitPlus is passed toSearch::QueryParser::parse
; if true, an implicit '+' is added in front of every query item (therefore the whole query is a big AND).Notice that in addition to usual comparison operators, you can also use regular expressions in queries like
+field1=~'^[abc]+' +field2!~'foobar$'
The query compiler needs to distinguish between word and non-word characters ; therefore it is important to
use locale
in your scripts (see perllocale). The compiler tries to be clever about a number of details :- looking for complete words
-
Words in queries become regular expressions enclosed by
\b
(word boundaries) ; so a query forfoo OR bar
will not matchfoobar
. - supports * for word completion
-
A '*' in a word is compiled into regular expression
\w*
; so queriesfoo*
or*bar
will both matchfoobar
. - case insensitive, accent-insensitive
-
Iso-latin-1 accented characters are translated into character classes, so for example
hétaïre
becomesqr/h[ée]ta[ïi]re/i
. Furthermore, as shown in this example, thei
flag is turned on (case-insensitive). Therefore this query will also matchHETAIRE
. - numbers and dates in operators
-
When compiling a subquery like
fieldname >= 'value'
, the compiler checks the value againstrxNum
andrxDate
(as specified in the "new" method). Depending on these tests, the subquery is translated into a string comparison, a numerical comparison, or a date comparison (more precisely,{date2str($a) cmp date2str($b)}
). - set of integers
-
Operator
#
means comparison with a set of integers; internally this is implemented with a bit vector. So queryId#2,3,5,7,11,13,17
will return records where fieldId
contains one of the listed integers. The field name may be omitted if it is the first field (usually the key field). - pre/postMatch
-
Words matched by a query can be highlighted; see parameters
preMatch
andpostMatch
in the "new" method.
AUTHOR
Laurent Dami, <laurent.dami AT etat ge ch>
COPYRIGHT AND LICENSE
Copyright (C) 2005 by Laurent Dami.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1016:
Non-ASCII character seen before =encoding in 'C<hétaïre>'. Assuming CP1252