NAME
Data::CTable - Read, write, manipulate tabular data
SYNOPSIS
## Read some data files in various tabular formats
use Data::CTable;
my $People = Data::CTable->new("people.merge.mac.txt");
my $Stats = Data::CTable->new("stats.tabs.unix.txt");
## Clean stray whitespace in fields
$People->clean_ws();
$Stats ->clean_ws();
## Retrieve columns
my $First = $People->col('FirstName');
my $Last = $People->col('LastName' );
## Calculate a new column based on two others
my $Full = [map {"$First->[$_] $Last->[$_]"} @{$People->all()}];
## Add new column to the table
$People->col(FullName => $Full);
## Another way to calculate a new column
$People->col('Key');
$People->calc(sub {no strict 'vars'; $Key = "$Last,$First";});
## "Left join" records matching Stats:PersonID to People:Key
$Stats->join($People, PersonID => 'Key');
## Find certain records
$Stats->select_all();
$Stats->select(Department => sub {/Sale/i }); ## Sales depts
$Stats->omit (Department => sub {/Resale/i}); ## not Resales
$Stats->select(UsageIndex => sub {$_ > 20.0}); ## high usage
## Sort the found records
$Stats->sortspec('DeptNum' , {SortType => 'Integer'});
$Stats->sortspec('UsageIndex', {SortType => 'Number' });
$Stats->sort([qw(DeptNum UsageIndex Last First)]);
## Make copy of table with only found/sorted data, in order
my $Report = $Stats->snapshot();
## Write an output file
$Report->write(_FileName => "Rept.txt", _LineEnding => "mac");
## Print a final progress message.
$Stats->progress("Done!");
## Dozens more methods and parameters available...
OVERVIEW
Data::CTable is a comprehensive utility for reading, writing, manipulating, cleaning and otherwise transforming tabular data. The distribution includes several illustrative subclasses and utility scripts.
A Columnar Table represents a table as a hash of data columns, making it easy to do data cleanup, formatting, searching, calculations, joins, or other complex operations.
The object's hash keys are the field names and the hash values hold the data columns (as array references).
Tables also store a "selection" -- a list of selected / sorted record numbers, and a "field list" -- an ordered list of all or some fields to be operated on. Select() and sort() methods manipulate the selection list. Later, you can optionally rewrite the table in memory or on disk to reflect changes in the selection list or field list.
Data::CTable reads and writes any tabular text file format including Merge, CSV, Tab-delimited, and variants. It transparently detects, reads, and preserves Unix, Mac, and/or DOS line endings and tab or comma field delimiters -- regardless of the runtime platform.
In addition to reading data files, CTable is a good way to gather, store, and operate on tabular data in memory, and to export data to delimited text files to be read by other programs or interactive productivity applications.
To achieve extremely fast data loading, CTable caches data file contents using the Storable module. This can be helpful in CGI environments or when operating on very large data files. CTable can read an entire cached table of about 120 megabytes into memory in about 10 seconds on an average mid-range computer.
For simple data-driven applications needing to store and quickly retrieve simple tabular data sets, CTable provides a credible alternative to DBM files or SQL.
For data hygiene applications, CTable forms the foundation for writing utility scripts or compilers to transfer data from external sources, such as FileMaker, Excel, Access, personal organizers, etc. into compiled or validated formats -- or even as a gateway to loading data into SQL databases or other destinations. You can easily write short, repeatable scripts in Perl to do reporting, error checking, analysis, or validation that would be hard to duplicate in less-flexible application environments.
The data representation is simple and open so you can directly access the data in the object if you feel like it -- or you can use accessors to request "clean" structures containing only the data or copies of it. Or you can build your own columns in memory and then when you're ready, turn them into a table object using the very flexible new() method.
The highly factored interface and implementation allow fine-grained subclassing so you can easily create useful lightweight subclasses. Several subclasses are included with the distribution.
Most defaults and parameters can be customized by subclassing, overridden at the instance level (avoiding the need to subclass too often), and further overridden via optional named-parameter arguments to most major method calls.
Similar / related modules on CPAN
The Data::Table module by Yingyao Zhou & Guangzhou Zou offers similar functionality, but uses a different underlying data representation (2-dimensional array), and has a somewhat different feature set. Check it out. Maybe you will prefer it for your application.
http://search.cpan.org/search?mode=module&query=Data::Table
The Data::ShowTable module renders tables in various viewable formats. CTable relies on ShowTable's ShowBoxTable method to implement its own format() and out() methods.
http://search.cpan.org/search?mode=module&query=Data::ShowTable
Prerequisites
The CTable documentation, source code, and examples assume familiarity with large nested data structures, object-oriented syntax and terminology, and comfort working with array and hash references and array and hash slice syntax.
See the perlref man page for more on these topics.
How to learn more
Dozens more methods, parameters, and examples are described below.
See the full source code in CTable.pm.
Or, after installing, read the man page using:
man Data::CTable
perldoc Data::CTable
See the eg/ (examples) folder in the Data::CTable distribution and the test.pl script for scripts demonstrating every CTable method.
For latest version and other news, check the Data::CTable home page:
http://christhorman.com/projects/perl/Data-CTable/
Or search CPAN:
http://search.cpan.org/search?mode=module&query=Data::CTable
INSTALLATION
Using CPAN module:
perl -MCPAN -e 'install Data::CTable'
Or manually:
tar xzvf Data-CTable*gz
cd Data-CTable-?.??
perl Makefile.PL
make
make test
make install
INCLUDED SUBCLASSES AND UTILITIES
In addition to the module itself, there are a number of subclasses and simple utilities included with the Data::CTable distribution.
Subclases
The Data::CTable distribution includes these example subclasses. Each is installed in your Perl environment along with the main module, and so may be used by your scripts. Each has its own man/perldoc page containing more detail.
Data::CTable::ProgressLogger is a subclass that logs all progress messages to a list within the object itself rather than (merely) echoing them to STDERR. Later, you may retrieve and examine the list.
Data::CTable::Script is a virtual subclass that includes class and object methods that make it easy to write a simple interactive command-line program that parses options and outputs a table.
Data::CTable::Listing is a very useful subclass of Data::CTable::Script that implements a souped-up Unix-like "ls" (file listing) command -- it first gets an optionally-recursive listing of any number of files and/or directories, builds a list of their full absolute or relative paths, then build a Data::CTable::Listing object that contains all the paths, plus about 25+ other pieces of useful information about each file or directory.
The "tls" utility, below, is simply a command-line cover for this class, but you could use this class in your own scripts in order to get detailed file listings.
Utilities
Each of these utilities is provided mainly so you can see mature examples of how to use Data::CTable in real-world scripts.
But each is also genuinely useful, too, and you may enjoy adding them to your regular bag of tricks or using them as an easily-modifiable basis for scripts of your own.
On most systems, these will be installed in an appropriate directory in your path when you install CTable, and hence will be executable just by typing their file name. (On Windows, they'll be wrapped by a .bat file and installed in C:\Perl\bin or equivalent. On *nix, they'll be in /usr/bin/ or equivalent.)
tls is a command-line utility that wraps Data::CTable::Listing to implement a variant on the classic Unix "ls" command in which an internal CTable object is used to hold and calculate a very large amount of meta-data about each file and directory and then output that data in formatted tables, or as any kind of delimited text file, and with much more flexibility and control over included data, and sort and sub-sort order than with ls.
tshow is a command-line utility that reads each of its arguments as a Data::CTable file and then calls the out() method to display its entire contents. (Warning: out() is slow with very large data sets.)
getweather is a command-line utility that takes a US zip code, grabs the local weather report from a popular weather web site and uses a CTable object to store, process, and clean, and present the table of weather data that results in a simple text format.
CREATING TABLE OBJECTS
## Create an object / read file(s) / override params
use Data::CTable;
$t = Data::CTable->new()
$t = Data::CTable->new($File)
$t = Data::CTable->new($File1, $File2....)
$t = Data::CTable->new($Params)
$t = Data::CTable->new($Params, $File1)
$t = Data::CTable->new($Params, $File1, $File2....)
## Internal initializer (subclassable): called for you by new()
$t->initialize()
If the first argument to new() is a hash ref, it is the $Params hash of initial parameters and/or data columns which, if supplied will form the starting point for the object being created. Any non-hash and/or further arguments to new() are treated as file names to be opened.
If supplied, data in the $Params hash will be shallowly copied -- the original hash object passed will not be used, but any sub-structures within it will now "belong" to the resulting new object which will feel free to manipulate them or discard them.
Then, any parameters not supplied (usually most of them) will be defaulted for you because new() will call the internal method initialize() before the object is finished and returned.
See the PARAMETER REFERENCE section below for the parameters you can choose to supply or have defaulted for you.
Any file name arguments will be read and appended into a single object. If any of the files fails to be read, then new() will fail (return a false value) and no object will be created.
initialize() makes sure there is a legal and consistent value for every internal parameter in the object. Generally, initialize() leaves alone any parameters you supplied to new(), and simply sets default values for any that were not yet supplied.
You should never need to call initialize() directly.
After calling initialize(), new() then calls the append_files_new() method to process the filename arguments. This method then calls read() on the first filename, and then append_file on all subsequent file names, appending them to the first, in sequence.
See append() for an explanation of how the data from multiple files is combined into a single table.
Advanced: Using a template object
## Calling new() with a template object
$v = $t->new()
$v = $t->new($File)
$v = $t->new($File1, $File2....)
$v = $t->new($Params)
$v = $t->new($Params, $File1)
$v = $t->new($Params, $File1, $File2....)
You can also call $t->new() to use an existing object, $t, as a template for the new object. $t->new() will create a new object of the same class or subclass as the template object. Furthermore, the template object, if provided, will be used as a starting point for the resulting object -- in fact, it will initially share shallow copies of all data columns, if any, and all internal parameters and data structures, if any.
This advanced shared-data technique could be used to create two separate table objects that share and operate on the same underlying data columns in memory but have different custom field lists, custom selections, sort behavior, etc. But don't do this unless you're sure you understand what you're doing, because changing data in one table would change it in the other.
PARAMETER REFERENCE
The parameters listed here are recognized by new(), initialize(), and by many functions that use the named-parameter calling convention, such as read(), write(), sort(), and others.
Any parameter listed may be specified when using new().
Most parameters should not be directly accessed or manipulated once the object has been created, except those that have appropriate accessor methods described throughout in this documentation.
Each parameter in the lists below is listed along with its defaulting logic as performed by new() via initialize().
Custom field list and custom selection (record list)
- _FieldList ||= undef;
-
This is the ordered list (array reference) of columns / fields present in the object. It is set by read() to reflect the names and order of the fields encountered or actually read from in the incoming data file, if any. Initially this list is undefined, and if removed or left undefined, then the de-facto field list will be a list of all columns present in the table object, in alphabetical order (see fieldlist() and fieldlist_all()).
Normally, all fields present in the table object would be listed in the field list.
However, this parameter may be set in the object (or overridden in named-parameter function calls like read(), write(), etc.) to cause a subset of fields to be read, written or otherwise used. If a subset field list is specified before reading a data file, then ONLY fields listed will be read -- this is a way to read just certain fields from a very large file, but may not always be what you want.
Specifying the field list before calling read() is required if the data file has no header row giving names to its fields -- the names you specify in the field list, in order, will be applied to the data file being read. See the _HeaderRow parameter, below.
If a subset field list is used after columns are already loaded in memory, the columns not listed in the field list will still be present in the object (and can be listed by calling fieldlist_all()), but they will be omitted from most operations that iterate over fields.
- _Selection ||= undef;
-
This is a list of the record numbers of "selected" records in the table, possibly indicating sorted order.
If absent, then all records are considered to be selected, in "natural" order -- i.e. the order they occur in the file.
You can create and set your own selection list or get and modify an existing one. Deleting it resets the selection (for example, by calling select_all()).
Calling sort() will create a _Selection if none existed. Otherwise it operates by modifying the existing _Selection, which may be a subset of all record numbers.
Cache behavior controls
See sections related to Cacheing, below.
- _CacheOnRead = 1 unless exists
-
Boolean: whether data files read by the read() method should be cached after reading. Once cached, the data will be read from the cache instead of the original file the NEXT TIME READ() IS CALLED, but only if: 1) the cache file is found, and 2) its date is later than the original. Otherwise, the cache file is ignored or re-written. Cacheing can be up to 10x faster than parsing the file, so it's almost always worth doing in any situation where you'll be reading a data file more often than writing it.
This parameter defaults to true.
- _CacheOnWrite = 0 unless exists
-
Boolean: whether tables written by the write() method should be cached after writing. This defaults to false on the assumption that the program won't need to re-read a file it just wrote. However, this behavior would be useful if a later step in your program or another program will be reading the file that was written and would benefit from having the cacheing already done. Cacheing a file after writing is quite fast since the data is already in memory and of course it speeds up subsequent read() operations by up to 10x.
- _CacheExtension = ".cache" unless exists
-
This is the file name extension that is added to a file's name to determine the name of its corresponding cache file. (First, any existing extension, if any, is removed.) If this extension is empty, then the cache file will be named the same as the original assuming the cache file is being stored in a different directory. (See next setting.)
- _CacheSubDir = "cache" unless exists
-
This is the absolute or relative path to the subdirectory that should be used to store the cache files. The default value is the relative path to a directory called "cache". Relative paths will be appended to the directory path containing the original file being read.
Absolute cache paths (such as /tmp or c:\temp\) can also be used.
Override _CacheExtension and _CacheSubdir in a subclass, in each object, or in each call to read() or write() in order to have the cache files stored elsewhere. But remember: unless you use the same cache settings next time you read the same file, the cache files will be orphaned.
Progress routine / setting
- _Progress = undef unless exists
-
The _Progress setting controls the routing of diagnostic messages. Four possible settings are recognized:
undef (default) The class's progress settings are used. subroutine reference Your own custom progress routine. true Built-in progress_default() method used. 0/false No progress messages for this object.
See the PROGRESS section for a description of the interface of custom progress routines and for details on how the builtin one works.
File format settings
- _LineEnding ||= undef;
-
_LineEnding indicates the line ending string or setting to be used to read a file, the setting that actually was used to read a file, and/or the line ending that will be used to write a file.
Set this parameter to force a particular encoding to be used.
Otherwise, leave it undef. The program will Do What You Mean.
If _LineEnding is undef when read() is called, read() will try to guess the line ending type by inspecting the first file it reads. Then it will set this setting for you. It can detect DOS, Unix, and Mac line endings.
If _LineEnding is undef when write() is called, write() will use
"\n"
, which yields different strings depending on the current runtime platform: \x0A on Unix; \x0D in MacPerl, \x0D\x0A on DOS.Otherwise, write() uses the value defined in _LineEnding, which would match the value filled in by read() if this object's data originally had been read from a file. So if you read a file and then later write it out, the line endings in the written file will match the format of original unless you override _LineEnding specifically.
Since Data::CTable supports reading and writing all common endings, base your decision on line ending format during write() on the needs of other programs you might be using.
For example: FileMaker Pro and Excel crash / hang if Unix line endings are used, so be sure to use the ending format that matches the needs of the other programs you plan to use.
As a convenience, you may specify and retrieve the _LineEnding setting using the mnemonic symbols "mac", "dos" and "unix." These special values are converted to the string values shown in this chart:
symbol string value chars decimal octal control ------------------------------------------------------------- dos "\x0D\x0A" CR/LF 13,10 "\015\012" ^M^J mac "\x0D" CR 13 "\015" ^M unix "\x0A" LF 10 "\012" ^J
See the section LINE ENDINGS, below, for accessor methods and conversion utilities that help you get/set this parameter in either symbolic format or string format as you prefer.
- _FDelimiter ||= undef;
-
_FDelimiter is the field delimiter between field names in the header row (if any) and also between fields in the body of the file. If undef, read() will try to guess whether it is tab
"\t"
or comma <",">, and set this parameter accordingly. If there is only one field in the file, then comma is assumed by read() and will be used by write().To guess the delimiter, the program looks for the first comma or tab character in the header row (if present) or in the first record. Whichever character is found first is assumed to be the delimiter.
If you don't want the program to guess, or you have a data file format that uses a custom delimiter, specify the delimiter explicitly in the object or when calling read() or make a subclass that initializes this value differently. On write(), this will default to comma if it is empty or undef.
- _QuoteFields = undef unless exists
-
_QuoteFields controls how field values are quoted by write() when writing the table to a delimited text file.
An undef value (the default) means "auto" -- each field is checked individually and if it contains either the _FDelimiter character or a double-quote character, the field value will be surrounded by double-quotes as it is written to the file. This method is slower to write but faster to read, and may make the output easier for humans to read.
A true value means always put double-quotes around every field value. This mode is faster to write but slower to read.
A zero value means never to use double-quotes around field values and not to check for the need to use them. This method is the fastest to read and write. You may use it when you are certain that your data can't contain any special characters. However, if you're wrong, this mode will produce a corrupted file in the event that one of the fields does contain the active delimiter (such as comma or tab) or a quote.
- _HeaderRow = 1 unless exists
-
_HeaderRow is a boolean that says whether to expect a header row in data files. The default is true: a header row is required. If false, _FieldList MUST be present before calling read() or an error will be generated. In this latter case, _FieldList will be assumed to give the correct names of the fields in the file, in order, before the file is read. In other words, the object expects that either a) it can get the field names from the file's header row or b) you will supply them before read() opens the file.
Encoding of return characters within fields
- _ReturnMap = 1 unless exists
-
_ReturnMap says that returns embedded in fields should be decoded on read() and encoded again on write(). The industry-standard encoding for embedded returns is ^K (ascii 11 -- but see next setting to change it). This defaults to true but can be turned off if you want data untouched by read(). This setting has no effect on data files where no fields contain embedded returns. However, it is vital to leave this option ON when writing any data file whose fields could contain embedded returns -- if you have such data and call write() with _ReturnMap turned off, the resulting file will be an invalid Merge/CSV file and might not be re-readable.
When these fields are decoded on read(), encoded returns are converted to
"\n"
in memory, whatever its interpretation may be on the current platform (\x0A on Unix or DOS; \x0D on MacPerl).IMPORTANT NOTE: When these fields are encoded by write(), any occurrence of the current _LineEnding being used to write the file is searched and encoded FIRST, and THEN, any occurrence of "\n" is also searched and encoded. For example, if using mac line endings (^M) to write a file on a Unix machine, any ^M characters in fields will be encoded, and then any "\n" (^J) characters will ALSO be encoded. This may not be what you want, so be sure you know how your data is encoded in cases where your field values might contain any ^J and/or ^M characters.
IMPORTANT NOTE: If you turn _ReturnMap off, fields with returns in them will still be double-quoted correctly. Some parsers of tab- or comma-delimited files are able to support reading such files. HOWEVER, the parser in this module's read() method DOES NOT currently support reading files in which a single field value appears to span multiple lines in the file. If you have a need to read such a file, you may need to write your own parser as a subclass of this module.
- _ReturnEncoding ||= "\x0B";
-
This is the default encoding to assume when embedding return characters within fields. The industry standard is "\x0B" (ascii 11 / octal \013 / ^K) so you should probably not ever change this setting.
When fields are encoded on write(),
"\n"
is converted to this value. Note that different platforms use different ascii values for"\n"
, which is another good reason to leave the ReturnEncoding feature enabled when calling write().To summarize: this module likes to assume, and you should too, that returns in data files on disk are encoded as "\x0B", but once loaded into memory, they are encoded as the current platform's value of
"\n"
. - _MacRomanMap = undef unless exists
-
Data::CTable assumes by default that you want field data in memory to be in the ISO 8859-1 character set (the standard for Latin 1 Roman characters on Unix and Windows in the English and Western European languages -- and also the default encoding for HTML Web pages).
_MacRomanMap controls the module's optional mapping of Roman characters from Mac format on disk to ISO format in memory when reading and writing data files. These settings are recognized:
undef ## Auto: Read/write Mac chars if using Mac line endings 1 ## On: Assume Mac char set in all fields 0 ## Off: Don't do any character mapping at all
The default setting is undef, which enables "Auto" mode: files found to contain Mac line endings will be assumed to contain Mac upper-ASCII characters and will be mapped to ISO on read(); and files to be written with Mac line endings will mapped back from ISO to Mac format on write().
If your data uses any non-Latin-1 character sets, or binary data, or you really want Mac upper-ASCII characters in memory, or you just don't want this module messing with your encodings, set this option to 0 (Off) or make a subclass that always sets it to 0.
See also the clean() methods that can help you translate just the columns you want after reading a file or before writing it, which may be faster for you if only a few fields might contain high-ASCII characters.
- _FileName ||= undef;
-
This is the name of the file that should be read from or WAS read from. (read() will set _FileName to the value it used to read the file, even if _FileName was only supplied as a named parameter.)
This name will also be used, unless overridden, to re-write the file again, but with an optional extension added. (See next setting.)
- _WriteExtension = ".out" unless exists
-
The _WriteExtension is provided so that CTable won't overwrite your input data file unless you tell it to.
_WriteExtension will be added to the object's _FileName setting to create a new, related file name, before writing.... UNLESS _FileName is supplied as an direct or named parameter when calling write().
In the latter case, write() uses the file name you supply and adds no extension, even if this would mean overwriting the original data file.
To add _WriteExtension, write() places it prior to any existing final extension in the _FileName:
_FileName default file name used by write() -------------------------------------------------------------- People.merge.txt People.merge.out.txt People People.out
If you want to always overwrite the original file without having to supply _FileName each time, simply set _WriteExtension to undef in a subclass or in each instance.
If _CacheOnWrite is true, then the _WriteExtension logic is applied first to arrive at the actual name of the file to be written, and then the _CacheExtension logic is applied to that name to arrive at the name of the cache file to be written.
Sorting-related parameters
- _SortOrder ||= undef;
-
_SortOrder is the list of fields which should be used as primary, secondary, etc. sort keys when sort() is called. Like other parameters, it may be initialized by a subclass, stored in the object, or provided as a named parameter on each call to sort().
If _SortOrder is empty or undefined, then sort() sorts the records by record number (i.e. they are returned to their "natural" order).
- _SortSpecs ||= {};
-
_SortSpecs is a hash of specifications for the SortType and SortDirection of fields on which sorting may be done. For any field missing a sort spec or the SortType or SortDirection components of its sort spec, the _DefaultSortType and _DefaultSortDirection settings will be used. So, for example, if all fields are of type String and you want them to sort Ascending, then you don't need to worry about _SortSpecs. You only need to provide specs for fields that don't take the default settings.
_SortSpecs might look like this:
{Age => {SortType => 'Integer'}, NameKey => {SortType => 'Text', SortDirection => -1}}
- _SRoutines ||= {};
-
_SRoutines is a hash mapping any new SortTypes invented by you to your custom subroutines for sorting that type of data. (See the section on sort routines, below, for a full discussion.)
Sorting defaults
- _DefaultSortType ||= 'String';
-
If you sort using a field with no sort spec supplied, or whose sort spec omits the SortType, it will get its SortType from this parameter.
See the sections below on SORT TYPES and SORT ROUTINES.
- _DefaultSortDirection ||= 1;
-
If you sort using a field with no sort spec supplied, or whose sort spec omits the SortDirection, it will get its SortDirection from this parameter.
Legal sort directions are: 1 (Ascending) or -1 (Descending).
See the section below on DEFAULT SORT DIRECTION.
Miscellaneous parameters
- _ErrorMsg ||= "";
-
This parameter is set by read() or write() methods that encounter an error (usually a parameter error or file-system error) that prevents them from completing. If those methods or any methods that call them return a false value indicating failure, then _ErrorMsg will contain a string explaining the problem. The message will also have been passed to the progress() method for possible console feedback.
- _Subset
-
This parameter is set to 1 (true) by read() if the last call to read() brought in a subset of the fields available in the file; 0 otherwise.
The object uses this field internally so it knows to abandon any cache files that might not contain all requested fields upon read().
SUBCLASSING
Most subclasses will override initialize() to set default values for the parameters of the parent class and then they may provide default values for other subclass-specific parameters. Then, the subclass's initialize() should call SUPER::initialize() to let the parent class(es) take care of the remaining ones.
Every initialize() method should always allow for parameters to have already been provided by the $Params hash or template object. It should not overwrite any valid values that already exist.
The following sample subclass changes the default setting of the _Progress parameter from undef to 1 and then overrides the progress_default() method to log all progress messages into a new "_ProgrLog" (progress log) parameter stored in the object.
BEGIN
{ ## Data::CTable::ProgressLogger: store messages in the object
package Data::CTable::ProgressLogger;
use vars qw(@ISA); @ISA=qw(Data::CTable);
sub initialize ## Add a new param; change one default
{
my $this = shift;
$this->{_Progress} = 1 unless exists($this->{_Progress});
$this->{_ProgrLog} ||= [];
$this->SUPER::initialize();
}
sub progress_default ## Log message to object's ProgMsgs list
{
my $this = shift;
my ($msg) = @_;
chomp $msg;
push @{$this->{_ProgrLog}}, localtime() . " $msg";
return(1);
}
sub show_log ## Use Dumper to spit out the log list
{
my $this = shift;
$this->dump($this->{_ProgrLog});
}
}
## Later...
my $Table = Data::CTable::ProgressLogger->new("mydata.txt");
# ... do stuff...
$Table->write();
$Table->show_log();
FIELD LIST
## Getting / setting the object's _FieldList
$t->fieldlist() ## Get _FieldList or fieldlist_all()
$t->fieldlist_get()
$t->fieldlist_hash() ## Get fieldlist() as keys in a hash
$t->fieldlist_all() ## Get all fields (ignore _FieldList)
$t->fieldlist($MyList) ## Set field list (_FieldList param)
$t->fieldlist_set($MyList)
$t->fieldlist(0) ## Remove field list (use default)
$t->fieldlist_set()
$t->fieldlist_force($MyList)## Set list; remove non-matching cols
$t->fieldlist_truncate() ## Just remove nonmatching cols
$t->fieldlist_default() ## Default field list (alpha-sorted)
$t->fieldlist_add($MyName) ## Append new name to custom list.
$t->fieldlist_delete($MyName) ## Delete name from custom list.
A CTable object can optionally have a custom field list. The custom field list can store both the ORDER of the fields (which otherwise would be unordered since they are stored as keys in a hash), and also can be a subset of the fields actually in the object, allowing you to temporarily ignore certain effectively-hidden fields for the benefit of certain operations. The custom field list can be changed or removed at any time.
The custom field list is stored in the private _FieldList parameter.
fieldlist() always returns a list (reference). The list is either the same list as _FieldList, if present, or it is the result of calling fieldlist_default(). In CTable, fieldlist_default() in turn calls fieldlist_all() -- hence fieldlist() would yield an auto-generated list of all fields in alphabetical order.
fieldlist_all() can be called directly to get a list of all fields present regardless of the presence of a _FieldList parameter. The list is an alphabetical case-insensitively sorted list of all hash keys whose names do not begin with an underscore.
You could override this method if you want a different behavior. Or, you could create your own custom field list by calling fieldlist_all() and removing fields or ordering them differently.
To set a custom field list (in _FieldList), call fieldlist() or fieldlist_set() with a list (reference). The list must be a list of strings (field names) that do not begin with underscore. The object owns the list you supply.
To remove a custom field list (and let the default be used), call fieldlist(0) or fieldlist_set() with no arguments (these will return the fieldlist that was deleted, if any).
fieldlist_freeze() "freezes" the fieldlist in its current state. This is equivalent to the following:
$t->fieldlist_set($t->fieldlist());
... which would force the fieldlist to $t->fieldlist_all() if and only if there is not already a custom _FieldList present.
IMPORTANT NOTE ABOUT PARTIAL FIELD LISTS: When setting a field list, the object ensures that all fields (columns) mentioned in the list are present in the object -- it creates empty columns of the correct length as necessary. However, it does NOT delete any fields not mentioned in the field list. This allows you to manipulate the field list in order to have certain fields be temporarily ignored by all other methods, then alter, restore, or remove it (allow it to revert to default) and they will be effectively unhidden again. Some methods (such as cols(), write(), etc.) also allow you to specify a custom field list that will override any other list just during the execution of that method call but will not modify the object itself.
Call fieldlist_force() to set the list AND have any non-listed fields also deleted at the same time (by calling fieldlist_truncate() internally). You can also just delete individual columns one-by-one, of course, using the column-manipulation methods and the custom fieldlist, if any, will be appropriately updated for you.
fieldlist_truncate() deletes any fields found in the table but not currently present in _FieldList. A hash of the deleted columns is returned to the caller. If there is no _FieldList, then this method does nothing.
fieldlist_default() just calls fieldlist_all() in this implementation, but could be changed in subclasses.
fieldlist_add() is the internal method that adds a new field name to the custom field list (if present) and if the field name was not already on the list. It is called by other methods any time a new column is added to the table. Don't call it directly unless you know what you're doing because the corresponding column won't be created. (Instead, use col().) The field name is appended to the end of the existing custom field list. If there is no custom field list, nothing is done.
fieldlist_delete() is the internal method that deletes a field name from the custom field list (if present). It is called by other methods when columns are deleted, but it does not actually delete the columns themselves, so use with caution: deleting a field from the custom field list effectively hides the field. This method has no effect, however, if there is no custom field list present. So don't call this method directly unless you know what you're doing.
DATA COLUMNS (FIELD DATA)
## Getting or setting data in entire columns
$t->{$ColName} ## Get a column you know exists
$t->col($ColName) ## Get a column or make empty one.
$t->col_get($ColName)
$t->col($ColName, $ListRef) ## Set all of a column all at once.
$t->col_set($ColName, $ListRef)
$t->col_force($ColName, $ListRef) ## Add but don't check size or
## add to custom field list
$t->col_set($ColName, undef) ## Delete a column completely
$t->col_delete($ColName)
$t->col_empty() ## An empty col presized for table
$t->col_empty(22) ## An empty col of another length
$t->col_empty($Col) ## An empty col sized to match another
$t->col_default() ## Default if req. column not found.
$t->col_exists($Field) ## Check existence of column
$t->col_active($Field) ## Restrict check to fieldlist()
$t->cols($ColList) ## Get list of multiple named columns
$t->cols_hash($ColList) ## Get hash " " "
$t->col_rename($Old => $New) ## Change name of columns
$t->col_rename($Old1 => $New1, $Old2 => $New2) ## Change several
A "column" is a field in the table and all its data. The column's field name is a key in the object itself, and may also optionally be listed in a custom field list if present. The column's data is the key's value in the hash and is an array ref of values presumed to be of the same data type (e. g. string, integer, etc.)
Sometimes the terms "column" and "field" are used interchangeably in this documentation.
If you already know that a column exists (because you got it from the fieldlist() method and you've not previously manipulated _FieldList directly but instead carefully used the method calls available for that), then you can safely get the column by just looking it up in the object itself.
The col() method does the same thing, but forces the column to spring into existence if it did not already (which can also have the potentially unwanted side-effect of hiding coding errors in which you retreive mis-named columns: so beware). Columns brought into existence this way will automatically be pre-sized (i.e. they will will be created and set to whatever col_default() returns).
The col() or col_set() methods can also be used to set a column. When the column is set, the list you pass is automatically sized (lengthened or truncated) to match the current length of the table. If this is not what you want, then call col_force() which will not check whether the new column matches the size of the others.
No matter how you set it, the object now "owns" the list you gave it.
As a convenience, col(), col_set() and col_force() return the column that was set. They silently discard any previous column.
All three methods of column setting will append the column to the custom field list if one is present and the column name is not already listed there (by calling fieldlist_add()). They will also call the extend() method to ensure all columns have the same length (either others will be extended to match the length of the new one, or the new one will be extended to match the length of the others).
col_delete() deletes a column.
col_empty() returns an anonymous list reference that is pre-sized to the length of the table (by default). You could use it to get an empty column that you intend to fill up and then later insert into the table or use to hold the results of an operation on other columns. If you want a different length, specify it as a number or as an array ref whose length should be matched.
col_default() is the internal method that implements the "springing into existence" of missing columns. Currently it just calls col_empty(). Other subclasses might want to have it return undef or a string like "NO_SUCH_COLUMN" in order to help track programming errors where nonexistent columns are requested.
cols($FieldList) returns an ordered list of the requested column names. If no list is given, then fieldlist() is used.
cols_hash($FieldList) does the same as cols(), but the result is a hash whose keys are the field names and whose values are the columns -- much like the original object itself, but not blessed into the class. The resulting hash, however, could be used as the prototype for a new Data::CTable object (by calling the new() method). However, be warned that both objects will think they "own" the resulting shared so be careful what you do..... which brings us to this:
IMPORTANT NOTE ABOUT GETTING COLUMNS: The columns you retrieve from a table are still "owned" by the table object as long as it lives. If you modify them, you are modifying the table's data. If you change their length, then you may be invalidating the table's own expectations that all its columns have the same length. So beware.
Just make yourself a copy of the data if that isn't what you want. For example, instead of this:
my $Foo = $Table->col('Foo'); ## Reference to actual column
Do this:
my $Foo = [@{$Table->col('Foo')}]; ## Shallow copy of the column
CLEANUP AND VALIDATION
## Performing your own cleanups or validations
$t->clean($Sub) ## Clean with custom subroutine
$t->clean($Sub, $Fields) ## Clean specified columns only
## Cleaning whitespace
$t->clean_ws() ## Clean whitespace in fieldlist() cols
$t->clean_ws($Fields) ## Clean whitespace in specified cols
## Cleaning methods that map character sets
$t->clean_mac_to_iso8859()
$t->clean_mac_to_iso8859($Fields)
$t->clean_iso8859_to_mac()
$t->clean_iso8859_to_mac($Fields)
## Character mapping utilities (not methods)
use Data::CTable qw(
ISORoman8859_1ToMacRoman
MacRomanToISORoman8859_1
ISORoman8859_1ToMacRoman_clean
MacRomanToISORoman8859_1_clean
);
&ISORoman8859_1ToMacRoman(\ $Str) ## Pass pointer to buffer
&MacRomanToISORoman8859_1(\ $Str) ## Pass pointer to buffer
&ISORoman8859_1ToMacRoman_clean() ## Operates on $_
&MacRomanToISORoman8859_1_clean() ## Operates on $_
One of the most important things you can do with your data once it's been placed in a table in Perl is to use the power of Perl to scrub it like crazy.
The built-in clean_ws() method applies a standard white-space cleanup to selected records in every field in the fieldlist() or other list of fields you optionally supply (such as fieldlist_all()).
It does the following cleanups that are deemed correct for the majority of data out there:
- Remove all leading whitespace, including returns (\n)
- Remove all trailing whitespace, including returns (\n)
- Convert runs of spaces to a single space
- Convert empty string values back to undef to save space
Of course, depending on your data, clean_ws() might just be the first thing you do in your cleanup pass. There might be many more cleanups you'd like to apply.
clean() is like clean_ws() except you supply as the first argument your own cleaning subroutine (code reference). It should do its work by modifying $_.
Both clean_ws() and clean() apply cleaning ONLY to selected records. If this isn't what you want, then select_all() before cleaning.
Since a cleanup subroutine can do ANY modifications to a field that it likes, you can imagine some cleanup routines that also supply default values and do other validations.
For example, a cleanup routine could convert every value in each field to an integer, or apply minimum or maximum numerical limits:
sub {$_ = int($_) }
sub {$_ = max(int($_), 0) }
sub {$_ = min(int($_), 200)}
Or your cleanup routine could use regular expressions to do capitalizations or other regularizations of data:
sub Capitalize {/\b([a-z])([a-z]+)\b)/\U$1\E$2/g}
$t->clean(\ &Capitalize , ['FirstName', 'LastName']);
$t->clean(\ &PhoneFormat, ['Phone', 'Fax' ]);
$t->clean(\ &LegalUSZip, ['HomeZip', 'WorkZip' ]);
... and so on. Cleanups are easy to write and quick and easy to apply with Data::CTable. Do them early! Do them often!
Hints for writing cleanup routines
If your cleanup routine may be used to clean up fields that could be empty/undef and empty/undef is a legal value, it should not touch any undef values (unintentionally converting them to strings).
Finally, instead of setting any values to the empty string, it should set them to undef instead. This includes any values it might have left empty during cleanup. (Using undef instead of empty string to represent empty values is one way that Data::CTable likes to save memory in tables that may have lots of those.)
For an example of a well-behaved cleanup routine, consider the following implementation of the builtin CleanWhitespace behavior:
sub CleanWhitespace
{
return unless defined; ## Empty/undef values stay that way
s/ \s+$//sx; ## Remove trailing whitespace
s/^\s+ //sx; ## Remove leading whitespace
s/ +/ /g; ## Runs of spaces to single space
$_ = undef unless length; ## (Newly?) empty strings to undef
}
Roman character set mapping
The character set mapping cleanup routines can be used to convert upper-ASCII characters bidirectionally between two popular Roman Character sets -- Mac Roman 1 and ISO 8859-1 (also sometimes called ISO Latin 1) -- i.e. the Western European Roman character sets.
By default, read() converts all incoming data fields in data files with Mac line endings to ISO format when reading in. Conversely, write() does the reverse mapping (ISO to Mac) when writing a file with Mac line endings.
However, you may wish to turn off these default behaviors and instead apply the mappings manually, possibly just to certain fields.
For example, if a table contains fields with non-Roman character sets, you would definitely not want to apply these mappings, and instead might want to apply some different ones that you create yourself.
Utility routines for character mapping
This module can optionally export four utility routines for mapping character Latin 1 character sets. Always be sure to map the correct direction -- otherwise you'll end up with garbage! Be careful to only pass Western Roman strings -- not double-byte strings or strings encoded in any single-byte Eastern European Roman or non-Roman character set.
&ISORoman8859_1ToMacRoman(\ $Str) ## Pass pointer to buffer
&MacRomanToISORoman8859_1(\ $Str) ## Pass pointer to buffer
These routines translate characters whose values are 128-255 from one Western Roman encoding to another. The argument is a string buffer of any size passed by reference.
The functions return a count of the number of characters that were mapped (zero or undef if none were).
&ISORoman8859_1ToMacRoman_clean() ## Operates on $_
&MacRomanToISORoman8859_1_clean() ## Operates on $_
These routines are variants of the above, but they're versions that are compatible with clean() -- they operate on $_ and will take care to leave undefined values undefined. They do not have return values.
More advanced cleaning and validation
Unfortunately, clean() only lets you operate on a single field value at a time -- and there's no way to know the record number or other useful information inside the cleaning routine.
For really powerful cleaning and validation involving access to all fields of a record as well as record numbers, see the discussion of the calc() method and other methods for doing complex field calculations in the next section.
CALCULATIONS USING calc()
## Calculate a new field's values based on two others
$t->calc($Sub) ## Run $Sub for each row, with
## fields bound to local vars
$t->calc($Sub, $Sel) ## Use these row nums
$t->calc($Sub, undef, $Fields) ## Use only these fields
$t->calc($Sub, $Sel, $Fields) ## Use custom rows, fields
my $Col = $t->calc($Sub) ## Gather return vals in vector
## Example 1: Overwrite values in an existing column.
$t->calc(sub{no strict 'vars'; $Size = (stat($Path))[7]});
## Example 2: Create empty column; fill fields 1 by 1
$t->col('PersonID');
$t->calc(sub{no strict 'vars'; $PersonID = "$Last$First"});
## Example 3: Calculate values; put into to table if desired
$PersonID = $t->calc(sub{no strict 'vars'; "$Last$First"});
$t->sel('PersonID', $PersonID);
## Example 4: Using fully-qualified variable names
$t->calc(sub{$main::PersonID = "$main::Last$main::First"});
calc() runs your custom calculation subroutine $Sub once for every row in the current selection() or other list of row numbers that you specify in the optional $Sel argument.
This lets you apply a complex calculation to every record in a table in a single statement, storing the results in one or more columns, or retrieving them as a list.
Your custom subroutine may refer to the value in any field in the current row by using a global variable with the field's name:
For example, if the table has fields First, Last, and Age, then $Sub may use, modify, and set the variables $First, $Last, $Age. (Also known as $main::First, $main::Last, $main::Age).
Modifying any of these specially-bound variables actually modifies the data in the correct record and field within the table.
By default, the fields available to $Sub are all fields in the table. calc() must bind all the field names for you for each row, which can be time-consuming for tables with very large numbers of fields.
You can speed up the operation of calc() by listing only the fields your $Sub needs in the optional parameter $Fields. Any field names you don't mention won't be available to $Sub. Conversely, calc() will run faster because it can bind only the fields you actually need.
If you include non-existent fields in your custom $Fields list, calc() creates them for you before $Sub runs the first time. Then your $Sub can store field values into the new column, referring to it by name.
Variables in $Sub are in the "main" package. So you should set $Sub to use pacakge "main" in case the rest of your code is not in "main".
Similarly, if you "use strict", Perl will complain about global variables in $Sub. So you may need to assert "no strict 'vars'".
{ package Foo; use strict;
$t = ...;
$t->calc(sub {package main; no strict 'vars';
$Age = int($Age)});
}
## Or this:
{ package Foo; use strict;
$t = ...;
{ package main; no strict 'vars';
my $Sub = sub {$Age = int($Age)};
}
$t->calc($Sub);
}
You may be able to get around both problems more easily by prefixing each variable reference in $Sub with "main::". This takes care of the package name issue and bypasses "use strict" at the same time, at the slight cost of making the calculation itself a bit harder to read.
$t->calc(sub {$main::Age = int($main::Age)}); ## OK in any package
In addition to the field names, the following three values are defined during each invocation of $Sub:
$_r ($main::_r) -- the row number in the entire table
$_s ($main::_s) -- the item number in selection or $Recs
$_t ($main::_t) -- the table object itself
You could use these values to print diagnostic information or to access any of the data, parameters, or methods of the table itself from within $Sub. Or you could even calculate field values using $_r or $_s.
For example, after searching & sorting, you could make a field which preserves the resulting sort order for future reference:
$t->col('Ranking'); ## Create the empty column first.
$t->calc(sub{$main::Ranking = $main::_s});
This last example is equivalent to:
$t->sel(Ranking => [0 .. $#{$t->selection()}]); ## See sel() below
"MANUAL" CALCULATIONS
calc() (see previous secion) is the briefer, more elegant way to do batch calculations on entire columns in a table, but it can be slightly slower than doing the calculations yourself.
If you have extremely large tables, and you notice the processing time for your calculations taking more than a second, you might want to rewrite your calculations to use the more efficient techniques shown here.
You will often need to create new calculated columns based on one or more existing ones, and then either insert the columns back in the tables or use them for further calculations or indexing.
Examples 1a and 1b create a new field 'NameOK' containing either the string "OK" or undef (empty) depending on whether the field 'Name' is empty. Just use map() to iterate over the existing values in the other column, binding $_ to each value in turn.
### Example 1a: Calculation based on one other column
## Retrieve column
my $Name = $t->col('Name');
## Make new column
my $NameOK = [map {!!length && 'OK'} @$Name];
## Insert column back into table:
$t->col(NameOK => $NameOK);
### Example 1b: Same calculation, in a single statement:
$t->col(NameOK => [map {!!length && 'OK'} @{$t->col('Name')}]);
In order to iterate over MULTIPLE columns at once, you need a list of the row numbers generated by $t->all() so you can index the two columns in tandem. Then, you use map to bind $_ to each row number, and then use the expression $t->col($ColName)->[$_] to retreive each value.
Examples 2a and 2b demonstrate this method. They create a new field 'FullName' which is a string joining the values in the 'First' and 'Last' columns with a space between.
### Example 2a: Calculation based on multiple columns
## Retrieve columns
my $First = $t->col('First');
my $Last = $t->col('Last' );
## Retreive row nums
my $Nums = $t->all();
## Calculate a new column based on two others
my $Full = [map {"$First->[$_] $Last->[$_]"} @$Nums];
## Add new column to the table
$t->col(FullName => $Full);
### Example 2b: Same calculation, in a single statement:
$t->col(FullName =>
[map {"$t->col('First')->[$_] t->col('Last')->[$_]"}
@{$t->all()}]);
In examples 1 and 2, you create entirely new columns and then add or replace them in the table.
Using the approach in Examples 3a and 3b, you can assign calculated results directly into each value of an existing column as you go.
## Example 3a: Calculate by assigning directly into fields...
my $A = $t->col->('A'); ## This column will be modified
my $B = $t->col->('B');
my $C = $t->col->('C');
foreach @($t->all()) {$A->[$_] = $B->[$_] + $C->[$_];}
## Example 3b: Same calculation, in a single statement:
foreach @($t->all()) {($t->col('A')->[$_] =
$t->col('B')->[$_] +
$t->col('C')->[$_])};
Before writing your code, think about which calculation paradigms best suit your needs and your data set.
Just as Perl Hackers know: There's More Than One Way To Do It!
INDEXES
## Make indexes of columns or just selected data in columns
my $Index1 = $t->index_all($Key); ## entire column
my $Index2 = $t->index_sel($Key); ## selected data only
## Make hashes of 2 columns or just selected data in columns
my $Index1 = $t->hash_all($KeyFld, $ValFld); ## entire column
my $Index2 = $t->hash_sel($KeyFld, $ValFld); ## selected data
index_all() creates and returns a hash (reference) that maps keys found in the column called $Key to corresponding record numbers.
Ideally, values in $Key would be unique (that's up to you).
If any values in $Key are NOT unique, then later values (higher record numbers) will be ignored.
index_sel() creates and returns a hash (ref) that maps keys found in the SELECTED RECORDS of column $Key to corresponding record numbers.
Any keys in unselected records are ignored. Otherwise, the behavior is equivalent to index_all().
hash_all() and hash_sel() are similar, except they create and return hashes whose keys are taken from column $KeyFld, but whose values are from $ValFld in the corresponding records.
So, for example, imagine you have a tab-delimited file on disk with just a single tab per line (2 fields) and no header row. The entries on the left side of the tab on each line are keys and the right side are values. You could convert that file into a hash in memory like this:
my $t = Data::CTable->new({_HeaderRow=>0, _FieldList=>[qw(F1 F2)]},
"DeptLookup.txt");
my $DeptLookup = $t->hash_all(qw(F1 F2));
Reverse indexes
If you'd like an index mapping record number to key, just get $t->col($Key). That's what the data columns in Data::CTable are.
DATA ROWS (RECORDS)
## Getting or setting rows / records
$t->row($Num) ## Get a row or make empty one.
$t->row_get($Num)
$t->row($Num, $HashRef) ## Set all of a row all at once.
$t->row_set($Num, $HashRef)
$t->row_set($Num, undef) ## Delete a row completely
$t->row_delete($Num)
$t->row_delete($Beg, $End)## Delete a range of rows
$t->row_move($Old, $New) ## Move a row to before $New
$t->row_empty() ## An empty hash
$t->row_exists($Num) ## True if $Num < $t->length()
$t->rows($RowList) ## Get list of multiple row nums
$t->row_list($Num) ## Get row vals as a list
$t->row_list($Num, $Fields) ## Get row vals: specified fields
$t->row_list_set($Num, undef, $Vals) ## Set row vals as a list
$t->row_list_set($Num, $Fields, $Vals) ## Set row vals as a list
$t->row_list_set($Num, $Fields) ## Set vals to empty/undef
Usually, when working with Data::CTable objects, you are operating on entire columns or tables at a time (after all: any transformation you do on one record you almost always want to do on all records or all selected ones).
You should very rarely need to access data by retrieving rows or setting rows, moving them around individually, and so on. (It's much cleaner, and much more efficient to manipulate the selection() (list of selected row numbers) instead -- just delete a row number from the selection, for example, and then for most operations it's almost as if the row is gone from the table, except the data is really still there.)
However, if on rare occasions you really do need direct row operations, you're reading the right section.
A row is generally accessed as a hash. The hash you provide or get back is not saved by the object in any way. Data values are always copied in or out of it, so you always "own" the hash.
Rows are specified by $Num -- the row number with in the unsorted columns (the raw data in the table). These numbers are just array indices into the data columns, and so their legal range is:
[0 .. ($t->length() - 1)] ## (Zero-based row numbering.)
The row hash (or "record") has keys that are field names and values that are copies of the scalar values stored in the data columns within the table.
row() always copies only the fields in fieldlist(), except for row_list() which allows you to specify an optional $Fields parameter which can override the current fieldlist().
If the fieldlist happens to be a subset of all fields, but you really want to get all fields in your record, then call fieldlist_set(0) first to permanently or temporarily delete it.
row() and row_get() always return a hash.
row($Num, $Hash), row_set() take a hash and set just the fields you specify in the hash (in the given row of course). Any non-existent field names in the hash are created, so be careful.
In fact, in general with either getting or setting rows, any non-existent fields mentioned will be created for you (by internally calling col()). So you could build a whole table of 100 rows by starting with an empty, new, table and setting row 99 from a hash that gives the field names.
Setting a row number higher than any existing row number with row(), row_set() or row_force() will automatically set the new length of the entire table to match (extending all the columns with empty rows as necessary).
IMPORTANT: IF YOU SIMPLY MUST ADD ROWS SEQUENTIALLY, do not let the table auto-extend by one with each row you set. This is slow and gets rapidly slower if there's lots of data because the arrays holding the data columns will keep getting reallocated on every insert. Instead, first pre-extend the table to your highest row number by calling length($Len), and then set your rows. Or easier: if convenient just set your rows starting with the highest-numbered one first. If you don't know how many you'll have, guess or estimate and pre-extend to the estimated number and then cut back later. This will be faster than extending all columns by one each time.
row_delete() removes a row or range of rows completely from the table. Any rows above the deleted ones will move down and the table's length() will decrease. If the data columns are very large, this could be a bit slow because a lot of data could be moved around. The low and high row numbers will be limited for you to 0 and length() - 1, respectively. Null ranges are OK and are silently ignored. The range is inclusive, so to delete just row 99, call row_delete(99) or row_delete(99,99).
EFFICIENCY NOTE: Don't call row_delete() to remove lots of individual rows. Instead, select those row numbers by setting the selection (if not already selected), and then invert the selection using selection_invert(), so the undesired rows are deselected, and then use the cull() method to rewrite the entire table at once. The deselected rows will be omitted very efficiently this way.
row_move() moves a row from its $Old row number to the position before the row currently in row $New (specify $New = length() to move the row to the end). Again, in shuffling data in columns, lots of data could get moved around by this operation, so expect it to be slow. If as with row_delete(), if you will be doing several moves, consider building an appropriate selection() first, and then using cull() instead.
Using row_delete() and row_move() to shift records around changes the record numbers of the affected records and many others in the table. The record numbers in the custom selection, if any, are updated to reflect these changes, so the records that were selected before will still be selected after the move (except those that were deleted of course). If you had a private copy of the selection, your copy will likely become outdated after these operations. You should get it again by calling selection().
row_empty() returns a hash whose keys are the entries in fieldlist() and whose values are undef. (You could use it to fill in values before calling row_set()). Note: in this class, row_empty() does exactly the same thing as fieldlist_hash() when the latter is called with no arguments.
row_exists() returns true if (($Num >= 0) && ($Num < $t->length()))
.
rows() calls row() for each row num in a list and returns a list of the resulting hashes.
row_list() gets row values as a list instead of a hash. They appear in the order specified in fieldlist() unless you supply an optional $Fields parameter listing the fields you want to get.
row_list_set() sets row values as a list instead of a hash. Pass your own $Fields list or undef and fieldlist() will be used. $Values should be a list with the same number of values as fields expected; any shortage will result in undef/empty values being set.
ROW / RECORD COUNT (TABLE LENGTH)
## Getting or setting table length
$t->length() ## Get length
$t->length_get()
$t->length(22) ## Set length (truncate or pre-extend)
$t->length_set(22)
$t->extend() ## Set length of all columns to match longest
The length* methods assume the table already has columns of equal length. So the length of the table is the length of any field taken at random. We choose the first one in the field list.
Setting the length will truncate or pre-extend every column in the table to a given length as required.
(Pre-extending means setting each column's length via $# so that it has the correct number of entries already allocated (and filled with undef) so that operations that fill up the table can be done much more quickly than with push().
However, if a new column has been added directly, or a table has been constructed out of columns whose length may not initially match, the extend() method may be (should be) called to inspect all columns and extend them all to match the longest one. Note that extend() operates on all fields in the object, ignoring the custom _FieldList if any.
The length of a table with no columns is zero.
SELECTIONS
## Getting or setting the custom selection list itself (_Selection)
$t->selection() ## Get sel if any; else all()
$t->selection_get()
$t->selection($List) ## Set sel (list of rec nums)
$t->selection_set($List)
$t->selection(0) ## Remove sel (select all)
$t->selection_set(undef)
$t->selection_delete()
$t->select_all()
$t->selection_inverse() ## Get inverse copy of selection
$t->select_inverse() ## Invert the selection
$t->selection_validate() ## Remove invalid #s from sel
## List of all rec nums present (regardless of selection)
$t->all()
## Getting or setting just selected fields in columns
## (as contrasted with col() and friends).
$t->sel($ColName) ## Get col but only records in sel
$t->sel_get($ColName)
$t->sel($ColName, $ListRef) ## Set selected fields in col...
$t->sel_set($ColName, $ListRef) ##... in selection order
$t->sel_set($ColName) ## Set selected fields to undef
$t->sel_clear($ColName)
$t->sels($ColList) ## Like cols, but selected fields
$t->sels_hash($ColList) ## " " cols_hash()... " " " "
## Finding out size of selection (number of rows)
$t->sel_len() ## Get number of selected rows.
A selection is an ordered list of record numbers. The record numbers in the selection may be a subset of available records. Furthermore, they may be in non-record-number order, indicating that the records have been sorted.
Record numbers are numeric array indices into the columns in the table. It is an error for any selection list to contain an index less than zero or greater than (length() - 1), so if you set a selection explicitly, be careful.
Any selection list you get or set belongs to the object. Be careful of modifying its contents.
The custom selection, if any, is stored internally in the _Selection parameter. If this parameter is absent, the selection defaults to all() -- i.e. a list of all record numbers, in order: [0..($this->length() - 1)] (which becomes [] if length() is 0).
REMEMBER: length() is one-based, but record numbers are zero-based.
Removing the selection (that is, removing the LIST itself of which records are selected), is the same as selecting all records. consequently, selection(0), selection_delete(), and select_all() are all synonymous.
selection_validate() removes any entries from the current _Selection list (if any) that are not valid record numbers -- i.e. it removes any record whose integer value is < 0 or greater than length() - 1. This routine is mainly used by other methods that might delete records, such as length_set().
Getting or setting just selected data from columns
Sometimes, you don't want to get/set entire columns, you instead want to get or set data in just the selected fields in a column.
The sel(), sel_get(), sel_set(), sels() and sels_hash() methods are analagous to the corresponding col(), ... cols_hash() methods except in these two ways:
- the 'sels' variants get or set just selected data, as determined by the current selection(), which gives an ordered list of the selected / sorted records.
- the 'sels' variants all make COPIES of the data you request or supply -- the data is copied out of or into the correspnding column. So, you "own" any vector you pass or receive in reply.
So, for example, imagine you have just set selection() to only list record numbers where the LastName field is not empty. Then you have called sort() to sort those record numbers by the LastName field. You could then call $t->sel('LastName') to get a sorted list of all non-empty last names.
It might be helpful to think of "sel" as short for "selected". So $t->sel('LastName') would mean "get the selected field values from the LastName field".
SEARCHING / SELECTING RECORDS
## Modifying the table's custom selection (_Selection)
$t->select_all() ## Set _Selection = $t->all() or undef
$t->select_none() ## Set _Selection = []
$t->select_inverse() ## Invert the curr. sel. (and get it)
## Specific searches: "the select() methods"
$t->select($Field1=>$Sub1, ## Del nonmatching recs from sel.
$Field2=>$Sub2, ## i.e. narrow sel. to match
...);
$t->omit ($Field1=>$Sub1, ## Del matching recs from sel.
$Field2=>$Sub2,
...);
$t->add ($Field1=>$Sub1, ## Add matching recs to sel.
$Field2=>$Sub2,
...);
$t->but ($Field1=>$Sub1, ## Add nonmatching recs to sel.
$Field2=>$Sub2,
...);
## Getting useful lists of record numbers...
$t->all() ## Get "full" sel. (all record #s)
$t->selection() ## Get current selection
$t->selection_inverse() ## Get inverse copy of curr. sel.
## Example 1: Refine a selection by narrowing down...
$t->select_all()
$t->select(Field1 => sub {$_});
$t->select(Field2 => sub {$_});
$t->select(Field3 => sub {$_});
## Example 2: Manually refine and set the selection...
$Sel = [grep {$t->col($Field1)->[$_]} @{$t->all ()}];
$Sel = [grep {$t->col($Field2)->[$_]} @$Sel];
$Sel = [grep {$t->col($Field3)->[$_]} @$Sel];
$t->selection($Sel); ## Set the selection when done.
## Example 3: Complex manual search using calculated value
my $A = $t->col('A');
my $B = $t->col('B');
my $S = [grep
{my $X = $A->[$_] + $B->[$_]; ($X > 100 && $X < 200);}
@{$t->all()}]; ## Or could start with $t->selection().
$t->selection($S); ## Set the selection when done.
## Example 4: Refine a selection by building up...
$t->select_none()
$t->add(Field1 => sub {$_});
$t->add(Field2 => sub {$_});
$t->add(Field3 => sub {$_});
## Example 5: Combining the select() methods to build a query...
$t->select_all()
$t->select(Status => sub {/prime/i });
$t->omit (DueDate => sub {$_ > $Today});
$t->add (Force => sub {$_ });
select() and its friends omit(), add(), and but(), known collectively as "the select() methods," all work similarly: they take a series of one or more pairs indicating matches to be done, where each match is specified as (FieldName => Subroutine).
In addition to the field names already present in the table, the FieldName in any Spec may also be one of these two special pseudo-fields:
- _RecNum
-
the record number of the record being compared
- _SelNum
-
the numerical position of the record being compared within the previous selection (only usable with select() and omit() since add() and but() by definition operate on non-selected records).
For example:
## Match 2nd 100 rec numbers
$t->select(_RecNum => sub {$_ >= 100 && $_ <= 199});
## Match 2nd 100 currently selected/sorted items
$t->select(_SelNum => sub {$_ >= 100 && $_ <= 199});
Be careful when using _SelNum in a search. In the above _SelNum search example, since the selection itself will be modified by select(), the items that were formerly selection items 100 - 199 will now be _SelNum 0 - 99 in the new selection.
The Subroutine is an anonymous grep-style predicate that operates on $_ and should return true/false to indicate a match with an element of the field FieldName.
The order of multiple matches in a single method call is significant only in that the searches can be faster if the field that will match the fewest records is listed first.
A given FieldName may be listed in the specs more than once if it has multiple search criteria that you prefer to execute as multiple subroutines (though it would be more efficient on very large tables to combine their logic into one subroutine joined with "&&").
Each field match will be applied (with an implied AND joining them) to determine whether the record itself matches. Then, based on whether the record itself matches, it will either be added or deleted from the selection based on which method is being called:
method... operates on... action....
------------------------------------------------------------------
select() selected records Keep only recs that DO match
omit() selected records Keep only recs that DO NOT match
add() non-selected recs Add recs that DO match
but() non-selected recs Add recs that DO NOT match
Here's how to think about what's going on:
methods... think...
------------------------------------------------------------------
select() "SELECT things matching this"...
omit() "... then OMIT those matching this."
select() "SELECT things matching this"...
add() "... and ADD any others matching this."
select() "SELECT things matching this"...
but() "... and add any others BUT those matching this."
select() and omit() both NARROW the selection.
add() and but() both INCREASE the selection.
IMPORTANT: You DO NOT need to use these select() routines to work with selections. It may be much easier for you to clarify your logic, or more efficient to express your search, using a single grep or series of grep operations as in Examples 2 or 3 above.
Building the selection manually is required if you want to filter based on any COMPLEX RELATIONSHIPS BETWEEN FIELDS. For example, if you want to add two fields and match or reject the record based on the sum of the fields.
In Example 3 above, we add the values in fields "A" and "B" and then match the record only if the SUM is between 100 and 199. By grepping to produce a subset of @{$t->all()}, you end up with a Selection -- a list of record numbers you want "selected". Then you call $t->selection() to put the selection you built into the object.
If you had instead wanted to narrow an existing selection in the above example, you would start with $t->selection() (which defaults to $t->all()) instead of starting with $t->all().
Each of the select() methods returns $this->selection() as a convenience.
The effects of modifying a sorted selection
Generally, you should sort AFTER finding, and you should not generally rely on sort order after doing a find. But in case you want to know, the following chart explains what happens to the sort order after the various select() commands are called (at least in the current implementation, which may change without notice):
method... effect on an existing sort order...
------------------------------------------------------------------
select() relative sort order is preserved (stay sorted)
omit() all selected recs restored to "natural" order (unsorted)
add() orig. recs preserved; new recs appended: "natural" order
but() orig. recs preserved; new recs appended: "natural" order
In other words, you could sort() first and then call select() to narrow down the selection repeatedly without disrupting the sort order. However, any of the other methods will disrupt the sort order and you would need to re-sort. The preservation of order when using select(), and other sort order effects, are likely but not guaranteed to be preserved in future implementations.
Hints about Boolean logic
Consider the following example and the alternative below it. You might initially think these are equivalent, but they're not:
## Case 1:
my $Sel = $t->add(Force => sub {$_ == 1 });
my $Sel = $t->add(Status => sub {$_ eq 'Prime'});
## Case 2:
my $Sel = $t->add(Force => sub {$_ == 1 },
Status => sub {$_ eq 'Prime'});
Case 1 extends the selection by adding all records where Force == 1, and then extends it again by adding all additional records where Status eq 'Prime'.
Case 2 adds only those records where: Force == 1 AND ALSO, IN THE SAME RECORD, Status eq 'Prime'.
One final note about logic. This is not SQL and these select() etc. routines are not meant to replace the full power of a programming language.
If you want full Boolean expressions, use the power of Perl to form your own arbitrarily complex query using grep as in Example 3 above.
Writing your own grep is also almost always faster than chaining the builtin select() methods or using multiple Field / Sub specifications, so keep that in mind when working with extremely large data sets.
With tables of only a few thousand records or so, you probably won't notice the difference in efficiency.
SORTING
## Sort the current table's _Selection
$t->sort() ## Use existing/default params
$t->sort([qw(Last First Phone)]) ## Specify _SortOrder (fields)
$t->sort( ## Named-parameter call:
_SortOrder => [...], ## override sort-related params.
_Selection => [...], ## (See param lists above).
_SortSpecs => {...},
_SRoutines => {...},
_DefaultSortType=>'Integer',
_DefaultSortDirection=>-1,
);
The sort() method modifies the _Selection (creating one with all records if it was missing, undef, or not supplied by caller) so that the record numbers listed there are sorted according to the criteria implied by _SortOrder, _SortSpecs, _SRoutines, etc.
For example, before sorting, a table's "natural" order might be:
Rec# First Last Age State
0 Chris Zack 43 CA
1 Marco Bart 22 NV
2 Pearl Muth 15 HI
... and its selection() method would yield: [0, 1, 2] -- which is a list of all the records, in order.
After calling $t->sort([Last]), selection() would yield [1, 2, 0]. So displaying the table in "selection" order would yield:
Rec# First Last Age State
1 Marco Bart 22 NV
2 Pearl Muth 15 HI
0 Chris Zack 43 CA
IMPORTANT: sorting does not alter any data in the table. It merely alters the _Selection parameter (which you can then get and set using the selection() methods described above).
If you want to permanently alter the table's data in memory so that the new sorted order becomes the "natural" order, you can use the cull() method to modify the original object, the snapshot() method to make a new object, or use the write() method to write the data to disk in selected/sorted order and then read() it back again.
Using the Named-parameter calling convention with sort()
You may specify any combination of the parameters listed above when calling sort(). Any you specify will be used IN PLACE OF the corresponding parameters already found in the object.
If you specify _Selection using the named-parameter calling, the sort() method reserves the right to "own" the list you provide, and use it as the object's new _Selection, possibly discarding the previous _Selection, if any and modifying the one you provided. So don't make any assumptions about ownership of that list object after calling sort(). Luckily, you will rarely need to provide _Selection explicitly since generally you'll want to be sorting the selection() already inherent in the object.
sort() returns the _Selection list owned by the object (the same list that would be returned if you called the selection() method immediately after calling sort()).
See the next sections for complete descriptions of _SortOrder and other sorting parameters.
SORT ORDER
## Getting / Setting table's default _SortOrder
$t->sortorder() ## Get sortorder (default is [])
my $Order = [qw(Last First State Zip)];
$t->sortorder($Order) ## Set sortorder (use [] for none)
$t->sortorder_default() ## Get the object's default sort order ([])
The sort order is an optional list of field names on which to sort and sub-sort the data when sorting is requested. The field names must be the names of actual columns in the table. The names in the sort order do not necessarily need to coincide with the custom fieldlist if any.
There is one special value that can be included: _RecNum. This sorts on the imaginary "record number" field. So for example, you could specify a sort order this way:
[qw(Last First _RecNum)]
(There is no point in putting _RecNum anywhere except at the end of the sort order because no two records will ever have the same record number so there will be no further need to disambiguate by referring to additional fields.)
Sorting by _RecNum adds a bit of computational overhead because sort() first builds a record number vector for use in sorting, so for very large tables, don't do it unless you really need it.
A sort order can be specified each time the object is sorted (see the sort() method for details).
Or, the object's sort order can be set once, and then sort() will use that sort order when no other sort order is specified.
If sorting is done when there is no sort order present in the object or specifed for the sort() method, the selection is sorted by record number (i.e. it is "unsorted" or returned to its "natural" order).
In order words, a sortorder that is undef or [] is considered the same as: [qw(_RecNum)]. This is sometimes called "unsorting".
In order to decide how values in each field should be compared, sort() is informed by SortSpecs (specifying SortType and SortDirection for each field) and by SortRoutines, each of which may similarly either be pre-set for the object or specified when calling sort() -- see below for further details.
SORT SPECIFICATIONS
## Getting / Setting table's default _SortSpecs
$t->sortspecs() ## Get sortspecs (default is {} -- none)
my $Specs = {Last => {SortType => 'String' ,
SortDirection => -1 },
Zip => {SortType => 'Integer' }};
$t->sortspecs($Specs) ## Set sortspecs
$t->sortspecs_default() ## Get the object's default sort specs ({})
The sortspecs are an optional hash mapping field names to "sort specifications".
Each field's sort specification may specify zero or more of these values:
- SortType
-
the sort type to use (For example: String, Integer)
- SortDirection
-
the sort direction (1: ascending, -1: descending)
Sortspecs can be specified when calling the sort() routine, or, a set of specs can be placed beforehand into the object itself and those will be used by sort() if no other specs are given.
For any field listed in the sort order at the time of sorting, but lacking a sort spec or any component of the sort spec, the object's default sort type (see sorttype_default()) and default sort direction (see sortdirection_default()) will be used.
In addition to getting/setting sort specs as a whole, they may be gotten/set on a per-field basis, too:
sortspec($Field) ## Get sortspec for $Field or default spec
my $Spec = {SortType => 'Integer', SortDirection => -1};
sortspec('Zip', $Spec) ## Set sortspec
sortspec_default() ## Get a sortspec with all defaults filled in
For any $Field not found in the object's sortspecs, sortspec($Field) returns the same thing returned by sortspec_default(), which is a sortspec filled in with the default sort type and sort direction (see below).
For a list of available built-in SortTypes, and instructions for how to define your own, see SORT ROUTINES, below.
DEFAULT SORT DIRECTION
## Getting / Setting table's default _DefaultSortDirection
$t->sortdirection_default() ## Get default sort direction
$t->sortdirection_default(-1) ## Set default sort direction
Each element in a sort specification can optionally specify a sort direction.
1 = ascending, -1 = descending
For any sort specs that don't specify a direction, the object's default sort direction will be used. Use these routines to get/set the default sort direction.
DEFAULT SORT TYPE
## Getting / Setting table's default _DefaultSortType
$t->sorttype_default() ## Get default sort type
$t->sorttype_default('Integer') ## Set default sort type
Each element in a sort specification can optionally specify a sort type. The sort type is a string (like 'String' or 'Integer' or 'Date') that selects from one or more sort routines. (See Sort Routines, below).
There are several sort routines built into the CTable object, and you can also add as many of your own routines (and hence Sort Types) as you like or need. This allows for very flexible sorting.
For any sort specs that don't specify a type, the object's default sort type will be used. Use these routines to get/set the default sort type, which initially is 'String'.
SORT ROUTINES: BUILTIN AND CUSTOM
## Getting / Setting table's custom sort routines (_SRoutines)
$t->sortroutine($Type) ## Get a sort routine for $Type
$t->sortroutine($Type, $Sub) ## Set a sort routine for $Type
$t->sortroutine($Type, 0 ) ## Remove sort routine for $Type
$t->sortroutine_set($Type)
$t->sortroutines() ## Get hash of any sort routines
$t->sortroutines_builtin() ## Get hash of builtin routines
Each SortType in the sortspecs should have a corresponding sort routine (any unrecognized type will be sorted using the 'String' sort routine).
The sort() command looks up the appropriate sort routine for each field it is asked to sort, based on the SortType for that field, as given in the sortspecs, as described above.
Builtin sort types, recognized and implemented natively by this module, are:
String ## Fastest case-sensitive compare (data is string)
Text ## Insensitive compare (lowercases, then compares)
Number ## Number works for floats or integers
Integer ## Faster than floats. Uses "use integer"
DateSecs ## Same as integer; assumes date in seconds
Boolean ## Treats item as a Perlish boolean (empty/undef = false)
The above sort types are always recognized. Additional sort types may be added by subclasses (and could shadow the builtin implementations of the above types if desired) and/or may be added to instances (and again could shadow the above implementations), and/or may be specified when the sort() method is called, once again optionally shadowing any deeper definitions.
CUSTOM SORT ROUTINE INTERFACE
A custom sort routine is called with two arguments, each of which is a pointer to a scalar. The sort routine should dereference each pointer and compare the resulting scalars, returning -1 if the first scalar is smaller than the second, 1 if it is larger, and 0 if they are considered equal.
For example, here is the built-in comparison routine for 'String':
sub { $ {$_[0]} cmp $ {$_[1]} }
NOTE: Your custom sort routines should NOT compare $a and $b as with Perl's builtin sort() command.
Examine the variable $BuiltinSortRoutines in this module's implementation to see some additional examples of sort routines.
Internally, sort() calls the sortroutines() method to get a hash that should consist of all builtin sort routines with the per-object sort routines, if any, overlaid. sortroutines() in turn calls the sortroutines_builtin() method to get a copy of the hash of all builtin sort routines for the object. (So a subclass could easily add additional SortTypes or reimplement them by just overriding sortroutines_builtin() and adding its own additional routines to the resulting hash.)
sortroutine() may be called to get or set a custom sort routine for a given type in the given object.
There is no way to directly manipulate the builtin sort routines for the entire class. To accomplish that, you should define and use a subclass that extends sortroutines_builtin() to add its own routines.
For example:
BEGIN
{ ## A subclass of Data::CTable with an INetAddr SortType.
package IATable; use vars qw(@ISA); @ISA = qw(Data::CTable);
sub sortroutines_builtin
{
my $this = shift;
my $CustomRoutines =
{INetAddr =>
sub {use integer; ip2int($ {$_[0]}) <=> ip2int($ {$_[1]})}};
my $AllRoutines =
{%{$this->SUPER::sortroutines_builtin()} %$CustomRoutines};
return($AllRoutines);
};
sub ip2int {.....} $# Could memoize & inline for efficiency
}
my $Table = IATable::new(......);
The IATable class would then have all the same features of Data::CTable but would then also support the INetAddr SortType.
FREEZING SELECTION & FIELD LIST
## Freeze data layout: re-order columns; omit unused fields
$t->cull(...params...) ## Rebuild table in order
my $s = $t->snapshot(...params...) ## Make copy as if rebuilt
The cull() method re-writes all data in the table to be in the order indicated in _Selection (if present). This will cause any records not listed in _Selection to be omitted (unless selection is null in which case all records are retained in original order).
In addition, if there is a custom field list present, it removes any fields NOT mentioned in _FieldList.
The snapshot() method is similar, except instead of modifying the object itself, it makes a copy of the object that's equivalent to what cull() would have created, and returns that new object, leaving the original untouched. (All data structures are deep-copied from the old object to the new one, leaving the objects totally independent.)
cull() and snapshot() both take two optional named parameters: _FieldList and/or _Selection to be used in place of the corresponding parameters found in the object.
If only a single argument is supplied, it is assumed to be _Selection.
LINE ENDINGS
## Get current value
$t->lineending() ## Get actual setting: string or symbol
$t->lineending_symbol() ## Get setting's symbolic value if possible
$t->lineending_string() ## Get setting's string value if possible
## Set value
$t->lineending($Ending) ## Will be converted internally to symbol
## Convert a value to symbol or string form
$t->lineending_symbol($L) ## Convert string form to symbolic form
$t->lineending_string($L) ## Convert symbol form to string form
## Get internal conversion hash tables
$t->lineending_symbols() ## Hash ref mapping known strings to symbols
$t->lineending_strings() ## Hash ref mapping known symbols to strings
Use these accessor functions to get/set the _LineEnding parameter.
You can set the parameter in either string or symbol form as you wish. You can get it in its raw, as-stored, form, or, you can get it in string form or symbol form as desired.
Finally, some utility conversion calls allows you to convert a string you have on hand to a symbolic form. For example:
$L = "\x0D";
print ("This file uses " . $t->lineending_symbol($L) . " endings.");
This would print:
This file uses mac endings.
AUTOMATIC CACHEING
By default, Data::CTable makes cached versions of files it reads so it can read them much more quickly the next time. Optionally, it can also cache any file it writes for quicker re-reading later.
On Unix systems, cache files are always created with 0666 (world-write) permissions for easy cleanup.
When reading files, Data::CTable checks the _CacheOnRead parameter. If that parameter is true, which it is by default, the module tries to find an up-to-date cache file to read instead of the original. Reading a cache file can be 10x faster than reading and parsing the original text file.
In order to look for the cache file, it must first calculate the path where the cache file should be located, based on the _FileName of the file to be read.
The path of the cache file is calculated as follows:
If the _CacheSubDir parameter is a RELATIVE PATH, then it is appended to the directory component of _FileName to arrive at the directory to use to store the cache file. If it is an ABSOLUTE PATH, then _CacheSubDir is used by itself. (The trailing path separator is optional and an appropriate one will be added by Data::CTable if it is missing.)
The file name of the cache file is calculated as follows:
If the _CacheExtension parameter is specified, it is appended to the base file name component from the _FileName parameter. If you want the cached file name to be the same as the name of the original file, you can set _CacheExtension to "", which is not recommended.
Then, the cache path and cache file name are joined to arrive at the name of the cache file. If both _CacheSubDir and _CacheExtension were empty, then the cache file path will be the same as the _FileName, and Data::CTable will refuse to either read or write a cache file, so setting these fields both to empty is equivalent to setting _CacheOnRead to false.
The cache file contains a highly-efficient representation of all the following data that would otherwise have to be determined by reading and parsing the entire text file:
- All the data columns (field values)
- _FieldList: The list of fields, in order
- _HeaderRow: Whether a header row is / should be present
- _LineEnding: The line ending setting
- _FDelimiter: The field delimiter setting
If found prior to a read(), AND, the date of the cache file is LATER than the date of the original file, the cache file is used instead. (If the date is EARLIER, then the cache file is ignored because it can be presumed that the data inside the text file is newer.)
If cacheing is ON, then after successfully reading the text file (either because there was no cache file yet or the cache file was out of date or corrupted or otherwise unusable), read() will then try to create a cache file. This, of course, takes some time, but the time taken will be more than made up in the speedup of the next read() operation on the same file.
If creating the cache file fails (for example, because file permissions didn't allow the cache directory to be created or the cache file to be written), read() generates a warning explaining why cacheing failed, but the read() operation itself still succeeds.
No parameters in the object itself are set or modified to indicate the success or failure of writing the cache file.
Similarly, there is no way to tell whether a successful read() operation read from the cache or from the original data file. If you want to be SURE the reading was from the data file, either turn off _CacheOnRead, or call the read_file() method instead of read().
NOTE: because the name of the cache file to be used is calculated just before the read() is actually done, the cache file can only be found if the _CacheSubDir and _CacheExtension are the same as they were when the cache was last created. If you change these parameters after having previously cached a file, the older caches could be "orphaned" and just sit around wasting disk space.
Cacheing on write()
You may optionally set _CacheOnWrite (default = false) to true. If done, then a cache file will be saved for files written using the write() command. Read about write() below for more about why you might want to do this.
AUTOMATIC DIRECTORY CREATION
When Data::CTable needs to write a file, (a cache file or a data file), it automatically tries to create any directories or subdirectories you specify in the _FileName or _CacheSubDir parameters.
If it fails while writing a data file, write() will fail (and you will be warned). If it fails to create a directory while writing a cache file, a warning will be issued, but the overall read() or write() operation will still return a result indicating success.
Any directories created will have the permissions 0777 (world-write) for easy cleanup.
Generally, the only directory the module will have to create is a subdirectory to hold cache files.
However, since other directories could be created, be sure to exercise caution when allowing the module to create any directories for you on any system where security might be an issue.
Also, if the 0666 permissions on the cache files themselves are too liberal, you can either 1) turn off cacheing, or 2) call the prep_cache_file() method to get the name of the cache file that would have been written, if any, and then restrict its permissions:
chmod (0600, $this->prep_cache_file());
READING DATA FILES
## Replacing data in table with data read from a file
$t->read($Path) ## Simple calling convention
$t->read( ## Named-parameter convention
## Params that override params in the object if supplied...
_FileName => $Path, ## Full or partial path of file to read
_FieldList => [...], ## Fields to read; others to be discarded
_HeaderRow => 0, ## No header row (_FieldList required!)
_IgnoreQuotes => 0, ## Load files with unbalanced quotes
_LineEnding => undef, ## Text line ending (undef means guess)
_FDelimiter => undef, ## Field delimiter (undef means guess)
_ReturnMap => 1, ## Whether to decode internal returns
_ReturnEncoding=>"\x0B", ## How to decode returns.
_MacRomanMap => undef, ## Whether/when to read Mac char set
_CacheOnRead => 0, ## Enable/disable cacheing behavior
_CacheExtension=> ".x", ## Extension to add to cache file name
_CacheSubDir => "", ## (Sub-)dir, if any, for cache files
## Params specific to the read()/write() methods...
_MaxRecords => 200, ## Limit on how many records to read
)
$t->read_file() ## Internal: same as read(); ignores cacheing
read() opens a Merge, CSV, or Tab-delimited file and reads in all or some fields, and all or some records, REPLACING ANY EXISTING DATA in the CTable object.
Using the simple calling convention, just pass it a file name. All other parameters will come from the object (or will be defaulted if absent). To specify additional parameters or override any parameters in the object while reading, use the named-parameter calling convention.
See the full PARAMETER LIST, above, or read on for some extra details:
_ReturnMap controls whether return characters encoded as ASCII 11 should be mapped back to real newlines ("\n"
) when read into memory. If false, they are left as ASCII 11 characters. (default is "true")
_ReturnEncoding controls the character that returns are encoded as, if different from ASCII 11.
_FieldList is an array (reference) listing the names of fields to import, in order (and will become the object's _FieldList upon successful completion of the read() operation). If not provided and not found in the object, or empty, then all fields found in the file are imported and the object's field list will be set from those found in the file, in the order found there. If _HeaderRow is false, then this parameter is required (either in the object or as a formal parameter) and is assumed to give the correct names for the fields as they actually occur in the file. If _HeaderRow is true and _FieldList is provided, then _FieldList specifies the (sub-)set of fields to be read from the file and others will be ignored.
_HeaderRow, which defaults to true, if set to false, tells read() to not expect a header row showing the field names in the file. Instead, it assumes that the _FieldList gives those (and _FieldList must therefore be specified either as a parameter or an existing parameter in the object).
_IgnoreQuotes is false by default. If true then we'll ignore quotes in the file upon import, which makes the loading of files with unbalanced quotes possible.
_MaxRecords (optional) is an upper limit on the number of fields to import. If not specified, or zero, or undef, then there is no limit; all records will be imported or memory will be exhausted.
read() returns a Boolean "success" code.
If read() returns false, then it will also have set the _ErrorMsg parameter in the object. It may or may not have partially altered data in the object if an error is encountered.
After a successful read:
fieldlist() (the object's _FieldList parameter) tells which fields were actually read, in what order. It may omit any fields requested in _FieldList that were not actually found in the file for whatever reason.
length() tells how many fields were read.
The selection() is reset to no selection (all selected / unsorted)
The object's _FileName parameter contains the path to the file that was read. If the _FileName you specified did not have a path, then _FileName will be prepended with a path component indicating "current directory" (e.g. "./" on Unix).
_FDelimiter will contain the actual delimiter character that was used to read the file (either tab or comma if the delimiter was guessed, or whatever delimiter you specified).
_LineEnding will contain the actual line-ending setting used to read the file. This will be either "mac" ("\x0D"), "unix" ("\x0D"), or "dos" ("\x0D\x0A") if the line endings were guessed by read(). Otherwise it will be whatever _LineEnding you specified.
FILE FORMAT NOTES
As mentioned, read() allows the following flexibilities in reading text-based tabular data files:
You may specify the line endings (record delimiters), or it can guess them (mac, unix, dos are supported).
You may specify the field delimiters, or it can guess them (tab and comma are supported).
It can get field names from a header row, or, if there is no header row, you can tell it the field names, in order.
You can tell it whether or not to decode embedded returns in data fields, and if so, which character they were encoded as.
Beyond supporting the above flexible options, read() makes the following non-flexible assumptions:
Fields must NOT contain unencoded returns -- that is: whatever character sequence is specified for _LineEnding will NEVER occur inside a field in the text file; in addition, the current platform's definition of "\n"
will NEVER occur; these characters if present in field data, MUST have been encoded to some safe character string before the file was created.
Each field may OPTIONALLY be surrounded with double-quote marks. However, if the field data itself contains either a double-quote character ("
) or the current file's field delimiter (such as tab or comma), then the field MUST be surrounded with double-quotes. (Currently, all data written by Data::CTable have all field values surrounded by double-quotes, but a more selective policy may be used in the future.)
If a field contains a double-quote character, then each double-quote character in the field must be encoded as ""
-- i.e. each "
in the original data becomes ""
in the text file.
Data files may not mix line-ending types or field delimiter types. Once determined, the same endings and delimiters will be used to read the entire file.
The fields recognized on each line will either be determined by the header row or the _FieldList provided by the caller. Any extra fields on any given line will be ignored. Any missing fields will be treated as undef/empty.
If you are having trouble reading a delimited text file, check that all data in the file obeys these assumptions.
WRITING DATA FILES
## Writing some or all data from table into a data file
$t->write($Path) ## Simple calling convention
$t->write( ## Named-parameter convention
## Params that override params in the object if supplied...
_FileName => $Path, ## "Base path"; see _WriteExtension
_WriteExtension=> ".out",## Insert/append extension to _FileName
_FieldList => [...], ## Fields to write; others ignored
_Selection => [...], ## Record (#s) to write; others ignored
_HeaderRow => 0, ## Include header row in file
_LineEnding => undef, ## Record delimiter (default is "\n")
_FDelimiter => undef, ## Field delimiter (default is comma)
_ReturnMap => 1, ## Whether to encode internal returns
_ReturnEncoding=>"\x0B", ## How to encode returns
_MacRomanMap => undef, ## Whether/when to write Mac char set
_CacheOnWrite => 1, ## Enable saving cache after write()
_CacheExtension=> ".x", ## Extension to add to cache file name
_CacheSubDir => "", ## (Sub-)dir, if any, for cache files
## Params specific to the read()/write() methods...
_MaxRecords => 200, ## Limit on how many records to write
)
$t->write_file() ## Internal: same as write(); ignores cacheing
write() writes a Merge, CSV, or Tab-delimited file.
It uses parameters as described above. Any parameters not supplied will be gotten from the object.
Using the simple calling convention, just pass it a path which will override the _FileName parameter in the object, if any.
All other parameters will come from the object (or will be defaulted if absent).
If no _FileName or path is specified, or it is the special string "-" (dash), then the file handle \ * STDOUT will be used by default (and you could redirect it to a file). You can supply any open file handle or IO::File object of your own for the _FileName parameter.
If write() is writing to a file handle by default or because you specified one, then no write-cacheing will occur.
To specify additional parameters or override any parameters in the object while reading, use the named-parameter calling convention.
If the object's data was previously filled in using new() or read(), then the file format parameters from the previous read() method will still be in the object, so the format of the written file will correspond as much as possible to the file that was read().
write() returns the path name of the file actually written, or the empty string if a supplied file handle or STDOUT was written to, or undef if there was a failure.
If write() returns undef, then it will also have set the _ErrorMsg parameter in the object.
write() never modifies any data in the object itself.
Consequently, if you specify a _FieldList or a _Selection, only those fields or records will be written, but the corresponding parameters in the object itself will be left untouched.
How write() calculates the Path
The _FileName parameter is shared with the read() method. This parameter is set by read() and may be overridden when calling write().
In the base implementation of Data::CTable, write() will try not to overwrite the same file that was read, which could possibly cause data loss.
To avoid this, it does not use the _FileName parameter directly. Instead, it starts with _FileName and inserts or appends the value of the _WriteExtension parameter (which defaults to ".out") into the file name before writing.
If the _FileName already has an extension at the end, write() will place the _WriteExtension BEFORE the final extension; otherwise the _WriteExtension will be placed at the end of the _FileName.
For example:
Foobar.txt ==> Foobar.out.txt
Foobar.merge.txt ==> Foobar.merge.out.txt
My_Merge_Data ==> My_Merge_Data.out
If you DON'T want write() to add a _WriteExtension to _FileName before it writes the file, then you must set _WriteExtension to empty/undef either in the object or when calling write(). Or, you could make a subclass that initializes _WriteExtension to be empty. If _WriteExtension is empty, then _FileName will be used exactly, which may result in overwriting the original data file.
Remember: write() returns the path name it actually used to successfully write the file. Just as with read(), if the _FileName you specified did not have a path, then write() will prepend a path component indicating "current directory" (e.g. "./" on Unix) and this will be part of the return value.
Cacheing with write()
By default, Data::CTable only creates a cached version of a file when it reads that file for the first time (on the assumption that it will need to read the file again more often than the file's data will change.)
But by default, it does not create a cached version of a file when writing it, on the assumption that the current program probably will not be re-reading the written file and any other program that wants to read it can cache it at that time.
However, if you want write() to create a cache for its output file, it is much faster to create it on write() than waiting for the next read() because the next read() will be able to use the cache the very first time.
To enable write-cacheing, set _CacheOnWrite to true. Then, after the write() successfully completes (and only if it does), the cached version will be written.
FORMATTED TABLES (Using Data::ShowTable)
## Get formatted data in memory
my $StringRef = $t->format(); ## Format same data as write()
my $StringRef = $t->format(10); ## Limit records to 10
my $StringRef = $t->format(...); ## Specify arbitrary params
print $$StringRef;
## Write formatted table to file or terminal
$t->out($Dest, ....);## $Dest as follows; other params to format()
$t->out($Dest, 10, ....) ## Limit recs to 10; params to format()
$t->out() ## print formatted data to STDOUT
$t->out(\*STDERR) ## print to STDERR (or any named handle)
$t->out("Foo.txt") ## print to any path (file to be overwritten)
$t->out($FileObj) ## print to any object with a print() method
out() takes a first argument specifying a destination for the output, then passes all other arguments to format() to create a nice-looking table designed to be human-readable; it takes the resulting buffer and print()s it to the destination you specified.
Sample output:
+-------+------+-----+-------+
| First | Last | Age | State |
+-------+------+-----+-------+
| Chris | Zack | 43 | CA |
| Marco | Bart | 22 | NV |
| Pearl | Muth | 15 | HI |
+-------+------+-----+-------+
(Note extra space character before each line.)
The destination may be a file handle (default if undef is \*STDOUT), a string (treated as a path to be overwritten), or any object that has a print() method, especially an object of type IO::File.
The main purpose of out() is to give you a quick way to dump a table when debugging. out() calls format() to create the output, so read on...
format() produces a human-readable version of a table, in the form of a reference to a string buffer (which could be very large), and returns the buffer to you. Dereference the resulting string reference before using.
If format() is given one argument, that argument is the _MaxRecords parameter, which limits the length of the output.
Otherwise, format() takes the following named-parameter arguments, which can optionally override the corresponding parameters, if any, in the object:
_FieldList ## Fields to include in table
_Selection ## Records to be included, in order
_SortSpecs ## SortType controls number formatting
_DefaultSortType
_MaxRecords ## Limit number of records output
_MaxWidth ## Limit width of per-col. data in printout
format() will obey _MaxRecords, if you'd like to limit the number of rows to be output. _MaxRecords can also be a single argument to format(), or a second argument to out() if no other parameters are passed.
format() also recognizes the _SortSpecs->{SortType} and _DefaultSortType parameters to help it determine the data types of the fields being formatted. Fields of type "Number" are output as right-justified floats; "Integer" or "Boolean" are output as right-justified integers, and all others (including the default: String) are output as left-justified strings.
In addition, there is one parameter uniquely supported by format() and out():
_MaxWidth specifies the maximum width of columns. If unspecifed, this will be 15; the minimum legal value is 2. Each column may actually take up 3 more characters than _MaxWidth due to divider characters.
The data to be output will be examined, and only the necessary width will be used for each column. _MaxWidth just limits the upper bound, not the lower.
Data values that are too wide to fit in _MaxWidth spaces will be truncated and the tilde character "~" will appear as the last character to indicate the truncation.
Data values with internal returns will have the return characters mapped to slashes for display.
format() and out() will NOT wrap entries onto a second line, like you may have seen Data::ShowTable::ShowBoxTable do in some cases. Each record will get exactly one line.
format() and out() ignore the _HeaderRow parameter. A header row showing the field names is always printed.
format() and out() make no attempt to map upper-ascii characters from or to any particular dataset. The encoding used in memory (generally ISO 8859-1 by default) is the encoding used in the output. If you want to manipulate the encoding, first call format(), then change the encoding, then format the resulting table.
APPENDING / MERGING / JOINING TABLES
## Append all records from a second table
$t->append($Other) ## Append records from $Other
$t->append_file($File, $Params) ## Append from new($Params, $File)
$t->append_files($Files, $Params) ## Call append_file for all files
$t->append_files_new($Files, $Params) ## Internal helper routine
## Combine all fields from a second table
$t->combine($Other) ## Combine fields from $Other
$t->combine_file($File, $Params) ## Combine new($Params, $File)
$t->combine_files($Files, $Params) ## combine_file on each file
## Left-join records from a second table (lookup field vals)
$t->join ($Other, $KeyField1, [$KeyField2, $Fields])
$t->join_file ($File, $Params, $KeyField1, [$KeyField2, $Fields])
$t->join_files($Files, $Params, $KeyField1, [$KeyField2, $Fields])
The append() method concatenates all the records from two CTable objects together -- even if the two tables didn't start out with exactly the same fields (or even any of the same fields).
It takes all the data records from another CTable object and appends them into the present table. Any columns present in the $Other table but not in the first table, are created (and the corresponding field values in the first table will all be empty/undef). Similarly, any columns present in $t but not present in $Other will be extended to the correct new length as necessary and the field values in the original columns will be empty/undef. Columns present in both will, of course, have all the data from both the original sets of data.
All data from the second table is brought into the first one. No attempt whatsoever is made to eliminate any duplicate records that might result.
The number of records (length()) after this call is the sum of the length() of each of the tables before the operation.
IMPORTANT NOTE: The data from the $Other object is COPIED in memory into the new object. This could be hard on memory if $Other is big. Might want to be sure to discard $Other when you're done with it.
$Other is left untouched by the operation.
All columns from both tables are combined whether or not they are mentioned in the custom field list of either.
The custom field lists, if present in either table object, are concatenated into this object's custom field list, but with duplications eliminated, and order retained.
Any existing custom selections, custom sort order, sort specs, and/or sort routines are also combined appropriately, with settings from this object taking precedence over those from $Other anywhere the two have conflicting settings.
append_file() takes a file name and optional $Params hash. It uses those to create a new() object with data read from the file. Then, the new table is appended to $t using append() and then the new table is discarded.
append_files() is a convenience function that calls append_file() on each file in a list, using the same optional $Params for each.
append_files_new() is the internal routine that implements the processing done by new() on the optional list of files to be read. It does the following: It calls read() on the first file in the list. Then, it calls append_files() to read the remaining into their own new() objects of the same class as $t and using the same $Params to new() (if any were supplied). Then each of these is append()-ed in turn to $t and discarded. The final result will be that $t will hold a concatenation of all the data in all the files mentioned. However, consistent with the behavior of append(), the _FileName parameter and other read()-controlled settings will correspond to the first file read. The intermediate objects are discarded.
NOTE: As with new() and read(), if a non-empty _FieldList Param is specified, the read() methods called internally by the append_file*() methods will read only the fields mentioned and will ignore any other fields in the files.
Combining tables
combine() adds columns from a second table into the current one.
CAUTION: You should only use combine() when you have two tables where all the (possibly selected) records in the second table line up perfectly with all the (unselected) records in the first table -- in other words, each table before combine() should contain a few of the columns of the new table -- for example, maybe one table contains a column of file names, and the other contains columns of corresponding file sizes and modification times. If you don't understand the consequences of combine, don't use it or you could end up with some records whose field values don't refer to the same object. (Maybe you meant to use append() or join() instead.)
If the second table has a custom field list, only those columns are brought in.
If any column in the second table has the same name as one in the current table, the incoming column replaces the one by the same name.
All columns are COPIED from the second table, so the first table owns the new data exclusively.
If the second table has a selection, only those records are copied, in selection order. (select_all() first if that's not what you want.)
The selection in the first table, if any, is ignored during the combine. If this isn't what you want, then consider using cull() before combine().
Field list and sort order are concatenated (but retaining uniqueness: second mentions of a field in the combined lists are omitted).
Custom sort routines and sort specs are combined, with those in the first table taking precedence over any copied in with the same name.
The custom _Selection from the first table, if any, is retained. (It will initially omit any records added by extend()).
All other parameters from the first table are retained, and from the second table are ignored.
combine() calls extend() after combining to ensure that all columns have the same length: if either the older or newer columns were shorter, they will all be set to the length of the longest columns in the table -- creating some empty field values at the end of the lengthened columns.
combine_file() does the same as combine() except starting with a file name, first creating the $Other object by creating it using new($Params, $File), then discarding it after combining.
Joining tables (Looking up data from another table)
join() looks up field values from a second table, based on common values in key fields which may have different or the same names in each table. It adds columns to the current table if necessary to hold any new field values that must be brought in.
join() never adds any new or non-matching records to the table: records where the lookup fails will simply have empty/undef values in the corresponding columns.
## Example:
$t->join ($People, 'FullName', 'FirstAndLast'); ## or
$t->join_file("People.txt", {}, 'FullName', 'FirstAndLast');
Here's how join() calculates the list of fields to bring in:
- Legal field names from the optional $Fields list, if supplied
- Otherwise, the fieldlist() from second table
- ... minus any fields with same name as $KeyField1 or $KeyField2
Join starts by adding new empty columns in the first table for any field to be brought in from the second but not yet present in the first.
Here's how join() calculates the records eligible for lookup:
- Join only modifies the selected records in the first table
- Join only looks up values from selected records in second table
(If you want all records to be used in both or either table, call the table's select_all() method before calling join().)
Then, for every selected record in $t (using the example above), join examines the FullName field ($KeyField1), and looks up a corresponding entry (must be 'eq') in the FirstAndLast field ($KeyField2) in the second table.
IMPORTANT NOTE ABOUT KEY LENGTH: To speed lookup, hash-based indices are made. The strings in $Key1 and $Key2 fields should not be so long that the hash lookups bog down or things could get ugly fast. There is no fixed limit to hash key length in Perl, but fewer than 128 characters in length is longer than customary for such things. (Many systems require text-based keys to be no longer than 31 characters.) So be judicious about the values in $Key1 and $Key2 fields.
The first record found in the second table's selection with a matching value in the key field is then copied over (but only the appropriate fields are copied, as explained above). Any field values being brought over will REPLACE corresponding field values in the first table, possibly overwriting any previous values if the field being looked up was already present in the first table and contained data.
The first table's _FieldList is updated to reflect new fields added.
Its _Selection is untouched.
Its _SortOrder is untouched.
Its _SortSpecs are augmented to include any entries from the second table that should be brought over due to the field additions.
Its _SRoutines are augmented to add new ones from the second table.
All other parameters of table 1 are untouched.
The second table is not modified. No data structures will be shared between the tables. Data is only copied.
join_file() calls join() after creating a seond table from your $File.
join_files() calls join_file() repeatedly for each file in a list, but it is important to note that each file in the list of files to be joined must have a $Key2 field -- AND, that any values looked up from the second file will overwrite any values of the same key found in the first file, and so on. You probably will not ever need join_files(). It is mainly here for completeness.
INVERTING A TABLE'S ROWS/COLUMNS
## Re-orient table's data using vals from $ColName as field names...
$t-invert($ColName)
Sometimes a situation gives you a table that's initially organized with column data in rows, and field names in one of the columns, so you need to flip the table in order to be able to work meaningfully with it.
"Inverting" a table means to rewrite each row as a column. One row is designated to be used as the field names.
For example, consider this table:
F01 F02 F03 F04
------------------------
First Chris Agnes James
Last Bart Marco Nixon
Age 22 33 44
Calling invert() using field names from "F01"...
$t->invert('F01');
... would change the table to look like this:
First Last Age
----------------
Chris Bart 22
Agnes Marco 33
James Nixon 44
The field F01 which formerly contained the field names, is now gone, and the remaining data columns have been converted from their old row orientation into a column orientation.
PROGRESS MESSAGES
## Printing a progress message....
$t->progress($Msg) ## Print a message per current settings
## Progress settings applying to this object only...
$t->progress_get() ## Get current progress setting
$t->progress_set(1) ## Use progress_default() method
$t->progress_set($Sub) ## Set a custom progress routine
$t->progress_set(0) ## Disable progress
$t->progress_set(undef) ## Use class's settings (default)...
## Class's settings (for instances with _Progress == undef)
$t->progress_class() ## Get current setting.
$t->progress_class(1) ## Use progress_default() method
$t->progress_class($Sub) ## Set shared custom prog routine
$t->progress_class(0) ## Disable class-default progress
Data::CTable->progress_class(..) ## Call without an object
## Call builtin default progress method regardless of settings
$t->progress_default($Msg) ## Default prog. routine for class
## Generate a warning (used internally by other methods)
$t->warn($Msg) ## In this class, calls progress_default()
## Timed progress: print msg to start, then at most once/2 sec
$t->progress_timed($Op, $Item) ## Re-print msg every 2 sec
$t->progress_timed($Op, $Item, $Pos, $Tot) ##... with % readout
$t->progress_timed($Op, $Item, $Pos, $Tot, $Wait) ## Not 1st x
$t->progress_timed_default($Msg) ## Called by progress_timed
Data::CTable is especially useful in creating batch-oriented applications for processing data. As such, routines that may perform time-consuming tasks will, by default, generate helpful progress messages. The progress mechanism is highly customizable, however, to suit the needs of applications that don't require this output, or that require the output to go somewhere other than STDERR or the console.
The default progress routine is one that prints a message with a date/time stamp to STDERR if and only if STDERR is an interactive terminal, and otherwise is silent.
You could write a custom progress routine that does something else or something in addition (e.g. logs to a file or syslog). The custom routine could either be implemented by overriding the progress_default() method in a subclass, or by calling progress_set() in any instance.
The custom progress routine, if any, is stored in the _Progress parameter of the object. But use progress_set() and progress_get() to access it.
The interface for your custom progress routine should be:
sub MyProgress {my ($Obj, $Message) = @_; chomp $Message; .....}
In other words, the routine takes a single message which may or may not have a trailing newline. It should always chomp the newline if present, and then do its business... which generally will include printing or logging a message (usually with a newline added).
The default, built-in progress routine for Data::CTable is:
sub progress_default
{
my ($this, $msg) = @_;
chomp $msg;
print STDERR (localtime() . " $msg\n") if -t STDERR;
return(1); ## Indicate progress actually completed
}
Of course, you are free to call this method directly at any time, and it will do its thing regardless of other progress-enabling settings. But the preferred way is to first set the settings and then call progress().
The warn() method always calls progress_default() -- i.e. warnings will display even if progress is otherwise disabled or overridden at the object or class level. However, you could create a subclass that changes warn()'s behavior if desired. (For example, it could just call perl's builtin warn function, or be even more forceful, generating warnings even if STDERR is not a terminal, for example.)
The progress_set() method may be used to override the progress routine for an individual object (set to 1/true for default behavior, or 0/undef/false to disable progress for that object entirely).
Call progress_class() to set similar values to control the global default behavior (e.g. turning on/off default progress behavior for all instances), but be cautious about using this approach in any environment where other programs might be accessing the same loaded class data, since the setting is stored in a class-owned global ($Data::CTable::DefaultProgress).
Manipulating the class-default settings is only recommended in batch or shell-script environments, not in mod_perl Web applications where the module stays loaded into the Perl environment across multiple invocations, for example.
If you want a particular method (e.g. read() but not write()) to be silent, you could make a subclass and could override that method with an implementation that first disables progress, calls the SUPER:: method, and then restores the progress setting to its original setting.
Timed progress
Timed progress is a way of printing periodically-recurring progress messages about potentially time-consuming processes to the terminal.
For example, consider the following messages which might appear every 2 seconds during a lengthy read() operation:
Reading... 0 (0%)
Reading... 2000 (4%)
...
Reading... 38000 (96%)
Reading... 40000 (100%)
The progress_timed() method is called internally by potentially time-consuming processes (read(), write(), and sort()), and you may want to call it yourself from your own scripts, to produce weary-programmer-soothing visual output during otherwise panic-producing long delays.
Generally, progress_timed() is called with the $Wait parameter set to true, which delays the display of any messages until 2 seconds have passed, so no messages will be displayed unless the process actually does end up being slower than 2 seconds.
Parameters are:
$Op The string that identifies the "operation" taking place
$Item A milestone such as a number or datum to indicate progress
$Pos A numeric position against the (maybe estimated) baseline
$Tot The baseline. If estimated, don't re-estimate too often
$Wait If true, skip printing the first message for this $Op
All parameters except $Op are optional.
progress_timed() has a throttle that keeps it from re-triggering more often than every 2 seconds for any given sequence of the same $Op. The clock is restarted each time you call it with a different $Op or $Tot from the previous call (on the assumption that if the operation or the baseline changes then that fact should be noted).
The messages printed will start with "$Op... ".
If you supply $Item, which could be a number or a string, the messages will then show the $Item after the $Op.
If you supply BOTH $Pos and $Tot, then a percentage will be calculated and added to the readout; otherwise omitted.
If you supply $Wait, the first message (only) that uses this $Op will be skipped, and the next one won't appear for at least 2 seconds.
If using $Pos and $Tot to display percentages for your user, be sure to call progress_timed() one final time when $Pos == $Tot so your user sees the satisfying 100% milestone. This "completion" call will not be skipped even if 2 seconds have not passed since the previous timed progress message was printed.
Althought progress_timed() is designed to cut down on too much visual output when called often in a tight loop, remember that it still takes some processing time to call it and so if you call it too frequently, you're slowing down the very loop you wish were running faster.
So, you might want to call it every tenth or 100th or even 1000th time through a tight loop, instead of every time through, using the mod (%) operator:
$t->progress_timed(....) if ($LoopCount % 100) == 0;
progress_timed_default() is the method called internally by progress_timed() to actually print the messages it has prepared. In this implementation, progress_timed_default() just calls progress_default(). That is, it ignores all other progress-inhibiting or -enhancing settings so delay-soothing messages will print on the terminal even if other messages are turned off.
This is because the author assumes that even if you don't want all those other progress messages, you might still want these ones that explain long delays. If you REALLY don't, then just make yourself a lightweight subclass where progress_timed_default() is a no-op, or maybe calls regular progress(). For example:
BEGIN {package Data::CTable::Silent; use vars qw(@ISA);
@ISA=qw(Data::CTable); sub progress_timed_default{}}
## Later...
my $t = Data::CTable::Silent->new(...);
Rejecting or reporting on groups of records and continuing
Use utility methods omit_warn() and omit_note() to conditionally omit some records from a table and warn (or "note") if any were affected.
Use select_extract() to do the same thing but without actually removing the extracted records from the table, and restoring the original selection before select_extract was called.
If you supply a file name as the 4th argument, the omitted records will be extracted to a file for later reference.
If you supply a message prefix as the 5th argument, a string other than "WARNING" or "Note" may be specified.
# Reject with a progress message prefixed by "WARNING:"
$t->omit_warn(FirstName => sub{!length($_)}, "First name is empty");
Mon Aug 23 08:24:15 2004 WARNING: Omitting 2 of 78243 records (now 78241): First name is empty.
# Reject with a progress message prefixed by "Note:", with output to a file
$t->omit_note(FirstName => sub{!length($_)}, "First name is empty", "empty.names.txt");
Mon Aug 23 08:24:15 2004 Note: Omitting 2 of 78243 records (now 78241): First name is empty.
Mon Aug 23 08:24:15 2004 Writing bad.firstname.txt...
Mon Aug 23 08:24:15 2004 Wrote bad.firstname.txt.
# Extract some items, leaving original selection intact
$t->select_extract(FirstName => sub{!length($_)}, "First name is empty", "empty.names.txt");
Mon Aug 23 08:24:15 2004 Note: Extracting 2 of 78243 records: First name is empty.
DEBUGGING / DUMPING
## Print some debugging output...
$t->out() ## Pretty-print $t using Data::ShowTable
$t->dump() ## Dump $t using Data::Dumper
$t->dump($x, $y) ## Dump anything else using Data::Dumper
## Print some debugging output and then die.
die $t->out() ## Same but die afterwards.
die $t->dump() ## Same but die afterwards.
die $t->dump($x, $y) ## Same but die afterwards.
These two methods can be very helpful in debugging your scripts.
The out() method, which has many options, is described in complete detail in the section below titled "FORMATTED TABLES". In short, it prints a nicely-formatted diagram of $t, obeying the custom field list if any and custom selection if any.
The dump() method uses the Data::Dumper module to call &Dumper() on the table itself (by default) and prints the result to STDERR. If you specify any number of other values, those will be dumped instead using a single call to &Dumper (rather than individually).
Optional module dependencies
These methods require the otherwise-optional modules shown here:
out() Data::ShowTable
dump() Data::Dumper
You'll get a warning at runtime if you try to call either method without the appropriate module installed on your system.
MISCELLANEOUS UTILITY METHODS
The following utilities are methods of the Data::CTable object. They may be called directly by clients, subclassed, or used by subclass implementations as needed.
## Get cache file path (all args optional: defaulted from $t)
$f = $t->prep_cache_file($FileName, $CacheExtension, $CacheSubDir)
## Verify all directories in a path, creating any needed ones.
$ok = $t->verify_or_create_path($DirPath, $Sep)
## Testing readability / writeability of a proposed file
$ok = $t->try_file_read ($Path); ## Opens for read; then closes
$ok = $t->try_file_write($Path); ## Opens for write; then deletes
## Getting parameters from object with optional overrides
$param = $t->getparam($Params, $Param)
prep_cache_file() is the internal method used by both read() and write() to calculate the name of a cache file to be used for a given $FileName.
It calculates the path to the cache file that corresponds to the given $FileName (which may be a bare file name, a relative path, or a partial path, as long as it obeys the current platform's path format rules). All arguments are optional and if absent (undef), will be defaulted from the corresponding parameters in $t.
In addition to calculating the path and file name, it also prepends the "current directory" path if there was no path. Then it checks that all directories mentioned in the path actually exist. If not, it fails. Then, it checks that EITHER the file exists and is readable, OR it does not exist but would be writeable in that directory. If any of these directory creations or file checks fails, then undef is returned (and there would be no cache file).
You may call it with no arguments on a file that has been read() to find the path to the cache file that may have been used and/or created, if any.
You may call it with a file name that was written to, to see what the corresponding written cache file would be.
For example:
## Get name of cache file used or created by read and delete it.
$RCache = $t->prep_cache_file() and unlink($RCache);
## Cache on write() and get name of file and delete it.
$Written = $t->write(_CacheOnWrite=>1, _FileName=>"Foo.txt");
$WCache = $t->prep_cache_file($Written) and unlink($WCache);
verify_or_create_path() is the internal routine used by read(), write(), and the cache-writing logic, that makes sure a requested file path exists (by creating it if necessary and possible) before any file is written by this module.
(If you don't like this module's tendency to try to create directories, make yourself a subclass in which this routine simply checks -d on its $Path argument and returns the result.)
It must be called with a full or partial path TO A DIRECTORY, NOT A FILE. You may supply $Sep, a platform-appropriate separator character (which defaults correctly for the runtime platform if you don't).
Returns true if the path verification and/or creation ultimately succeeded, false otherwise (meaning that, after this call, there is no such directory on the system and so you should not try to write a file there).
try_file_read() and try_file_write() are the internal methods called by prep_cache_file() as well as by read() and write() to preflight proposed file reading and writing locations.
try_file_read() opens a file for read and closes it again; returns true if the open was possible.
try_file_write() opens a file for write and closes it again, deleting it if successful. Returns true if the open for write and the delete were successful. (Be aware that this call will actually delete any existing file by this name.)
The reason that failure to delete causes try_file_write() to fail is that successful cacheing depends on the ability to delete cache files as well as create them or write to them. A file in a location that couldn't be deleted will not be used for cacheing.
getparam() looks up a named parameter in a params hash if it exists there, otherwise looks it up in the object, thereby allowing $Params to shadow any parameters in $this.
This internal routine is used by any methods that allow overriding of parameters in the object when using a named-parameter calling interface. It should be used by any subclasses that also wish to use a named-parameter calling convention. For example:
my $this = shift;
my $Params = (@_ == 1 ? {_FieldList => $_[0]} : {@_});
my($FieldList, $Selection) = map {$this->getparam($Params, $_)}
qw(_FieldList _Selection);
GENERAL-PURPOSE UTILITY FUNCTIONS
These general-purpose utility routines are defined in the Data::CTable module but are not method calls. You may optionally import them or call them by their fully-qualified name.
use Data::CTable qw(
guess_endings
guess_delimiter
path_info
path_is_absolute
min
max
);
## File-format guessing
my $E = &guess_endings($IOHandle) ## Guess txt file line endings
my $D = &guess_delimiter($String) ## Tab if found, else comma
## Cross-platform file path analysis
my $Info = path_info(); ## Hash: 3 of platform's path values:
my ($Sep, ## ... path separator ( / on Unix)
$Up, ## ... "up" component (../ on Unix)
$Cur) = ## ... curr. dir path ( ./ on Unix)
@$Info{qw(sep up cur)};
my $Abs = path_is_absolute($Path) ## Check path type
## Our old favorites min and max
$x = max($x, 0); ## Should have been part of Perl...
$x = min($x, 100);
guess_endings() tries to figure out whether an open IO::File handle has DOS, Mac, or Unix file endings. It reads successively large blocks of the file until it either finds evidence of at least two separate line endings (of any type, but presumably they are the same), or until it reaches the end of the file. Then, it takes the resulting block and searches for the first qualifying line ending sequence it finds, if any. This sequence is then returned to the caller. If it returns undef, it was not able to find any evidence of line endings in the file.
guess_delimiter() takes a string buffer and returns a "," unless it finds a tab character before the first comma in the $String, if any, in which case a tab is returned.
path_info() returns a hash of three helpful strings for building and parsing paths on the current platform. Knows about Mac, Dos/Win, and otherwise defaults to Unix.
path_is_absolute($Path) returns true if it thinks the given path string is an absolute path on the current platform.
IMPLEMENTATION LIMITATIONS
- Column (field) names must not start with underscore
-
This object is implemented as a blessed hash reference. By convention, keys that do not start with underscore are data columns and the key is the field name. Keys that do start with underscore refer to parameters or other data structures stored in the object.
Consequently, no field names may start with underscore. When a file is read from disk, any field names that DO start with underscores will have the leading underscores stripped off. Strange things could then occur if the field names are then no longer unique. For example, field "A" and "_A" in the data file would be treated as the single field "A" after the file was read.
- Field values are always read as strings
-
Field values when written to a file are necessarily converted to strings. When read back in, they are read as strings, regardless of original format. The sole exception is the empty string which is read back in as undef for efficiency.
An exception is when the _CacheOnWrite feature is used: field values stored internally as integers or other scalar types may be saved and later restored as such. However, you should not rely on this behavior.
- Undef vs. empty
-
Empty field values are stored as "undef" for efficiency. This means that programs should generally not rely on any differences between "" and undef in field values. However, when working with large but sparse tables, programs should take care not to convert undef values to empty strings unnecessarily since the separate string objects consume considerably more memory than undef.
CONTRIBUTIONS
Corrections, bug reports, bug fixes, or feature additions are encouraged. Please send additions or patches with a clear explanation of their purpose. Consider making additions in the form of a subclass if possible.
I'm committed to bundling useful subclasses contributed by myself or others with this main distribution.
So, if you've got a subclass of Data::CTable (which should have a name like Data::CTable::YourClassName) and you would like it included in the main distribution, please send it along with a test script and I'll review the code and add it (at my discretion).
If you've got a module that uses, augments, or complements this one, let me know that, too, and I'll make appropriate mention of it.
AUTHORS
Chris Thorman <chthorman@cpan.org>
Jay Hannah <jay.hannah@iinteractive.com>
SEE ALSO
The Data::CTable home page: http://christhorman.com/projects/perl/Data-CTable/
Version control: https://github.com/jhannah/data-ctable/issues
Report bugs: https://rt.cpan.org/Public/Dist/Display.html?Name=Data-CTable
The implementation in CTable.pm.
The test.pl script, other subclasses, and examples.
The Data::ShowTable module.
The Data::Table module by Yingyao Zhou & Guangzhou Zou.
The perlref manual page.
LICENSE
Copyright (c) 1995-2012 Chris Thorman. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.