NAME
Microarray::DataMatrix::CdtFile - abstraction to cdtfile
Abstract
cdtFile.pm provides an abstraction layer to a cdt file, in that it provides methods for manipulating or querying the contents of a cdt file.
Overall Logic
This is for programmers only - do not rely on any of these details when programming clients of the object, as the underlying implementation is subject to change at any time without notice.
The cdtFile does little more than providing the parsing ability for cdt files. All filtering/transformations are taken care of by anyDataMatrix, from which it inherits.
Public Methods
new
Constructor - will take a fully qualified filename that correspond to a cdt file as a name argument - see - http://genome-www5.stanford.edu/MicroArray/help/formats.shtml. In addition you can set a variable, autodump, to indicate whether data should be automatically dumped out after any method call that transforms or filters the data. If this is not set, then you must manually call the dumpData method. The default is for autodumping to be off. This is useful, for instance, because you may run several filters/transformations over the data, and only want to dump the data at the end, which is more optimal that dumping it at every step. Note that if autodumping is on, then the data are only dumped if a method completes successfully. Currently, if a method fails, the matrix may be left in an uncertain state, and should not be used further.
Construction requires that a directory be provided where temp files may be written, which may be generated during matrix filtering and transformation.
Usage:
my $cdtFile = cdtFile->new(file=>$file,
autodump=>1,
tmpDir=>$tmpDir);
returns : a cdtFile object.
dumpData
This method dumps the current contents of the cdtFile object to a file, either whose name was provided as a single argument, or to a file whose name was used to construct the object. If the data have been filtered based on columnPercentiles, and these were elected to be shown, then these will be dumped out too (see below).
Usage:
$cdtFile->dumpData($file);
or:
$cdtFile->dumpData;
Tranformation and Filtering Methods
General note on methods that transform the data : If autodumping is on, then by default, they will overwrite the file that was used to create the cdtFile object, unless a new filename is passed in. If a new filename is passed in (as an argument named 'file'), and autodumping is on, then further operations on the cdtFile of filtered data will require instantiation of a cdtFile object with that file. Note, this also means that the program MUST have permissions to overwrite the file, and also to write to the same directory as the file (for temp file purposes).
All of the transformation and filtering methods return 1 upon success. If an error was encountered, then the method will return 0, and the error message associated with the problem can be retrieved using the errstr() method, eg:
$cdtFile->methodX(%args) || die "An error occured ".$cdtFile->errstr."\n";
All of the transformation and filtering methods allow a verbose argument to be passed in, with valid values for the verbose argument being either 'text' or 'html'. For text, \n will be used as an end of line character after every line of reporting is printed. For html, \n<br> will be used, eg:
$cdtFile->center(rows=>'mean',
verbose=>'html') || die $cdtFile->errstr;
center
This method allows either rows or columns of the cdtFile to be centered using either means or medians (centering is when the average - mean or median - is set to zero, by subtracting the average from every value for that row/column). If centering both rows and columns, centering will be done iteratively, until no datapoint changes by more than 0.01. Alternatively, the maxNumIterations can be specified, or the maxAllowableChange can be specified. If used in combination, the first one that is met will terminate centering. The defaults are:
maxAllowableChange 0.01
maxNumIterations 10
Usage: eg:
$cdtFile->center(rows=>'mean',
columns=>'median') || die $cdtFile->errstr;
returns : 1 upon success, or 0 otherwise
filterByPercentPresentData
This method allows for filtering out of rows or columns which do not have greater than the specified percentage of data available. Note, if filtering by both rows and columns, filtering will be done sequentially, firstly by rows. To overide this, make two seperate calls to the method, in the opposite order. There is no fancy algorithm to maximize the amount of retained data (eg consider filter by rows, then by columns, that removal of a column means that some rows thrown out in the first step may have greater than 80% good data for the remaining columns - this method does not consider this).
$cdtFile->filterByPercentPresentData(rows=>80,
columns=>80);
$cdtFile->filterByPercentPresentData(rows=>90,
file=>$filename);
returns: 1 upon success, or 0 otherwise
filterRowsOnColumnPercentile
This method will filter out rows whose values do not have a percentile rank for their particular column above a specified percentile rank, in at least numColumns columns. In addition, this method will accept a 'showPercentile' argument, which if set to a non-zero value, will result in the percentiles of the datapoints being dumped out with the data, when the data are aubsequently dumped to a file. Columns of percentiles are interleaved with the data columns, so the resulting file can not be clustered.
Usage:
$cdtFile->filterRowsOnColumnPercentile(percentile=>95,
numColumns=>1,
showPercentiles=>1);
returns : 1 upon success, or 0 otherwise
filterRowsOnColumnDeviation
This method will filter out rows whose values do not deviate from the column mean by a specified number of standard deviations, in at least numColumns columns.
Usage:
$cdtFile->filterRowsOnColumnDeviation(deviations=>2,
numColumns=>1);
returns : 1 upon success, or 0 otherwise
filterRowsOnValues
This method filters out rows whose values do not pass a specified criterion, in at least numColumns columns. To specify the criterion, a value, and an operator must be specified. The valid operators are:
"absolute value >" also aliased by "absgt" and "|>|"
"absolute value >=" also aliased by "absgteq" and "|>=|"
"absolute value =" also aliased by "abseq" and "|=|"
"absolute value <" also aliased by "abslt" and "|<|"
"absolute value <=" also aliased by "abslteq" and "|<=|"
">" also aliased by "gt"
">=" also aliased by "gteq"
"=" also aliased by "eq" and "=="
"<=" also aliased by "lteq"
"<" also aliased by "lt"
"not equal" also aliased by "ne" and "!="
Usage:
$cdtFile->filterRowsOnValues(operator=>"absolute value >",
value=>2,
numColumns=>1);
returns : 1 upon success, or 0 otherwise
filterRowsOnVectorLength
This method filters out rows based on whether the vector that their values define has a length of greater than the specified length.
Usage:
$cdtFile->filterRowsOnVectorLength(length=>2);
returns : 1 upon success, or 0 otherwise
logTransformData
This method log transforms the contents of the data matrix, using the specified base. If any values less than or equal to zero are encountered, then the transformation will fail. The matrix will be returned to its state prior to log transformation if the operation fails.
Usage :
$cdtFile->logTransformData(base=>2);
returns : 1 upon success, or 0 otherwise
scaleColumnData
This method scales the data for particular columns as specified by the client, by dividing the values by specified factors. It could, for instance, be used to renormalize the data. Note it is only appropriate to normalize ratio data, not log transformed data.
The client passes in a hash, by reference, of the column numbers (starting from zero) as the keys, and the scaling factors as the values.
If a column number which is invalid is specified, then a warning to STDERR will be printed. Also, if a scaling factor of zero (or undef) is supplied for a column, a warning will also be printed to STDERR, and the column data for that column will not be scaled.
Usage:
$cdtFile->scaleColumnData(columns=>{0=>1.2,
2=>0.8});
returns : 1 upon success, or 0 otherwise
Public Accessor Methods
file
This methods returns the name of the file that was used to construct the object.
Usage:
my $file = $cdtFile->file;
returns: a scalar
numRows
This method returns the number of rows that are currently valid in the data matrix.
Usage:
my $numRows = $cdtFile->numRows;
returns: a scalar
numColumns
This method returns the number of columns that are currently valid in the data matrix.
Usage:
my $numColumns = $cdtFile->numColumns;
returns: a scalar
errstr
This method returns an error string that is associated with the last failed call to a data transformation/filtering method. Calling this method will clear the contents of the error string.
Public Setter Methods
setAutoDump
This method can be used to turn autodumping on or off.
Usage:
$cdtFile->setAutoDump($n); # where $n can be 0 or 1
Public setter/getter methods
rowName
This polymorphic setter/getter method returns the row name for a given row in the cdt file. If a new name is provided, it will update that name to the new value. Note the row number that is passed in is based on the row number in the file used for object construction.
Usage:
$self->rowName($rowNum, $newName);
my $rowName = $self->rowName($rowNum);
rowDesc
This polymorphic setter/getter method returns the row description for a given row in the cdt file. If a new description is provided, it will update that description to the new value. Note the row number that is passed in is based on the row number in the file used for object construction.
Usage:
$self->rowDesc($rowNum, $newDesc);
my $rowDesc = $self->rowDesc($rowNum);
gWeight
This polymorphic setter/getter method returns the gWeight for a given row in the cdt file. If a new gWeight is provided, it will update that gWeight to the new value. Note the row number that is passed in is based on the row number in the file used for object construction.
Usage:
$self->gWeight($rowNum, $newGWeight);
my $gWeight = $self->gWeight($rowNum);
eweightsArrayRef
This polymorphic setter/getter method returns a reference to an array of the eweights that existed in the original file, in the order that they appeared. If a new array reference is provided, it will give the experiments new eweights.
Usage:
my $eweightsArrayRef = $self->eweightsArrayRef;
$self->eweightsArrayRef(\@eweights);
idName
This polymorphic setter/getter method returns the name of the id column that was used in the original file. If a new value is passed in, it will instead use that value as the id column name.
Usage:
print $self->idName;
$self->idName($idName);
descName
This polymorphic setter/getter method either returns the name of the description column that was used in the original file, or allows it to be set to a new value.
Usage:
print $self->descName;
$self->descName($descName);
Protected methods
_printLeadingMeta
This method prints the first two lines of a cdt file out. It prints to the passed in file handle. It will only print meta information for those valid columns. If the $extraInfo variable is true (ie a non-zero value) it will leave a print a column called $extraInfoName in between each column of meta data. This method is implemented as required by its superclass, anySizeDataMatrix.
Usage:
$self->_printLeadingMeta($fh, $validColumnsArrayRef, $hasExtraInfo, $extraInfoName);
_printTrailingMeta
This method is implemented as required by its superclass, anySizeDataMatrix. For a cdtFile, it down;t actually need to do anything.
Usage:
$self->_printTrailingMeta($fh, $self->_validColumnsArrayRef);
Private Methods
__usage
This private method prints out a usage message for the constructor, and then dies with a specific error message.
Usage :
$self->__usage("$file does not exist");
_parseFileHeaders
This private method removes the file headers from a cdt file. It accepts a file handle as input, and assumes that the file handle is pointing to the very beginning of the file.
Usage:
$self->_parseFileHeaders($fh);
__parseFirstLine
This method deals with recording data from the first line of a cdt file. It accepts as input a reference to an array that contains the fields on the first line split on tabs.
Usage:
$self->__parseFirstLine($lineRef);
__parseSecondLine
This method deals with recording data from the second line of a cdt file. It accepts as input a reference to an array that contains the fields on the first second split on tabs.
Usage:
$self->__parseSecondLine($lineRef);
_removeMetaDataFromLine
This method removes and stores meta data from a data line in a cdt file. Because the array of data is passed in by reference, the array to which this reference refers is directly manipulated, such that the callers array will have meta data removed from it.
Usage:
$self->_removeMetaDataFromLine($currFileDataRow, $lineRef);
AUTHOR
Gavin Sherlock
sherlock@genome.stanford.edu