NAME
Microarray::CdtDataset - an abstraction to the files produced from clustering
Abstract
This package implements an object that serves as an abstraction to a
cdtDataset. It is different than the Microarray::DataMatrix::CdtFile
abstraction, because it deals with the cdtFile in the context of gtr
and/or atr files. It also provides methods by which the geneXplorer
program can interact with a cdtDataset.
The essential purpose of CdtDataset's initialization functions is to
de-construct the .cdt file into its constituent data parts of the
dataset:
1) the data matrix (.data_matrix)
2) the bioassay names or slidenames (.expt_info)
3) the annotations of the spotted features/reporters/sequences
(.feature_info)
4) any additional meta information about the set (.meta)
5) additionally, it computes or creates the following:
a) a binary file containing a list of feature-feature
correlations (.binCor)
b) a 2-color image representation of the data matrix
(.data_matrix.png)
c) a image representation of the expt_info file
(.expt_info.png)
Known Issues
There are good reasons to add additional meta data to a dataset,
including possibly the organism of the set or the location of the
default display configuration file to display the .feature_info.
These would probably have to be called in the constructor.
Future Plans
Currently, only the .cdt file of a clustered dataset
is utilized. In the future, the other data files detailing the
clustering [gene tree(.gtr) and array tree(.atr)] should be
utilized, and DatasetImageMaker should export suitable image
representations for these files. Furthermore, It would be great to
pull general dataset methods from this class into a future class,
Microarray::Dataset. That way, you could make a MageMLDataset class
as well, and still keep many of the general class attributes/methods
in the same locations. Microarray:Dataset would inherit constructor
methods (i.e. knowledge of the file structure) from either
CdtDataset orMageMLDataset at initialization (perhaps a run-time ISA
declaration within the constructor). Otherwise, I don't see a huge
advantage to having these specialized (and somewhat misnamed)
classes, in the sense that Dataset only need to know how to parse
the initialization file while converting a new dataset
Instance Constructor
new
This is the constructor. There are two modes in which the
constructor can be used. In one mode, it will create various files
which support the dataset, using the cdt, (and hopefully in the
future, gtr and atr files). In the second mode, it will assume that
these files already exist and just return the constructed objevt.
Thus when a dataset is first created, there will be the overhead of
creating the additional files, but subsequent creation of a
cdtDataset object will not have that overhead. The constructor
takes the following arguments:
name : The fully qualified name of the dataset (slash/delimited),
which encodes the location and stem of the files,
without any extensions, and with no path
information. If the 'initialize' argument is set
(see below), a directory tructure of the same name
will also be created to contain the exported data
files.
datapath : This required path prefix is where any newly created data
files should be placed (or read from).
imagepath : An optional path prefix where any newly created image files
should be placed (or read from). Will default to
datapath if none is specified.
contrast : If a dataset is being instantiated for the first
time, then a contrast is needed for image
generation. If no contrast is provided, then a
default value of 4 will be used. As the data are
expected to be in log base 2, this corresponds to a
16-fold change as the maximum color in any image.
colorscheme : Can either be 'red/green' (the default if none is
specified) or 'yellow/blue'
initialize : A filepath of the originating .cdt file indicate
whether to initialize all the required supporting
files that a cdtDataset needs. This defaults to 0
(assumes that the necessary supporting files already
exist. If it is a filepath, then the dataset is
initialized using it
Note that if you supply a contrast, you must set initialize to 1, as
a contrast is useless in the absence of initialization. Both the
'dataset' and 'path' arguments are absolutely required.
Usage, eg if you have a file:
my $ds = Microarray::CdtDataset->new(name=>dataset/name, # name of the dataset
datapath=>$dir, # prefix path where dataset files will be written
contrast=>2, # image contrast
initialize=>/path/to/file.cdt);
Instance Methods
name
This method returns the fully qualified name of the dataset
contrast
This method returns the contrast
colorScheme
This method returns the colorScheme
fileBaseName
This method returns the base name string of the files comprising of
the dataset, sans suffices
height
This method returns the number of data rows in the cdtFile
width
This method returns the number of data columns in the cdtFile
image
Returns the data matrix as a GD::Image, drawn with 1x1 pixel per
value at the contrast last used/initialized with $ds->new()
Usage: $ds->image();
experiment
getFeatureKeys
returns the keys (attributes) for the features (gene expression row
vectors)
Usage: $ds->getFeatureKeys()
feature
required by the search function of Explorer
getFeature
search
Returns an array of data matrix row numbers where <query> matched in
column <column_name>. When using 'ALL' as <column_name>, all
columns will be searched
correlations
Returns the precalculated correlation values for row <index>. Up to
50 correlations values > 0.5 are stored. As an example client
usage, see Explorer's/gx retrieval of those profiles correlated to
the query (user-clicked profile within zoom view).
Protected Methods
_cdtFileName
This method returns the name of the cdtFile
_cdtBase
This method returns the base name string of the files comprising of
the dataset, sans suffices
_cdtPath
This method returns the path to the cdt file of thebeing converted
into a dataset
datapath
This method returns the path to which data files either written
or read from
imagepath
This method returns the path to which image files are either written
or read from
_load_meta
This method loads in previously cached meta data
_load_image
this protected method just opens up the previously stored matrix
image (from dataset initialization) , created a GD::Image object
with it, and returns it. Possible bug: it relies on GD::Image
version (>1.19) to pick $kImgType, when perhaps it should rely on
the filename suffix (.gif, .png) instead. This may prevent the
portability of intact datasets from one filesystem to another, but
in the end, you're always going to be limited by the version of GD...
_search_feature
usage: $hit = $self->_search_feature( 100, "kinase", ['ACC','NAME','SYMBOL'])
this function returns true, if the feature queried contains the passed
string values(s). The parameters to this function are:
- required: the index number of the feature
- required: a search term
- optional: an array reference, containing the names of fields to search,
if not passed, all fields will be searched.
_get_correlations
required for Explorer to retrieve those profiles highly correlated
to the query (user-clicked profile within zoom view)
Private Methods
__init
This method takes care of all of the initialization of the
attributes of the cdtDataset
__checkAndSetConstructorArguments
This private method checks that the constructor arguments pass all
sanity checks, and that files that should exist do exist.
__checkAndSetInitializationState
This method checks and sets whether the object needs full
initialization. There are meant to be 2 initilization requests.
The first (initialization=><path>) would request that the dataset be
created de novo from an initial file, and the second
(initialization=>1) would just remake the images with a different
constrast and different colors. The second initialization has not
been adequately tested.
__checkAndSetDataPath
This private method checks that an Path is supplied, that
corresponds to an existent directory, then stores it in the object.
__checkAndSetImagePath
This private method checks that an Path is supplied, that
corresponds to an existent directory, then stores it in the object.
__checkAndSetDatasetName
This method checks that a dataset was given to the constructor. In
addition because CdtDataset creates and stores all its images and data
in a directory hierarchy, the initially specified data and image
paths are augmented with the dataset name directories (which are
created upon initialization)
__checkAndSetContrast
This method determines if the contrast is valid, and then stores the
value in the object
__checkAndSetColorScheme
This method determines if the colorscheme is valid, and then stores
the value in the object
__checkRequiredFilesExist
This method checks that all the required files for the dataset exist
If they do not, it will cause a fatal error
__setCdtInfo
this subroutine takes the initalize arguement and store the path and
the stem of the .cdt filename
__setFileBaseName
This method allows the filename stem (no suffix) of the datafiles
use to initialize the dataset to be set
__setDataPath
This method allows the path to where the data files for the dataset
exist to be set
__setImagePath
This method allows the path to where the image files for the dataset
exist to be set
__setDatasetName
This method allows the name of the dataset to be set.
__setCdtFileName
This method sets the name of the cdtFile
__setContrast
This method allows the contrast to be set.
__setColorScheme
This method allows the colorscheme to be set.
__setShouldInitialize
This method allows a flag to be set as to whether full
initialization need to take place
__setHeight
This private method allows the 'height' of the dataset to be set.
This in fact corresponds to the number of rows in the cdt file.
__setWidth
This private method allows the 'height' of the dataset to be set.
This in fact corresponds to the number of rows in the cdt file.
__ensureDirectoriesExist
This subroutine checks to see that the full outpath is created if
necessary, by extended a previouslt validated filepath. It is
tended for use only when initializating a dataset, where the dataset
directories might need to be created and appended to the data and
image out paths
__cdtFileObject
This private method returns a cdtFile Object. If one does not exist
within the object, one will be created. If one does exist, that
will simply be returned. This will likely fail for sets that are
already converted, because the .cdt file is not copied into the
dataset location. This is a design issue that needs to be
discussed, in addition to the fact that it is private method, when
it seems like other software might actually *want* to retrieve the
Datamatix object
__shouldInitialize
This private method returns whether the object needs initialization
__initializeDataset
This method creates a new dataset from a CDT (clustered data) file.
The CDT file format was defined by Michael Eisen for his Windows
applications TreeView and Cluster. It has certain drawbacks, for
example not more then two columns per gene can be used to store
additional information. This can be partly resolved by putting more
data into one record field. A kludgy fix.
__lock
This method locks the dataset
__unlock
This method unlocks the dataset
__dissectCDT
This method determines the contents of the cdtfile, and stores some
of the cdtMeta data for quick retrieval. Note that the previous
version did its own parsing of the cdtFile. This is now delegated
to the cdtFile object.
__saveCdtExptNames
This method (we may eliminate it later) save the names of the data
columns from the cdtFile (these are usually the experiment names) to
a file. This is later used by GeneXplorer, but also provides a
quick way of looking up the data, without having to read the cdtFile
in.
__prepareCorrelations
This method prepares a correlations file
__createIndexedPclFile
This method creates a pcl file from the cdt file that was used to
instantiate the object. This is coded here, rather than using the
cdtFile method to convert to a pcl, because the pcl file must have
an index for it's names, rather than the names themselves.
__compressCorrelations
This method takes a correlations file as output by Gavin Sherlocks
correlations program. These represent the correlation values of a
certain gene (array element) intensity vector vs. all other vectors
in a data matrix.
The output generated is a binary representation of the list of
correlation values for each row in the data matrix (= expression
vectors).
The file is built like this:
name content bytes
header
index_size length of index 2
index offset for rows index_size * 2
body
data 1..n correlation data 4 * look up in index
-> index correlated vector 2 \
-> corr correlation 2 / 2 words (16 int)
__prepareMetaFile
This method writes out a file of meta information that pertain to
the dataset, in the form of name=value pair.
__loadExptInfo
# This method loads the expt_info data
__load_table
loads an ASCII table. It is expected that the first row contains the
column headers It is also expected that the first column contains
numeric id's starting at '0'. returns a reference to the table
structure
Authors
John C. Matese jcmatese@genome.stanford.edu