NAME

GO::TermFinder - identify GO nodes that annotate a group of genes with a significant p-value

DESCRIPTION

This package is intended to provide a method whereby the P-values of a set of GO annotations can be determined for a set of genes, based on the number of genes that exist in the particular genome, and their annotation, and the frequency with which the GO nodes are annotated across the provided set of genes. The P-value is simply calculated using either the hypergeometric, or the binomial distribution, as the probability of x or more out of n genes having a given annotation, given that G of N have that annotation in the genome in general. The hypergeometric distribution (sampling without replacement) is more accurate, though slower to calculate the the binomial distibution (sampling with replacement).

In addition, a corrected p-value is also calculated, to correct for multiple hypothesis testing. From the sub-graph of GO nodes that are considered as hypotheses (any nodes with 2 or more annotations from the list of genes to be considered), the minimal set of nodes is determined from which all other hypotheses (nodes and annotations) can be inferred. Their number is then used to multiply the calculated p-values, to generate corrected p-values. The client has access to both the corrected and uncorrected values. This is done instead of simple Bonferroni correction, because the hypotheses are not independent of one another.

The general idea is that a list of genes may have been identified for some reason, e.g. they are coregulated, and TermFinder can be used to find out if any nodes annotate the set of genes to a level which is extremely improbable if the genes had simply been picked at random.

TODO

On the day that I finished the code, BioPerl 1.2 was released, which appears to have some Ontology parsing code that is better than mine, which uses the Graph module on CPAN to implement the DAG structure of GO. The Graph module provides a richer set of methods for querying the structure, though the BioPerl implementation does not appear to have all the things I want. Anyway, converting my code to use the BioPerl Ontology stuff is probably a good goal, as the modules in their appear to have been somewhat better conceived in terms of their relationships to each other. I did not notice anything to do with providing annotations, so my GO::AnnotationProvider::AnnotationParser may still be of use. We will see....

May want the client to decide the behaviour for ambiguous names, rather than having it hard coded (eg always ignore; use if standard name (current implementation); use all databaseIds for the ambiguous name; decide on a case by case basis (potentially useful if running on command line)).

Would probably be a good idea to use the Math::BigInt module to deal with the factorials and nChooser calculations, so that they are more accurate, rather than the log versions that I currently have. The latest version of Math::BigInt has a C implementation underneath, that may even speed it up a little.

The multiple hypothesis correction (based on some simulations I've run) appears not to be conservative enough, and thus needs more work....

Instance Constructor

new

This is the constructor. It expects to be passed named arguments for an annotationProvider, an ontologyProvider. In addition, it must be told the aspect of the ontology provider, so that it knows how to query the annotationProvider.

There are also some additional, optional arguments:

population: this argument allows a client to indicate the population that should used to calculate a background distribution of GO terms. In the absence of population argument, then the background distribution will be drawn from all genes in the annotationProvider. This should be provided as an array reference, and no ambiguous names should be provided (see AnnotationProvider for details of name ambiguity). This option is particularly pertinent in the case that e.g. you assayed 2000 genes in a two hybrid, and found 20 interesting ones. To find significant terms, you need to do it in the context of the genes that you assayed, not in the context of all genes with annotation.

totalNumGenes: this argument allows a client to indicate that the size of the background distribution is in fact larger that the number of genes that exist in the annotation provider, and the extra genes are merely assumed to be entirely unannotated.

NB: This is an API change, as totalNumGenes was previously required.

Thus - if using 'population', the total number of genes considered as the background will be the number of genes in the provided population. If not using 'population', then the number of genes that will be considered as the total population will be the number of genes in the annotationProvider. However, if the totalNumGenes argument is provided, then that number will be used as the size of the population. If it is not larger than the total number of genes in the annotationParser, then the number of genes in the annotationParser will be used. The totalNumGenes and the population arguments are mutually exclusive, and both should not be provided at the same time.

Usage (population is larger than those with annotations):

my $termFinder = GO::TermFinder->new(annotationProvider=> $annotationProvider,
                                     ontologyProvider  => $ontologyProvider,
                                     totalNumGenes     => $num,
                                     aspect            => <P|C|F>);

Usage (use all annotated genes as population):

my $termFinder = GO::TermFinder->new(annotationProvider=> $annotationProvider,
                                     ontologyProvider  => $ontologyProvider,
                                     aspect            => <P|C|F>);

Usage (use a subset of genes as the background population):

my $termFinder = GO::TermFinder->new(annotationProvider=> $annotationProvider,
                                     ontologyProvider  => $ontologyProvider,
                                     population        => \@genes,
                                     aspect            => <P|C|F>);

Instance Methods

findTerms

This method returns an array of hash references that indicates what terms can annotate the list of genes with what P-value. The contents of the hashes in the returned array are:

key                   value
-------------------------------------------------------------------------
NODE                  A GO::Node

PVALUE		  The P-value for having the observed number of
                      annotations that the provided list of genes
                      has to that node

CORRECTED_PVALUE      The CORRECTED_PVALUE is the PVALUE multiplied
                      by the number of nodes in the minimal set
                      of hypotheses from which all other
                      hypotheses can be generated.  A hypothesis
                      is any node to which 2 or more genes in the
                      supplied list are annotated, either
                      directly or indirectly.  The minimal subset
                      of hypotheses from which all others can be
                      constructed consists of the union of the
                      following classes of node: 

                       1).  Leaf hypotheses (i.e. hypotheses which 
                            have no children that were tested as hypotheses).

                       2).  Hypotheses that have at least one non-hypothesis 
                            child with an annotation.
  
                       3).  Hypotheses with direct annotation.

NUM_ANNOTATIONS       The number of genes within the provided list that
                      are annotated to the node.

TOTAL_NUM_ANNOTATIONS The number of genes across the genome
                      annotated to the node

ANNOTATED_GENES       A hash reference, whose keys are the
                      databaseIds that are annotated to the node,
                      and whose values are the original name
                      supplied to the findTerms() method.

The entries are sorted by increasing p-value (ie least likely is first). If there is a tie in the p-value, then the sort order is determined by GOID, using a cmp comparison.

This method expects to be passed, by reference, a list of gene names for which terms will be found. If a passed in name is ambiguous (see AnnotationProvider), then the following will occur:

1) If the name can be used as a standard name, it will assume that
   it is that.

2) Otherwise it will not use it.

Currently a warning will be printed to STDOUT in the case of an ambiguous name being used.

The passed in gene names are converted into a list of databaseIds. If a gene does not map to a databaseId, then an undef is put in the list - however, if the same gene name, which does not map to a databaseId, is used twice then it will produce only one undef in the list. If more than one gene name maps to the same databaseId (either because you used the same name twice, or you used an alias as well), then that databaseId is only put into the list once, and a warning is printed.

If a gene name does not have any information returned from the AnnotationProvider, then it is assumed that the gene is entirely unannotated. For these purposes, TermFinder annotates such genes to the root node (Gene_Ontology), its immediate child (which indicates the aspect of the ontology (such as biological_process), and a dummy go node, corresponding to unannotated. This node will have a goid of 'GO:XXXXXXX', and a term name of 'unannotated'. No other information will be set up for this GO::Node, so you should not count on being able to retrieve it. What it does mean is that you can determine if the predominant feature of a set of genes is that they have no annotation.

If more genes are provided that have been indicated exist in the genome (as provided during object construction), then an error message will be printed out, and an empty list will be returned.

Usage:

    my @pvalueStructures = $termFinder->findTerms(genes=>\@genes);

    my $hypothesis = 1;						    

    foreach my $pvalue (@pvalueStructures){

    print "-- $hypothesis of ", scalar @pvalueStructures, "--\n",

	"GOID\t", $pvalue->{NODE}->goid, "\n",

	"TERM\t", $pvalue->{NODE}->term, "\n",

	"P-VALUE\t", $pvalue->{PVALUE}, "\n",

	"CORRECTED P-VALUE\t", $pvalue->{CORRECTED_PVALUE}, "\n",
	
        "NUM_ANNOTATIONS\t", $pvalue->{NUM_ANNOTATIONS}, " (of ", $pvalue->{TOTAL_NUM_ANNOTATIONS}, ")\n",

        "ANNOTATED_GENES\t", join(", ", values (%{$pvalue->{ANNOTATED_GENES}})), "\n\n";

    $hypothesis++;

    }

aspect

Returns the aspect with the the GO::TermFinder object was constructed.

Usage:

my $aspect = $termFinder->aspect;

Authors

Gavin Sherlock; sherlock@genome.stanford.edu
Elizabeth Boyle; ell@mit.edu