NAME
similarity_match.pl
SYNOPSIS
Compares a list of annotations to another ontology and suggests the best match based on some similarity metric (n-grams). It is also possible to align one ontology to another. Accepts ontologies in both OBO and OWL formats as well as MeSH ASCII.
The script runs non-interactively and the results have to be manually inspected, although it can be expected that anything with a similarity score higher than ~80-90% will be a valid match.
USAGE
similarity_match.pl (-w owlfile || -o obofile || -m meshfile) -t targetfile -r resultfile [--obotarget || --owltarget]
Optional '--obotarget' setting specifies that the target file is an OBO ontology. Optional '--owltarget' setting specifies that the target file is an OWL ontology.
INPUT FILES
- ontologies to map the targetfile against
-
owlfile, obofile, meshfile are ontologies in OWL, OBO and MeSH ASCII formats. Only a single file needs to be specified.
- targetfile
-
The script expects a single column text file with no headears.
OUTPUT
The script will produce a single tab-delimited file as set with the -r flag. The file will have four headers:
- ID
-
Accession of the term from the targetfile if the file was an ontology, otherwise OE_VALUE repeated.
- OE_VALUE
-
Annotation from the supplied targetfile or a term label if the file was an ontology.
- ONTOLOGY_TERM
-
Term label that was matched based on the highest similarity from the supplied onotlogy file.
- ACCESSION
-
Accession of the ontology term that provided the best match.
- SIMILARITY%
-
Similarity score of ONTOLOGY_TERM compared to OE_VALUE. This is the Levenshtein distance normalised by OE_VALUE length expressed in %. Higher is better.
DESCRIPTION
Function list
- normalise_hash()
-
Normalises labels and synonyms in the target hash. These are stored in extra annotations on the hash, so that the original value is preserved for display.
- check_data()
-
Checks the input data, e.g. removing empty lines or warning of duplicates
- normalise()
-
Normalises a string by changing it lowercase and splitting into 2-grams.
- align()
-
Aligns the two data structures targetfile and ontology. Outputs the results into a file.
- parseMeSH()
-
Custom MeSH parser for the MeSH ASCII format.
- parseFlat()
-
Custom flat file parser.
- parseFlatColumns()
-
Splits and joins the columns of a flat file. The first column is assigned to the first element. Concatenates the ragged end (leftover columns) into the second element or returns undef for a one-column file.
- parseOBO()
-
Custom OBO parser.
- parseOWL()
-
Custom OWL parser.
- find_match()
-
A wrapper around the calculate_distance function. Specifies the similarity metric to be used, in this case Text::LevenshteinXS::distance.
Outputs a single line in the output file.
- calculate_distance()
-
Finds the best match for the supplied term in the ontology using the supplied anonymous distance function defined in find_match().
AUTHORS
Tomasz Adamusiak <tomasz@cpan.org>