NAME

similarity_match.pl

SYNOPSIS

Compares a list of annotations to another ontology and suggests the best match based on some similarity metric (n-grams). It is also possible to align one ontology to another. Accepts ontologies in both OBO and OWL formats as well as MeSH ASCII and OMIM txt.

The script runs non-interactively and the results have to be manually inspected, although it can be expected that anything with a similarity score higher than ~80-90% will be a valid match.

USAGE

similarity_match.pl (-w owlfile || -o obofile || -m meshfile || -i omimfile) -t targetfile -r resultfile [--obotarget || --owltarget]

Optional '--obotarget' setting specifies that the target file is an OBO ontology. Optional '--owltarget' setting specifies that the target file is an OWL ontology.

INPUT FILES

ontologies to map the targetfile against

owlfile, obofile, meshfile are ontologies in OWL, OBO and MeSH ASCII formats. Only a single file needs to be specified.

targetfile

The script expects a single column text file with no headears.

OUTPUT

The script will produce a single tab-delimited file as set with the -r flag. The file will have four headers:

ID

Accession of the term from the targetfile if the file was an ontology, otherwise OE_VALUE repeated.

OE_VALUE

Annotation from the supplied targetfile or a term label if the file was an ontology.

ONTOLOGY_TERM

Term label that was matched based on the highest similarity from the supplied onotlogy file.

ACCESSION

Accession of the ontology term that provided the best match.

SIMILARITY%

Similarity score of ONTOLOGY_TERM compared to OE_VALUE. This is the Levenshtein distance normalised by OE_VALUE length expressed in %. Higher is better.

DESCRIPTION

Function list

normalise_hash()

Normalises labels and synonyms in the target hash. These are stored in extra annotations on the hash, so that the original value is preserved for display.

check_data()

Checks the input data, e.g. removing empty lines or warning of duplicates

normalise()

Normalises a string by changing it lowercase and splitting into 2-grams.

align()

Aligns the two data structures targetfile and ontology. Outputs the results into a file.

parseMeSH()

Custom MeSH parser for the MeSH ASCII format.

parseMeSH()

Custom OMIM parser.

parseFlat()

Custom flat file parser.

parseFlatColumns()

Splits and joins the columns of a flat file. The first column is assigned to the first element. Concatenates the ragged end (leftover columns) into the second element or returns undef for a one-column file.

parseOBO()

Custom OBO parser.

parseOWL()

Custom OWL parser.

find_match()

A wrapper around the calculate_distance function. Specifies the similarity metric to be used, in this case Text::LevenshteinXS::distance.

Outputs a single line in the output file.

calculate_distance()

Finds the best match for the supplied term in the ontology using the supplied anonymous distance function defined in find_match().

AUTHORS

Tomasz Adamusiak <tomasz@cpan.org>