NAME
FortyTwo::Manual - User Guide for Forty-Two
VERSION
version 0.213470
Background
Aim and features
42
is a phylogenomic tool designed to add (and optionally align) sequences to a preexisting multiple sequence alignment (MSA) while controlling for orthology relationships and potentially contaminating sequences. Sequences to add are either nucleotide transcripts resulting from transcriptome assembly or already translated protein sequences. In theory, one can also use genomic nucleotide sequences (because 42
can splice introns), but this possibility has not been extensively tested so far.
The working hypothesis of 42
is that its orthology-controlling heuristics can enrich not only MSAs of single-copy genes but also more complicated MSAs including terminally duplicated genes (in-paralogues) and/or corresponding to multigenic families featuring different out-paralogues of different ages. Preliminary tests on a broadly sampled eukaryotic data set suggest that the orthology relationships enforced by 42
are in good agreement with those inferred with OrthoFinder
software [Emms and Kelly (2015) Genome Biol 16:157]. To this end, it relies on complete proteomes of reference organisms.
42
is also able to enrich MSAs resulting from the split of complex multigenic families after phylogenetic analysis. For this, it requires decoy files composed of representative sequences of unwanted out-paralogues. Such PARA
files have to be provided by the user.
Regarding contamination, 42
implements a dual system of taxonomic filters (based on NCBI Taxonomy) allowing it to flag any new sequence for which the taxonomic affiliation is doubtful. Two main approaches are available: 42
either checks that a new sequence is most similar to (an)other sequence(s) of the expected taxon already present in the MSA (= positive filter) or that a new sequence is more similar to a sequence in the MSA than to any sequence from a set of complete proteomes that do not include the expected taxon (= TOL check decoy). While the power of the first mechanism is dependent on the taxonomic breadth of each MSA, the second approach is more widely applicable.
As species-rich ribosomal protein MSAs are available for both prokaryotic and eukaryotic genomes, 42
can also be used to probe the contamination level of any genome or transcriptome of interest using the first approach. A special mode is provided for this application, termed the metagenomic
mode.
Design principles
In a single run, 42
can process an arbitrary large number of MSAs (FASTA
files specified using shell jokers on the command line). Moreover, one can search for orthologous sequences in as many organisms as wanted.
42
is exclusively setup through a structured text file (e.g., YAML
format). Archiving of this file allows a user to document all the configuration details for a given run. This config
file has two main parts: one with the options that apply globally to the run and one that lists the organisms (orgs
) to search and their specific options, including the path (bank_dir
) to the corresponding sequence databases (banks
in 42
's parlance). The config
file further includes a mechanism of default values (defaults
) that apply to all organisms except when otherwise specified in individual org
subsections (e.g., code
).
When 42
enriches a MSA, it processes each organism in turn following the order of org
subsections in the config
file. Several out-of-order optimisations ensure that similar computations (e.g., BLAST
searches) are not repeated uselessly. Parallelization is possible either by running multiple instances of 42
on a grid computer, each working on a different set of MSAs processed serially, or by enabling an internal batch queue allowing the processing of several MSAs at once by each instance of 42
.
42
's verbosity is configured directly on the command line. 42
can be very introspective if asked to be so. At the highest verbosity level, the numerous BLAST
reports are not deleted after the run and are thus available for manual inspection (e.g., for debugging purposes).
Functional overview
Orthology-controlling heuristics
Each run of 42
must specify a set of candidate organisms orgs
that are going to be mined for orthologues, a set of reference organisms (ref_orgs
), for which the complete proteomes have to be available (ref_bank_dir
, ref_org_mapper
), and a set of query organisms (query_orgs
), which should be represented in most MSAs to be enriched. These two latter sets of organisms do not need to be identical but certainly can. They will apply to all organisms (orgs
) to be added to yield the new MSAs (out_suffix
).
Collection of queries
For each org
, 42
extracts all sequences belonging to the query_orgs
in order to assemble a list of query_seqs
. Those are used to mine orgs
for homologs (candidate orthologues) and to generate a list of 'validating' orthologues out of ref_orgs
. If a MSA does not contain any sequence fulfilling the selection criteria, 42
warns the user and falls back to selecting the longest sequence instead, which leads to a singleton query_seqs
.
Preflight check of orthology relationships
To ensure that it can accurately enrich MSAs in orthologous sequences, 42
verifies that query_seqs
and ref_orgs
themselves satisfy its orthology criteria. This two-step process is carried out separately for each MSA.
First, an average BLASTP
bit score is computed for each ref_org
based on the individual best hits of each query_seq
against the corresponding complete proteomes. query_seqs
without any hit in a given ref_org
are taken into account by contributing a value of zero to the average bit score for the ref_org
. How exactly first hits are considered best hits is explained in "Identification of best hits for queries".
ref_orgs
without any hit to query_seqs
are automatically discarded, whereas the remaining ones are ranked in descending order on the average bit score. Low-scoring ref_orgs
can be optionally discarded by specifying a value < 1.0 for the ref_org_mul
parameter of the config
file. For example, assuming the user lists 10 different ref_orgs
and set ref_org_mul
to 0.7, at most 7 ref_orgs
will be retained for assessing orthology relationships. This could be the result of the automatic removal of two ref_orgs
without any hit and of an additional low-scoring one to honor the ref_org_mul
setting.
Second, the best hits for each ref_org
are BLAST
ed (BLASTP
) against the complete proteomes of other ref_orgs
to check that they indeed recover the same best hits as the query_seqs
. If any ref_org
fails with any of the other ref_orgs
, a message is issued to warn the user, but 42
proceeds normally. More details about the logic behind this are available in "Identification of orthologues...". Otherwise, the preflight check is considered successful.
Search for homologues using queries
Each one of the query_seqs
is BLAST
ed in turn against each one of the banks
for the current org
. The exact BLAST
flavour is either TBLASTN
or BLASTP
, depending on the sequence type of org
's banks
. Moreover, default options of this first BLAST
can be overridden by specifying key/value pairs in the subsection homologues
under the section blast_args
of the config
file (e.g., low-complexity filters, E-value threshold, maximum number of hits).
The whole set of hits corresponding to all query_seqs
is consolidated into a single list of homologous sequences. These sequences can be optionally trimmed to the segment really covered by the matching query_seqs
. This behaviour is especially useful when using (complete) genome assemblies for enrichment, but also to avoid non-core regions to perturb orthology assessment and improves the reliability of the intron splicing step. It is controlled by the trim_homologues
parameter of the config
file. The details of this trimming step can be fine-tuned by editing the other trim_*
parameters of the config
file. Briefly, trim_max_shift
corresponds to the maximum length allowed for an intron before breaking a homologous sequence into multiple (exon-bounded) subsequences, whereas trim_extra_margin
controls how many additional nucleotides are extracted at each homologue extremities.
Identification of best hits for queries
Each query_seq
is furthermore BLAST
ed (BLASTP
) against the complete proteome of each ref_org
. Again, BLAST
options can be overridden if needed (subsection references
under section blast_args
). For each query_seq
, the best hit in the ref_org
is recorded. However, when bit scores of subsequent hits are nearly equal to the bit score of the best hit, the corresponding sequences are interpreted as closely related in-paralogues and also added to the list of best hits. This behaviour can be tweaked using the bitscore_mul
parameter of the config
file.
As a consequence, several best hits can be recorded for a single query_seq
/ref_org
pair, either because several sequences are available for the query_org
(in-paralogues or out-paralogues in the case of a multigenic family) or because several sequences match a single query_seq
in the org
's banks
(which should be co-orthologues then), or for both reasons. In contrast, if a ref_org
has no homologue for the current MSA, 42
warns the user and drops it from the list of ref_orgs
considered by the orthology-controlling engine.
Identification of orthologues among homologues
To sort out orthologous sequences from paralogous sequences, each homologue in the current org
is BLAST
ed (BLASTX
or BLASTP
) against the complete proteome of each ref_org
(BLAST
options in subsection orthologues
under section blast_args
). And now, here's the heart of 42
's heuristics... To be considered as an orthologue, a homologue must satisfy the following criterion for every one of the (active) ref_orgs
without exception: its best hit in the corresponding complete proteome must be found in the original list of best hits assembled using the query_seqs
.
It is important to note that 42
does not care about which particular query_seq
(or query_seqs
) recovered the homologue in the org
nor about those that recovered the best hits in the complete proteomes of the ref_orgs
. The only thing that matters is that the loop is closed. The set of homologues for which this condition holds then become the orthologues. If the parameter ref_brh
of the config
file is set to off
, all homologues are automatically considered as orthologues (but see "PARA files" just below).
Optimized enrichment of multigenic families
For multigenic families split over multiple MSAs, one can also optionally assemble PARA
files. Such a file should contain sequences representative of the other sub-families of a multigenic family, so as to help 42
to even better discriminate between orthologous and paralogous sequences. Sequences in PARA
files are in FASTA
format but do not need to be aligned. To be considered as an orthologue, a homologue must obtain a best hit BLAST
bit score that is higher when compared to the sequences of the MSA than those of the PARA
file.
For example, let us say we have a family composed of 4 subfamilies (A-D). The initial orthologous group would include the 4 types of paralogues in a single MSA. Based on a phylogenetic analysis of this MSA, we could split this orthologous group into 4 distinct MSAs (A-D). If we consider the enrichment of subfamily A (famA.fasta
), then the sequences of the other subfamilies (B, C and D) should be used to build the PARA
file (famA.para
). Hence, any homologous sequence that would be more similar to a sequence in the PARA
file (say of type B) than to any sequence (of type A) in the MSA would then be rejected as paralogous.
In some cases, PARA
files might be a replacement for the main heuristics of 42
. Yet, both approaches can be used jointly for maximal accuracy.
Consolidation of redundant orthologues
When using (highly) redundant transcriptome assemblies for MSA enrichment, some genes can be represented by a series of very similar transcripts, either partially overlapping or containing minor sequencing errors. To deal with these situations, 42
provides the parameter merge_orthologues
in the config
file, which is set to off
by default. When enabled (with on
), orthologues are first fed to CAP3
in an attempt to merge some of them into contigs. Successfully merged orthologues are identified by a trailing +N
tag where N
is the number of orthologous sequences removed in the merging process. The contig itself is named after the longest orthologous sequence composing it. The details of this merging step can be fine-tuned by editing the other merge_*
parameters in the config
file.
Orthologue post-processing
Once orthologues are identified, each one is BLAST
ed (BLASTX
or BLASTP
) against the MSA itself to recover its closest relatives (BLAST
options in subsection templates
under section blast_args
).
Family affiliation and orthologue naming
If the most closely related sequence in the MSA belongs to a given family (e.g., mt-
), the orthologue is affiliated to the same family, as did the original forty
. This allows enriching MSAs corresponding to multigenic families. Note that only the most closely related sequence can be used to infer the orthologue's family.
The orthologue identifier is built using the org
name and the accession of the corresponding sequence in the org
's banks
, which helps tracking down all the sequences added to a MSA by 42
(e.g., for debugging purposes). This is thus different from the original forty
, in which most sequences were contigs having lost all connection with the nucleotide sequences in the org
's banks
.
Contamination detection and handling
42
then seeks to determine whether the orthologue is a genuine orthologue or a xenologue contaminating the org
's banks
. To this end, two main avenues are available: positive taxonomic filters and decoy proteomes sampled across the diversity of the Tree of Life (TOL). While both approaches are in principle combinable, 42
currently implements them as exclusive options. Positive filters are enabled by adding tax_filter
parameters in the config
file, whereas decoy proteomes further require enabling the tol_check
option.
To use positive filters, 42
must infer the taxonomy of the orthologue by analysing the identifiers of the closest sequences in the MSA. How this is precisely carried out depends on several parameters in the config
file. For simplicity, the user can choose between two predefined modes to set these parameters in bulk: best-hit
and megan-like
.
In best-hit
mode, only the most closely related preexisting sequence is used to infer the orthologue's taxonomy, whereas in <megan-like> mode, several sequences are considered and a Last-Common-Ancestor (LCA) inference is performed on them. The latter mode often yields more reliable taxonomic affiliations, but at the obvious expense of accuracy (i.e., only at the genus or family level instead of species level).
If the inferred taxonomy for the orthologue satisfies the taxonomic filter, the orthologue is simply added to the MSA. Otherwise, it is tagged as a contaminant (c#
). When an orthologue is tagged as a contaminant, the binomial of the organism at the origin of the taxonomic affiliation (or a higher-ranking taxon in case of LCA inference) is further appended to its identifier (i.e., ...Genus_species
). This mechanism can also be used to estimate the level of contamination of any genome or transcriptome (see the "Metagenomic mode" below).
Taxonomic filters are optional and require a local copy of the NCBI Taxonomy database (tax_dir
parameter in the config
file). It can be installed using setup-taxdir.pl
(see "Installation..." below).
The logic behind decoy TOL proteomes is similar to the way PARA
files work but with a twist. To be considered as uncontaminated, an orthologue must obtain a best hit BLAST
bit score that is higher when compared to the sequences of the MSA than those of the decoy TOL proteomes. However, decoy proteomes from organisms taxonomically related to the one to which the orthologue belongs are expected to yield very good bit scores, maybe higher than any bit score from sequences already present in the MSA. To avoid rejecting a genuinely uncontaminated orthologue because of this possibility, 42
skips all decoy hits that satisfy the taxonomic filter specified for the organism being added.
Let us take an example. Imagine that our MSAs only contain sequences from hymenopterans (e.g., wasps and ants). If we enable decoy TOL proteomes when adding some bees, contaminated orthologues corresponding to say, parasitic mites such as Varroa destructor, will be correctly rejected because they match some tick protein included in Ixodes decoy proteome. However, genuine bee sequences would also be wrongly rejected because they match with the Apis decoy proteome. To avoid this, we will use a taxonomic filter to skip Hymenoptera (or even Hexapoda) hits in decoy proteomes.
As one can see, both avenues rely on taxonomic filters. Choosing the right level of taxonomic filtration is no easy task and often requires a bit of testing. In short, the more distant potential contaminants are from organisms being added, the easier it is to find an adequate taxonomic filter (see "tax_filter" below for details).
Alignment and MSA integration
To integrate the orthologue into the MSA, 42
chooses the most appropriate template(s) for alignment among the closest relatives. As for taxonomic inference, it considers each of them in turn and stops once the coverage of the orthologue cannot be significantly improved. This allows 42
to select a slightly less related sequence as a template provided it aligns with a longer part of the orthologue. By how much exactly coverage has to be improved for a close sequence to be retained as a template can be fine-tuned with the coverage_mul
parameter of the config
file.
Then comes the alignment itself. With nucleotide banks
, both BLAST
and exonerate
aligners are available, whereas only BLAST
can be used with protein banks
. The preferred aligner can be specified using the aligner
parameter of the config
file.
The BLAST
aligner has been much improved with respect to the aligner of the original forty
. It extracts all the HSPs for the selected template(s) from the XML BLAST
report and uses them as guides for integrating the orthologue fragments into the MSA. Then, once all fragments have been integrated for all candidate organisms, it merges them into a single contiguous sequence per orthologue. When fragments overlap, the merger gives precedence to the fragments corresponding to the highest-scoring templates and HSPs.
When the new exonerate
aligner is preferred, only the longest selected template is used. In most cases, the orthologue can be aligned as a single large fragment. If not, 42
emits different types of warnings depending on the exact issue. In worst cases (e.g., exonerate
crashing), the orthologue cannot be integrated, often due to structural rearrangements between the orthologue and the template. To avoid discarding the orthologue in such cases, one can enable BLAST
as a fall-back for exonerate failures by setting the aligner
parameter to exoblast
.
Aligned orthologues are integrated into the MSA all together at the end of the file but in the following arrangement: first by family, then by candidate organism and then by accession. Contaminants are interspersed with genuine orthologues but can be easily identified thanks to their tag (c#
).
Redundancy detection and handling
Independently of the aligner, 42
never integrates twice the same sequence for a given organism, even if obtained from multiple orthologues. Further, it filters out subsequences included in sequences from the same organism that are either already present in the MSA or that are listed in the NON
counterpart of the MSA. NON
files are a bit like PARA
files (non-aligned sequences in FASTA
format) except that matches must be exact. Finally, when a newly added orthologue includes a sequence already present in the MSA for the same organism, the latter can be either kept or removed, depending on the value of the parameter ali_keep_lengthened_seqs
in the config
file.
#NEW#
tags
All newly added orthologues are tagged by a specific #NEW#
suffix. This tag helps 42
to organize the post-processing of new sequences (e.g., fragment merging and redundancy detection) but is also useful for the end-user to identify which sequences have been added by 42
. Therefore any preexisting #NEW#
tag is cleared when 42
starts processing a MSA.
While automatic untagging can be disabled via the parameter ali_keep_old_new_tags
in the config
file, one should note that such preexisting new sequences are basically invisible to 42
. This means that they will not be chosen as queries for mining transcriptomes nor as templates for aligning additional new orthologues. Moreover they will not be considered for taxonomic analyses. That is why the recommended approach is to let this parameter set to its default value.
Usage
Installation and dependencies
42
is written in Modern Perl but relies on 1 to 3 external dependencies: NCBI-BLAST+
, Exonerate
and CAP3
. However only BLAST is really required. You should download and install the corresponding binaries the way you feel the most appropriate for your system. (Alas this can be tricky.)
- ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate (use classical v2.2.0 not newer v2.4.0)
- http://seq.cs.iastate.edu/cap3.html
Most other dependencies can be handled automatically by cpanm
. If you cannot (or do not want to) modify your system Perl install, you will need to setup a Perlbrew
environment (https://perlbrew.pl/). Below are two distinct sets of commands that should work on Ubuntu
20.04.
System Perl install
Obviously, this requires admin rights on your system:
$ sudo su
$ apt install zlib1g-dev # maybe unnecessary
$ apt install libssl-dev # maybe unnecessary
$ apt install libmodule-build-tiny-perl # maybe unnecessary
$ apt install ncbi-blast+
$ apt install cpanminus
$ cpanm Bio::FastParsers
$ cpanm Bio::MUST::Core
$ cpanm Bio::MUST::Drivers
$ cpanm Bio::MUST::Apps::FortyTwo
$ exit
If a cpanm
command fails, retype it with the --force
option:
$ cpanm --force Bio::MUST::Drivers
Finally install a local mirror of the NCBI Taxonomy:
$ setup-taxdir.pl --taxdir=taxdump/
Perlbrew install
Depending on how pristine your system is, some of the commands below might
be unnecessary. However they should do no harm.
# install development tools
$ sudo apt update
$ sudo apt install build-essential
$ sudo apt install ncbi-blast+
# download the perlbrew installer...
$ wget -O - http://install.perlbrew.pl | bash
# initialize perlbrew
$ source ~/perl5/perlbrew/etc/bashrc
$ perlbrew init
# search for a recent stable version of the perl interpreter
$ perlbrew available
# install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
# (this will take a while)
$ perlbrew install perl-5.26.2
# install cpanm (for Perl dependencies)
$ perlbrew install-cpanm
# enable the just-installed version
$ perlbrew list
$ perlbrew switch perl-5.26.2
# make perlbrew always available
# if using bash (be sure to use double >> to append)
$ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
# if using zsh (only the destination file changes)
$ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc
Major 42
dependencies are the Bio::MUST
series of modules. Install them as follows.
$ cpanm Bio::FastParsers
$ cpanm Bio::MUST::Core
$ cpanm Bio::MUST::Drivers
Since Bio::MUST
modules rely on external bioinformatics programs and come with complex test suites, they sometimes raise errors during installation. If you encounter any such error, consider enabling --force
and/or --notest
options of cpanm
.
$ cpanm --force Bio::MUST::Drivers
Install 42
itself. All remaining dependencies can also be taken care of by cpanm
.
$ cpanm Bio::MUST::Apps::FortyTwo
Finally install a local mirror of the NCBI Taxonomy. It will be used by 42
to taxonomically affiliate inferred orthologous sequences.
$ setup-taxdir.pl --taxdir=taxdump/
Input and configuration files
To help with the configuration of the numerous parameters of the software, we designed a config
file generator: yaml-generator-42.pl
. When run with the --wizard
option, it will guide you through the configuration by prompting for all required parameters (pressing ENTER
selects the default value). At the end of process, it will produce a YAML
config
file named config-$out_suffix.yaml
and a file (build-$out_suffix.sh
) providing the command to reproduce the exact same configuration without using the wizard.
MSAs (*.fasta
)
42
native file format for MSAs is known as the ALI
format. It is very similar to the well-known FASTA
format except for a few differences: (1) sequences must appear on a single (long) line; (2) gaps are encoded as asterisk characters (*
) instead of dashes (-
) and any whitespace is interpreted as missing character states; (3) sequence identifiers accept a single whitespace between genus and species (more on this just below); and (4) comment lines (starting with the hashtag character #
) are allowed. Although 42
can read and write FASTA
files transparently, its ALI
roots sometimes play tricks to the user.
This is especially true for sequence identifiers. Basically, each identifier has to hold the organism name (org
) followed by a separator (@
) and by a protein/gene accession number. The organism name is usually the binomial name. Genus and species must be separated by a whitespace (
if in ALI
format) or underscore character (_
if in FASTA
format). In addition, strain name and/or NCBI taxon id are also allowed after the species name but each preceded by an underscore character (_
). If both are used in the sequence identifier, the taxon id has to come last. Finally, all sequence identifiers must be unique within each MSA. See examples below:
# Genus species@protacc
>Arabidopsis thaliana@AAL15244
# Genus species_taxonid@protacc
>Arabidopsis thaliana_3702@AAO44026
# Genus species_subspecies_taxonid@protacc
>Arabidopsis lyrata_lyrata_81972@EFH60692
# Genus species_taxonid@protacc
>Archaeoglobus fulgidus_2234@WP_048095550
# Genus species_strain_taxonid@protacc
>Archaeoglobus fulgidus_DSM4304_224325@AAB90113
# Genus species_strain_taxonid@protacc
>archaeon 13_1_20CM_2_54_9_1805008@OLE74253
Reference organisms (ref_orgs
, ref_banks
)
The reference proteome set must be described in the config
file. Firstly, each of the reference proteomes must be in FASTA
format in order to be formatted as a BLAST
database with the makeblastdb
command. For robustness, it is advised to use simple (one-word) sequence identifiers here.
$ for REFORG in *.faa; do makeblastdb -in $REFORG -dbtype prot \
-out `basename $REFORG .faa` -parse_seqids; done
Then, yaml-generator-42.pl
will read a file describing the reference proteome set (ref_org_mapper.idm
). This file is composed of two columns separated by a tabulation character (\t
) with the first column being the organism name (ref_org
) and the second being the database basename (ref_bank
).
If your banks are like this:
$ ls Arabidopsis_thaliana_3702_bank.*
Arabidopsis_thaliana_3702_bank.faa
Arabidopsis_thaliana_3702_bank.phr
Arabidopsis_thaliana_3702_bank.pin
Arabidopsis_thaliana_3702_bank.pog
Arabidopsis_thaliana_3702_bank.psd
Arabidopsis_thaliana_3702_bank.psi
Arabidopsis_thaliana_3702_bank.psq
Then the ref_org_mapper
file should look like this:
Arabidopsis thaliana_3702 Arabidopsis_thaliana_3702_bank
If you mainly work with microbes, you may want to name your banks after the NCBI GCA/GCF accessions of the corresponding genome assemblies. In this case, you can use fetch-tax.pl
to generate a suitable file from a list of such numbers:
$ head -n5 banks.idl
GCA_000008085.1
GCA_000011505.1
GCA_000012285.1
GCA_000014585.1
GCA_000019605.1
$ fetch-tax.pl --taxdir=taxdump/ --org-mapper --item-type=taxid banks.idl
$ head -n5 banks.org-idm
Nanoarchaeum equitans_GCA_000008085.1 GCA_000008085.1
Staphylococcus aureus_GCA_000011505.1 GCA_000011505.1
Sulfolobus acidocaldarius_GCA_000012285.1 GCA_000012285.1
Synechococcus sp._GCA_000014585.1 GCA_000014585.1
Korarchaeum cryptofilum_GCA_000019605.1 GCA_000019605.1
Query organisms (query_orgs
)
query_orgs
should be listed in a file (queries.txt
) and spelled exactly as in your MSAs (excluding the FASTA
-specific `>` character preceding each sequence identifier). This file will be processed by yaml-generator-42.pl
to populate the config
file. To easily draft a list of query_orgs
, you can for example use the 10 to 20 most represented organisms across all your MSAs (prior to enrichment).
$ grep -h \> *.fasta | cut -f1 -d'@' | cut -c2- | sort | uniq -c | sort -rn | head -n10
22498 Danio_rerio
21071 Homo_sapiens
20722 Mus_musculus
18933 Monodelphis_domestica
18616 Loxodonta_africana
17762 Latimeria_chalumnae
17678 Canis_familiaris
17114 Xenopus_tropicalis
16665 Anolis_carolinensis
16611 Sarcophilus_harrisii
Note: Organism names must follow the same rules as above. This means that no underscore should appear between genus and species. 42
emits a warning when suspecting you got it wrong. However, it cannot fix this for you. When working with native ALI
files, this issue does not crop up:
$ grep -h \> *.ali | cut -f1 -d'@' | cut -c2- | sort | uniq -c | sort -rn | head -n10
22498 Danio rerio
21071 Homo sapiens
...
Candidate organisms (orgs
, banks
)
The candidate organisms set must be described in the config
file. Firstly, each of the candidate organism files must be in FASTA
format in order to produce a BLAST
database with the makeblastdb
command:
$ for ORG in *.fna; do makeblastdb -in $ORG -dbtype nucl \
-out `basename $ORG .fna` -parse_seqids; done
Within each BLAST
database, sequence identifiers must be unique. 42
will use the first run of non-whitespace characters as the accession. If this first chunk is composed of multiple parts separated by pipe characters (|
), only the last part is taken as the sequence accession (see "Orthologue naming..." above).
sequence identifier accession
>seq37 seq37
>comp12_c0_seq1 comp12_c0_seq1
>EH093040.1 Sl_SlB_01N04_T7 SLB ... EH093040.1
>MMETSP0151_2-20130828|7_1 len=174 7_1
>gi|301500844|ref|YP_003795256.1| ... YP_003795256.1
Then, as for ref_orgs
above, you need to produce a bank_mapper.idm
file composed of two columns separated by a tabulation character (\t
) with the first column being the organism name (org
) and the second being the database basename (bank
).
Euglena gracilis Euglena_bank
Note: Again, organism names must follow the same rules as above!
Taxonomic filters (tax_filter
)
Note: this section deals with an advanced use of 42
. It can be skipped if you do not plan to check added sequences for potential contaminations (see "Contamination detection..." for theoretical background).
Bio::MUST
modules provide quite sophisticated taxonomic filters. Hence, in tax_filter
syntax, wanted taxa are to be prefixed by a +
symbol, whereas unwanted taxa are to be prefixed by a -
symbol. Wanted taxa are linked by logical ORs while unwanted taxa are linked by logical ANDs. Unwanted taxa should not include wanted taxa because the former ones take precedence over the latter ones. In contrast the opposite helps fine-tuning the tax_filter. Here are a few examples and their meaning:
[ +Poaceae ] # any grass
[ -eudicotyledons ] # anything but a dicotyledon flower
[ +Amphibia, +Amniota ] # any amphibian or amniote
[ +Bacteria, -Cyanobacteria ] # any non-cyanobacterial bacterium
In principle, 42
can also use such filters, but the generator currently supports only the plain tax_filter
syntax shown in the first example (+Poaceae
). Yet, it can assist you in finding the adequate taxa to specify based on the organisms you add.
For this to work, you must define at which taxonomic level(s) you want to set the filter. When specifying several levels (--levels
option), the script will try to check for the next level in case one is missing. You can put as many levels as you want separated by a comma (,
, no whitespace character) when using the --wizard
option and by a whitespace character (
) as a command line argument. Another possibility is to choose manually from the NCBI lineage for those that fail (in this case use the --choose_tax_filter=1
argument). If you want to select manually for each candidate organisme set the argument --choose_tax_filter=2
.
Alternatively, you can define a custom taxonomic filter for each organism by adding a third column to the bank_mapper.idm
file.
Running 42
Assisted configuration using the wizard
Now that you are done preparing files, let's run the wizard!
$ yaml-generator-42.pl --wizard
Using the --wizard
option enables an interactive mode where you will be asked to enter each parameter in the terminal.
Note: Pressing the ENTER
key selects the default value encoded in 42
.
Two run_mode
are available metagenomic
or phylogenomic
. The phylogenomic
mode is designed to enrich MSAs with orthologues for subsequent phylogenomic analysis. In contrast, the metagenomic
mode is designed to estimate the contamination level of transcriptomic data using reference ribosomal protein MSAs. The latter mode does not modify the MSAs but instead produces one taxonomic report per MSA listing the lineage of each identified orthologous sequence. When not specified, run_mode
internally defaults to phylogenomic
.
Note: the phylogenomic
mode also produces taxonomic reports but deprived of taxonomic affiliations for the purpose of one-on-one.pl
(not currently distributed on CPAN
).
The wizard does its best to assist you in building your config
file. In particular, it scans the directories you specify for relevant files. Hence, to identify the banks
and ref_banks
files, it looks for files ending with a specific suffix. These suffices can be provided using the bank_suffix
and ref_bank_suffix
options, respectively. If your banks are built from protein sequences, use .psq
; otherwise, for nucleotide sequences, use .nsq
.
Because of this scanning behavior, it is better to prepare your files directly on the computer on which you plan to run 42
. If you try to prepare your config
file locally (for subsequent upload on a remote computer), it is very likely that the wizard complains about some directories not being found.
Command-line options
Since the configuration (config
) file specifies all the details, running 42
boils down to a simple command:
$ forty-two.pl --config=config.yaml *.fasta
By default, 42
is very terse. Yet it can be made quite verbose using the corresponding --verbosity
option. If you need all the debugging information, select level 6
. In any case, it is useful to redirect the STDERR
stream to a log file for post-run analysis.
$ forty-two.pl --config=config.yaml --verbosity=3 *.fasta 2> 42.log
42
supports multithreading by allowing parallel enrichment of multiple MSAs. This is controlled by the --threads
command line option. MSAs will be arranged in an internal queue and processed in parallel using the specified number of threads. As long as there remain more MSAs to enrich than that number, 42
will makes efficient use of the CPU cores. Obviously, there is no speed gain in specifying more threads than MSAs to process.
$ forty-two.pl --config=config.yaml --threads=20 *.fasta
Unfortunately, the current parallel implementation scheme leads to completely scrambled log files. There is thus no point to ask for a high verbosity level.
Estimation of the contamination level in metagenomic mode
42
can be used as a stand-alone contamination detection tool to spot foreign sequences and estimate the contamination level in transcriptomic or genomic data as well as the taxonomic sources of contamination. To this end, it comes with two sets of ribosomal protein MSAs: one set of 78 eukaryotic MSAs, manually curated and continuously enriched with new species in H. Philippe's lab, and one set of 90 prokaryotic MSAs, fetched from RiboDB. Both sets are available at https://bitbucket.org/phylogeno/42-ribo-msas/.
For each transcriptome/genome, 42
recovers the ribosomal protein orthologs and then labels each one by computing the last common ancestor (LCA) of their closest relatives (best BLAST hits) in the corresponding MSA (excluding self-matches). The algorithm relies on the megan-like
mode described in "Contamination detection and handling". In this regard, since ribosomal proteins are highly conserved, we suggest a more stringent parameterization of the megan-like
algorithm, so as to avoid false positives during LCA computation, with a --tax_score_mul
of 0.99
instead of 0.95
and a --tax_min_ident
of 50
instead of 0
.
The follow up consists in running debrief-42.pl
, which parses the taxonomic reports produced by 42
in order to compare the taxonomic label (LCA) of each ortholog computed by 42
with the source organism lineage (according to NCBI Taxonomy) and classifies the sequences as contaminants if they differ at a predefined taxonomic rank, based on a first user-defined list of taxa (--seq_labeling
). After each ortholog has been classified, an estimated contamination percentage is computed.
Additionally, contaminations are further classified to determine the main sources of contaminants, based on a second user-defined list of taxa (--contam_labeling
), which allows the user to fine control the output report. In this regard, we distinguish two types of sequences, classified contaminations and unclassified contaminations. The latter are those that bear an uninformative taxonomic label, i.e., too broad to point to a specific lineage with accuracy (e.g., Sar
). Finally, the sequences that can only be affiliated at the highest taxonomic levels, such as cellular organisms
, Eukaryota
, Bacteria
or Archaea
, are classified as unknown sequences.
A typical command for running the metagenomic debriefer is shown below:
$ debrief-42.pl --indir=./MSAs/ --in=-42 --taxdir=taxdump/ \
--seq_labeling=seq-labels.idl --contam_labeling=contam-labels.idl
AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
CONTRIBUTOR
Mick VAN VLIERBERGHE <mvanvlierberghe@doct.uliege.be>
COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.