NAME
Bio::WGS2NCBI - module to assist in submitting whole genome sequencing projects to NCBI
DESCRIPTION
This module documents the four actions (prepare, process, convert and compress) that are available to users of the wgs2ncbi script. Each of these steps is configured by one or more configuration files. In the documentation below, the relevant fields from these configuration files are listed. To understand how the configuration system itself works, consult the documentation at Bio::WGS2NCBI::Config.
prepare
The prepare
action takes the annotation file (in GFF3 format) and extracts the relevant information out of it, writing it to a potentially large set of files. This is done because GFF3 annotation files can become quite large, so that finding the annotations for any particular scaffold might take a long time if this is done by scanning through the whole file. Instead, the set of annotations is reduced, by taking the following steps:
- Remove all sequence data from the file - Embedded FASTA data is permissible according to the GFF3 standard, but this makes files needlessly bulky if the FASTA data are available separately as well.
- Remove all annotations from unrecognized sources - GFF3 files can contain annotations from sources you don't particularly trust and want to ignore in your submission. This is configurable.
- Remove all irrelevant features - any sequence features in GFF3 that are not recognized in NCBI feature tables are discarded. This is configurable.
Subsequently, the annotations are split such that there is a separate file for each scaffold (or chromosome, if that is how your annotations are organized). This way, the relevant information for any scaffold can be found much more quickly through the file system rather than by scanning through a large file.
In order for this action to succeed, the following configuration values need to be provided:
gff3file
-
The location of the input annotation file in GFF3 format.
gff3dir
-
The location of the output directory for the split annotation files.
source
-
Which annotation source to trust.
feature
-
Which features to retain.
process
The process
action takes the sequence file (in FASTA format, with a record for each scaffold or chromosome) and the pre-processed annotations and converts these into feature table files and (masked) sequence files.
In order for this action to succeed, the following configuration values need to be provided:
datafile
-
The location of the input FASTA file, with a record for each scaffold or chromosome.
info
-
An INI-style configuration file that contains key/value pairs that will be embedded in the FASTA sequence headers of the produced output files. This is typically used for metadata about the sampled organism, such as its sex, collection locality, collected cell type, etc.
masks
-
An INI-style configuration file that contains the coordinates of sequence segments to mask. This may be needed because NCBI will do a strict screen to check for unclipped adaptor sequences or contaminants. In the report that is returned by NCBI it will state the sequence coordinates of segments that NCBI will not accept in a submission. By putting these coordinate in this file the offending segments will be replaced with NNNs.
products
-
An INI-style configuration file that contains the corrected names for protein products. The rationale is that your genome annotation process may introduce protein names that NCBI would like to deprecate, such as names that include molecular weights, database identifiers, references to 'homology', and so on. The discrepancy report that is produced in the "convert" in Bio::WGS2NCBI step will be a first guide in composing corrected names, but the validation that NCBI will perform will likely point out additional errors.
gff3dir
-
The location of the pre-processed GFF3 files as produced by "prepare" in Bio::WGS2NCBI.
datadir
-
The location of the output dir where the (potentially 'chunked', see below) sequence files and feature tables will be written.
prefix
-
A short character sequence that is prefixed to every sequence record identifier that is generated. NCBI will provide submitters with this prefix when the submission is initialized.
-
This is a naming authority that will be applied to all sequence record identifiers. A reasonable value for this could be the name of the lab or institution that leads the project resulting in the submission. NCBI intends this authority, in combination with the
prefix
as a way to ensure that sequences are globally uniquely identifiable. minlength
-
The minimum length of a scaffold to be retained in a submission. This should be 200 or above.
minintron
-
The minimum length of an intron to be retained in a submission. Introns shorter than this are interpreted (by NCBI) to be spurious and should therefore by discarded. As a consequence, the gene that contains such an intron will be annotated as a pseudogene. This value must be 10 or above.
chunksize
-
The output that is produced can be combined into chunks of more than one scaffold per file. To keep the number of files manageable it is convenient to set this to a large value, but less than or equal to 10,000.
limit
-
This parameter allows you to run the process on only a limited set of scaffolds. This is provided for testing, "dry run" purposes. For real usage this value must be set to 0.
convert
The convert
action runs the tbl2asn
program provided by NCBI with the right settings. This requires the following configuration settings:
datadir
-
The location of the dir where the (potentially 'chunked', see below) sequence files and feature tables were written by "process" in Bio::WGS2NCBI.
template
-
The location of the template file produced with the form at: http://www.ncbi.nlm.nih.gov/WebSub/template.cgi
outdir
-
The location where to write the resulting ASN.1 files.
discrep
-
The location where to write the discrepancy report.
tbl2asn
-
The location where the tbl2asn executable is located.
trim
The trim
action trims stretches of leading or trailing NNNs from sequence records, and updates the coordinates in the associated feature tables accordingly. In cases where a feature falls within a trimmed region, the feature is removed entirely.
datadir
-
The location of the dir where the (potentially 'chunked', see below) sequence files and feature tables were written by "process" in Bio::WGS2NCBI.
prune
The prune
action reads a discrepancy file as supplied by NCBI, parses out errors that have locations in them, which are then pruned from the table files in $config->datadir.
This requires the following configuration settings:
datadir
-
The location of the dir where the (potentially 'chunked', see below) sequence files and feature tables were written by "process" in Bio::WGS2NCBI.
validation
-
The location where to read the validation report from NCBI.
prefix
-
The ID prefix that was assigned to you by NCBI when you created your submission, something like 'CR513_'
-
The naming authority prefix that you chose for your identifiers, something like 'gnl|aceprd|'
The
compress
The compress
action bundles the ASN.1 files produced by Bio::WGS2NCBI/convert
into a .tar.gz archive that can be uploaded to NCBI. This requires the following configuration settings:
outdir
-
The location where the ASN.1 files were written.
archive
-
The name and location of the archive to produce.
help
Displays module documentation (which you are reading now).