NAME
bp_genbank2gff3.pl -- Genbank->gbrowse-friendly GFF3
SYNOPSIS
bp_gbrowse_genbank2gff3.pl [options] filename(s)
# process a directory containing GenBank flatfiles
perl gbrowse_genbank2gff3.pl --dir path_to_files --zip
# process a single file, ignore explicit exons and introns
perl bp_genbank2gff3.pl --filter exon --filter intron file.gbk.gz
# process a list of files
perl bp_genbank2gff3.pl *gbk.gz
Options:
--dir -d path to a list of genbank flatfiles
--outdir -o location to write GFF files
--zip -z compress GFF3 output files with gzip
--summary -s print a summary of the features in each contig
--filter -x genbank feature type(s) to ignore
--split -y split output to seperate GFF and fasta files for
each genbank record
--nolump -n seperate file for each reference sequence
(default is to lump all records together into one
output file for each input file)
--ethresh -e error threshold for unflattener
set this high (>2) to ignore all unflattener errors
--help -h display this message
DESCRIPTION
This script uses Bio::SeqFeature::Tools::Unflattener and Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene containment hierarchies mapped for optimal display in gbrowse.
The input files are assumed to be gzipped GenBank flatfiles for refseq contigs. The files may contain multiple GenBank records. Either a single file or an entire directory can be processed. By default, the DNA sequence is embedded in the GFF but it can be saved into seperate fasta file with the --split(-y) option.
If an input file contains multiple records, the default behaviour is to dump all GFF and sequence to a file of the same name (with .gff appended). Using the 'nolump' option will create a seperate file for each genbank record. Using the 'split' option will create seperate GFF and Fasta files for each genbank record.
Notes
Note1:
In cases where the input files contain many GenBank records (for example, the chromosome files for the mouse genome build), a very large number of output files will be produced if the 'split' or 'nolump' options are selected. If you do have lists of files > 6000, use the --long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to load the gff and/ or fasta files.
Note2:
This script is designed for refseq genomic sequence entries. It may work for third party annotations but this has not been tested.
AUTHOR
Sheldon McKay (mckays@cshl.edu)
Copyright (c) 2004 Cold Spring Harbor Laboratory.