NAME

cx-genbank2chaos.pl.pl

SYNOPSIS

cx-genbank2chaos.pl.pl sample-data/AE003734.gbk > AE003734.chaos.xml

cx-genbank2chaos.pl.pl -islands sample-data/AE003734.gbk

DESCRIPTION

Converts a genbank file to a chaos xml file (or a collection of chaos xml files).

The genbank file is 'unflattened' in order to infer the relationships between features

with the -islands option set, this loops through a list of genbank-formatted files and builds a chaos file for every gene

by default it will store each gene in a directory named by the sequence accession. it will name each file by the unique feature_id; for example

AE003644.2/
  gene:EMBLGenBankSwissProt:AE003644:128108:128179.xml
  gene:EMBLGenBankSwissProt:AE003644:128645:128716.xml
  gene:EMBLGenBankSwissProt:AE003644:128923:128994.xml

You can change the field used to name the file with -nameby; for example, if you use the chado/chaos name field like this:

cx-genbank2chaos.pl.pl -islands -nameby name AE003734.gbk

You will get

AE003644.3/
 noc.xml
 osp.xml
 BG:DS07721.3.xml

the default is the feature_id field, which is usually more unix-friendly (fly genes can have all kinds of weird characters in their name); also using the 'name' field could run into uniqueness issues.

HOW IT WORKS

1 - parse genbank to bioperl

uses Bio::SeqIO::genbank

2 - unflatten the flat list of bioperl SeqFeatures

uses Bio::Seqfeature::Tools::Unflattener

3 - turn bioperl objects into chaos datastructure

uses Bio::SeqIO::chaos

4 - remap every gene to an 'island' (virtual contig)

uses Bio::Chaos::ChaosGraph

5 - spit out each virtual contig chaos graph to a file

uses Bio::Chaos::ChaosGraph

ARGUMENTS

-islands

exports one file per gene

-ethresh ERRORTHRESH

Sets the error threshold. See Bio::SeqFeature::Tools::Unflattener

you will want to keep this at its default setting of 3 (insensitive)

-remove_type GENBANKFEATURETYPE

This will remove all features of a certain type prior to unflattening

This is useful if you wish to exclude a certain kind of feature (eg variation) from your analysis

It is also required for the genbank release of S_Pombe, which has a few scattered types purportedly of mRNA which confuse the unflattening process

-include_haplotypes

by default, only reference sequences are exported. if the genbank definition like contains the string "haplotype", then this is probably an alternative haplotype that will skew analyses. this is removed by default, unless this switch is set

For an example, see contigs NG_002432 and NT_007592 (the former is an alt hap of the latter)

REQUIREMENTS

You will need a very up to date bioperl, probably from cvs, with the Unflattener modules added