NAME
import_ncbi_mv_hs.pl -- make gff files from NCBI Map Viewer data files.
SYNOPSIS
perl import_ncbi_mv_hs.pl --type type [options]
A QUICK RUN
Download from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/ (or the most current directory) the files
seq_gene.md.gz
gene.q.gz
to the same directory as import_ncbi_mv_hs.pl and execute the command
perl import_ncbi_mv_hs.pl --type gene
This creates the file seq_gene.gff which can be loaded into a gbrowse database using bp_load_gff.pl.
DESCRIPTION
This script reads two kinds of input files from the NCBI Map Viewer FTP site. The source for human input files is
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview
which contains subdirectories for the various builds. For example, mapview/BUILD.34.3/seq_gene.md.gz would be an input file for use with the subroutine mk_seq_gene.
At the moment this script will import the files seq_gene.md (essentially records from the Entrez Gene database) and seq_sts.md (the UniSTS database). However there are many other kinds of data available from the Map Viewer FTP site.
This script does not load the gff files into the database. This can be achieved by running the script bp_load_gff.pl with the output files (gff files) from import_ncbi_mv_hs.pl.
The argument 'type' to the option '--type' indicates what kind of Map Viewer file to import.
type Map Viewer file
---- ---------------
gene seq_gene.md. The path of this file can be indicated
with the --seq_gene option. The script can read
directly from the compressed version seq_gene.md.gz.
sts seq_sts.md. Similary, use the --seq_sts option to specify
the path.
Options (default)
--type Type of file: gene, sts. Explained above.
--seq_gene Path for file seq_gene.md, text or *.gz (seq_gene.md.gz)
--gene_q Path for file gene.q, text or *.gz (gene.q.gz). See hs_mk_seq_gene.
--seq_sts Path for file seq_sts.md, text or *.gz (seq_sts.md.gz)
--chromosome Only import records for this chromosome
--gff Path of gff file to create (default=seq_gene.gff for type=gene, etc)
--min_pos Minimum chromosomal position to import
--max_pos Maximum chromosomal position to import
Example:
perl import_ncbi_mv_hs.pl --type gene --chr 2 --gff seq_gene_chr2.gff
This imports the file seq_gene.gz
AUTHOR
Scott Saccone (ssaccone@han.wustl.edu)
hs_mk_seq_gene
Example:
hs_mk_seq_gene(-seq_gene=>'seq_gene.md.gz',
-gene_q=>'gene.q',
-gff=>'seq_gene_chr1.gff',
-assembly=>'reference',
-chromosome=>1,
-min_pos=>undef,
-max_pos=>undef
);
This converts the human Map Viewer file seq_gene.md to gff format. The gff source field is named "ncbi:mapview:$assembly" where $assembly is specified as an option whose default is 'reference'. Optionally, gene descriptions can be obtained from the Map Viewer file 'gene.q' in which case the group field of the gff gets a 'Note' attribute; for example 'Note "similar to beta-tubulin 4Q"'.
Format of seq_gene.md: tab delimited header line 1 fields: 0 taxid 1 chr 2 chrStart 3 chrEnd 4 orientation 5 contig 6 cnt_start 7 cnt_end 8 cnt_orient 9 featureName 10 featureId 11 featureType 12 groupLabel 13 transcript 14 weight
Notes on the fields:
featureId: has the form GeneID:n where n is the Entrez Gene ID. This
is sometimes the same as the LocusLink ID but I believe LocusLink
is being phased out and these IDs may not always agree. Features that are
grouped together by a common featureId will have a common group id
in the gff file. Then the transcript aggregator can then be applied.
featureType: is used to define the method field in the gff
record. The values I've seen are GENE,UTR,CDS and PSEUDO. I think
the current transcript aggregator only recognizes CDS (the
GENE records use the 'transcript' method). Perhaps UTR must
be converted to 5'UTR and 3'UTR somehow.
groupLabel: the 'assembly' I believe: 'reference', 'HSC_TCAG' or 'DR51'.
Options (default): -seq_gene mapview file with gene locations, text or *.gz file (seq_gene.md.gz) -gene_q mapview file with gene descriptions, text or *.gz file (gene.q.gz) -chromosome only make records for this chromosome -min_pos minimum chromosomal position -max_pos maximum chromosomal position -assembly which assembly to use (reference)
read_seq_q
Read Map Viewer file seq_q and store the full gene descriptions. Used by hs_mk_seq_gene.
Format of seq_q: tab delimited header at line 1 field 0: GeneID field 7: full description
hs_mk_seq_sts
Example:
hs_mk_seq_sts(-seq_sts=>'seq_sts.md.gz',
-gff=>'seq_sts_chr1.gff',
-assembly=>'reference',
-chromosome=>1,
-min_pos=>undef,
-max_pos=>undef
);
Convert human Map Viewer file seq_sts.md to gff format. The gff source is 'sts' and the gff method is "ncbi:mapview:$assembly" where $assembly is specified as an option whose default is 'reference'. The group field is of the form 'STS "name"; Name "name"' where name is the featureName field from the Map Viewer file. The group fields will also contain 'UniSTS_ID n' if the UniSTS ID is available in the Map Viewer record.
Format of seq_sts.md: tab delimited header line 1 fields: 0 taxid 1 chr 2 chrStart 3 chrEnd 4 orientation 5 contig 6 cnt_start 7 cnt_end 8 cnt_orient 9 featureName 10 featureId 11 featureType 12 groupLabel 13 weight
Notes on the fields:
featureId: has the form UniSTS:n where n is the UniSTS ID.
groupLabel: see hs_mk_seq_gene.
Options (default): -seq_sts Map Viewer file with sts locations. Can read directly from *.gz file (seq_sts.md.gz) -chromosome only make records for this chromosome -min_pos minimum chromosomal position -max_pos maximum chromosomal position -assembly assembly to use (reference)