NAME
data2wig.pl
A program to convert a generic data file into a wig file.
SYNOPSIS
data2wig.pl [--options...] <filename>
File options:
-i --in <filename> input file: txt, gff, bed, vcf, etc
-o --out <filename> output file name
-H --noheader input file has no header row
-0 --zero file is in 0-based coordinate system
Column indices:
-a --ask interactive selection of columns
-s --score <index> score column, may be comma list
-c --chr <index> chromosome column
-b --begin --start <index> start coordinate column
-e --end --stop <index> stop coordinate column
--attrib <name> GFF or VCF attribute name of score
Wig options:
-p --step [fixed|variable|bed] type of wig file
--bed --bdg alternative shortcut for bedGraph
--size <integer> step size for fixedStep
--span <integer> span size for fixed and variable
Conversion options:
-f --fast fast mode, no error checking
--name <text> optional track name
--(no)track generate a track line
--mid use the midpoint of feature intervals
--format <integer> format decimal points of scores
-m --method [mean | median | sum | max] combine multiple score columns
BigWig options:
-B --bw --bigwig generate a bigWig file
-d --db <database> database to collect chromosome lengths
--chromof <filename> specify a chromosome file
--bwapp </path/to/wigToBigWig> specify path to wigToBigWig
General options:
-z --gz compress output text files
-v --version print version and exit
-h --help show extended documentation
OPTIONS
The command line flags and descriptions:
File options
- --in <filename>
-
Specify an input file containing either a list of database features or genomic coordinates for which to collect data. The file should be a tab-delimited text file, one row per feature, with columns representing feature identifiers, attributes, coordinates, and/or data values. Genome coordinates are required. The first row should be column headers. Text files generated by other BioToolBox scripts are acceptable. Files may be gzipped compressed.
- --out <filename>
-
Optionally specify the name of of the output file. The track name is used as default. The '.wig' extension is automatically added if required.
- --noheader
-
The input file does not have column headers, often found with UCSC derived annotation data tables.
- --zero
-
Source data is in interbase coordinate (0-base) system. Shift the start position to base coordinate (1-base) system. Wig files are by definition 1-based. This is automatically handled for most input files. Default is false.
Column indices
- --ask
-
Indicate that the program should interactively ask for column indices or text strings for the GFF attributes, including coordinates, source, type, etc. It will present a list of the column names to choose from. Enter nothing for non-relevant columns or to accept default values.
- --score <column_index or list of column indices>
-
Indicate the column index of the dataset in the data table to be used for the score. If a GFF file is used as input, the score column is automatically selected. If not defined as an option, then the program will interactively ask the user for the column index from a list of available columns. More than one column may be specified, in which case the scores are combined using the method specified.
- --chr <column_index>
-
Optionally specify the column index of the chromosome or sequence identifier. This is required to generate the wig file. It may be identified automatically from the column header names.
- --start <column_index>
- --begin <column_index>
-
Optionally specify the column index of the start or chromosome position. This is required to generate the wig file. It may be identified automatically from the column header names.
- --stop <column_index>
- --end <column_index>
-
Optionally specify the column index of the stop or end position. It may be identified automatically from the column header names.
- --attrib <attribute_name>
-
Optionally provide the name of the attribute key which represents the score value to put into the wig file. Both GFF and VCF attributes are supported. GFF attributes are automatically taken from the attribute column (index 9). For VCF columns, provide the index number of the sample column from which to take the value (usually 10 or higher) using the --index option. INFO field attributes can also be taken, if desired (use --index 8).
Wig options
- --step [fixed | variable | bed]
-
The type of step progression for the wig file. Three wig formats are available: - fixedStep: where data points are positioned at equal distances along the chromosome - variableStep: where data points are variably positioned along the chromosome. - bed (bedGraph): where scores are associated with intervals defined by start and stop coordinates. The fixedStep wig file has one column of data (score), the variableStep wig file has two columns (position and score), and the bedGraph has four columns of data (chromosome, start, stop, score). If the option is not defined, then the format is automatically determined from the metadata of the file.
- --bed
- --bdg
-
Convenience option to specify a bedGraph file should be written. Same as specifying --step=bed.
- --size <integer>
-
Optionally define the step size in bp for 'fixedStep' wig file. This value is automatically determined from the table's metadata, if available. If the
--step
option is explicitly defined as 'fixed', then the step size may also be explicitly defined. If this value is not explicitly defined or automatically determined, the variableStep format is used by default. - --span <integer>
-
Optionally indicate the size of the region in bp to which the data value should be assigned. The same size is assigned to all data values in the wig file. This is useful, for example, with microarray data where all of the oligo probes are the same length and you wish to assign the value across the oligo rather than the midpoint. The default is inherently 1 bp.
Conversion options
- --fast
-
Disable checks for overlapping or duplicated intervals, unsorted data, valid score values, and calculated midpoint positions. Requires setting the chromosome, start, end (for bedGraph files only), and score column indices. WARNING: Use only if you trust your input file format and content.
- --name <text>
-
The name of the track defined in the wig file. The default is to use the name of the chosen score column, or, if the input file is a GFF file, the base name of the input file.
- --(no)track
-
Do (not) include the track line at the beginning of the wig file. Wig files normally require a track line, but if you will be converting to the binary bigwig format, the converter requires no track line. Why it can't simply ignore the line is beyond me. This option is automatically set to false when the
--bigwig
option is enabled. - --mid
-
A boolean value to indicate whether the midpoint between the actual 'start' and 'stop' values should be used. The default is to use only the 'start' position.
- --format <integer>
-
Indicate the number of decimal places the score value should be formatted. The default is to not format the score value.
- --method [mean | median | sum | max]
-
Define the method used to combine multiple data values at a single position. Wig files do not tolerate multiple identical positions. Default is mean.
BigWig options
- --bigwig
- --bw
-
Indicate that a binary BigWig file should be generated instead of a text wiggle file.
- --db <database>
-
Specify the name of a Bio::DB::SeqFeature::Store annotation database or other indexed data file, e.g. Bam or bigWig file, from which chromosome length information may be obtained. It may be supplied from the input file metadata.
- --chromof <filename>
-
When converting to a BigWig file, provide a two-column tab-delimited text file containing the chromosome names and their lengths in bp. Alternatively, provide a name of a database, below.
- --bwapp </path/to/wigToBigWig>
-
Specify the path to the UCSC wigToBigWig conversion utility. The default is to first check the BioToolBox configuration file
biotoolbox.cfg
for the application path. Failing that, it will search the default environment path for the utility. If found, it will automatically execute the utility to convert the wig file.
General options
- --gz
-
A boolean value to indicate whether the output wiggle file should be compressed with gzip.
- --version
-
Print the version number.
- --help
-
Display the POD documentation
DESCRIPTION
This program will convert any tab-delimited data text file into a wiggle formatted text file. This requires that the file contains not only the scores bu also chromosomal coordinates, i.e. chromosome, start, and (optionally) stop. The program should automatically detect these columns (if appropriately labeled) or they can be specified. An option exists to use the midpoint of a region, e.g. microarray probe.
The wig file format is specified by documentation supporting the UCSC Genome Browser and detailed here: http://genome.ucsc.edu/goldenPath/help/wiggle.html. Three formats are supported, 'fixedStep', 'variableStep', and 'bedGraph'. The format may be requested or determined empirically from the input file metadata. Genomic bin files generated with BioToolBox scripts record the window and step values in the metadata, which are used to determine the span and step wig values, respectively. The variableStep format is otherwise generated by default. The span is, by default, 1 bp.
Wiggle files cannot tolerate multiple datapoints at the same identical position, e.g. multiple microarray probes matching a repetitive sequence. An option exists to mathematically combine these positions into one value.
Strand is not inherently supported in wig files. If you have stranded data, they should be split into separate files. The BioToolBox
script split_data_file.pl
can be used for this purpose.
A binary BigWig file may also be further generated from the text wiggle file. The binary format is preferential to the text version for a variety of reasons, including fast, random access and no loss in data value precision. More information can be found at this location: http://genome.ucsc.edu/goldenPath/help/bigWig.html. Conversion requires BigWig file support, supplied by the external wigToBigWig
or bedGraphToBigWig
utility available from UCSC.
AUTHOR
Timothy J. Parnell, PhD
Howard Hughes Medical Institute
Dept of Oncological Sciences
Huntsman Cancer Institute
University of Utah
Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.