NAME
App::Sandy::Command::Transcriptome - simulate command class. Simulate transcriptome sequencing
VERSION
version 0.25
SYNOPSIS
sandy transcriptome [options] <fasta-file>
Arguments:
a fasta file
Input/Output options:
-h, --help brief help message
-H, --man full documentation
-v, --verbose print log messages
-p, --prefix prefix output [default:"out"]
-o, --output-dir output directory [default:"."]
-O, --output-format bam, sam, fastq.gz, fastq [default:"fastq.gz"]
-1, --join-paired-ends merge R1 and R2 outputs in one file
-x, --compression-level speed compression: "1" - compress faster,
"9" - compress better [default:"6"; Integer]
Runtime options:
-j, --jobs number of jobs [default:"1"; Integer]
-s, --seed set the seed of the base generator
[default:"time()"; Integer]
Sequence identifier options:
-i, --append-id append to the defined template id [Format]
-I, --id overlap the default template id [Format]
Sequencing option:
-q, --quality-profile sequencing system profiles from quality
database [default:"poisson"]
-e, --sequencing-error sequencing error rate for poisson
[default:"0.001"; Number]
-m, --read-mean read mean size for poisson
[default:"100"; Integer]
-d, --read-stdd read standard deviation size for poisson
[default:"0"; Integer]
-t, --sequencing-type single-end or paired-end reads
[default:"paired-end"]
-M, --fragment-mean the fragment mean size for paired-end reads
[default:"300"; Integer]
-D, --fragment-stdd the fragment standard deviation size for
paired-end reads [default:"50"; Integer]
Transcriptome-specific options:
-n, --number-of-reads set the number of reads
[default:"1000000", Integer]
-f, --expression-matrix an expression-matrix entry from database
DESCRIPTION
This subcommand simulates transcriptome sequencing reads taking into account the quality-profile and the expression-matrix weights, along with: raffle seed; number of reads; fragment mean and standard deviation; single-end (long and short fragments) and paired-end sequencing type; bam, sam, fastq.gz and fastq output formats and more.
INPUT
sandy transcriptome expects as argument a fasta file with transcript sequences. For example, the GENCODE human genome transcript sequences and protein-coding transcript sequences.
OUTPUT
The output file generated will depend on the output-format (fastq, bam), on the join-paired-ends option (mate read pairs into a single file) and on the sequencing-type (single-end, paired-end). One file with the simulated abundance (${prefix}_abundance_transcripts.tsv) per transcript and one file with the simulated abundance (${prefix}_abundance_genes.tsv) per gene (if the fasta file used has the relationship between gene and its transcripts at the header) will accompany the output file.
OPTIONS
- --help
-
Print a brief help message and exits.
- --man
-
Prints the manual page and exits.
- --verbose
-
Prints log information to standard error
- --prefix
-
Concatenates the prefix to the output-file name.
- --output-dir
-
Creates output-file inside output-dir. If output-dir does not exist, it is created recursively
- --output-format
-
Choose the output format. Available options are: bam, sam, fastq.gz, fastq. For bam option, --append-id is ignored, considering that the sequence identifier is splitted by blank character, so just the first field is included into the query name column (first column).
- --join-paired-ends
-
By default, paired-end reads are put into two different files, prefix_R[12]_001.fastq(\.gz)?. If the user wants both outputs together, she can pass this option. If the --id does not have the escape character %R, it is automatically included right after the first field (blank separated values) as in id/%R - which resolves to id/1 or id/2. It is necessary to distinguish which read is R1/R2
- --compression-level
-
Regulates the speed of compression using the specified digit (between 1 and 9), where "1" indicates the fastest compression method (less compression) and "9" indicates the slowest compression method (best compression). The default compression level is "6"
- --append-id
-
Append string template to the defined template id. See Format
- --id
-
Overlap the default defined template id: single-end %i.%U %U and paired-end %i.%U %U e.g. SR123.1 1 See Format
- Format
-
A string Format is a combination of literal and escape characters similar to the way printf works. That way, the user has the freedom to customize the fastq sequence identifier to fit her needs. Valid escape characteres are:
Common escape characters
---------------------------------------------------------------------------- Escape Meaning ---------------------------------------------------------------------------- %i instrument id composed by SR + PID %I job slot number %q quality profile %e sequencing error %x sequencing error position %R read 1, or 2 if it is the paired-end mate %U read number %r read size %m read mean %d read standard deviation %c sequence id as chromossome, gene/transcript id %C sequence id type (reference or alternate non reference allele) *** %s read strand %t read start position %n read end position %a read start position regarding reference genome *** %b read end position regarding reference genome *** %v genomic variation position *** ---------------------------------------------------------------------------- *** specific for genomic variation (genome simulation only)
Paired-end specific escape characters
---------------------------------------------------------------------------- Escape Meaning ---------------------------------------------------------------------------- %T mate read start position %N mate read end position %A mate read start position regarding reference genome *** %B mate read end position regarding reference genome *** %D distance between the paired-reads %M fragment mean %D fragment standard deviation %f fragment size %F fragment strand %S fragment start position %E fragment end position %X fragment start position regarding reference genome *** %Z fragment end position regarding reference genome *** ---------------------------------------------------------------------------- *** specific for genomic variation (genome simulation only)
- --jobs
-
Sets the number of child jobs to be created
- --seed
-
Sets the seed of the base generator. The ability to set the seed is useful for those who want reproducible simulations. Pay attention to the number of jobs (--jobs) set, because each job receives a different seed calculated from the main seed. So, for reproducibility, the same seed set before needs the same number of jobs set before as well.
- --read-mean
-
Sets the read mean if quality-profile is equal to 'poisson'. The quality-profile from database overrides the read-size
- --read-stdd
-
Sets the read standard deviation if quality-profile is equal to 'poisson'. The quality-profile from database overrides the read-stdd
- --number-of-reads
-
Sets the number of reads desired for each fragment end. That means, it will be the number of reads for each pair - 1 x N reads for single-end and 2 x N reads for paired-end. This is the default option for transcriptome sequencing simulation
- --sequencing-type
-
Sets the sequencing type to single-end or paired-end
- --fragment-mean
-
If the sequencing-type is set to paired-end, it sets the fragment mean
- --fragment-stdd
-
If the sequencing-type is set to paired-end, it sets the fragment standard deviation
- --sequencing-error
-
Sets the sequencing error rate if quality-profile is equal to 'poisson'. Valid values are between zero and one
- --quality-profile
-
Sets the sequencing system profile for quality. The default value is a poisson distribution, but the user can choose among several profiles stored into the database or import his own data. See quality command for more details
- --expression-matrix
-
By default, the gene/transcript is raffled using its length as weight. If you choose an expression-matrix, then the raffle will be made based on the gene/transcript expression. The expression-matrix entries are found into the database. See expression command for more details
EXAMPLES
The command:
$ sandy transcriptome \
--verbose \
--jobs=5 \
--number-of-reads=5000000 \
--output-dir=my_results/ \
gencode.v43.transcripts.fa.gz 2> sim.log
or, equivalently:
$ sandy transcriptome \
-v -j 5 -n 5000000 -o my_results/ \
gencode.v43.transcripts.fa.gz 2> sim.log
will both generate two paired-end fastq files (R1 and R2) into my_results/ directory with 5000000 reads and two abundance files, one per transcripts and the other per genes.
By default the raffled bias is the transcript length, but the user can change this behavior by choosing an expression-matrix from the database:
$ sandy transcriptome -f brain_cortex gencode.v43.transcripts.fa.gz
To see the current list of available expression matrices:
$ sandy expression
And in order to learn how to add your custom expression-matrix, see:
$ sandy expression add --help
For reproducibility, the user can set the seed option and guarantee the reliability of all the raffles in a later simulation:
$ sandy expression --seed=1717 my_transcripts.fa
To simulate reads with a specific quality-profile other than the default poisson:
$ sandy expression --quality-profile=hiseq_150 my_transcripts.fa
To see the current list of available quality-profiles:
$ sandy quality
And in order to learn how to add your custom quality-profile, see:
$ sandy quality add --help
Sequence identifiers (first lines of fastq entries) may be customized in output using a format string passed by the user. This format is a combination of literal and escaped characters, in a similar fashion to that used in C programming language’s printf function. For example, let’s simulate a paired-end sequencing and add the read length, read position and mate position into all sequence identifiers:
$ sandy expression --id="%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r" my_genes.fa.gz
In this case, results would be:
==> Into R1
@SR.1 read=BRAF:979-880 mate=BRAF:736-835 length=100
...
==> Into R2
@SR.1 read=BRAF:736-835 mate=BRAF:979-880 length=100
...
See Format section for details.
Putting all together:
$ sandy transcriptome \
--verbose \
--jobs=5 \
--number-of-reads=5000000 \
--output-dir=my_results/ \
--expression-matrix=brain_cortex \
--seed=1717 \
--quality-profile=hiseq_150 \
--id="%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r" \
gencode.v43.transcripts.fa.gz 2> sim.log
AUTHORS
Thiago L. A. Miller <tmiller@mochsl.org.br>
J. Leonel Buzzo <lbuzzo@mochsl.org.br>
Felipe R. C. dos Santos <fsantos@mochsl.org.br>
Helena B. Conceição <hconceicao@mochsl.org.br>
Rodrigo Barreiro <rbarreiro@mochsl.org.br>
Gabriela Guardia <gguardia@mochsl.org.br>
Fernanda Orpinelli <forpinelli@mochsl.org.br>
Rafael Mercuri <rmercuri@mochsl.org.br>
Rodrigo Barreiro <rbarreiro@mochsl.org.br>
Pedro A. F. Galante <pgalante@mochsl.org.br>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2023 by Teaching and Research Institute from Sírio-Libanês Hospital.
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007