NAME
polyester_polyA.pl - A Perl application for enhancing the polyester RNA sequencing simulation tool.
VERSION
version 0.02
SYNOPSIS
polyester_polyA.pl [options]
DESCRIPTION
The purpose of the application is to enhance the polyester RNA sequencing simulation tool by including polyA tails in the reference RNA being used to generate the simulated sequencing data. The application is a wrapper around the R package polyester, which only accounts for the processes of fragmentation, reverse complementation and sequencing when generating data. Note that the Perl application does not (at this moment) include the possibility of passing logspline R objects as parameters to the R script and the the polyester "simulate_experiment" function. The command line options are the same as the ones in the polyester R package, with the exception of: * The addition of the --taildist option, which is mandatory and specifies the tail distribution to be used. * The addition of the --distparams option, which is mandatory and specifies the parameters of the distribution. * The addition of the --maxseqs option, which is optional and specifies whether to break the single fasta file generated by the application into multiple files of a specified maximum number of sequences. * The addition of the --modformat option, which is optional and specifies the format for storing modifications (one of JSON, YAML, or MessagePack). * BONUS: provide a R script that can be used to control the polyester simulation process from the command line (polyester.R) All other parameters have the same interpretation and semantics as in the polyester R package.
OPTIONS
- --bias, -b [STRING]
-
Fragment selection bias (optional).
- --distparams, -P [FLOAT1 FLOAT2 ...]
-
Distribution parameters (mandatory, list of numeric values).
- --errormodel, -e [STRING]
-
Error model (optional).
- --errorrate, -E [FLOAT]
-
Error probability (optional).
- --fastafile, -f [PATH]
-
Fasta file path (mandatory).
- --fcfile, -c [PATH]
-
Fold change file path (optional).
- --fraglen, -F [INTEGER]
-
Fragment length (average) (optional).
- --fragsd, -S [INTEGER]
-
Fragment length (standard deviation) (optional).
- --gcbias, -g [INTEGER]
-
GC bias (optional).
- --modformat, -m [INTEGER]
-
Case insensitive format for storing modifications (one of JSON, YAML, or MessagePack) (optional).
- --maxseqs, -m [INTEGER]
-
Maximum sequences per file (optional).
- --numreps, -n [INTEGER1 INTEGER2 ...]
-
Number of replicates in each group (optional, list).
- --outdir, -o [PATH]
-
Path to output directory (optional).
- --paired, -p [TRUE|FALSE]
-
Paired reads (optional).
- --readlen, -R [INTEGER]
-
Read length (optional).
- --readsfile, -r [PATH]
-
Reads per transcript file path (optional).
- --seed, -d [INTEGER]
-
Random seed (optional).
- --strandspec, -s [TRUE|FALSE]
-
Strand specificity (optional).
- --taildist, -t [STRING]
-
Tail distribution (mandatory).
- --writeinfo, -w [INTEGER]
-
Save simulation info (optional).
EXAMPLES
polyester_polyA.pl --fastafile myseq.fasta --taildist gamma \
--distparams 125.0 1.0 0.0 250.0 --fraglen 100 --fragsd 10 \
--numreps 1 --strandspec TRUE --readlen 75 --paired F \
--maxseqs 1000 --modformat YAML --outdir /path/to/output
TODO
Add the possibility of passing logspline R objects as parameters to the R script and the polyester "simulate_experiment" function.
Add the possibility of adding UMI tags to sequences.
Add the possibility of adding sequencing adapters to sequences.
SEE ALSO
-
Polyester is an R package designed to simulate RNA sequencing experiments with differential transcript expression.Given a set of annotated transcripts, Polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using your choice of downstream analysis tools. Polyester has a built-in wrapper function to simulate a case/control experiment with differential transcript expression and biological replicates. Users are able to set the levels of differential expression at transcripts of their choosing. This means they know which transcripts are differentially expressed in the simulated dataset, so accuracy of statistical methods for differential expression detection can be analyzed.
Polyester offers several unique features:
* Built-in functionality to simulate differential expression at the transcript level * Ability to explicitly set differential expression signal strength * Simulation of small datasets, since large RNA-seq datasets can require lots of time and computing resources to analyze * Generation of raw RNA-seq reads, as opposed to alignments or transcript-level abundance estimates * Transparency/open-source code
AUTHOR
Christos Argyropoulos <chrisarg@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2024 by Christos Argyropoulos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.