NAME
n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.
VERSION
version 1.5.8
SYNOPSIS
n50.pl [options] [FILE1 FILE2 FILE3...]
DESCRIPTION
This program parses a list of FASTA/FASTQ files calculating for each one the number of sequences, the sum of sequences lengths and the N50, N75, N90 and auN. It will print the result in different formats, by default only the N50 is printed for a single file and all metrics in TSV format for multiple files.
If a single file is provided, by default the output will be the N50 value only. If multiple files are provided, by default the output will be a TSV table with filename and N50 for each file. More output formats are available (see below).
PARAMETERS
- -o, --sortby
-
Sort by field: 'N50' (default), 'min', 'max', 'seqs', 'size', 'path'. By default will be descending for numeric fields, ascending for 'path'. See
-r, --reverse
. - -r, --reverse
-
Reverse sort (see:
-o
); - -f, --format
-
Output format: default, tsv, json, custom, screen. See below for format specific switches. Specify "list" to list available formats.
- -e
-
Also calculate a custom N{e} metric. Expecting an integer 0 < e < 100.
- -s, --separator
-
Separator to be used in 'tsv' output. Default: tab. The 'tsv' format will print a header line, followed by a line for each file given as input with: file path, as received, total number of sequences, total size in bp, and finally N50.
- -b, --basename
-
Instead of printing the path of each file, will only print the filename, stripping relative or absolute paths to it. See
-a
. Warning: if you are reading multiple files with the same basename, only one will be printed. This is the intended behaviour and you will only receive a warning. - -a, --abspath
-
Instead of printing the path of each file, as supplied by the user (can be relative), it will the absolute path. Will override -b (basename). See
-b
. - -u, --noheader
-
When used with 'tsv' output format, will suppress header line.
- -n, --nonewline
-
If used with 'default' (or 'csv' output format), will NOT print the newline character after the N50 for a single file. Useful in bash scripting:
n50=$(n50.pl filename);
- -t, --template
-
String to be used with 'custom' format. Will be used as template string for each sample, replacing {new} with newlines, {tab} with tab and {N50}, {seqs}, {size}, {path} with sample's N50, number of sequences, total size in bp and file path respectively (the latter will respect --basename if used).
- -q, --thousands-sep
-
Add the thousands separator in all the printed numbers. Enabled by default with --format screen (-x).
- -p, --pretty
-
If used with 'json' output format, will format the JSON in pretty print mode. Example:
{ "file1.fa" : { "size" : 290, "N50" : 290, "seqs" : 2 }, "file2.fa" : { "N50" : 456, "size" : 456, "seqs" : 2 } }
- -h, --help
-
Will display this full help message and quit, even if other arguments are supplied.
Output formats
These are the values for --format
.
- tsv (tab separated values)
-
#path seqs size N50 min max test2.fa 8 825 189 4 256 reads.fa 5 247 100 6 102 small.fa 6 130 65 4 65
- csv (comma separated values)
-
Same as
--format tsv
and--separator ,
:#path,seqs,size,N50,min,max test.fa,8,825,189,4,256 reads.fa,5,247,100,6,102 small_test.fa,6,130,65,4,65
- screen (screen friendly)
-
Use
-x
as shortcut for--format screen
. Enables --thousands-sep (-q) by default..-----------------------------------------------------------------------------------------. | File | Seqs | Total bp | N50 | min | max | N75 | N90 | auN | +---------------+------+----------+--------+-------+--------+-------+-------+-------------+ | big.fa | 4 | 18,359 | 11,840 | 2,167 | 11,840 | 2,176 | 2,167 | 8923.21,984 | | sim1.fa | 39 | 18,864 | 679 | 20 | 971 | 408 | 313 | 733.51,389 | | sim2.fa | 21 | 7,530 | 493 | 68 | 989 | 330 | 174 | 575.47,012 | | test.fa | 8 | 825 | 189 | 4 | 256 | 168 | 168 | 260.99,515 | '---------------+------+----------+--------+-------+--------+-------+-------+-------------'
- json (JSON)
-
Use
-j
as shortcut for--format json
.{ "data/sim1.fa" : { "seqs" : 39, "N50" : 679, "max" : 971, "N90" : 313, "min" : 20, "size" : 18864, "auN" : 733.51389, "N75" : 408 }, "data/sim2.fa" : { "max" : 989, "seqs" : 21, "N50" : 493, "N90" : 174, "min" : 68, "auN" : 575.47012, "N75" : 330, "size" : 7530 } }
- custom
-
Will print the output using the template string provided with -t TEMPLATE. Fields are in theÂ
{field_name}
format.{new}
/{n}
/\n
is the newline,{tab}
/{t}
/\t
is a tab. All the keys of the JSON object are valid fields:{seqs}
,{N50}
,{min}
,{max}
,{size}
.
EXAMPLE USAGES
Screen friendly table (-x
is a shortcut for --format screen
), sorted by N50 descending (default):
n50.pl -x files/*.fa
Screen friendly table, sorted by total contig length (--sortby max
) ascending (--reverse
):
n50.pl -x -o max -r files/*.fa
Tabular (tsv) output is default:
n50.pl -o max -r files/*.fa
A custom output format:
n50.pl data/*.fa -f custom -t '{path}{tab}N50={N50};Sum={size}{new}'
CITING
Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering 2021, 8, 59. https://doi.org/10.3390/bioengineering8050059
CONTRIBUTING, BUGS
The repository of this project is available at https://github.com/telatin/proch-n50/.
AUTHOR
Andrea Telatin <andrea@telatin.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2018-2023 by Andrea Telatin.
This is free software, licensed under:
The MIT (X11) License