NAME
rem-boilerplate-text
VERSION
Version 0.2
SYNOPSIS
> rem-boilerplate-text [options] <list of files>
E.g.
> rem-boilerplate-text --min_dupl=6 intranet/txt/*.txt
DESCRIPTION
Removes repeated text from a set of files.
Note that the system only works when more than one file is specified, since boilerplate text is detected based on repetition across files.
New files are written, with a suffix appended to the original filenames.
OPTIONS
- -m, --min_dupl
-
The minimum number of thimes a line has to occur to be considered boilerplate (default: 3). Can be either an integer or a percentage ('50 %') of the number of files processed. Minimum value: 2.
- -i, --ignore_digits
-
Lines only seperated by differences in digits will be considered duplicates (default: yes).
- -s, --suffix
-
Added to the new files (default: 'content').
-
Only sets consecutive lines of duplicates at the start and end of documents are considered boilerplate (default: yes).
- -d, digest
-
Lines will be replaced by a MD5 digest during duplicate compilation, saving memory (default: no).
- -l, log
-
Name of the log file, where deleted lines are recorded; if set to false, no log will be created (default: './text-identify-boilerplate.log').
- -h, --help
-
Display usage information.
- -v, --verbose
-
Be verbose.
AUTHOR
Lars Nygaard, <lars.nygaard@inl.uio.no>
COPYRIGHT & LICENSE
Copyright 2005 Lars Nygaard, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.