USAGE

Classification:

blacklist_classifier [OPTIONS] lang1 lang2 ... < file

training:

blacklist_classifier -n [OPTIONS] text1 text2 > blacklist.txt
blacklist_classifier [OPTIONS] -t "t1.txt t2.txt ..." lang1 lang2 ...

run experiments:

blacklist_classifier -t "t1.txt t2.txt ..." \
                        -e "e1.txt e2.txt ..." \
                        lang1 lang2 ...

command line arguments:

lang1 lang2 ... are language ID's
blacklists are expected in <BlackListDir>/<lang1-lang2.txt
t1.txt t2.txt ... are training data files (in UTF-8)
e1.txt e2.txt ... are training data files (in UTF-8)
the order of languages needs to be the same for training data, eval data
  as given by the command line arguments (lang1 lang2 ..)


-a <freq> ...... min freq for common words
-b <freq> ...... max freq for uncommon words
-c <score> ..... min difference score to be relevant
-d <dir> ....... directory of black lists
-i ............. classify each line separately
-m <number> .... use approximately <number> tokens to traing/classify
-n ............. train a new black list
-v ............. verbose mode

-U ............. don't lowercase
-S ............. don't tokenize (use the string as it is)
-A ............. don't discard tokens with non-alphabetic characters

AUTHOR

Jörg Tiedemann, https://bitbucket.org/tiedemann

BUGS

Please report any bugs or feature requests to https://bitbucket.org/tiedemann/blacklist-classifier. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Lingua::Identify::Blacklists

LICENSE AND COPYRIGHT

Copyright 2012 Jörg Tiedemann.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 131:

Non-ASCII character seen before =encoding in 'Jörg'. Assuming UTF-8