NAME
File::FormatIdentification::RandomSampling - methods to identify content of device o media files using random sampling
VERSION
version 0.006
SYNOPSIS
This module is suitable to get a good estimation about the content of media (or files). It uses random sampling of sectors to obtain heuristics about the content types.
To check the base type of a given binary string:
my $ff = File::FormatIdentification::RandomSampling->new(); # basic instantiation
my $type = $ff->calc_type($buffer); # calc type of given binary string
NAME
File::FormatIdentification::RandomSampling
TOOLS
The following tools are supplied with this module and are presented below:
crazy_fast_image_scan.pl
This script scans devices or images very fast using random sampling and reports wht kind of content could be found.
For a detailed documentation use the included POD there.
cfi_create_training_data.pl
This script scans a bunch of files and calcs most frequent one- and bigrams and stores them in a CSV file.
cfi_learn_model.pl
This script uses the CSV file and prints a new model module in style of File::FormatIdentification::RandomSampling::Model using AI::DecisionTree.
SOURCE
The actual development version is available at https://art1pirat.spdns.org/art1/crazy-fast-image-scan
METHODS
init_bytegrams
resets the internal bytegram state. Also called if object will be instantiated
update_bytegram
calc_histogram
uses the most significant first 8 bytegram entries to from a histogram, returned as hash reference
is_uniform
returns true, if 1-byte bytegrams are uniform
is_empty
returns true, if 1-byte bytegrams indicating empty buffers
is_text
returns true, if 1-byte bytegrams are typical for texts
is_video
returns true, if 1-byte bytegrams are typical for MPEG/Quicktime Videos
calc_type
returns string indicating type of a given buffer
AUTHOR
Andreas Romeyke <pause@andreas-romeyke.de>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2020 by Andreas Romeyke.
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007