NAME
Image::PHash - Fast perceptual image hashing (DCT-based pHash)
SYNOPSIS
use Image::PHash;
# Load an image and resize to the default 32x32 that will be the base for calculating the DCT matrix
my $iph = Image::PHash->new($image_file, 'Imlib2');
# Calculate the perceptual hash (top 8x8 of the DCT) - 64 bits, 16 hex chars
my $p = $iph->pHash(); # implies settings geometry => '8x8', method => 'average'
# Alternative, better performing, 64 bit hash (64 upper-left most DCT values)
my $p = $iph->pHash(geometry => 64); # this is actually recommended over the default hash
# Different method for the bitmask for lowest false-negative rate
my $p = $iph->pHash(geometry => 64, method => 'log');
# Calculate a pHash with the upper left half of the upper-left-most 7x7 of the DCT - 27 bits, 7 hex chars
# Use for indexing or reducing full phash false negatives - significant false negative rate by itself
my $p7 = $iph->pHash(geometry => '7x7', reduce => 1);
# or shortcut function
$p7 = $iph->pHash7();
# Calculate a pHash with the upper left half of the upper-left-most 6x6 of the DCT - 20 bits, 5 hex chars
# Use for indexing or reducing full phash false negatives - very significant false negative rate by itself
my $p6 = $iph->pHash(geometry => '6x6', reduce => 1, method => 'median');
# or shortcut function
$p6 = $iph->pHash6();
# Calculate the difference (Hamming distance) between two hex hashes
my $diff = Image::PHash::diff($p1, $p2);
DESCRIPTION
Image::PHash allows you to calculate the (DCT-based) perceptual hash (pHash) of an image.
The constructor and general structure is based on Image::Hash - keeping usage quite similar, but the pHash algorithm is rewritten from scratch as the Image::Hash implementation was seriously flawed (and slow). Apart from fixes for GD, Imager, ImageMagick resizing, Image::Imlib2 support is added along with some unique features like reduced hashes for indexing, options to deal with image mirroring, alternative bitmap methods.
A fast DCT module made for the specific purpose is used (Math::DCT - over 10x faster than the pHash.org C++ implementation).
CONSTRUCTOR METHODS
my $iph = Image::PHash->new($image_file , $library?, \%settings?);
The first argument is an image filename when the Imlib2 library is used, but it can also by a variable with the image data for the other libraries.
The second (optional) argument is the image library. Valid options are Imlib2
, GD
, ImageMagick
, Imager
and if not specified the module will try to load them in that order. Using a different library (or even library version) will most likely result in different hashes being returned, so make sure you hash your entire image set using the same image library. Also, the module will probably not work with very old versions of the image libraries. See Notes for comparison of libraries.
The third (optional) argument can take a reference to a settings hash. Currently supported settings:
resize
: Expects an integer to specify the size of the image resize before applying the DCT transformation. By default it is32
, which resizes to a32x32
image. You may want to explore different sizes which have different hashing behaviour e.g. increasing resize to64
seems to offer some benefit for the reduced/index hashes (at a performance penalty of course).magick_filter
: Thefilter
parameter for ImageMagick which controls the scaling filter. The default isCubic
which is about as good as ImageMagick's own defaultLanczos
, but significantly faster. You can look up the ImageMagick documentation for alternatives (e.g.Lanczos
,Mitchell
,Triangle
,Gaussian
...) if you want a different balance of speed/quality.imager_qtype
: Theqtype
parameter for Imager which defines the quality of scaling performed. The typemixing
is used by default as it seems to behave for most cases while being about twice as fast as Imager's internal default qtype ofnormal
, but you can still manually specify that if you prefer.The constructor will
croak
if there is an error loading an image library or an incorrect/missing argument. It will onlycarp
and returnundef
if an image library returns after failing to load an image.
METHODS
pHash
my $p = $iph->pHash(
geometry => '8x8', # Size of the matrix to keep from the DCT. 8x8 is the standard 64 bit pHash.
reduce => 0, # If enabled removes lower right half or matrix and 0,0 position.
method => 'average', # Method with which to convert the dct to bitmask.
mirror => 0, # If enabled will return the hash of the mirror image (horizontal flip).
mirrorproof => 0, # If enabled will return a hash type that resists image mirroring.
);
my @bits = $iph->pHash();
Generates a pHash, returns it as a hex string (or array of 1s and 0s in array context).
The pHash process consists of resizing the image to 32x32 (unless different resize
) was specified with the constructor, converting color to luminance, DCT on resized image's luminance, conversion to bit values based on the selected method and dropping the high frequency part of the matrix, following the geometry
and reduce
settings.
The parameters to pass:
geometry
: A string with the desired square dimensionsNxN
of the bit matrix taken from the upper left of the full (by default 32x32) processed DCT to be used in the hash. By default it is'8x8'
, which produces the typical 64bit pHash.Alternatively, specify a simple integer for the number of low frequency bits to take from the matrix going through the matrix in increasing diagonals. E.g. a value of
64
will return 64 bits just like the default pHash, but they will be taken from the upper left half of the 11x11 top-left part of the DCT matrix. The effect is that the resulting hash will have improved characteristics, especially in the average distance of non-similar images.reduce
: The option will return only the bits that are on, or to the upper left of the 0,N->N,0 diagonal of the selected NxN matrix, except the first bit (which is always 1). This way you get the (N-1)*(2+N)/2 most significant (recording the largest changes) bits.For example, a 8x8 reduced hash has 35 bits, which is less than a 6x6 matrix, and yet may outperform a full 7x7 49 bit matrix.
The option only applies to square
geometry
. If you have specified bits instead, you already get an effect similar toreduce
.method
: Specifies which method to use for converting the DCT result to a bitmask. By default it isaverage
. Supported methods:average
: The default method which compares each DCT value with the arithmetic mean. It usually a bit better at recognising similarity thanmedian
(the average difference of similar images will be lower), albeit with an increase in false positives.median
: The median of the DCT values is used as the threshold. Some implementations prefer it overaverage
due to lower false positives/collision rate, which can be good for reduced size hashes, but it increases false negatives so it is not the recommended method for many scenarios.average_x
: Applies only to reduced hashes - it will calculate the average using the entire NxN matrix forgeometry='NxN'
, or use the next X lowest frequency DCT coefficients (total 2*X) to calculate the average forgeometry=X
. Almost as low collision rate aslog
, perhaps a bit better (lower) false negatives.log
: A special logarithmic average is calculated as the threshold, giving the lowest collision rate of all the methods. It will usually have a bit increased false negative chance compared toaverage
or perhaps evendiff
- but should still be lower thanmedian
. Quite close toaverage_x
, but also applies to non-reduced hashes.diff
: The difference between each DCT value is taken as the bitmask. It seems to fall betweenaverage
andmedian
in most tests, both in false negative/collision rate and at similarity recognition.
mirror
: Returns the pHash of the mirror image. Note that this is a function applied after the DCT, so if you callpHash
once withmirror
and once without you can get both hashes without a processing overhead. Not compatible withmirrorproof
.mirrorproof
: Will return a pHash that is impervious to mirroring (flipping images horizontally). This means two mirrored images will have the same/similar pHash, so it is good for declaring such images as "similar" with a single pHash comparison, but if you want to know they were mirrored themirror
option is more appropriate.Caveat: The option sacrifices about 2 bits or so of entropy, so the resulting pHash is less effective. Thus, it should not be preferred if there is no specific reason, especially when the
mirror
option is available.
INDEXING SHORTCUT METHODS
The special reduced hashes below, named pHash6 and pHash7 are specially chosen to be useful in two scenarios: * Extra verification to reduce false positives that the full has produces. Depending on the scenario, tests have shown each of the reduced hash being able to reduce false positives by 60-90% with the right threshold (example 2, 3 for pHash6, pHash7 resp.), and if both are used over 95% reduction of false positives is possible. * For scenarios where we want to use a simple index of a database (e.g. MySQL) and only very simple manipulations are required to be matched (e.g. resize) or there is a higher tolerance for false negatives, pHash6 & pHash7 at diff 0 can be used to retrieve most matches. For example, storing pHash6, pHash7 (as indices) along with pHash in a MySQL db, we can retrieve most matches with something like:
SELECT *
FROM hash_table
WHERE (phash7 = @phash7
OR phash6 = @phash6)
AND BIT_COUNT(CAST(CONV(phash, 16, 10) AS UNSIGNED) ^ CAST(CONV(@phash, 16, 10) AS UNSIGNED)) < 8;
pHash7
my $p7 = $iph->pHash7();
Equivalent to $iph->pHash(geometry => '7x7', reduce => 1)
. It is useful for relational databases that don't support indexing with differences, so using the reduced pHash will return many of the close full pHash matches. For simple resize/compression manipulations expect matches in the region of 98% or so to be returned. It can also be used to verify the match in some cases where a specific pattern might make different photos match in the full phash but not the limited. It takes the same parameters as pHash
.
pHash6
my $p6 = $iph->pHash6();
Equivalent to $iph->pHash(geometry => '6x6', reduce => 1, method => 'median')
. Similar to pHash7, but will produce more matches when used as an index, along with many more false positives.
Note that neither reduced version is appropriate to use as an image comparison hash by itself (too many false positives), and they are chosen to be complimentary, so when used in conjuction for either indexing or verification, their performance increases considerably.
HELPER METHODS
reducedimage
my $img = $iph->reducedimage();
Returns the reduced (rescaled) image that will be used for the DCT.
dctdump
my $dct = $iph->dctdump();
Will return the full 32x32 DCT as an arrayref of floats.
printbitmatrix
$iph->printbitmatrix(
%phash_opt, # Any pHash method option applies
separator => '', # Separator for horizontal values
filler => ' ' # For reduced results, filler for the missing positions
);
Will return a print-friendly reduced size bitmask matrix as a string. Basically a string with rows/columns of the 1s and 0s you would get from calling $iph-
pHash()> with the same parameters.
HELPER FUNCTIONS
b2h
my $hash = Image::PHash::b2h(join('', @bits));
Will convert a bit value string to a hex string.
diff
my $diff = Image::PHash::diff($phash1, $phash2);
Will calculate the bit difference of two hex string hashes (their Hamming distance of their bit stream form). On 64 bit systems (checking $Config{ivsize}
) it will actually call diff64
which can calculate the difference of up to 64bit hashes in a single operation (using %064b
).
NOTES
Performance
The hashing performance of the module is enough to make the actual pHash generation from the final 32x32 mono image a trivial part of the process. For a general idea, on a single core of a 2015 Macbook Pro, over 18000 hashes/sec can be processed thanks in part to the fast Math::DCT XS module (developed specifically for Image::PHash).
So, most of the processing time is spent on loading the image, resizing, extracting pixel values, removing color, all of which depend on the specific image module. On an Apple M1, hashing 800x600 jpg images was measured at 131 h/s with Image::Magick, 208 h/s with Imager, 241 h/s with GD, 547 h/s with Image::Imlib2. Higher resulutions make the process slower as you could expect. Since all images will be resized to 32x32 in the end, the fastest hashing performance would be if you loaded 32x32 thumbnails. In that case, the performance of the libraries in the same order for the resized imageset were: 659 h/s, 664 h/s, 1883 h/s, 2296 h/s. It is clear that Image::Imlib2 should be preferred when hashing performance is desired, as it offers dramatically better performance (unless you are hashing 32x32 images in which case GD also fast). It should be noted that the resulting hashes don't have exactly the same behaviour/metrics, due to the different resizing algorithms used, but the differences seem to be very small. You are encouraged to test on your own data set.
Remember, never mix image libraries (or settings), the hashes will most likely not be compatible.
Finally, if you are curious about the performance of this module compared to the C++ pHash.org implementation, pHash.org could achieve 33 h/s with the test setup as above, making Image::PHash over 16x faster with Imlib2. With pre-sized 32x32 images, pHash.org ran at 101 h/s.
Compatibility of hashes
As already mentioned, if you produce hashes with different settings, different image libraries etc, the hashes might not be compatible. It is advisable to even freeze the version of this module and the image library in a production environment to avoid any degraded performance.
Calculation caching
Calculating pHashes with different dct/reduce/median/mirror arguments for the same image is very fast (when the same object is used), as the resize and DCT transform will only happen on the very first pHash calculation and are cached for any subsequent call. You can essentially get the extra phash6/phash7/mirror etc "for free" after the initial pHash calculation.
Image::PHash vs Image::Hash
While Image::Hash may still be useful for the aHash and dHash functionality, its pHash implementation is seriously flawed. It does not actually do a full DCT, using instead a shortcut that seems to result to hashes with lots of zeros and thus a high rate of collisions (~2% chance for identical hash on dissimilar images making it useless for my large data set), which is the reason the hashing was implemented from scratch. Despite it not doing a full DCT it was really slow (over 80x slower than the XS Math::DCT), so switching to Image::PHash will give you "correct" hashes at a significant speed increase, along with several extra features.
Image::PHash vs phash.org
Apart from the significant speed advantage of Image::PHash noted above, there are a couple of important differences, in that phash.org will apply a 7x7 mean filter to the image before the resize and the conversion to bits is always done with the median method. This seems to keep false positives quite low, but its false negatives are higher. Since with Image::PHash you can get even better hashes with, for example, geometry=64
and you can combine them with method='average_x'
or method='log'
, you will get even lower false positive rate than phash.org, but with less false negatives as well. Feel free to share your own comparisons with the author if in doubt.
Note that the differences you are to use as a threshold for Image::PHash and phash.org are quite different - phash.org will give about 50% greater diffs on average (e.g. where I would use 7 for the former, 11 would be the equivalent for the latter).
Selecting a diff
threshold
The appropriate diff
threshold for declaring images as "similar" is not a precise art and will depend on the application (type of images, tolerance for false positives etc.). The exact application is very important too, if you have 2 images and want to check whether they are similar, a false positive rate of even over 1% is fine, in which case the diff can be chosen to be probably over 10, whereas having a big collection of photos in which you want to check whether a duplicate exists, requires a very low false positive rate. Example diff ranges for a full pHash are 3-7 if you want to keep false positives close to 0%. For the small pHash7 and pHash6 probably not more than 3 and 2 respectively are useful for lookups (and still with lots of false positives as noted above).
ACKNOWLEDGEMENTS
Initially based on Image::Hash, so code to do with loading images, pixels etc has been kept.
AUTHOR
Dimitrios Kechagias, <dkechag at cpan.org>
BUGS
Please report any bugs or feature requests either on GitHub, or on RT (via the email bug-image-phash at rt.cpan.org
or web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Image-PHash).
I will be notified, and then you'll be notified of progress on your bug as I make changes.
GIT
https://github.com/dkechag/Image-PHash
COPYRIGHT & LICENSE
Copyright (C) 2022, SpareRoom & Dimitrios Kechagias.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.