NAME¶
simhash - file similarity hash tool
SYNOPSIS¶
simhash 
[ -s nshingles ] [ -f nfeatures
  ] [ file ]
 
simhash 
[ -s nshingles ] [ -f nfeatures
  ] -w file ...
 
simhash 
[ -s nshingles ] [ -f nfeatures
  ] -m file ...
 
simhash 
-c hashfile hashfile
DESCRIPTION¶
This program is used to compute and compare similarity hashes of files. A
  similarity hash is a chunk of data that has the property that some distance
  metric between files is proportional to some distance metric between the
  hashes. Typically the similarity hash will be much smaller than the file
  itself.
The algorithm used by 
simhash is Manassas' "shingleprinting"
  algorithm (see BIBLIOGRAPHY below): take a hash of every 
m-byte
  subsequence of the file, and retain the 
n of these hashes that are
  numerically smallest. The size of the intersection of the hash sets of two
  files gives a statistically good estimate of the similarity of the files as a
  whole.
In its default mode, 
simhash will compute the similarity hash of its file
  argument (or stdin) and write this hash to its standard output. When invoked
  with the 
-w argument (see below), 
simhash will compute
  similarity hashes of all of its file arguments in "batch mode". When
  invoked with the 
-m argument (see below), 
simhash will compare
  all the given files using similarity hashes in "match mode".
  Finally, when invoked with the 
-c argument (see below), 
simhash
  will report the degree of similarity between two hashes.
OPTIONS¶
  - -f feature-count
 
  - When computing a similarity hash, retain at most
      feature-count significant hashes from the target file. The default
      is 128 features. Larger feature counts will give higher resolution in
      differences between files, will increase the size of the similarity hash
      proportionally to the feature count, and will increase similarity hash
      computation time slightly.
 
  - -s shingle-size
 
  - When computing a similarity hash, use hashes of samples
      consisting of shingle-size consecutive bytes drawn from the target
      file. The default is 8 bytes, the minimum is 4 bytes. Larger shingle sizes
      will emphasize the differences between files more and will slow the
      similarity hash computation proportionally to the shingle size.
 
  - -c hashfile1 hashfile2
 
  - Display the distance (normalized to the range 0..1) between
      the similarity hash stored in hashfile1 and the similarity hash
      stored in hashfile2.
 
  - -w file ...
 
  - Write the similarity hash of each of the file
      arguments to file.sim.
 
  - -m file ...
 
  - Compute the similarity hash of each of the file
      arguments, and output a similarity matrix for those files.
 
AUTHOR¶
Bart Massey <bart@cs.pdx.edu>
BUGS¶
This currently uses CRC32 for the hashing. A Rabin Fingerprint should be offered
  as a slightly slower but more reliable alternative.
The shingleprinting algorithm works for text files and fairly well for other
  sequential filetypes, but does not work well for image files. The latter both
  are 2D and often undergo odd transformations.
BIBLIOGRAPHY¶
Mark Manasse, Microsoft Research Silicon Valley. Finding similar things quickly
  in large collections.
  
http://research.microsoft.com/research/sv/PageTurner/similarity.htm
Andrei Z. Broder. On the resemblance and containment of documents. In
  Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE
  Computer Society, 1998.
  
ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-wpnums.pdf
Andrei Z. Broder. Some applications of Rabin's fingerprinting method. Published
  in R. Capocelli, A. De Santis, U. Vaccaro eds., Sequences II: Methods in
  Communications, Security, and Computer Science, Springer-Verlag, 1993.
  
http://athos.rutgers.edu/~muthu/broder.ps