.\" Automatically generated by Pod::Man 4.10 (Pod::Simple 3.35)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "DUPEMAP 1"
.TH DUPEMAP 1 "2018-10-16" "1.1.10" "Magic Rescue"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
dupemap \- Creates a database of file checksums and uses it to eliminate
duplicates
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
\&\fBdupemap\fR [ \fIoptions\fR ] [ \fB\-d\fR \fIdatabase\fR ] \fIoperation\fR \fIpath...\fR
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
\&\fBdupemap\fR recursively scans each \fIpath\fR to find checksums of file contents.
Directories are searched through in no particular order.  Its actions depend on
whether the \fB\-d\fR option is given, and on the \fIoperation\fR parameter, which
must be a comma-seperated list of \fBscan\fR, \fBreport\fR, \fBdelete\fR:
.SS "Without \fB\-d\fP"
.IX Subsection "Without -d"
\&\fBdupemap\fR will take action when it sees the same checksum repeated more than
once, i.e. it simply finds duplicates recursively.  The action depends on
\&\fIoperation\fR:
.IP "\fBreport\fR" 7
.IX Item "report"
Report what files are encountered more than once, printing their names to
standard output.
.IP "\fBdelete\fR[\fB,report\fR]" 7
.IX Item "delete[,report]"
Delete files that are encountered more than once.  Print their names if
\&\fBreport\fR is also given.
.Sp
\&\fI\s-1WARNING:\s0\fR use the \fBreport\fR operation first to see what will be deleted.
.Sp
\&\fI\s-1WARNING:\s0\fR You are advised to make a backup of the target first, e.g. with
\&\f(CW\*(C`cp \-al\*(C'\fR (for \s-1GNU\s0 cp) to create hard links recursively.
.SS "With \fB\-d\fP"
.IX Subsection "With -d"
The \fIdatabase\fR argument to \fB\-d\fR will denote a database file (see the
\&\*(L"\s-1DATABASE\*(R"\s0 section in this manual for details) to read from or write to.  In
this mode, the \fBscan\fR operation should be run on one \fIpath\fR, followed by the
\&\fBreport\fR or \fBdelete\fR operation on another (\fInot the same!\fR) \fIpath\fR.
.IP "\fBscan\fR" 7
.IX Item "scan"
Add the checksum of each file to \fIdatabase\fR.  This operation must be run
initially to create the database.  To start over, you must manually delete the
database file(s) (see the \*(L"\s-1DATABASE\*(R"\s0 section).
.IP "\fBreport\fR" 7
.IX Item "report"
Print each file name if its checksum is found in \fIdatabase\fR.
.IP "\fBdelete\fR[\fB,report\fR]" 7
.IX Item "delete[,report]"
Delete each file if its checksum is found in \fIdatabase\fR.  If \fBreport\fR is also
present, print the name of each deleted file.
.Sp
\&\fI\s-1WARNING:\s0\fR if you run \fBdupemap delete\fR on the same \fIpath\fR you just ran
\&\fBdupemap scan\fR on, it will \fIdelete every file!\fR The idea of these options is
to scan one \fIpath\fR and delete files in a second \fIpath\fR.
.Sp
\&\fI\s-1WARNING:\s0\fR use the \fBreport\fR operation first to see what will be deleted.
.Sp
\&\fI\s-1WARNING:\s0\fR You are advised to make a backup of the target first, e.g. with
\&\f(CW\*(C`cp \-al\*(C'\fR (for \s-1GNU\s0 cp) to create hard links recursively.
.SH "OPTIONS"
.IX Header "OPTIONS"
.IP "\fB\-d\fR \fIdatabase\fR" 7
.IX Item "-d database"
Use \fIdatabase\fR as an on-disk database to read from or write to.  See the
\&\*(L"\s-1DESCRIPTION\*(R"\s0 section above about how this influences the operation of
\&\fBdupemap\fR.
.IP "\fB\-I\fR \fIfile\fR" 7
.IX Item "-I file"
Reads input files from \fIfile\fR in addition to those listed on the command line.
If \fIfile\fR is \f(CW\*(C`\-\*(C'\fR, read from standard input.  Each line will be interpreted as
a file name.
.Sp
The paths given here will \s-1NOT\s0 be scanned recursively.  Directories will be
ignored and symlinks will be followed.
.IP "\fB\-m\fR \fIminsize\fR" 7
.IX Item "-m minsize"
Ignore files below this size.
.IP "\fB\-M\fR \fImaxsize\fR" 7
.IX Item "-M maxsize"
Ignore files above this size.
.SH "USAGE"
.IX Header "USAGE"
.SS "General usage"
.IX Subsection "General usage"
The easiest operations to understand is when the \fB\-d\fR option is not given.  To
delete all duplicate files in \fI/tmp/recovered\-files\fR, do:
.PP
.Vb 1
\&    $ dupemap delete /tmp/recovered\-files
.Ve
.PP
Often, \fBdupemap scan\fR is run to produce a checksum database of all files in a
directory tree.  Then \fBdupemap delete\fR is run on another directory, possibly 
following \fBdupemap report\fR.  For example, to delete all files in
\&\fI/tmp/recovered\-files\fR that already exist in \fI\f(CI$HOME\fI\fR, do this:
.PP
.Vb 2
\&    $ dupemap \-d homedir.map scan $HOME
\&    $ dupemap \-d homedir.map delete,report /tmp/recovered\-files
.Ve
.SS "Usage with magicrescue"
.IX Subsection "Usage with magicrescue"
The main application for \fBdupemap\fR is to take some pain out of performing
undelete operations with \fBmagicrescue\fR(1).  The reason is that \fBmagicrescue\fR
will extract every single file of the specified type on the block device, so
undeleting files requires you to find a few files out of hundreds, which can
take a long time if done manually.  What we want to do is to only extract the
documents that don't exist on the file system already.
.PP
In the following scenario, you have accidentally deleted some important Word
documents in Windows.  If this were a real-world scenario, then by all means use
The Sleuth Kit.  However, \fBmagicrescue\fR will work even when the directory
entries were overwritten, i.e. more files were stored in the same folder later.
.PP
You boot into Linux and change to a directory with lots of space.  Mount the
Windows partition, preferably read-only (especially with \s-1NTFS\s0), and create the
directories we will use.
.PP
.Vb 2
\&    $ mount \-o ro /dev/hda1 /mnt/windows
\&    $ mkdir healthy_docs rescued_docs
.Ve
.PP
Extract all the healthy Word documents with \fBmagicrescue\fR and build a database
of their checksums.  It may seem a little redundant to send all the documents
through \fBmagicrescue\fR first, but the reason is that this process may modify
them (e.g. stripping trailing garbage), and therefore their checksum will not
be the same as the original documents.  Also, it will find documents embedded
inside other files, such as uncompressed zip archives or files with the wrong
extension.
.PP
.Vb 4
\&    $ find /mnt/windows \-type f \e
\&      |magicrescue \-I\- \-r msoffice \-d healthy_docs
\&    $ dupemap \-d healthy_docs.map scan healthy_docs
\&    $ rm \-rf healthy_docs
.Ve
.PP
Now rescue all \f(CW\*(C`msoffice\*(C'\fR documents from the block device and get rid of
everything that's not a *.doc.
.PP
.Vb 2
\&    $ magicrescue \-Mo \-r msoffice \-d rescued_docs /dev/hda1 \e
\&      |grep \-v \*(Aq\e.doc$\*(Aq|xargs rm \-f
.Ve
.PP
Remove all the rescued documents that also appear on the file system, and
remove duplicates.
.PP
.Vb 2
\&    $ dupemap \-d healthy_docs.map delete,report rescued_docs
\&    $ dupemap delete,report rescued_docs
.Ve
.PP
The \fIrescued_docs\fR folder should now contain only a few files.  This will be
the undeleted files and some documents that were not stored in contiguous
blocks (use that defragger ;\-)).
.SS "Usage with fsck"
.IX Subsection "Usage with fsck"
In this scenario (based on a true story), you have a hard disk that's gone bad.
You have managed to \fIdd\fR about 80% of the contents into the file \fIdiskimage\fR,
and you have an old backup from a few months ago.  The disk is using reiserfs
on Linux.
.PP
First, use fsck to make the file system usable again.  It will find many
nameless files and put them in \fIlost+found\fR.  You need to make sure there is
some free space on the disk image, so fsck has something to work with.
.PP
.Vb 6
\&    $ cp diskimage diskimage.bak
\&    $ dd if=/dev/zero bs=1M count=2048 >> diskimage
\&    $ reiserfsck \-\-rebuild\-tree diskimage
\&    $ mount \-o loop diskimage /mnt
\&    $ ls /mnt/lost+found
\&    (tons of files)
.Ve
.PP
Our strategy will be to restore the system with the old backup as a base and
merge the two other sets of files (\fI/mnt/lost+found\fR and \fI/mnt\fR) into the
backup after eliminating duplicates.  Therefore we create a checksum database
of the directory we have unpacked the backup in.
.PP
.Vb 1
\&    $ dupemap \-d backup.map scan ~/backup
.Ve
.PP
Next, we eliminate all the files from the rescued image that are also present
in the backup.
.PP
.Vb 1
\&    $ dupemap \-d backup.map delete,report /mnt
.Ve
.PP
We also want to remove duplicates from \fIlost+found\fR, and we want to get rid of
any files that are also present in the other directories in \fI/mnt\fR.
.PP
.Vb 3
\&    $ dupemap delete,report /mnt/lost+found
\&    $ ls /mnt|grep \-v lost+found|xargs dupemap \-d mnt.map scan
\&    $ dupemap \-d mnt.map delete,report /mnt/lost+found
.Ve
.PP
This should leave only the files in \fI/mnt\fR that have changed since the last
backup or got corrupted.  Particularly, the contents of \fI/mnt/lost+found\fR
should now be reduced enough to manually sort through them (or perhaps use
\&\fBmagicsort\fR(1)).
.SS "Primitive intrusion detection"
.IX Subsection "Primitive intrusion detection"
You can use \fBdupemap\fR to see what files change on your system.  This is one of
the more exotic uses, and it's only included for inspiration.
.PP
First, you map the whole file system.
.PP
.Vb 1
\&    $ dupemap \-d old.map scan /
.Ve
.PP
Then you come back a few days/weeks later and run \fBdupemap report\fR.  This will
give you a view of what \fIhas not\fR changed.  To see what \fIhas\fR changed, you
need a list of the whole file system.  You can get this list along with
preparing a new map easily.  Both lists need to be sorted to be compared.
.PP
.Vb 2
\&    $ dupemap \-d old.map report /|sort > unchanged_files
\&    $ dupemap \-d current.map scan /|sort > current_files
.Ve
.PP
All that's left to do is comparing these files and preparing for next week.
This assumes that the dbm appends the \f(CW\*(C`.db\*(C'\fR extension to database files.
.PP
.Vb 2
\&    $ diff unchanged_files current_files > changed_files
\&    $ mv current.map.db old.map.db
.Ve
.SH "DATABASE"
.IX Header "DATABASE"
The actual database file(s) written by \fBdupecheck\fR will have some relation to
the \fIdatabase\fR argument, but most implementations append an extension.  For
example, Berkeley \s-1DB\s0 names the files \fIdatabase\fR\fB.db\fR, while Solaris and \s-1GDBM\s0
creates both a \fIdatabase\fR\fB.dir\fR and \fIdatabase\fR\fB.pag\fR file.
.PP
\&\fBdupecheck\fR depends on a database library for storing the checksums.  It
currently requires the POSIX-standardized \fBndbm\fR library, which must be
present on XSI-compliant UNIXes.  Implementations are not required to handle
hash key collisions, and a failure to do that could make \fBdupecheck\fR delete
too many files.  I haven't heard of such an implementation, though.
.PP
The current checksum algorithm is the file's \s-1CRC32\s0 combined with its size.
Both values are stored in native byte order, and because of varying type sizes
the database is \fInot\fR portable across architectures, compilers and operating
systems.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
\&\fBmagicrescue\fR(1), \fBweeder\fR(1)
.PP
This tool does the same thing \fBweeder\fR does, except that \fBweeder\fR cannot seem
to handle many files without crashing, and it has no largefile support.
.SH "BUGS"
.IX Header "BUGS"
There is a tiny chance that two different files can have the same checksum and
size.  The probability of this happening is around 1 to 10^14, and since
\&\fBdupemap\fR is part of the Magic Rescue package, which deals with disaster
recovery, that chance becomes an insignificant part of the game.  You should
consider this if you apply \fBdupemap\fR to other applications, especially if they
are security-related (see next paragraph).
.PP
It is possible to craft a file to have a known \s-1CRC32.\s0  You need to keep this in
mind if you use \fBdupemap\fR on untrusted data.  A solution to this could be to
implement an option for using \s-1MD5\s0 checksums instead.
.SH "AUTHOR"
.IX Header "AUTHOR"
Jonas Jensen <jbj@knef.dk>
.SH "LATEST VERSION"
.IX Header "LATEST VERSION"
This tool is part of Magic Rescue.  You can find the latest version at
<https://github.com/jbj/magicrescue>