.\" Automatically generated by Pod::Man 4.10 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "DUPEMAP 1" .TH DUPEMAP 1 "2018-10-16" "1.1.10" "Magic Rescue" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" dupemap \- Creates a database of file checksums and uses it to eliminate duplicates .SH "SYNOPSIS" .IX Header "SYNOPSIS" \&\fBdupemap\fR [ \fIoptions\fR ] [ \fB\-d\fR \fIdatabase\fR ] \fIoperation\fR \fIpath...\fR .SH "DESCRIPTION" .IX Header "DESCRIPTION" \&\fBdupemap\fR recursively scans each \fIpath\fR to find checksums of file contents. Directories are searched through in no particular order. Its actions depend on whether the \fB\-d\fR option is given, and on the \fIoperation\fR parameter, which must be a comma-seperated list of \fBscan\fR, \fBreport\fR, \fBdelete\fR: .SS "Without \fB\-d\fP" .IX Subsection "Without -d" \&\fBdupemap\fR will take action when it sees the same checksum repeated more than once, i.e. it simply finds duplicates recursively. The action depends on \&\fIoperation\fR: .IP "\fBreport\fR" 7 .IX Item "report" Report what files are encountered more than once, printing their names to standard output. .IP "\fBdelete\fR[\fB,report\fR]" 7 .IX Item "delete[,report]" Delete files that are encountered more than once. Print their names if \&\fBreport\fR is also given. .Sp \&\fI\s-1WARNING:\s0\fR use the \fBreport\fR operation first to see what will be deleted. .Sp \&\fI\s-1WARNING:\s0\fR You are advised to make a backup of the target first, e.g. with \&\f(CW\*(C`cp \-al\*(C'\fR (for \s-1GNU\s0 cp) to create hard links recursively. .SS "With \fB\-d\fP" .IX Subsection "With -d" The \fIdatabase\fR argument to \fB\-d\fR will denote a database file (see the \&\*(L"\s-1DATABASE\*(R"\s0 section in this manual for details) to read from or write to. In this mode, the \fBscan\fR operation should be run on one \fIpath\fR, followed by the \&\fBreport\fR or \fBdelete\fR operation on another (\fInot the same!\fR) \fIpath\fR. .IP "\fBscan\fR" 7 .IX Item "scan" Add the checksum of each file to \fIdatabase\fR. This operation must be run initially to create the database. To start over, you must manually delete the database file(s) (see the \*(L"\s-1DATABASE\*(R"\s0 section). .IP "\fBreport\fR" 7 .IX Item "report" Print each file name if its checksum is found in \fIdatabase\fR. .IP "\fBdelete\fR[\fB,report\fR]" 7 .IX Item "delete[,report]" Delete each file if its checksum is found in \fIdatabase\fR. If \fBreport\fR is also present, print the name of each deleted file. .Sp \&\fI\s-1WARNING:\s0\fR if you run \fBdupemap delete\fR on the same \fIpath\fR you just ran \&\fBdupemap scan\fR on, it will \fIdelete every file!\fR The idea of these options is to scan one \fIpath\fR and delete files in a second \fIpath\fR. .Sp \&\fI\s-1WARNING:\s0\fR use the \fBreport\fR operation first to see what will be deleted. .Sp \&\fI\s-1WARNING:\s0\fR You are advised to make a backup of the target first, e.g. with \&\f(CW\*(C`cp \-al\*(C'\fR (for \s-1GNU\s0 cp) to create hard links recursively. .SH "OPTIONS" .IX Header "OPTIONS" .IP "\fB\-d\fR \fIdatabase\fR" 7 .IX Item "-d database" Use \fIdatabase\fR as an on-disk database to read from or write to. See the \&\*(L"\s-1DESCRIPTION\*(R"\s0 section above about how this influences the operation of \&\fBdupemap\fR. .IP "\fB\-I\fR \fIfile\fR" 7 .IX Item "-I file" Reads input files from \fIfile\fR in addition to those listed on the command line. If \fIfile\fR is \f(CW\*(C`\-\*(C'\fR, read from standard input. Each line will be interpreted as a file name. .Sp The paths given here will \s-1NOT\s0 be scanned recursively. Directories will be ignored and symlinks will be followed. .IP "\fB\-m\fR \fIminsize\fR" 7 .IX Item "-m minsize" Ignore files below this size. .IP "\fB\-M\fR \fImaxsize\fR" 7 .IX Item "-M maxsize" Ignore files above this size. .SH "USAGE" .IX Header "USAGE" .SS "General usage" .IX Subsection "General usage" The easiest operations to understand is when the \fB\-d\fR option is not given. To delete all duplicate files in \fI/tmp/recovered\-files\fR, do: .PP .Vb 1 \& $ dupemap delete /tmp/recovered\-files .Ve .PP Often, \fBdupemap scan\fR is run to produce a checksum database of all files in a directory tree. Then \fBdupemap delete\fR is run on another directory, possibly following \fBdupemap report\fR. For example, to delete all files in \&\fI/tmp/recovered\-files\fR that already exist in \fI\f(CI$HOME\fI\fR, do this: .PP .Vb 2 \& $ dupemap \-d homedir.map scan $HOME \& $ dupemap \-d homedir.map delete,report /tmp/recovered\-files .Ve .SS "Usage with magicrescue" .IX Subsection "Usage with magicrescue" The main application for \fBdupemap\fR is to take some pain out of performing undelete operations with \fBmagicrescue\fR(1). The reason is that \fBmagicrescue\fR will extract every single file of the specified type on the block device, so undeleting files requires you to find a few files out of hundreds, which can take a long time if done manually. What we want to do is to only extract the documents that don't exist on the file system already. .PP In the following scenario, you have accidentally deleted some important Word documents in Windows. If this were a real-world scenario, then by all means use The Sleuth Kit. However, \fBmagicrescue\fR will work even when the directory entries were overwritten, i.e. more files were stored in the same folder later. .PP You boot into Linux and change to a directory with lots of space. Mount the Windows partition, preferably read-only (especially with \s-1NTFS\s0), and create the directories we will use. .PP .Vb 2 \& $ mount \-o ro /dev/hda1 /mnt/windows \& $ mkdir healthy_docs rescued_docs .Ve .PP Extract all the healthy Word documents with \fBmagicrescue\fR and build a database of their checksums. It may seem a little redundant to send all the documents through \fBmagicrescue\fR first, but the reason is that this process may modify them (e.g. stripping trailing garbage), and therefore their checksum will not be the same as the original documents. Also, it will find documents embedded inside other files, such as uncompressed zip archives or files with the wrong extension. .PP .Vb 4 \& $ find /mnt/windows \-type f \e \& |magicrescue \-I\- \-r msoffice \-d healthy_docs \& $ dupemap \-d healthy_docs.map scan healthy_docs \& $ rm \-rf healthy_docs .Ve .PP Now rescue all \f(CW\*(C`msoffice\*(C'\fR documents from the block device and get rid of everything that's not a *.doc. .PP .Vb 2 \& $ magicrescue \-Mo \-r msoffice \-d rescued_docs /dev/hda1 \e \& |grep \-v \*(Aq\e.doc$\*(Aq|xargs rm \-f .Ve .PP Remove all the rescued documents that also appear on the file system, and remove duplicates. .PP .Vb 2 \& $ dupemap \-d healthy_docs.map delete,report rescued_docs \& $ dupemap delete,report rescued_docs .Ve .PP The \fIrescued_docs\fR folder should now contain only a few files. This will be the undeleted files and some documents that were not stored in contiguous blocks (use that defragger ;\-)). .SS "Usage with fsck" .IX Subsection "Usage with fsck" In this scenario (based on a true story), you have a hard disk that's gone bad. You have managed to \fIdd\fR about 80% of the contents into the file \fIdiskimage\fR, and you have an old backup from a few months ago. The disk is using reiserfs on Linux. .PP First, use fsck to make the file system usable again. It will find many nameless files and put them in \fIlost+found\fR. You need to make sure there is some free space on the disk image, so fsck has something to work with. .PP .Vb 6 \& $ cp diskimage diskimage.bak \& $ dd if=/dev/zero bs=1M count=2048 >> diskimage \& $ reiserfsck \-\-rebuild\-tree diskimage \& $ mount \-o loop diskimage /mnt \& $ ls /mnt/lost+found \& (tons of files) .Ve .PP Our strategy will be to restore the system with the old backup as a base and merge the two other sets of files (\fI/mnt/lost+found\fR and \fI/mnt\fR) into the backup after eliminating duplicates. Therefore we create a checksum database of the directory we have unpacked the backup in. .PP .Vb 1 \& $ dupemap \-d backup.map scan ~/backup .Ve .PP Next, we eliminate all the files from the rescued image that are also present in the backup. .PP .Vb 1 \& $ dupemap \-d backup.map delete,report /mnt .Ve .PP We also want to remove duplicates from \fIlost+found\fR, and we want to get rid of any files that are also present in the other directories in \fI/mnt\fR. .PP .Vb 3 \& $ dupemap delete,report /mnt/lost+found \& $ ls /mnt|grep \-v lost+found|xargs dupemap \-d mnt.map scan \& $ dupemap \-d mnt.map delete,report /mnt/lost+found .Ve .PP This should leave only the files in \fI/mnt\fR that have changed since the last backup or got corrupted. Particularly, the contents of \fI/mnt/lost+found\fR should now be reduced enough to manually sort through them (or perhaps use \&\fBmagicsort\fR(1)). .SS "Primitive intrusion detection" .IX Subsection "Primitive intrusion detection" You can use \fBdupemap\fR to see what files change on your system. This is one of the more exotic uses, and it's only included for inspiration. .PP First, you map the whole file system. .PP .Vb 1 \& $ dupemap \-d old.map scan / .Ve .PP Then you come back a few days/weeks later and run \fBdupemap report\fR. This will give you a view of what \fIhas not\fR changed. To see what \fIhas\fR changed, you need a list of the whole file system. You can get this list along with preparing a new map easily. Both lists need to be sorted to be compared. .PP .Vb 2 \& $ dupemap \-d old.map report /|sort > unchanged_files \& $ dupemap \-d current.map scan /|sort > current_files .Ve .PP All that's left to do is comparing these files and preparing for next week. This assumes that the dbm appends the \f(CW\*(C`.db\*(C'\fR extension to database files. .PP .Vb 2 \& $ diff unchanged_files current_files > changed_files \& $ mv current.map.db old.map.db .Ve .SH "DATABASE" .IX Header "DATABASE" The actual database file(s) written by \fBdupecheck\fR will have some relation to the \fIdatabase\fR argument, but most implementations append an extension. For example, Berkeley \s-1DB\s0 names the files \fIdatabase\fR\fB.db\fR, while Solaris and \s-1GDBM\s0 creates both a \fIdatabase\fR\fB.dir\fR and \fIdatabase\fR\fB.pag\fR file. .PP \&\fBdupecheck\fR depends on a database library for storing the checksums. It currently requires the POSIX-standardized \fBndbm\fR library, which must be present on XSI-compliant UNIXes. Implementations are not required to handle hash key collisions, and a failure to do that could make \fBdupecheck\fR delete too many files. I haven't heard of such an implementation, though. .PP The current checksum algorithm is the file's \s-1CRC32\s0 combined with its size. Both values are stored in native byte order, and because of varying type sizes the database is \fInot\fR portable across architectures, compilers and operating systems. .SH "SEE ALSO" .IX Header "SEE ALSO" \&\fBmagicrescue\fR(1), \fBweeder\fR(1) .PP This tool does the same thing \fBweeder\fR does, except that \fBweeder\fR cannot seem to handle many files without crashing, and it has no largefile support. .SH "BUGS" .IX Header "BUGS" There is a tiny chance that two different files can have the same checksum and size. The probability of this happening is around 1 to 10^14, and since \&\fBdupemap\fR is part of the Magic Rescue package, which deals with disaster recovery, that chance becomes an insignificant part of the game. You should consider this if you apply \fBdupemap\fR to other applications, especially if they are security-related (see next paragraph). .PP It is possible to craft a file to have a known \s-1CRC32.\s0 You need to keep this in mind if you use \fBdupemap\fR on untrusted data. A solution to this could be to implement an option for using \s-1MD5\s0 checksums instead. .SH "AUTHOR" .IX Header "AUTHOR" Jonas Jensen .SH "LATEST VERSION" .IX Header "LATEST VERSION" This tool is part of Magic Rescue. You can find the latest version at