.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.43) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "pg_comparator 1" .TH pg_comparator 1 "2023-09-16" "perl v5.36.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" pg_comparator \- efficient table content comparison and synchronization .SH "SYNOPSIS" .IX Header "SYNOPSIS" \&\fBpg_comparator\fR [options as \fB\-\-help\fR \fB\-\-option\fR \fB\-\-man\fR] conn1 conn2 .SH "DESCRIPTION" .IX Header "DESCRIPTION" This script performs a network and time efficient comparison or synchronization of two possibly large tables in \fBPostgreSQL\fR, \fBMySQL\fR or \fBSQLite\fR databases, so as to detect inserted, updated or deleted tuples between these tables. The algorithm is efficient especially if the expected differences are relatively small. .PP The implementation is quite generic: multi-column keys (but there must be a key!), no assumption of data types other that they can be cast to text, subset of columns can be used for the comparison, handling of \s-1NULL\s0 values... .PP This script focuses on the comparison algorithm, hence the many options. The fact that it may do anything useful, such as checking that a replication tool does indeed replicates your data, or such as synchronizing tables, is a mere side effect. .SH "OPTIONS" .IX Header "OPTIONS" Options allow requesting help or to adjust some internal parameters. Short one-letter options are also available, usually with the first letter of the option name. .ie n .IP """\-\-aggregate=(sum|xor)"" or ""\-a (sum|xor)""" 4 .el .IP "\f(CW\-\-aggregate=(sum|xor)\fR or \f(CW\-a (sum|xor)\fR" 4 .IX Item "--aggregate=(sum|xor) or -a (sum|xor)" Aggregation function to be used for summaries, either \fBxor\fR or \fBsum\fR. It must operate on the result of the checksum function. For PostgreSQL and SQLite, the \fBxor\fR aggregate needs to be loaded. There is a signed/unsigned issue on the key hash when using \fBxor\fR for comparing tables on MySQL or SQLite vs PostgreSQL. We provide a new \f(CW\*(C`ISUM\*(C'\fR aggregate for SQLite because both \f(CW\*(C`SUM\*(C'\fR and \f(CW\*(C`TOTAL\*(C'\fR do some incompatible handling of integer overflows. .Sp Default is \fBsum\fR because it is available by default and works in mixed mode. .ie n .IP """\-\-ask\-pass""" 4 .el .IP "\f(CW\-\-ask\-pass\fR" 4 .IX Item "--ask-pass" Ask for passwords interactively. See also \f(CW\*(C`\-\-env\-pass\*(C'\fR option below. .Sp Default is not to ask for passwords. .ie n .IP """\-\-asynchronous"" or ""\-A"", ""\-\-no\-asynchronous"" or ""\-X""" 4 .el .IP "\f(CW\-\-asynchronous\fR or \f(CW\-A\fR, \f(CW\-\-no\-asynchronous\fR or \f(CW\-X\fR" 4 .IX Item "--asynchronous or -A, --no-asynchronous or -X" Whether to run asynchronous queries. This provides some parallelism, however the two connections are more or less synchronized per query. .Sp Default is to use asynchronous queries to enable some parallelism. .ie n .IP """\-\-checksum\-computation=(create|insert)"" or ""\-\-cc=...""" 4 .el .IP "\f(CW\-\-checksum\-computation=(create|insert)\fR or \f(CW\-\-cc=...\fR" 4 .IX Item "--checksum-computation=(create|insert) or --cc=..." How to create the checksum table. Use \fBcreate\fR to use a \f(CW\*(C`CREATE ... AS SELECT ...\*(C'\fR query, or \fBinsert\fR to use a \f(CW\*(C`CREATE ...; INSERT ... SELECT ...\*(C'\fR query. The former will require an additional counting to get the table size, so in the end there are two queries anyway. There is a type size issue with the \fBinsert\fR strategy on MySQL, the cumulated key string length must be under 64 bytes. .Sp Default is \fBcreate\fR because it always works for both databases. .ie n .IP """\-\-checksum\-function=fun"" or ""\-\-cf=fun"" or ""\-c fun""" 4 .el .IP "\f(CW\-\-checksum\-function=fun\fR or \f(CW\-\-cf=fun\fR or \f(CW\-c fun\fR" 4 .IX Item "--checksum-function=fun or --cf=fun or -c fun" Checksum function to use, either \fBck\fR, \fBfnv\fR or \fBmd5\fR. For PostgreSQL, MySQL and SQLite the provided \fBck\fR and \fBfnv\fR checksum functions must be loaded into the target databases. Choosing \fBmd5\fR does not come free either: the provided cast functions must be loaded into the target databases and the computation is more expensive. .Sp Default is \fBck\fR, which is fast, especially if the operation is cpu-bound and the bandwidth is reasonably high. .ie n .IP """\-\-checksum\-size=n"" or ""\-\-check\-size=n"" or ""\-\-cs=n"" or ""\-z n""" 4 .el .IP "\f(CW\-\-checksum\-size=n\fR or \f(CW\-\-check\-size=n\fR or \f(CW\-\-cs=n\fR or \f(CW\-z n\fR" 4 .IX Item "--checksum-size=n or --check-size=n or --cs=n or -z n" Tuple checksum size, must be \fB2\fR, \fB4\fR or \fB8\fR bytes. The key checksum size is always 4 bytes long. .Sp Default is \fB8\fR, so that the false negative probability is very low. There should be no reason to change that. .ie n .IP """\-\-cleanup""" 4 .el .IP "\f(CW\-\-cleanup\fR" 4 .IX Item "--cleanup" Drop checksum and summary tables beforehand. Useful after a run with \f(CW\*(C`\-\-no\-temp\*(C'\fR and \f(CW\*(C`\-\-no\-clear\*(C'\fR, typically used for debugging. .Sp Default is not to drop because it is not needed. .ie n .IP """\-\-clear""" 4 .el .IP "\f(CW\-\-clear\fR" 4 .IX Item "--clear" Drop checksum and summary tables explicitly after the computation. Note that they are dropped implicitly by default when the connection is closed as they are temporary, see \f(CW\*(C`\-(\-no)\-temporary\*(C'\fR option. This option is useful for debugging. .Sp Default is \fBnot\fR to clear explicitly the checksum and summary tables, as it is not needed. .ie n .IP """\-\-debug"" or ""\-d""" 4 .el .IP "\f(CW\-\-debug\fR or \f(CW\-d\fR" 4 .IX Item "--debug or -d" Set debug mode. Repeat for higher debug levels. See also \f(CW\*(C`\-\-verbose\*(C'\fR. Beware that some safe gards about option settings are skipped under debug so as to allow testing under different conditions. .Sp Default is not to run in debug mode. .ie n .IP """\-\-env\-pass=\*(Aqvar\*(Aq""" 4 .el .IP "\f(CW\-\-env\-pass=\*(Aqvar\*(Aq\fR" 4 .IX Item "--env-pass=var" Take password from environment variables \f(CW\*(C`var1\*(C'\fR, \f(CW\*(C`var2\*(C'\fR or \f(CW\*(C`var\*(C'\fR for connection one, two, or both. This is tried before asking interactively if \f(CW\*(C`\-\-ask\-pass\*(C'\fR is also set. .Sp Default is not to look for passwords from environment variables. .ie n .IP """\-\-expect n"" or ""\-e n""" 4 .el .IP "\f(CW\-\-expect n\fR or \f(CW\-e n\fR" 4 .IX Item "--expect n or -e n" Total number of differences to expect (updates, deletes and inserts). This option is only used for non regression tests. See the \s-1TESTS\s0 section. .ie n .IP """\-\-folding\-factor=7"" or ""\-f 7""" 4 .el .IP "\f(CW\-\-folding\-factor=7\fR or \f(CW\-f 7\fR" 4 .IX Item "--folding-factor=7 or -f 7" Folding factor: log2 of the number of rows grouped together at each stage, starting from the leaves so that the first round always groups as many records as possible. The power of two allows one to use masked computations. The minimum value of 1 builds a binary tree. .Sp Default folding factor log2 is \fB7\fR, i.e. size 128 folds. This default value was chosen after some basic tests on medium-size cases with medium or low bandwidth. Values from 4 to 8 should be a reasonable choice for most settings. .ie n .IP """\-\-help"" or ""\-h""" 4 .el .IP "\f(CW\-\-help\fR or \f(CW\-h\fR" 4 .IX Item "--help or -h" Show short help. .ie n .IP """\-\-key\-checksum=\*(Aqkcs\*(Aq"" or ""\-\-kcs=...""" 4 .el .IP "\f(CW\-\-key\-checksum=\*(Aqkcs\*(Aq\fR or \f(CW\-\-kcs=...\fR" 4 .IX Item "--key-checksum=kcs or --kcs=..." Use key checksum attribute of this name, which must be already available in the tables to compare. This option also requires option \f(CW\*(C`\-\-tuple\-checksum\*(C'\fR. See also the \s-1EXAMPLES\s0 section below for how to set a checksum trigger. Consider \f(CW\*(C`\-\-use\-key\*(C'\fR instead if you already have a reasonably distributed integer primary key. .Sp Default is to build both key and tuple checksums on the fly. .ie n .IP """\-\-lock"", ""\-\-no\-lock""" 4 .el .IP "\f(CW\-\-lock\fR, \f(CW\-\-no\-lock\fR" 4 .IX Item "--lock, --no-lock" Whether to lock tables. Setting the option explicitly overrides the default one way or another. For PostgreSQL, this option requires \f(CW\*(C`\-\-transaction\*(C'\fR, which is enabled by default. .Sp Default depends on the current operation: the table is \fInot locked\fR for a comparison, but it is \fIlocked\fR for a synchronization. .ie n .IP """\-\-long\-read\-len=0"" or ""\-L 0""" 4 .el .IP "\f(CW\-\-long\-read\-len=0\fR or \f(CW\-L 0\fR" 4 .IX Item "--long-read-len=0 or -L 0" Set max size for fetched binary large objects. Well, it seems to be ignored at least by the PostgreSQL driver. .Sp Default is to keep the default value set by the driver. .ie n .IP """\-\-man"" or ""\-m""" 4 .el .IP "\f(CW\-\-man\fR or \f(CW\-m\fR" 4 .IX Item "--man or -m" Show manual page interactively in the terminal. .ie n .IP """\-\-max\-ratio=0.1""" 4 .el .IP "\f(CW\-\-max\-ratio=0.1\fR" 4 .IX Item "--max-ratio=0.1" Maximum relative search effort. The search is stopped if the number of results is above this threshold expressed relatively to the table size. Use 2.0 for no limit (all tuples were deleted and new ones are inserted). .Sp Default is \fB0.1\fR, i.e. an overall 10% difference is allowed before giving up. .ie n .IP """\-\-max\-report=n""" 4 .el .IP "\f(CW\-\-max\-report=n\fR" 4 .IX Item "--max-report=n" Maximum absolute search effort. The search is stopped if the number of differences goes beyond this threshold. If set, the previous \f(CW\*(C`\-\-max\-ratio\*(C'\fR option is ignored, otherwise the effort is computed with the ratio once the table size is known. .Sp Default is to compute the maximum number of reported differences based on the \f(CW\*(C`\-\-max\-ratio\*(C'\fR option, with a minimum of 100 differences allowed. .ie n .IP """\-\-max\-levels=0""" 4 .el .IP "\f(CW\-\-max\-levels=0\fR" 4 .IX Item "--max-levels=0" Maximum number of levels used. Allows one to cut-off folding. 0 means no cut-off. Setting a value of 1 would only use the checksum table, without summaries. A value of 3 or 4 would be raisonable, as the last levels of the tree are nice for the theoretical complexity formula, but do not improve performance in practice. .Sp Default is \fB0\fR. .ie n .IP """\-\-null=\*(Aqtext\*(Aq""" 4 .el .IP "\f(CW\-\-null=\*(Aqtext\*(Aq\fR" 4 .IX Item "--null=text" How to handle \s-1NULL\s0 values. Either \fBhash\fR to hash all values, where \s-1NULL\s0 has one special hash value, or \fBtext\fR where \s-1NULL\s0 values are substituted by the \f(CW\*(C`NULL\*(C'\fR string. .Sp Default is \fBtext\fR because it is faster. .ie n .IP """\-\-option"" or ""\-o""" 4 .el .IP "\f(CW\-\-option\fR or \f(CW\-o\fR" 4 .IX Item "--option or -o" Show option summary. .ie n .IP """\-\-pg\-text\-cast""" 4 .el .IP "\f(CW\-\-pg\-text\-cast\fR" 4 .IX Item "--pg-text-cast" With PostgreSQL add explicit \s-1TEXT\s0 casts to work around some typing issues. .ie n .IP """\-\-pg\-copy=128""" 4 .el .IP "\f(CW\-\-pg\-copy=128\fR" 4 .IX Item "--pg-copy=128" Experimental option to use PostgreSQL's \s-1COPY\s0 instead of \s-1INSERT/UPDATE\s0 when synchronizing, by chunks of the specified size. .ie n .IP """\-\-prefix=\*(Aqpgc_cmp\*(Aq""" 4 .el .IP "\f(CW\-\-prefix=\*(Aqpgc_cmp\*(Aq\fR" 4 .IX Item "--prefix=pgc_cmp" Name prefix, possibly schema qualified, used for generated comparison tables by appending numbers to it. Consider changing the prefix if you expect several comparisons to run concurrently against the same database. .Sp Default is \f(CW\*(C`pgc_cmp\*(C'\fR. Checksum tables is named \f(CW\*(C`pgc_cmp_1_0\*(C'\fR and \&\f(CW\*(C`pgc_cmp_2_0\*(C'\fR, and summary tables are named by increasing the last number. .ie n .IP """\-\-report"", ""\-\-no\-report""" 4 .el .IP "\f(CW\-\-report\fR, \f(CW\-\-no\-report\fR" 4 .IX Item "--report, --no-report" Report differing keys to stdout as they are found. .Sp Default is to report. .ie n .IP """\-\-separator=\*(Aq|\*(Aq"" or ""\-s \*(Aq|\*(Aq""" 4 .el .IP "\f(CW\-\-separator=\*(Aq|\*(Aq\fR or \f(CW\-s \*(Aq|\*(Aq\fR" 4 .IX Item "--separator=| or -s |" Separator string or character used when concatenating key columns for computing checksums. .Sp Defaults to the pipe '|' character. .ie n .IP """\-\-size=n""" 4 .el .IP "\f(CW\-\-size=n\fR" 4 .IX Item "--size=n" Assume this value as the table size. It is sufficient for the algorithm to perform well that this size is in the order of magnitude of the actual table size. .Sp Default is to query the table sizes, which is skipped if this option is set. .ie n .IP """\-\-source\-1=\*(AqDBI:...\*(Aq"", ""\-\-source\-2=\*(Aq...\*(Aq"" or ""\-1 \*(Aq...\*(Aq"", ""\-2 \*(Aq...\*(Aq""" 4 .el .IP "\f(CW\-\-source\-1=\*(AqDBI:...\*(Aq\fR, \f(CW\-\-source\-2=\*(Aq...\*(Aq\fR or \f(CW\-1 \*(Aq...\*(Aq\fR, \f(CW\-2 \*(Aq...\*(Aq\fR" 4 .IX Item "--source-1=DBI:..., --source-2=... or -1 ..., -2 ..." Take full control of \s-1DBI\s0 data source specification and mostly ignore the comparison authentication part of the source or target URLs. One can connect with \*(L"DBI:Pg:service=backup\*(R", use an alternate driver, set any option allowed by the driver... See \f(CW\*(C`DBD::Pg\*(C'\fR and \f(CW\*(C`DBD:mysql\*(C'\fR manuals for the various options that can be set through the \s-1DBI\s0 data source specification. However, the database server specified in the \s-1URL\s0 must be consistent with this source specification so that the queries' syntax is the right one. .Sp Default is to rely on the two \s-1URL\s0 arguments. .ie n .IP """\-\-skip\-inserts"", ""\-\-skip\-updates"", ""\-\-skip\-deletes""" 4 .el .IP "\f(CW\-\-skip\-inserts\fR, \f(CW\-\-skip\-updates\fR, \f(CW\-\-skip\-deletes\fR" 4 .IX Item "--skip-inserts, --skip-updates, --skip-deletes" When synchronizing, do not perform these operations. .Sp Default under \f(CW\*(C`\-\-synchronize\*(C'\fR is to do all operations. .ie n .IP """\-\-stats=(txt|csv)""" 4 .el .IP "\f(CW\-\-stats=(txt|csv)\fR" 4 .IX Item "--stats=(txt|csv)" Show various statistics about the comparison performed in this format. Also, option \f(CW\*(C`\-\-stats\-name\*(C'\fR gives the test a name, useful to generate csv files that will be processed automatically. .Sp Default is \fBnot\fR to show statistics, because it requires additional synchronizations and is not necessarily interesting to the user. .ie n .IP """\-\-synchronize"" or ""\-S""" 4 .el .IP "\f(CW\-\-synchronize\fR or \f(CW\-S\fR" 4 .IX Item "--synchronize or -S" Actually perform operations to synchronize the second table wrt the first. Well, not really, it is only a dry run. It is actually done if you add \&\f(CW\*(C`\-\-do\-it\*(C'\fR or \f(CW\*(C`\-D\*(C'\fR. Save your data before attempting anything like that! .Sp Default is not to synchronize. .ie n .IP """\-\-temporary"", ""\-\-no\-temporary""" 4 .el .IP "\f(CW\-\-temporary\fR, \f(CW\-\-no\-temporary\fR" 4 .IX Item "--temporary, --no-temporary" Whether to use temporary tables. If you don't, the tables are kept by default at the end, so they will have to be deleted by hand. See \f(CW\*(C`\-\-clear\*(C'\fR option to request a cleanup. This option is useful for debugging. .Sp Default is to use temporary tables that are automatically wiped out when the connection is closed. .ie n .IP """\-\-unlogged"", ""\-\-no\-unlogged""" 4 .el .IP "\f(CW\-\-unlogged\fR, \f(CW\-\-no\-unlogged\fR" 4 .IX Item "--unlogged, --no-unlogged" Use unlogged tables for storing checksums. These tables are not transactional, so it may speed up things a little. However, they are not automatically cleaned up at the end. See \f(CW\*(C`\-\-clear\*(C'\fR option to request a cleanup. .Sp Default is not to use unlogged tables. .ie n .IP """\-\-threads"" or ""\-T"", ""\-\-no\-threads"" or ""\-N""" 4 .el .IP "\f(CW\-\-threads\fR or \f(CW\-T\fR, \f(CW\-\-no\-threads\fR or \f(CW\-N\fR" 4 .IX Item "--threads or -T, --no-threads or -N" Highly \s-1EXPERIMENTAL\s0 feature. .Sp Try to use threads to perform computations in parallel, with some hocus-pocus because perl thread model does not really work well with \s-1DBI.\s0 Perl threads are rather heavy and slow, more like communicating processes than light weight threads, really. .Sp This does \s-1NOT\s0 work at all with PostgreSQL. It works partially with MySQL, at the price of turning off \f(CW\*(C`\-\-transaction\*(C'\fR. .Sp Default is \fBnot\fR to use threads, as it does not work for all databases. .ie n .IP """\-\-timeout n""" 4 .el .IP "\f(CW\-\-timeout n\fR" 4 .IX Item "--timeout n" Timeout comparison after \f(CW\*(C`n\*(C'\fR seconds. .Sp Default is no timeout. Be patient. .ie n .IP """\-\-transaction"", ""\-\-no\-transaction""" 4 .el .IP "\f(CW\-\-transaction\fR, \f(CW\-\-no\-transaction\fR" 4 .IX Item "--transaction, --no-transaction" Whether to wrap the whole algorithm in a single transaction. .Sp Default is to use a wrapping transaction, as it seems to be both faster and safer to do so. .ie n .IP """\-\-tuple\-checksum=\*(Aqtcs\*(Aq"" or ""\-\-tcs=...""" 4 .el .IP "\f(CW\-\-tuple\-checksum=\*(Aqtcs\*(Aq\fR or \f(CW\-\-tcs=...\fR" 4 .IX Item "--tuple-checksum=tcs or --tcs=..." Use tuple checksum attribute of this name, which must be already available in the tables to compare. This option requires to set also either \f(CW\*(C`\-\-use\-key\*(C'\fR or \f(CW\*(C`\-\-key\-checksum=...\*(C'\fR above. The provided checksum attributes must not appear in the lists of key and value columns. See also the \s-1EXAMPLES\s0 section below for how to set a checksum trigger. .Sp Default is to build both key and tuple checksums on the fly. .ie n .IP """\-\-use\-key"" or ""\-u""" 4 .el .IP "\f(CW\-\-use\-key\fR or \f(CW\-u\fR" 4 .IX Item "--use-key or -u" Whether to directly use the value of the key to distribute tuples among branches. The key must be simple, integer, not \s-1NULL,\s0 and evenly distributed. If you have a reasonably spread integer primary key, consider using this option to avoid half of the checksum table hash computations. .Sp Default is to hash the key, so as to handle any type, composition and distribution. .ie n .IP """\-\-use\-null"", ""\-\-no\-use\-null""" 4 .el .IP "\f(CW\-\-use\-null\fR, \f(CW\-\-no\-use\-null\fR" 4 .IX Item "--use-null, --no-use-null" Whether to use the information that a column is declared \s-1NOT NULL\s0 to simplify computations by avoiding calls to \s-1COALESCE\s0 to handle \s-1NULL\s0 values. .Sp Default is to use this information, at the price of querying table metadata. .ie n .IP """\-\-verbose"" or ""\-v""" 4 .el .IP "\f(CW\-\-verbose\fR or \f(CW\-v\fR" 4 .IX Item "--verbose or -v" Be verbose about what is happening. The more you ask, the more verbose. .Sp Default is to be quiet, so that possible warnings or errors stand out. .ie n .IP """\-\-version"" or ""\-V""" 4 .el .IP "\f(CW\-\-version\fR or \f(CW\-V\fR" 4 .IX Item "--version or -V" Show version information and exit. .ie n .IP """\-\-where=...""" 4 .el .IP "\f(CW\-\-where=...\fR" 4 .IX Item "--where=..." \&\s-1SQL\s0 boolean condition on table tuples for partial comparison. Useful to reduce the load if you know that expected differences are in some parts of your data, say those time-stamped today... The same condition is passed on both sides, so both tables must be pretty similar so that it works. This is usually the case. .Sp Default is to compare whole tables. .SH "ARGUMENTS" .IX Header "ARGUMENTS" The two arguments describe database connections with the following URL-like syntax, where square brackets denote optional parts. Many parts are optional with a default. The minimum syntactically correct specification is \f(CW\*(C`/\*(C'\fR, but that does not necessary mean anything useful. .PP .Vb 1 \& [driver://][login[:pass]@][host][:port]/[base/[[schema.]table[?key[:cols]]]] .Ve .PP See the \s-1EXAMPLES\s0 section below, and also the \f(CW\*(C`\-\-source\-*\*(C'\fR options above. .PP Note that some default value used by \s-1DBI\s0 drivers may be changed with driver-specific environment variables, and that \s-1DBI\s0 also provides its own defaults and overrides, so what actually happens may not always be clear. Default values for the second \s-1URL\s0 are mostly taken from the first \s-1URL.\s0 .IP "\fBdriver\fR" 4 .IX Item "driver" Database driver to use. Use \fBpgsql\fR for PostgreSQL, \fBmysql\fR for MySQL, \fBsqlite\fR for SQLite. Heterogeneous databases may be compared and synchronized, however beware that subtle typing, encoding and casting issues may prevent heterogeneous comparisons or synchronizations to succeed. Default is \fBpgsql\fR for the first connection, and same as first for second. .Sp For SQLite, the authentication part of the \s-1URL\s0 (login, pass, host, port) is expected to be empty, thus the full \s-1URL\s0 should look like: .Sp .Vb 1 \& sqlite:///base.db/table?key,col:other,columns .Ve .Sp Moreover, setting the \s-1PGC_SQLITE_LOAD_EXTENSION\s0 environment variable with \&\f(CW\*(C`:\*(C'\fR\-separated shared object files loads these into SQLite. .IP "\fBlogin\fR" 4 .IX Item "login" Login to use when connecting to database. Default is username for first connection, and same as first connection for second. .IP "\fBpass\fR" 4 .IX Item "pass" Password to use when connecting to database. Note that it is a bad idea to put a password as a command argument. Default is none for the first connection, and the same password as the first connection for the second \fIif\fR the connection targets the same host, port and uses the same login. See also \f(CW\*(C`\-\-ask\-pass\*(C'\fR and \f(CW\*(C`\-\-env\-pass\*(C'\fR options. .IP "\fBhost\fR" 4 .IX Item "host" Hostname or \s-1IP\s0 to connect to. Default is the empty string, which means connecting to the database on localhost with a \s-1UNIX\s0 socket. .IP "\fBport\fR" 4 .IX Item "port" TCP-IP port to connect to. Default is 5432 for PostgreSQL and 3306 for MySQL. .IP "\fBbase\fR" 4 .IX Item "base" Database catalog to connect to. Default is username for first connection. Default is same as first connection for second connection. For SQLite, provide the database file name. The path is relative by default, but can be made absolute by prepending an additional '/': .Sp .Vb 1 \& sqlite:////var/cache/sqlite/base.db/table?... .Ve .IP "\fBschema.table\fR" 4 .IX Item "schema.table" The possibly schema-qualified table to use for comparison. No default for first connection. Default is same as first connection for second connection. .Sp Note that MySQL does not have \fIschemas\fR, so the schema part must be empty. However, strangely enough, their \fIdatabase\fR concept is just like a \&\fIschema\fR, so one could say that MySQL really does not have \fIdatabases\fR, although there is something of that name. Am I clear? .IP "\fBkeys\fR" 4 .IX Item "keys" Comma-separated list of key columns. Default is table primary key for first connection. Default is same as first connection for second connection. The key \fBcannot\fR be empty. If you do not have a way of identifying your tuples, then there is no point in looking for differences. .IP "\fBcols\fR" 4 .IX Item "cols" Comma-separated list of columns to compare. May be empty. Default is all columns but \fBkeys\fR for first connection. Default is same as first connection for second connection. Beware that \f(CW\*(C`...?key:\*(C'\fR means an empty cols, while \f(CW\*(C`...?key\*(C'\fR sets the default by querying table metadata. .SH "EXAMPLES" .IX Header "EXAMPLES" Compare tables calvin and hobbes in database family on localhost, with key \fIid\fR and columns \fIc1\fR and \fIc2\fR: .PP .Vb 1 \& ./pg_comparator /family/calvin?id:c1,c2 /family/hobbes .Ve .PP Compare tables calvin in default database on localhost and the same table in default database on sablons, with key \fIid\fR and column \fIdata\fR: .PP .Vb 1 \& ./pg_comparator localhost/family/calvin?id:data sablons/ .Ve .PP Synchronize \f(CW\*(C`user\*(C'\fR table in database \f(CW\*(C`wikipedia\*(C'\fR from MySQL on \&\f(CW\*(C`server1\*(C'\fR to PostgreSQL on \f(CW\*(C`server2\*(C'\fR. .PP .Vb 2 \& ./pg_comparator \-S \-D \-\-ask\-pass \e \& mysql://calvin@server1/wikipedia/user pgsql://hobbes@server2/ .Ve .PP For PostgreSQL, you may add trigger-maintained key and tuple checksums as: .PP .Vb 10 \& \-\- TABLE Foo(id SERIAL PRIMARY KEY, data ... NOT NULL); \& \-\- add a key and tuple checksum attributes \& \-\- the key checksum can be skipped if you use \-\-use\-key, \& \-\- for which the key must be a simple NOT NULL integer. \& ALTER TABLE Foo \& ADD COLUMN key_cs INT4 NOT NULL DEFAULT 0, \& ADD COLUMN tup_cs INT8 NOT NULL DEFAULT 0; \& \-\- function to update the tuple checksum \& \-\- if some attributes may be NULL, they must be coalesced \& CREATE FUNCTION foo_cs() RETURNS TRIGGER AS $$ \& BEGIN \& \-\- compute key checksum \& NEW.key_cs = cksum4(NEW.id); \& \-\- compute tuple checksum \& NEW.tup_cs = cksum8(NEW.id || \*(Aq|\*(Aq || NEW.data); \& RETURN NEW; \& END; $$ LANGUAGE plpgsql; \& \-\- set trigger to call the checksum update function \& CREATE TRIGGER foo_cs_trigger \& BEFORE UPDATE OR INSERT ON Foo \& FOR EACH ROW EXECUTE PROCEDURE foo_cs(); \& \-\- if table Foo is not initially empty, \& \-\- update its contents to trigger checksum computations \& UPDATE Foo SET id=id; .Ve .PP Then a fast comparison, which does not need to compute the initial checksum table, can be requested with: .PP .Vb 2 \& ./pg_comparator \-\-tcs=tup_cs \-\-kcs=key_cs \e \& admin@server1/app/Foo?id:data hobbes@server2/ .Ve .PP As the primary key is a simple integer, the \fIkey_cs\fR could be left out and the comparison could be launched with: .PP .Vb 2 \& ./pg_comparator \-\-tcs=tup_cs \-\-use\-key \e \& admin@server1/app/Foo?id:data hobbes@server2/ .Ve .SH "OUTPUT" .IX Header "OUTPUT" The output of the command consists of lines describing the differences found between the two tables. They are expressed in term of insertions, updates or deletes and of tuple keys. .IP "\fB\s-1UPDATE\s0 k\fR" 4 .IX Item "UPDATE k" Key \fIk\fR tuple is updated from table 1 to table 2. It exists in both tables with different values. .IP "\fB\s-1INSERT\s0 k\fR" 4 .IX Item "INSERT k" Key \fIk\fR tuple does not appear in table 2, but only in table 1. It must be inserted in table 2 to synchronize it wrt table 1. .IP "\fB\s-1DELETE\s0 k\fR" 4 .IX Item "DELETE k" Key \fIk\fR tuple appears in table 2, but not in table 1. It must be deleted from 2 to synchronize it wrt table 1. .PP In case of tuple checksum collisions, false negative results may occur. Changing the checksum function would help in such cases. See the \s-1ANALYSIS\s0 sub-section. .SH "INSTALL" .IX Header "INSTALL" This section describes how to install extensions (functions, casts, aggregates) needed by pg_comparator for the different target databases. .PP First, get pg_comparator sources . .SS "PostgreSQL" .IX Subsection "PostgreSQL" For installing on PostgreSQL, you must ensure that the \f(CW\*(C`pg_config\*(C'\fR command found in your path is the one of the target PostgreSQL server, and that development packages are installed. .PP Then compile and install the extensions' shared objects: .PP .Vb 1 \& sh> make pgsql_install .Ve .PP To load the extension files into the target \f(CW\*(C`DB\*(C'\fR database, where \f(CW\*(C`...\*(C'\fR are the connection options: .PP .Vb 1 \& sh> psql ... \-c \*(AqCREATE EXTENSION pgcmp\*(Aq DB .Ve .PP To uninstall: .PP .Vb 2 \& sh> psql ... \-c \*(AqDROP EXTENSION pgcmp\*(Aq DB \& sh> make pgsql_uninstall .Ve .SS "MySQL" .IX Subsection "MySQL" For installing on MySQL, you must ensure that the \f(CW\*(C`mysql_config\*(C'\fR command found in your path is the one of the target MySQL server, and that development packages are installed. .PP Then compile and install the extensions' shared objects: .PP .Vb 1 \& sh> make mysql_install .Ve .PP And load the extension files into the database: .PP .Vb 2 \& sh> mysql ... < PATH\-TO\-EXTENSION/mysql_casts.sql \& sh> mysql ... < PATH\-TO\-EXTENSION/mysql_checksum.sql .Ve .PP See \f(CW\*(C`mysql_config \-\-plugindir\*(C'\fR for the extension directory path. On some systems \f(CW\*(C`PATH\-TO\-EXTENSION\*(C'\fR might be \f(CW\*(C`/usr/lib/mysql/contrib\*(C'\fR. .PP To uninstall: .PP .Vb 1 \& sh> make mysql_uninstall .Ve .SS "SQLite" .IX Subsection "SQLite" For installing with SQLite, the corresponding development package is needed. .PP First compile and install the extensions' shared objects (you may adjust \f(CW\*(C`SQLITE.libdir\*(C'\fR make variable to change the target directory, which is by default \f(CW\*(C`/usr/local/lib\*(C'\fR): .PP .Vb 1 \& sh> make sqlite_install .Ve .PP Then load the extension by executing (to do it always, you may append the line to your \f(CW\*(C`.sqliterc\*(C'\fR file): .PP .Vb 1 \& SELECT load_extension(\*(Aq/usr/local/lib/sqlite_checksum.so\*(Aq); .Ve .PP To uninstall: .PP .Vb 1 \& sh> make sqlite_uninstall .Ve .SH "DEPENDENCES" .IX Header "DEPENDENCES" Three support functions are needed on the database: .IP "1." 2 The \f(CW\*(C`COALESCE\*(C'\fR function takes care of \s-1NULL\s0 values in columns. .IP "2." 2 A checksum function must be used to reduce and distribute key and columns values. It may be changed with the \f(CW\*(C`\-\-checksum\*(C'\fR option. Its size can be selected with the \f(CW\*(C`\-\-checksize\*(C'\fR option (currently 2, 4 or 8 bytes). The checksums also require casts to be converted to integers of various sizes. .Sp Suitable implementations are available for PostgreSQL and can be loaded into the server by processing \f(CW\*(C`share/contrib/pgc_checksum.sql\*(C'\fR and \&\f(CW\*(C`share/contrib/pgc_casts.sql\*(C'\fR. New checksums and casts are also available for MySQL, see \f(CW\*(C`mysql_*.sql\*(C'\fR. An loadable implementation of suitable checksum functions is also available for SQLite, see \f(CW\*(C`sqlite_checksum.*\*(C'\fR. .Sp The \f(CW\*(C`ck\*(C'\fR checksum is based on Jenkins hash , which relies on simple add, shift and xor integer operations. The \f(CW\*(C`fnv\*(C'\fR checksum is inspired by \&\s-1FNV\s0 hash (64 bits 1a version) which uses xor and mult integer operations, although I also added some shift and add to help tweak high bits. .IP "3." 2 An aggregate function is used to summarize checksums for a range of rows. It must operate on the result of the checksum function. It may be changed with the \f(CW\*(C`\-\-aggregate\*(C'\fR option. .Sp Suitable implementations of a exclusive-or \f(CW\*(C`xor\*(C'\fR aggregate are available for PostgreSQL and can be loaded into the server by processing \&\f(CW\*(C`share/contrib/xor_aggregate.sql\*(C'\fR. .Sp The \f(CW\*(C`sqlite_checksum.*\*(C'\fR file also provides a \f(CW\*(C`xor\*(C'\fR and \f(CW\*(C`sum\*(C'\fR aggregates for SQLite that are compatible with other databases. .PP Moreover several perl modules are useful to run this script: .IP "\(bu" 4 \&\f(CW\*(C`Getopt::Long\*(C'\fR for option management. .IP "\(bu" 4 \&\f(CW\*(C`DBI\*(C'\fR, \&\f(CW\*(C`DBD::Pg\*(C'\fR to connect to PostgreSQL, \&\f(CW\*(C`DBD::mysql\*(C'\fR to connect to MySQL, and \f(CW\*(C`DBD::SQLite\*(C'\fR to connect to SQLite. .IP "\(bu" 4 \&\f(CW\*(C`Term::ReadPassword\*(C'\fR for \f(CW\*(C`\-\-ask\-pass\*(C'\fR option. .IP "\(bu" 4 \&\f(CW\*(C`Pod::Usage\*(C'\fR for doc self-extraction (\f(CW\*(C`\-\-man\*(C'\fR \f(CW\*(C`\-\-opt\*(C'\fR \f(CW\*(C`\-\-help\*(C'\fR). .IP "\(bu" 4 \&\f(CW\*(C`threads\*(C'\fR for the experimental threaded version with option \f(CW\*(C`\-\-threads\*(C'\fR. .IP "\(bu" 4 \&\f(CW\*(C`Digest::MD5\*(C'\fR for md5 checksum with SQLite. .PP Modules are only loaded by the script if they are actually required. .SH "ALGORITHM" .IX Header "ALGORITHM" The aim of the algorithm is to compare the content of two tables, possibly on different remote servers, with minimum network traffic. It is performed in three phases. .IP "1." 2 A checksum table is computed on each side for the target table. .IP "2." 2 A fist level summary table is computed on each side by aggregating chunks of the checksum table. Other levels of summary aggregations are then performed till there is only one row in the last table, which then stores a global checksum for the whole initial target tables. .IP "3." 2 Starting from the upper summary tables, aggregated checksums are compared from both sides to look for differences, down to the initial checksum table. Keys of differing tuples are displayed. .SS "\s-1CHECKSUM TABLE\s0" .IX Subsection "CHECKSUM TABLE" The first phase computes the initial checksum table \fIT(0)\fR on each side. Assuming that \fIkey\fR is the table key columns, and \fIcols\fR is the table data columns that are to be checked for differences, then it is performed by querying target table \fIT\fR as follow: .PP .Vb 5 \& CREATE TABLE T(0) AS \& SELECT key AS pk, \-\- primary key \& checksum(key) AS kcs, \-\- key checksum \& checksum(key || cols) AS tcs \-\- tuple checksum \& FROM t; .Ve .PP The initial key is kept, as it will be used to show differing keys at the end. The rational for the \fIkcs\fR column is to randomize the key-values distribution so as to balance aggregates in the next phase. The key must appear in the checksum also, otherwise content exchanged between two keys would not be detected in some cases. .SS "\s-1SUMMARY TABLES\s0" .IX Subsection "SUMMARY TABLES" Now we compute a set of cascading summary tables by grouping \fIf\fR (folding factor) checksums together at each stage. The grouping is based on a mask on the \fIkcs\fR column to take advantage of the checksum randomization. Starting from \fIp=0\fR we build: .PP .Vb 5 \& CREATE TABLE T(p+1) AS \& SELECT kcs & mask(p+1) AS kcs, \-\- key checksum subset \& XOR(tcs) AS tcs \-\- tuple checksum summary \& FROM T(p) \& GROUP BY kcs & mask(p+1); .Ve .PP The mask(p) is defined so that it groups together on average \fIf\fR checksums together: \fBmask\fR\|(0) = ceil2(size); mask(p) = mask(p\-1)/f; This leads to a hierarchy of tables, each one being a smaller summary of the previous one: .IP "level \fB0\fR" 4 .IX Item "level 0" checksum table, \fIsize\fR rows, i.e. as many rows as the target table. .IP "level \fB1\fR" 4 .IX Item "level 1" first summary table, (size/f) rows. .IP "level \fBp\fR" 4 .IX Item "level p" intermediate summary table, (size/f**p) rows. .IP "level \fBn\-1\fR" 4 .IX Item "level n-1" one before last summary table, less than f rows. .IP "level \fBn\fR" 4 .IX Item "level n" last summary table, mask is 0, 1 row. .PP It is important that the very same masks are used on both sides so that aggregations are the same, allowing to compare matching contents on both sides. .SS "\s-1SEARCH FOR DIFFERENCES\s0" .IX Subsection "SEARCH FOR DIFFERENCES" After all these support tables are built on both sides comes the search for differences. When checking the checksum summary of the last tables (level \fIn\fR) with only one row, it is basically a comparison of the checksum of the whole table contents. If they match, then both tables are equal, and we are done. Otherwise, if these checksums differ, some investigation is needed to detect offending keys. .PP The investigation is performed by going down the table hierarchy and looking for all \fIkcs\fR for which there was a difference in the checksum on the previous level. The same query is performed on both side at each stage: .PP .Vb 4 \& SELECT kcs, tcs \& FROM T(p) \& WHERE kcs & mask(p+1) IN (kcs\-with\-diff\-checksums\-from\-level\-p+1) \& ORDER BY kcs [and on level 0: , id]; .Ve .PP And the results from both sides are merged together. When doing the merge procedure, four cases can arise: .IP "1." 2 Both \fIkcs\fR and \fItcs\fR match. Then there is no difference. .IP "2." 2 Although \fIkcs\fR does match, \fItcs\fR does not. Then this \fIkcs\fR is to be investigated at the next level, as the checksum summary differs. If we are already at the last level, then the offending key can be shown. .IP "3." 2 No \fIkcs\fR match, one supplemental \fIkcs\fR in the first side. Then this \fIkcs\fR correspond to key(s) that must be inserted for syncing the second table wrt the first. .IP "4." 2 No \fIkcs\fR match, one supplemental \fIkcs\fR in the second side. Then this \fIkcs\fR correspond to key(s) that must be deleted for syncing the second table wrt the first. .PP Cases 3 and 4 are simply symmetrical, and it is only an interpretation to decide whether it is an insert or a delete, taking the first side as the reference. .SS "\s-1ANALYSIS\s0" .IX Subsection "ANALYSIS" Let \fIn\fR be the number of rows, \fIr\fR the row size, \fIf\fR the folding factor, \&\fIk\fR the number of differences to be detected, \fIc\fR the checksum size in bits, then the costs to identify differences and the error rate is: .IP "\fBnetwork volume\fR" 2 .IX Item "network volume" is better than \fIk*f*ceil(log(n)/log(f))*(c+log(n))\fR. the contents of \fIk\fR blocks of size \fIf\fR is transferred on the depth of the tree, and each block identifier is of size \fIlog(n)\fR and contains a checksum \fIc\fR. It is independent of \fIr\fR, and you want \fIk< T2.data \-\- UPDATE .Ve .SS "\s-1REFERENCES\s0" .IX Subsection "REFERENCES" A paper was presented at a conference about this tool and its algorithm: \&\fBRemote Comparison of Database Tables\fR by \fIFabien Coelho\fR, In Third International Conference on Advances in Databases, Knowledge, and Data Applications (\s-1DBKDA\s0), pp 23\-28, St Marteen, The Netherlands Antilles, January 2011. \&\s-1ISBN: 978\-1\-61208\-002\-4.\s0 Copyright \s-1IARIA 2011.\s0 Online at Think Mind . .PP The algorithm and script was inspired by \&\fBTaming the Distributed Database Problem: A Case Study Using MySQL\fR by \fIGiuseppe Maxia\fR in \fBSys Admin\fR vol 13 num 8, Aug 2004, pp 29\-40. See Perl Monks for details. In this paper, three algorithms are presented. The first one compares two tables with a checksum technique. The second one finds \s-1UPDATE\s0 or \s-1INSERT\s0 differences based on a 2\-level (checksum and summary) table hierarchy. The algorithm is asymmetrical, as different queries are performed on the two tables to compare. It seems that the network traffic volume is in \fIk*(f+(n/f)+r)\fR, that it has a probabilistically-buggy merge procedure, and that it makes assumptions about the distribution of key values. The third algorithm looks for \s-1DELETE\s0 differences based on counting, with the implicit assumption that there are only such differences. .PP In contrast to this approach, our fully symmetrical algorithm implements all three tasks at once, to find \s-1UPDATE, DELETE\s0 and \s-1INSERT\s0 between the two tables. The checksum and summary hierarchical level idea is reused and generalized so as to reduce the algorithmic complexity. .PP From the implementation standpoint, the script is as parametric as possible with many options, and makes few assumptions about table structures, types and values. .SH "SEE ALSO" .IX Header "SEE ALSO" \&\fIMichael Nacos\fR made a robust implementation pg51g based on triggers. He also noted that although database contents are compared by the algorithm, the database schema differences can \fIalso\fR be detected by comparing system tables which describe them. .PP \&\fIBenjamin Mead Vandiver\fR's PhD Thesis \&\fBDetecting and Tolerating Byzantine Faults in Database Systems\fR, Massachusset's Institute of Technology, May 2008 (report number \s-1MIT\-CSAIL\-TR\-2008\-040\s0). There is an interesting discussion in Chapter 7, where experiments are presented with a Java/JDBC/MySQL implementation of two algorithms, including this one. .PP \&\fIBaron Schwartz\fR discusses comparison algorithms in an online post . .PP Some more links: .IP "\(bu" 2 Adept \s-1SQL\s0 .IP "\(bu" 2 Altova Database Spy .IP "\(bu" 2 \&\s-1AUI\s0 Soft SQLMerger .IP "\(bu" 2 Clever Components dbcomparer .IP "\(bu" 2 Comparezilla .IP "\(bu" 2 Datanamic Datadiff .IP "\(bu" 2 \&\s-1DB\s0 Balance .IP "\(bu" 2 DBConvert .IP "\(bu" 2 DBSolo datacomp .IP "\(bu" 2 dbForge Data Compare .IP "\(bu" 2 DiffKit .IP "\(bu" 2 Percona Toolkit .IP "\(bu" 2 MySQL DBCompare .IP "\(bu" 2 \&\s-1SQL\s0 Server tablediff Utility .IP "\(bu" 2 Red Gate \s-1SQL\s0 Data Compare .IP "\(bu" 2 Spectral Core OmegaSync , .IP "\(bu" 2 \&\s-1SQL\s0 Delta .IP "\(bu" 2 SQLite sqldiff .IP "\(bu" 2 AlfaAlfa \s-1SQL\s0 Server Comparison Tool .IP "\(bu" 2 SQLyog MySQL \s-1GUI\s0 .IP "\(bu" 2 xSQL Software Data Compare .SH "TESTS" .IX Header "TESTS" The paper reports numerous performance tests with PostgreSQL under various bandwidth constraints. .PP Moreover, non regression tests are run over randomly generated tables when the software is upgraded: .IP "\fIsanity\fR \- about 30 seconds & 30 runs" 4 .IX Item "sanity - about 30 seconds & 30 runs" Run a comparison, synchronization & check for all databases combinaisons and all working asynchronous queries and threading options. .IP "\fIfast\fR \- about 5 minutes & 360 runs" 4 .IX Item "fast - about 5 minutes & 360 runs" Run 12 tests similar to the previous one with varrying options (number of key columns, number of value columns, aggregate function, checksum function, null handling, folding factor, table locking or not...). .IP "\fIfeature\fR \- about 5 minutes & 171 or 477 runs" 4 .IX Item "feature - about 5 minutes & 171 or 477 runs" Test various features: \&\fIcc\fR for checksum computation strategies, \&\fIauto\fR for trigger-maintained checksums on PostgreSQL, \&\fIpgcopy\fR for PostgreSQL copy test, \&\fIempty\fR for corner cases with empty tables, \&\fIquote\fR for table quoting, \&\fIengine\fR for InnoDB vs MyISAM MySQL backends, \&\fIwidth\fR for large columns, \&\fInullkey\fR for possible \s-1NULL\s0 values in keys, \&\fIsqlite\fR for SQLite test, \&\fImylite\fR for SQLite/MySQL mixed mode with some restrictions, \&\fIpglite\fR for SQLite/PostgreSQL mixed mode with some restrictions. .IP "\fIrelease\fR \- about 20 minutes & 944 runs" 4 .IX Item "release - about 20 minutes & 944 runs" This is \fIfeature\fR with two table sizes, \fIfast\fR, and \fIcollisions\fR to test possible hash collisions. .IP "\fIhour\fR \- about 1 hour & 2880 runs" 4 .IX Item "hour - about 1 hour & 2880 runs" A combination of 8 \fIfast\fR validations with varrying table sizes and difference ratio ranging from 0.1% to 99.9%. .IP "\fIfull\fR \- about 6 hours & 16128 runs... seldom run" 4 .IX Item "full - about 6 hours & 16128 runs... seldom run" A combinatorial test involving numerous options: aggregation, checksums, null handling, foldings, number of key and value attributes... .SH "BUGS" .IX Header "BUGS" All software have bugs. This is a software, hence it has bugs. .PP Reporting bugs is good practice, so tell me if you find one. If you have a fix, this is even better! .PP The implementation does not do many sanity checks. .PP Although the algorithm can work with some normalized columns (say strings are trimmed, lowercased, Unicode normalized...), the implementation may not work at all. .PP The script is really tested with integer and text types, issues may arise with other types. .PP The script handles one table at a time. In order to synchronize several linked tables, you must disable referential integrity checks, then synchronize each tables, then re-enable the checks. .PP There is no real attempt at doing some sensible identifier quoting, although quotes provided in the connection url are kept, so it may work after all for simple enough cases. .PP There is no neat user interfaces, this is a earthly command line tool. This is not a bug, but a feature. .PP There are too many options. .PP Using another language such as Python for this application seems attractive, but there is no cleanly integrated manual-page style support such as \s-1POD,\s0 and the documentation is 50% of this script. .PP Mixed SQLite vs PostgreSQL or MySQL table comparison may not work properly in all cases, because of SQLite dynamic type handling and reduced capabilities. .PP The script creates (temporary) tables on both sides for comparing the target tables: this imply that you must be allowed to do that for the comparison... However, read-only replicas do not allow creating objects, which mean that you cannot use pg_comparator to compare table contents on a synchronized replica. .SH "TODO" .IX Header "TODO" Allow larger checksum sizes. .PP Add an option to avoid \s-1IN\s0 (x,y,...) syntax, maybe with a temporary table to hold values and use a \s-1JOIN\s0 on that. I'm not sure about the performance implications, though. .PP Allow generating the \s-1SQL\s0 update script without applying it. .PP Option to generate more compact updates, i.e. only update attributes with different values. .SH "VERSIONS" .IX Header "VERSIONS" See web site for the latest version. Although versions are really managed with \s-1SVN,\s0 there is also a github repos . .IP "\fBversion 2.3.2\fR (r1594 on 2020\-11\-03)" 4 .IX Item "version 2.3.2 (r1594 on 2020-11-03)" Accept dash character (\*(L"\-\*(R") in login and database names, submitted by \fIPiotr Boniecki\fR. Fix quoting of values when synchronizing with copy, reported by \fILuis Gonzales Sotelino\fR. Add \f(CW\*(C`module_pathname\*(C'\fR to Postgres extension control file. .IP "\fBversion 2.3.1\fR (r1582 on 2017\-07\-07)" 4 .IX Item "version 2.3.1 (r1582 on 2017-07-07)" Fix spelling errors in the documentation, reported by \fIBas Couwenberg\fR. Fix distribution \f(CW\*(C`Makefile\*(C'\fR. .IP "\fBversion 2.3.0\fR (r1569 on 2017\-07\-07)" 4 .IX Item "version 2.3.0 (r1569 on 2017-07-07)" Add new \*(L"\s-1INSTALL\*(R"\s0 Section. Turn cast, functions and aggregates into a PostgreSQL extension. Fix \f(CW\*(C`\-\-where\*(C'\fR handling when \f(CW\*(C`\-\-tcs\*(C'\fR is used, reported by \fIKenneth Hammink\fR. Add \f(CW\*(C`\-\-pg\-text\-cast\*(C'\fR option to work around missing implicit casts, issue reported by \fISaulius Grigaitis\fR. Documentation updates. The \fIrelease\fR validation was run successfully on PostgreSQL 9.6.3 and MySQL 5.7.18. .IP "\fBversion 2.2.6\fR (r1540 on 2015\-04\-18)" 4 .IX Item "version 2.2.6 (r1540 on 2015-04-18)" Fix some typos found by Lintian and pointed out by \fIIvan Mincik\fR. Add support for \s-1FNV\s0 (Fowler Noll Vo) version 1a inspired hash functions. Add option to skip inserts, updates or deletes when synchronizing, which may be useful to deal with foreign keys, issue pointed out by \fIGraeme Bell\fR. The \fIrelease\fR validation was run successfully on PostgreSQL 9.4.1 and MySQL 5.5.41. .IP "\fBversion 2.2.5\fR (r1512 on 2014\-07\-24)" 4 .IX Item "version 2.2.5 (r1512 on 2014-07-24)" Fix broken \s-1URL\s0 defaults to use \s-1UNIX\s0 sockets with an empty host name, per report by \fIIvan Mincik\fR. Fix \f(CW\*(C`\-\-where\*(C'\fR condition handling with \f(CW\*(C`\-\-pg\-copy\*(C'\fR in corner cases. Do not take execution timestamps when not required. Allow a larger number of differences by default for small table comparisons. Add more sanity checks. Improve some error messages. The \fIrelease\fR validation was run successfully on PostgreSQL 9.4b1 and MySQL 5.5.38. .IP "\fBversion 2.2.4\fR (r1506 on 2014\-07\-13)" 4 .IX Item "version 2.2.4 (r1506 on 2014-07-13)" Add experimental support for using \s-1COPY\s0 instead of \s-1INSERT/UPDATE\s0 for PostgreSQL, in chunks of size specified with option \f(CW\*(C`\-\-pg\-copy\*(C'\fR, as suggested by \fIGraeme Bell\fR. Minor fix when computing the maximum number of differences to report. The \fIrelease\fR validation was run successfully on PostgreSQL 9.4b1 and MySQL 5.5.37. .IP "\fBversion 2.2.3\fR (r1494 on 2014\-04\-19)" 4 .IX Item "version 2.2.3 (r1494 on 2014-04-19)" Improved documentation. Add \f(CW\*(C`\-\-unlogged\*(C'\fR option to use unlogged tables. The \fIrelease\fR validation was run successfully on PostgreSQL 9.3.4 and MySQL 5.5.35. .IP "\fBversion 2.2.2\fR (r1485 on 2014\-01\-08)" 4 .IX Item "version 2.2.2 (r1485 on 2014-01-08)" Fix some warnings reported by \fIIvan Mincik\fR. Minor doc changes. The \fIrelease\fR validation was run successfully on PostgreSQL 9.3.2 and MySQL 5.5.34. .IP "\fBversion 2.2.1\fR (r1480 on 2013\-05\-09)" 4 .IX Item "version 2.2.1 (r1480 on 2013-05-09)" Do not die on missing driver in \s-1URL,\s0 regression reported by \fIIvan Mincik\fR. The \fIrelease\fR validation was run successfully on PostgreSQL 9.2.4 and MySQL 5.5.31. .IP "\fBversion 2.2.0\fR (r1473 on 2013\-03\-07)" 4 .IX Item "version 2.2.0 (r1473 on 2013-03-07)" Bug fix by \fIRobert Coup\fR, which was triggered on hash collisions (again). This bug was introduced in 2.1.0 when getting rid of the key separator, and not caught by the validation. Factor out database dependencies in a separate data structure, so that adding new targets should be simpler in the future. Add SQLite support. Add experimental Firebird support. Fix some warnings. Update \f(CW\*(C`cksum8\*(C'\fR function to propagate the first checksum half into the computation of the second half. Improved documentation. Improved validation, in particular with a \fIcollisions\fR test. The \fIrelease\fR and \fIhour\fR validations were run successfully on PostgreSQL 9.2.3 and MySQL 5.5.29. .IP "\fBversion 2.1.2\fR (r1402 on 2012\-10\-28)" 4 .IX Item "version 2.1.2 (r1402 on 2012-10-28)" Fix an issue when table names were quoted, raised by \fIRobert Coup\fR. Improved documentation, especially Section \*(L"\s-1SEE ALSO\*(R"\s0. More precise warning. Improved validation. The \fIrelease\fR and \fIhour\fR validations were run successfully on PostgreSQL 9.2.1 and MySQL 5.5.27. .IP "\fBversion 2.1.1\fR (r1375 on 2012\-08\-20)" 4 .IX Item "version 2.1.1 (r1375 on 2012-08-20)" Synchronization now handles possible NULLs in keys. Warn if key is nullable or not an integer under \f(CW\*(C`\-\-use\-key\*(C'\fR. Improved documentation, in particular non regression tests are described. The \fIrelease\fR and \fIhour\fR validations were run successfully on PostgreSQL 9.1.4 and MySQL 5.5.24. .IP "\fBversion 2.1.0\fR (r1333 on 2012\-08\-18)" 4 .IX Item "version 2.1.0 (r1333 on 2012-08-18)" Add \f(CW\*(C`\-\-tuple\-checksum\*(C'\fR and \f(CW\*(C`\-\-key\-checksum\*(C'\fR options so as to use existing possibly trigger-maintained checksums in the target tables instead of computing them on the fly. Add \f(CW\*(C`\-\-checksum\-computation\*(C'\fR option to control how the checksum table is built, either \f(CW\*(C`CREATE ... AS ...\*(C'\fR or \f(CW\*(C`CREATE ...; INSERT ...\*(C'\fR. For MySQL, rely directly on the count returned by \f(CW\*(C`CREATE ... AS\*(C'\fR if available. Add \f(CW\*(C`\-\-lock\*(C'\fR option for locking tables, which is enabled when synchronizing. Improve asynchronous query handling, especially when creating checksum tables and getting initial table counts, and in some other cases. Remove redundant data transfers from checksum table under option \f(CW\*(C`\-\-use\-key\*(C'\fR. Get rid of the separator when retrieving keys of differing tuples. Note that it is still used when computing checksums. Fix bug in bulk insert and delete key recovery under option \f(CW\*(C`\-\-use\-key\*(C'\fR. Fix potential bug in handling complex conditions with \f(CW\*(C`\-\-where\*(C'\fR. Change default prefix to \fBpgc_cmp\fR so that it is clearer that it belongs to \fBpg_comparator\fR. Fix initial count query which was not started asynchronously under \f(CW\*(C`\-\-tcs\*(C'\fR. Ensure that if not null detection is in doubt, a column is assumed nullable and thus is coalesced. Fix query counters so that they are shared under \f(CW\*(C`\-\-threads\*(C'\fR. Fix threading for explicit cleanup phase. Warn if nullable key attributes are encountered. Make default driver for second connection be the same as first. Rename option \f(CW\*(C`\-\-assume\-size\*(C'\fR as \f(CW\*(C`\-\-size\*(C'\fR. Add short documentation about \f(CW\*(C`\-\-debug\*(C'\fR. Multiple \f(CW\*(C`\-\-debug\*(C'\fR set \s-1DBI\s0 tracing levels as well. Improve the difference computation code so that the algorithm is more readable. Improve documentation. Add and improve comments in the code. The \fIrelease\fR and \fIhour\fR validations were run successfully on PostgreSQL 9.1.4 and MySQL 5.5.24. .IP "\fBversion 2.0.1\fR (r1159 on 2012\-08\-10)" 4 .IX Item "version 2.0.1 (r1159 on 2012-08-10)" Add \f(CW\*(C`\-\-source\-*\*(C'\fR options to allow taking over \s-1DBI\s0 data source specification. Change default aggregate to \f(CW\*(C`sum\*(C'\fR so that it works as expected by default when mixing PostgreSQL and MySQL databases. The results are okay with \f(CW\*(C`xor\*(C'\fR, but more paths than necessary were investigated, which can unduly trigger the max report limit. Improved documentation. In particular default option settings are provided systematically. The \fIfast\fR validation was run successfully on PostgreSQL 9.1.4 and MySQL 5.5.24. .IP "\fBversion 2.0.0\fR (r1148 on 2012\-08\-09)" 4 .IX Item "version 2.0.0 (r1148 on 2012-08-09)" Use asynchronous queries so as to provide some parallelism to the comparison without the issues raised by threads. It is enabled by default and can be switched off with option \f(CW\*(C`\-\-no\-asynchronous\*(C'\fR. Allow empty hostname specification in connection \s-1URL\s0 to use a \s-1UNIX\s0 socket. Improve the documentation, in particular the analysis section. Fix minor typos in the documentation. Add and fix various comments in the code. The \fIfast\fR validation was run successfully on PostgreSQL 9.1.4 and MySQL 5.5.24. .IP "\fBversion 1.8.2\fR (r1117 on 2012\-08\-07)" 4 .IX Item "version 1.8.2 (r1117 on 2012-08-07)" Bug fix in the merge procedure by \fIRobert Coup\fR that could result in some strange difference reports in corner cases, when there were collisions on the \fIkcs\fR in the initial checksum table. Fix broken synchronization with '|' separator, raised by \fIAldemir Akpinar\fR. Warn about possible issues with large objects. Add \f(CW\*(C`\-\-long\-read\-len\*(C'\fR option as a possible way to circumvent such issues. Try to detect these issues. Add a counter for metadata queries. Minor documentation improvements and fixes. .IP "\fBversion 1.8.1\fR (r1109 on 2012\-03\-24)" 4 .IX Item "version 1.8.1 (r1109 on 2012-03-24)" Change default separator again, to '|'. Fix \f(CW\*(C`\-\-where\*(C'\fR option mishandling when counting, pointed out by \&\fIEnrique Corona\fR. .Sp Post release note: the synchronisation is broken with the default separator in 1.8.1, do not use it, or use \-\-separator='%'. .IP "\fBversion 1.8.0\fR (r1102 on 2012\-01\-08)" 4 .IX Item "version 1.8.0 (r1102 on 2012-01-08)" Change default separator to '%', which seems less likely, after issues run into by \fIEmanuel Calvo\fR. Add more pointers and documentation. .IP "\fBversion 1.7.0\fR (r1063 on 2010\-11\-12)" 4 .IX Item "version 1.7.0 (r1063 on 2010-11-12)" Improved documentation. Enhancement and fix by \fIMaxim Beloivanenko\fR: handle quoted table and attribute names; Work around bulk inserts and deletes which may be undefined. More stats, more precise, possibly in \s-1CSV\s0 format. Add timeout and use-null options. Fix subtle bug which occurred sometimes on kcs collisions in table \fIT(0)\fR. .IP "\fBversion 1.6.1\fR (r754 on 2010\-04\-16)" 4 .IX Item "version 1.6.1 (r754 on 2010-04-16)" Improved documentation. Key and columns now defaults to primary key and all other columns of table in first connection. Password can be supplied from the environment. Default password for second connection always set depending on the first. Add max ratio option to express the relative maximum number of differences. Compute grouping masks by shifting left instead of right by default (that is doing a divide instead of a modulo). Threads now work a little, although it is still quite experimental. Fix a bug that made perl see differing checksum although they were equal, in some unclear conditions. .IP "\fBversion 1.6.0\fR (r701 on 2010\-04\-03)" 4 .IX Item "version 1.6.0 (r701 on 2010-04-03)" Add more functions (\s-1MD5, SUM\s0) and sizes (2, 4, 8). Remove template parameterization which is much too fragile to expose. Add a wrapping transaction which may speed up things a little. Implementation for MySQL, including synchronizing heterogeneous databases. Improved documentation. Extensive validation/non regression tests. .IP "\fBversion 1.5.2\fR (r564 on 2010\-03\-22)" 4 .IX Item "version 1.5.2 (r564 on 2010-03-22)" More documentation. Improved connection parsing with more sensible defaults. Make the mask computation match its above documentation with a bottom-up derivation, instead of a simpler top-down formula which results in bad performances when a power of the factor is close to the size (as pointed out in \fIBenjamin Mead Vandiver\fR's PhD). This bad mask computation was introduced somehow between 1.3 and 1.4 as an attempt at simplifying the code. .IP "\fBversion 1.5.1\fR (r525 on 2010\-03\-21)" 4 .IX Item "version 1.5.1 (r525 on 2010-03-21)" More documentation. Add \f(CW\*(C`\-\-expect\*(C'\fR option for non regression tests. .IP "\fBversion 1.5.0\fR (r511 on 2010\-03\-20)" 4 .IX Item "version 1.5.0 (r511 on 2010-03-20)" Add more links. Fix so that with a key only (i.e. without additional columns), although it could be optimized further in this case. Integrate patch by \fIErik Aronesty\fR: More friendly \*(L"connection parser\*(R". Add synchronization option to actually synchronize the data. .IP "\fBversion 1.4.4\fR (r438 on 2008\-06\-03)" 4 .IX Item "version 1.4.4 (r438 on 2008-06-03)" Manual connection string parsing. .IP "\fBversion 1.4.3\fR (r424 on 2008\-02\-17)" 4 .IX Item "version 1.4.3 (r424 on 2008-02-17)" Grumble! wrong tar pushed out. .IP "\fBversion 1.4.2\fR (r421 on 2008\-02\-17)" 4 .IX Item "version 1.4.2 (r421 on 2008-02-17)" Minor makefile fix asked for by \fIRoberto C. Sanchez\fR. .IP "\fBversion 1.4.1\fR (r417 on 2008\-02\-14)" 4 .IX Item "version 1.4.1 (r417 on 2008-02-14)" Minor fix for PostgreSQL 8.3 by \fIRoberto C. Sanchez\fR. .IP "\fBversion 1.4\fR (r411 on 2007\-12\-24)" 4 .IX Item "version 1.4 (r411 on 2007-12-24)" Port to PostgreSQL 8.2. Better documentation. Fix mask bug: although the returned answer was correct, the table folding was not. \&\s-1DELETE/INSERT\s0 messages exchanged so as to match a 'sync' or 'copy' semantics, as suggested by \fIErik Aronesty\fR. .IP "\fBversion 1.3\fR (r239 on 2004\-08\-31)" 4 .IX Item "version 1.3 (r239 on 2004-08-31)" Project moved to \s-1PG\s0 Foundry. Use cksum8 checksum function by default. Minor doc updates. .IP "\fBversion 1.2\fR (r220 on 2004\-08\-27)" 4 .IX Item "version 1.2 (r220 on 2004-08-27)" Added \f(CW\*(C`\-\-show\-all\-keys\*(C'\fR option for handling big chunks of deletes or inserts. .IP "\fBversion 1.1\fR (r210 on 2004\-08\-26)" 4 .IX Item "version 1.1 (r210 on 2004-08-26)" Fix algorithmic bug: checksums \fBmust\fR also include the key, otherwise exchanged data could be not detected if the keys were to be grouped together. Algorithmic section added to manual page. Thanks to \fIGiuseppe Maxia\fR who asked for it. Various code cleanups. .IP "\fBversion 1.0\fR (r190 on 2004\-08\-25)" 4 .IX Item "version 1.0 (r190 on 2004-08-25)" Initial revision. .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright (c) 2004\-2020, \fIFabien Coelho\fR .PP This software is distributed under the terms of the \s-1BSD\s0 License. Basically, you can do whatever you want, but you have to keep the license and I'm not responsible for any consequences. Beware, you may lose your data, your friends or your hairs because of this software! See the \s-1LICENSE\s0 file enclosed with the distribution for details. .PP If you are very happy with this software, I would appreciate a postcard saying so. See my webpage for current address.