| QSF(1) | User Manuals | QSF(1) |
NAME¶
qsf - quick spam filterSYNOPSIS¶
Filtering: qsf [ -snrAtav] [-d DB] [-g DB][ -L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
[ -X NUM]
DESCRIPTION¶
qsf reads a single email on standard input, and by default outputs it on standard output. If the email is determined to be spam, an additional header ("X-Spam: YES") will be added, and optionally the subject line can have "[SPAM]" prepended to it.:0 wf
| qsf -ra
:0 H:
* X-Spam: YES
$HOME/mail/spam
TRAINING¶
Before qsf can be used properly, it needs to be trained. A good way to train qsf is to collect a copy of all your email into two folders - one for spam, and one for non-spam. Once you have done this, you can use the training function, like this:qsf -aT spam-folder non-spam-folder
OPTIONS¶
The qsf options are listed below.- -d, --database [TYPE:]FILE
- Use FILE as the spam/non-spam database. The default
is to use /var/lib/qsfdb and, if that is not available or is
read-only, $HOME/.qsfdb. This option can also be useful if there is
a system-wide database but you do not want to use it - specifying your own
here will override the default.
If you prefix the filename with a TYPE, of the form btree:$HOME/.qsfdb, then this will specify what kind of database FILE is, such as list, btree, gdbm, sqlite and so on. Check the output of qsf -V to see which database backends are available. The default is to auto-detect the type, or, if the file does not already exist, use list. Note that TYPE is not case-sensitive.
- -g, --global [TYPE:]FILE
- Use FILE as the default global database, instead of /var/lib/qsfdb. If you also specify a database with -d, then this "global" database will be used in read-only mode in conjunction with the read-write database specified with -d. The -g option can be used a second time to specify a third database, which will also be used in read-only mode. Again, the filename can optionally be prefixed with a TYPE which specifies the database type.
- -P, --plain-map FILE
- Maintain a mapping of all database tokens to their non-hashed counterparts in FILE, one token per line. This can be useful if you want to be able to list the contents of your database at a later date, for instance to get a list of email addresses in your allow-list. Note that using this option may slow qsf down, and only entries written to the database while this option is active will be stored in FILE.
- -s, --subject
- Rewrite the Subject line of any email that turns out to be spam, adding "[SPAM]" to the start of the line.
- -S, --subject-marker SUBJECT
- Instead of adding "[SPAM]", add SUBJECT to the Subject line of any email that turns out to be spam. Implies -s.
- -H, --header-marker MARK
- Instead of setting the X-Spam header to "YES", set it to MARK if email turns out to be spam. This can be useful if your email client can only search all headers for a string, rather than one particular header (so searching for "YES" might match more than just the output of qsf).
- -n, --no-header
- Do not add an X-Spam header to messages.
- -r, --add-rating
- Insert an additional header X-Spam-Rating which is a rating of the "spamminess" of a message from 0 to 100; 90 and above are counted as spam, anything under 90 is not considered spam. If combined with -t, then the rating (0-100) will be output, on its own, on standard output.
- -A, --asterisk
- Insert an additional header X-Spam-Level which will contain between 0 and 20 asterisks (*), depending on the spam rating.
- -t, --test
- Instead of passing the message out on standard output, output nothing, and exit 0 if the message is not spam, or exit 1 if the message is spam. If combined with -r, then the spam rating will be output on standard output.
- -a, --allowlist
- Enable the allow-list. This causes the email addresses given in the message's "From:" and "Return-Path:" headers to be checked against a list; if either one matches, then the message is always treated as non-spam, regardless of what the token database says. When specified with a retraining flag, -a -m (mark as spam) will remove that address from the allow-list as well as marking the message as spam, and -a -M (mark as non-spam) will add that address to the allow-list as well as marking the message as non-spam. The idea is that you add all of your friends to the allow-list, and then none of their messages ever get marked as spam.
- -y, --denylist
- Enable the deny-list. This causes the email addresses given
in the message's "From:" and "Return-Path:" headers to
be checked against a second list; if either one matches, then theh message
is always treated as spam. Training works in the same way as with
-a, except that you must specify -m or -M twice to
modify the deny-list instead of the allow-list, and with the reverse
syntax: -y -m -m (mark as spam) will add that address to the
deny-list, whereas -y -M -M (mark as non-spam) will remove that
address from the deny-list. This double specification is so that the usual
retraining process never touches the deny-list; the deny-list should be
carefully maintained rather than automatically generated.
Normally you would not need to use the deny-list.
- -L, --level, --threshold LEVEL
- Change the spam scoring threshold level which must be reached before an email is classified as spam. The default is 90.
- -Q, --min-tokens NUM
- Only give a score if more than NUM tokens are found in the message - otherwise the message is assumed to be non-spam, and it is not modified in any way. The default is 0. This option might be useful if you find that very short messages are being frequently miscategorised.
- -e, --email, --email-only EMAIL
- Query or update the allow-list entry for the email address
EMAIL. With no other options, this will simply output
"YES" if EMAIL is in the allow-list, or "NO" if
it is not. With -t, it will not output anything, but will exit 0
(success) if EMAIL is in the allow-list, or 1 (failure) if it is
not. With the -m (mark-spam) option, any previous allow-list entry
for EMAIL will be removed. Finally, with the -M
(mark-nonspam) option, EMAIL will be added to the allow-list if it
is not already on it.
If EMAIL is just the word MSG on its own, then an email will be read from standard input, and the email addresses given in the "From:" and "Return-Path:" headers will be used.Using -e automatically switches on -a.If you also specify -y, then the deny-list will be operated on. Remember that -m and -M are reversed with the deny-list.If you specify an email address of the form @domain (nothing before the @), then the whole domain will be allow or deny listed.
- -v, --verbose
- Add extra X-QSF-Info headers to any filtered email, containing error messages and so on if applicable. Specify -v more than once to increase verbosity.
- -T, --train SPAM NONSPAM [MAXROUNDS]
- Train the database using the two mbox folders SPAM and NONSPAM, by testing each message in each folder and updating the database each time a message is miscategorised. This is done several times, and may take a while to run. Specify the -a (allow-list) flag to add every sender in the NONSPAM folder to your allow-list as a side-effect of the training process. If MAXROUNDS is specified, training will end after this number of rounds if the results are still not good enough. The default is a maximum of 200 rounds.
- -m, --mark-spam
- Instead of passing the message out on standard output, mark its contents as spam and update the database accordingly. If the allow-list (-a) is enabled, the message's "From:" and "Return-Path:" addresses are removed from the allow-list. If the deny-list (-y) is enabled and you specify -m twice, the message's addresses are added to the deny-list instead.
- -M, --mark-nonspam
- Instead of passing the message out on standard output, mark its contents as non-spam and update the database accordingly. If the allow-list (-a) is enabled, the message's "From:" and "Return-Path:" addresses are added to the allow-list (see the -a option above). If the deny-list (-y) is enabled and you specify -M twice, the message's addresses are removed from the deny-list instead.
- -w, --weight WEIGHT
- When marking as spam or non-spam, update the database with a weighting of WEIGHT per token instead of the default of 1. Useful when correcting mistakes, eg a message that has been mistakenly detected as spam should be marked as non-spam using a weighting of 2, i.e. double the usual weighting, to counteract the error.
- -D, --dump [FILE]
- Dump the contents of the database as a platform-independent text file, suitable for archival, transfer to another machine, and so on. The data is output on stdout or into the given FILE.
- -R, --restore [FILE]
- Rebuild the database from scratch from the text file on stdin. If a FILE is given, data is read from there instead of from stdin.
- -O, --tokens
- Instead of filtering, output a list of the tokens found in the message read from standard input, along with the number of times each token was found. This is only useful if you want to use qsf as a general tokeniser for use with another filtering package.
- -E, --merge OTHERDB
- Merge the OTHERDB database into the current database. This can be useful if you want to take one user's mailbox and merge it into the system-wide one, for instance (this would be done by, as root, doing qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and then removing /home/user/.qsfdb).
- -B, --benchmark SPAM NONSPAM [MAXROUNDS]
- Benchmark the training process using the two mbox folders
SPAM and NONSPAM. A temporary database is created and
trained using the first 75% of the messages in each folder, and then the
entire contents of each folder is tested to see how many false positives
and false negatives occur. Some timing information is also displayed.
This can be used to decide which backend is best on your system. Use -d to select a backend, eg qsf -B spam nonspam -d GDBM - this will create a temporary database which is removed afterwards.The exception to this is the MySQL backend, where a full database specification must be given (-d MySQL:database=db;host=localhost;...) and the database table given will not be wiped beforehand or dropped afterwards.As with -T, if MAXROUNDS is specified, training will never be done for more than this number of rounds; the default is 200.
- -h, --help
- Print a usage message on standard output and exit successfully.
- -V, --version
- Print version information, including a list of available
database backends, on standard output and exit successfully.
DEPRECATED OPTIONS¶
The following options are only for use with the old binary tree database backend or old databases that haven't been upgraded to the new format that came in with version 1.1.0.- -N, --no-autoprune
- When marking as spam or nonspam, never automatically prune the database. Usually the database is pruned after every 500 marks; if you would rather --prune manually, use -N to disable automatic pruning.
- -p, --prune
- Remove redundant entries from the database and clean it up a little. This is automatically done after several calls to --mark-spam or --mark-nonspam, and during training with --train if the training takes a large number of rounds, so it should rarely be necessary to use --prune manually unless you are using -N / --no-autoprune.
- -X, --prune-max NUM
- When the database is being pruned, no more than NUM
entries will be considered for removal. This is to prevent CPU and memory
resources being taken over. The default is 100,000 but in some
circumstances (if you find that pruning takes too long) this option may be
used to reduce it to a more manageable number.
FILES¶
- /var/lib/qsfdb
- The default (system-wide) spam database. If you wish to install qsf system-wide, this should be read-only to everyone; there should be one user with write access who can update the spam database with qsf --mark-spam and qsf --mark-non-spam when necessary.
- /var/lib/qsfdb2
- A second, read-only, system-wide database. This can be useful when installing qsf system-wide and using third-party spam databases; the first global database can be updated with system-specific changes, and this second database can be periodically updated when the third-party spam database is updated.
- $HOME/.qsfdb
- The default spam database for per-user data. Users without
write access to the system-wide database will have their data written
here, and the two databases will be read together. The per-user database
will be given a weighting equivalent to 10 times the weighting of the
global database.
NOTES¶
Currently, you cannot use qsf to check for spam while the database is being updated. This means that while an update is in progress, all email is passed through as non-spam.EXAMPLES¶
To filter all of your mail through qsf, with the allow-list enabled and the "spam rating" header being added, add this to your .procmailrc file::0 wf
| qsf -ra
:0 wf
| qsf -sra
:0 H
* ^To:.*spambox@yourdomain.com
| qsf -am
# If sent to spambox@yourdomain.com:
:0
* ^To:.*spambox@yourdomain.com
{
:0 wf
| qsf -a
# The above two lines can be skipped if you've
# already piped the message through qsf.
# If the qsf database says it's not spam,
# mark it as spam!
:0 H
* ^X-Spam: NO
| qsf -am
}
:0 wf
* ! ^Subject: Your .* is on fire
* ! ^From: .*@foobar.com
| qsf -ra
# Press F5 to mark a message as spam and delete it
macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"
# Press F9 to mark a message as non-spam
macro index <f9> "<pipe-message>qsf -aM\n"
macro pager <f9> "<pipe-message>qsf -aM\n"
macro index <f5> ":set pipe_split\n<tag-prefix><pipe-message>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"
| preline procmail
THE ALLOW-LIST¶
A feature called the "allow-list" can be switched on by specifying the --allowlist or -a option. This causes messages' "From:" and "Return-Path:" addresses to be checked against a list of people you have said to allow all messages from, and if a message's "From:" or "Return-Path:" address is in the list, it is never marked as spam. This means you can add all your friends to an "allow-list" and qsf will then never mis-file their messages - a quick way to do this is to use -a with -T (train); everyone in your non-spam folder who has sent you an email will be added to the allow-list automatically during training.qsf -e foo@bar.com -M
qsf -e bad@nasty.com -m
qsf -e someone@somewhere.com
BACKUP AND RESTORE¶
Because the database format is platform-specific, it is a good idea to periodically dump the database to a text file using qsf -D so that, if necessary, it can be transferred to another machine and restored with qsf -R later on.qsf -D > your-database-dump.txt
qsf -R < your-database-dump.txt
TECHNICAL DETAILS¶
When a message is passed to qsf, any attachments are decoded, all HTML elements are removed, and the message text is then broken up into "tokens", where a "token" is a single word or URL. Each token is hashed using the MD5 algorithm (see below for why), and that hash is then used to look up each token in the qsf database.TOKENISATION¶
When a message is broken up into tokens, various parts of the message are treated in different ways.SPECIAL FILTERS¶
As well as using the textual content of email to detect spam, qsf also uses special filters which create "pseudo-tokens" based on various rules. This means that specific patterns, not just individual words, can be used to determine whether a message is spam or not.- GTUBE
- Flags any message containing the string XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X as spam - useful for testing that your qsf installation is working.
- ATTACH-SCR
- ATTACH-PIF
- ATTACH-EXE
- ATTACH-VBS
- ATTACH-VBA
- ATTACH-LNK
- ATTACH-COM
- ATTACH-BAT
- Adds a token for every attachment whose filename ends in
".scr", ".pif", ".exe", ".vbs",
".vba", ".lnk", ".com", and ".bat"
respectively (these are often viruses).
- ATTACH-GIF
- ATTACH-JPG
- ATTACH-PNG
- Adds a token for every attachment whose filename ends in
".gif", ".jpg" or ".jpeg", and
".png" respectively.
- ATTACH-DOC
- ATTACH-XLS
- ATTACH-PDF
- Adds a token for every attachment whose filename ends in
".doc", ".xls", or ".pdf" respectively
(these tend to indicate a non-spam email).
- SINGLE-IMAGE
- Adds a token if the message contains exactly one attached
image.
- MULTIPLE-IMAGES
- Adds a token if the message contains more than one attached
image.
- GIBBERISH-CONSONANTS
- Adds a token for every word found that has multiple consonants in a row, as described above. Spam often contains strings of gibberish.
- GIBBERISH-VOWELS
- Adds a token for every word found that has multiple vowels in a row, eg "aeaiaiaeeio".
- GIBBERISH-FROMCONS
- Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-Path:" addresses on their own.
- GIBBERISH-FROMVOWL
- Like GIBBERISH-VOWELS, but only for the "From:" and "Return-Path:" addresses on their own.
- GIBBERISH-BADSTART
- Adds a token for every word that starts with a bad character such as %.
- GIBBERISH-HYPHENS
- Adds a token for every word with more than three hyphens or underscores in it.
- GIBBERISH-LONGWORDS
- Adds a token for every word with over 30 characters in it (but less than 60).
- HTML-COMMENTS-IN-WORDS
- Adds a token for every HTML comment found in the middle of a word. Spam often contains HTML inside words, like this: w<!--dsgfhsdgjgh-->ord
- HTML-EXTERNAL-IMG
- Adds a token for every HTML <img> (image) tag found that contains :// (i.e. it refers to an external image).
- HTML-FONT
- Adds a token for every HTML <font> tag found.
- HTML-IP-IN-URLS
- Adds a token for every URL found containing an IP address.
- HTML-INT-IN-URL
- Adds a token for every URL found containing an integer in its hostname.
- HTML-URLENCODED-URL
- Adds a token for every URL found containing a % sign in its
hostname.
DATABASE BACKENDS¶
The inbuilt "list" database backend will not necessarily provide the best performance, but is provided because using it requires no external libraries.USE mydatabase;
CREATE TABLE qsfdb (
key1 BIGINT UNSIGNED NOT NULL,
key2 BIGINT UNSIGNED NOT NULL,
token VARCHAR(64) DEFAULT '' NOT NULL,
value1 INT UNSIGNED NOT NULL,
value2 INT UNSIGNED NOT NULL,
value3 INT UNSIGNED NOT NULL,
PRIMARY KEY (key1,key2,token),
KEY (key1),
KEY (key2),
KEY (token)
);
database=DATABASE;host=HOST;port=PORT;
user=USER;pass=PASS;table=TABLE;
key1=KEY1;key2=KEY2
- DATABASE
- is the name of the MySQL database.
- HOST
- is the hostname of the database server (eg "localhost").
- PORT
- is the TCP port to connect on (eg 3306).
- USER
- is the username to connect with.
- PASS
- is the password to connect with.
- TABLE
- is the database table to use. If a table with this name does not exist when qsf is called in update or training mode, then it will be created if permissions allow this to be done.
- KEY1
- is the value to use for the key1 field.
- KEY2
- is the value to use for the key2 field.
TROUBLESHOOTING¶
If you have problems with qsf, please check the list below; if this does not help, go to the qsf home page and investigate the mailing lists, or email the author.- Nothing is being marked as spam.
-
First, use the -r option to switch on the X-Spam-Rating header, and check that this header appears in email passed through qsf. If it does not, then it is likely that qsf is not being run at all - check your configuration of procmail(1) or its equivalent.
-
If you are seeing X-Spam-Rating headers, and different emails have different scores, then you may simply need to retrain your database a little more. Take more spam email and pass it to qsf -m.
-
If you are seeing X-Spam-Rating headers but they all give the same spam rating, then the most likely reason is that qsf is not reading any database. Make sure that whatever is processing the email has read permissions on /var/lib/qsfdb and/or ~/.qsfdb - and make sure that, if you are using ~/.qsfdb, what your database creator thought was ~ ($HOME) is the same as it is for whatever is processing the email.
- Retraining sometimes takes a very long time.
- With the obtree backend or 2-column MySQL or SQLite
tables, every 500th retrain (-m or -M), the database
is pruned. On some systems this may take some time, and during this time
the database is locked (except when using the MySQL or SQLite backends).
If you constantly do a lot of retraining and want to avoid this, then use
the -N option to suppress auto-pruning, and then have a
cron(8) job or something run a manual prune (qsf -p)
every now and again.
- Running qsf from procmail fails with an error.
- If you can run qsf from the command line, but in
your procmail log file you get errors about "qsf: cannot
execute binary file", then contact your system administrator for
help. It may be that incoming email is handled by a different server to
the one you normally shell into, and either they are of a different
architecture or operating system, or the mail server is not permitted to
execute user-owned binaries.
ACKNOWLEDGEMENTS¶
The following people have contributed suggestions, comments, patches, and testing:Tom Parker
<http://www.bits.bris.ac.uk/palfrey/>
Dr Kelly A. Parker
Vesselin Mladenov <http://www.antipodes.bg/>
Glyn Faulkner
Mark Reynolds
Sam Roberts
Scott Allen
Karsten Kankowski
M. Kolbl
Micha Holzmann
Jef Poskanzer <http://www.acme.com/jef/>
Clemens Fischer <http://ino-waiting.gmxhome.de/>
Nelson A. de Oliveira
Michal Vitecek
Tommy Pettersson <http://www.lysator.liu.se/~ptp/>
AUTHOR¶
The author:Andrew Wood <andrew.wood@ivarch.com>
http://www.ivarch.com/
BUGS¶
If you find any bugs, please contact the author, either by email or by using the contact form on the web site.SEE ALSO¶
procmail(1), procmailrc(5), procmailex(5)LICENSE¶
This is free software, distributed under the ARTISTIC 2.0 license.| August 2007 | Linux |