NAME¶
Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser
SYNOPSIS¶
loadplugin Mail::SpamAssassin::Plugin::TextCat
DESCRIPTION¶
This plugin will try to guess the language used in the message body text.
You can use the "ok_languages" directive to set which languages are
considered okay for incoming mail and if the guessed language is not okay,
"UNWANTED_LANGUAGE_BODY" is triggered.
It will always add the results to a "X-Language" name-value pair in
the message metadata data structure. This may be useful as Bayes tokens and
can also be used in rules for scoring. The results can also be added to
marked-up messages using "add_header", with the _LANGUAGES_ tag. See
Mail::SpamAssassin::Conf for details.
Note: the language cannot always be recognized with sufficient confidence. In
that case, no action is taken.
USER OPTIONS¶
- ok_languages xx [ yy zz ... ] (default: all)
- This option is used to specify which languages are
considered okay for incoming mail. SpamAssassin will try to detect the
language used in the message body text.
Note that the language cannot always be recognized with sufficient
confidence. In that case, no action is taken.
The rule "UNWANTED_LANGUAGE_BODY" is triggered if none of the
languages detected are in the "ok" list. Note that this is the
only effect of the "ok" list. It does not act as a whitelist
against any other form of spam scanning.
In your configuration, you must use the two or three letter language
specifier in lowercase, not the English name for the language. You may
also specify "all" if a desired language is not listed, or if
you want to allow any language. The default setting is "all".
Examples:
ok_languages all (allow all languages)
ok_languages en (only allow English)
ok_languages en ja zh (allow English, Japanese, and Chinese)
Note: if there are multiple ok_languages lines, only the last one is used.
Select the languages to allow from the list below:
- af - Afrikaans
- am - Amharic
- ar - Arabic
- be - Byelorussian
- bg - Bulgarian
- bs - Bosnian
- ca - Catalan
- cs - Czech
- cy - Welsh
- da - Danish
- de - German
- el - Greek
- en - English
- eo - Esperanto
- es - Spanish
- et - Estonian
- eu - Basque
- fa - Persian
- fi - Finnish
- fr - French
- fy - Frisian
- ga - Irish Gaelic
- gd - Scottish Gaelic
- he - Hebrew
- hi - Hindi
- hr - Croatian
- hu - Hungarian
- hy - Armenian
- id - Indonesian
- is - Icelandic
- it - Italian
- ja - Japanese
- ka - Georgian
- ko - Korean
- la - Latin
- lt - Lithuanian
- lv - Latvian
- mr - Marathi
- ms - Malay
- ne - Nepali
- nl - Dutch
- no - Norwegian
- pl - Polish
- pt - Portuguese
- qu - Quechua
- rm - Rhaeto-Romance
- ro - Romanian
- ru - Russian
- sa - Sanskrit
- sco - Scots
- sk - Slovak
- sl - Slovenian
- sq - Albanian
- sr - Serbian
- sv - Swedish
- sw - Swahili
- ta - Tamil
- th - Thai
- tl - Tagalog
- tr - Turkish
- uk - Ukrainian
- vi - Vietnamese
- yi - Yiddish
- zh - Chinese (both Traditional and Simplified)
- zh.big5 - Chinese (Traditional only)
- zh.gb2312 - Chinese (Simplified only)
- inactive_languages xx [ yy zz ... ] (default: see
below)
- This option is used to specify which languages will not be
considered when trying to guess the language. For performance reasons,
supported languages that have fewer than about 5 million speakers are
disabled by default. Note that listing a language in
"ok_languages" automatically enables it for that user.
The default setting is:
- bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi
That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian, Irish Gaelic,
Scottish Gaelic, Icelandic, Latin, Lithuanian, Latvian, Rhaeto-Romance,
Sanskrit, Scots, Slovenian, and Yiddish.
- textcat_max_languages N (default: 3)
- The maximum number of languages any one message can
simultaneously match before its classification is considered unknown.
- textcat_optimal_ngrams N (default: 0)
- If the number of ngrams is lower than this number then they
will be removed. This can be used to speed up the program for longer
inputs. For shorter inputs, this should be set to 0.
- textcat_max_ngrams N (default: 400)
- The maximum number of ngrams that should be compared with
each of the languages models (note that each of those models is used
completely).
- textcat_acceptable_score N (default: 1.02)
- Include any language that scores at least
"textcat_acceptable_score" in the returned list of
languages.