NAME¶
slmbuild - generate language model from idngram file
SYNOPSIS¶
slmbuild [
option]...
idngram_file...
DESCRIPTION¶
slmbuild generates a back-off smoothing language model from a given
idngram file. Generally, the
idngram_file is created by
ids2ngram.
OPTIONS All the following options are mandatory.¶
- -n,--NMax N
- 1 for unigram, 2 for bigram, 3 for trigram. Any number not
in the range of 1..3 is not valid.
- -o, --out output-file
- Specify the output xfilei name.
- -l, --log
- using -log(pr), use pr directly by
default.
- -w, --wordcount N
- Lexican size, number of different words.
- -b, --brk id...
- Set the ids which should be treated as breaker.
- -e, --e id...
- Set the ids which should not be put into LM.
- -c, --cut c...
- k-grams whose freq <= c[k] are dropped.
- -d, --discount method,
param...
- The k-th -d parm specifies the discount method
For k-gram, possibble values for method/param are:
B<GT>,I<R>,I<dis> : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999
B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
0 E<lt> dis E<lt> 1.0
NOTE¶
-n must be given before
-c -b. And
-c must give
right number of cut-off, also
-ds must appear exactly N times
specifying the discounts for 1-gram, 2-gram..., respectively.
BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually, these ids
have no meaning when they appeared in the middle of n-gram.
EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which contain those
ids are meaningless.
We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly from
IDNGRAM file, because some low-level information is still useful in it.
EXAMPLE¶
Following example read 'all.id3gram' and write trigram model 'all.slm'.
At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8,
dis=0.9995. At 2-gram level, use Absolute discount with cut-off 3, dis
auto-calc. At 3-gram level, use Absolute discount with cut-off 2, dis
auto-calc. Word id 10,11,12 are breakers (sentence/para/paper breaker, etc).
Exclude-ID is 9. Lexicon contains 200000 words. The result languagme model
uses -log(pr).
slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS
-b 10,11,12 -e 9 all.id3gram
AUTHOR¶
Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently
maintained by Kov.Chai <tchaikov@gmail.com>.
SEE ALSO¶
ids2ngram(1),
slmprune(1).