NAME¶
- cmcalibrate - fit exponential tails for E-values and
determine HMM
- filter thresholds for a CM
SYNOPSIS¶
cmcalibrate [options] cmfile
DESCRIPTION¶
cmcalibrate calibrates E-value statistics and HMM filter thresholds for
the covariance models (CMs) in
cmfile. The E-values and HMM filter
threshold statistics are added to the
cmfile and are used by
cmsearch for increased sensitivity and acceleration in RNA homology
search.
CMs are profiles of RNA consensus sequence and secondary structure. A CM file is
produced by the
cmbuild program, from a given RNA sequence alignment of
known consensus structure.
cmcalibrate is very slow. It takes several hours to calibrate a single
average sized CM.
cmcalibrate can be run in parallel with MPI. To do
this, use the
--mpi option and run cmsearch inside a MPI wrapper
program such as
mpirun. For example:
mpirun C cmcalibrate
--mpi [other options] cmfile. Note that
cmcalibrate must have been compiled in MPI mode to use the
--mpi
option. See the Installation seciton of the user's guide for instructions on
how to compile in MPI mode.
The
--forecast <n> option can be used to estimate how long
the program will take to run on
<n> processors. Unless you plan
on running
cmcalibrate in MPI mode,
<n> should be set as
1.
cmcalibrate performs two main tasks. The first is to calibrate E-value
statistics. This is done by generating random sequences and searching them
with the CM and collecting hits. The histogram of the bit scores of the hits
is fit to an exponential tail, and the parameters of the fitted tail are saved
to the CM file. The exponential tail is used to predict the expected number of
hits (E-values) at a given bit score in
cmsearch. The random sequences
are generated by an HMM that was trained on real genomic sequences with
various GC contents. The goal is to have the GC distributions in the random
sequences to be similar to actual genomic sequences.
The second task is to determine appropriate HMM filter thresholds for the CM
over the possible range of final CM bit score thresholds. This is done by
sampling 10,000 sequences from the CM itself and searching them with the CM
and HMM. The appropriate HMM bit score threshold for a given CM threshold is
set as the HMM threshold that will recognize 99.5% of the hits that score
above the CM threshold. This HMM threshold is calculated over the range of
reasonable CM thresholds.
Both tasks must be performed for each configuration and algorithm that
cmsearch might use. These include HMM Viterbi, HMM Forward, CM CYK and
CM Inside algorithms for E-value calibration, and CM CYK and CM Inside
algorithms for HMM filter thresholds. Additionally, for each algorithm, each
task must be performed twice, once for a locally configured model and once for
a globally configured model.
The E-values and HMM filter thresholds determined by
cmcalibrate are only
used by the
cmsearch program. If you are not going to use
cmsearch, do not waste time calibrating your models.
The majority of the options to
cmcalibrate fall into one of two
categories, depending on which of the two main tasks they're associated with.
Options that affect the exponential tail E-value fitting are prefixed with
--exp. Options that affect the HMM filter threshold determination are
prefixed with
--fil.
The calibration of E-value statistics takes the majority of the running time of
cmcalibrate. This is because CM search algorithms are slow, and the
random sequences that must be searched have to be long enough to include
enough random hits that can be binned into a histogram to which an exponential
tail can be reliably fit. By default the random sequence length for CM
searches is 1.5 megabases (Mb), for all search modes, but 1.5 can be changed
to
<x> with
--exp-cmL-glc <x> or
--exp-cmL-loc <x> options for glocal and local CM search
calibrations respectively. Because
cmsearch uses HMM search algorithms
to filter,
cmcalibrate must also fit exponential tails for HMM search
algorithms. HMMs are much faster than CMs so it is possible to search much
longer random sequence than 1.5 MB and not significantly increase the running
time of
cmcalibrate. The length of sequence searched with the HMM is
controlled by the
--exp-fract <x>,
--exp-hmmLn-glc <x>, --exp-hmmLn-loc
<x>, and the
--exp-hmmLx <x> options. By
default, the sequence length for HMM calibration is set as the length that
will require 0.10 times the number of dynamic programming calculations as a CM
E-value calibration step. (The value 0.10 can be changed to
<x>
with the
--exp-fract <x> option). If this sequence length
is less than a minimum value, which by default is 15.0 MB, then the minimum
value is used. The minimum value can be changed to
<x> with
--exp-hmmLn-glc <x> and
--exp-hmmLn-loc
<x> for glocal and local HMM search calibrations separately.
Similarily if this value is more than a maximum value, which by default is
1000.0 MB, then the maximum value is used. The maximum value can be changed to
<x> with the
--exp-hmmLx <x> option.
OPTIONS¶
- -h
- Print brief help; includes version number and summary of
all options, including expert options.
- -s <n>
- Set the random number generator seed to <n>,
where <n> is a positive integer. The default is to use time()
to generate a different seed for each run, which means that two different
runs of cmcalibrate on the same CM will give slightly different
E-value and HMM filter threshold parameters. You can use this option to
generate reproducible results.
- --forecast <n>
- Predict the running time of the calibration for
cmfile and provided options and exit, DO NOT perform the
calibration. The predictions should be used as rough estimates. The value
<n> is the number of processors the calibration will be run
on, so <n> equal to 1 is appropriate unless you will run
cmcalibrate in parallel with MPI.
- --devhelp
- Print help, as with -h , but also include
undocumented developer options. These options are not listed below, are
under development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only resources for
understanding what they actually do are the brief one-line description
printed when --devhelp is enabled, and the source code.
- --mpi
- Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see User's Guide for details).
EXPERT OPTIONS¶
- --exp-cmL-glc <x>
- Set the length of random sequence to search for the CM
glocal exponential tail fits to <x> megabases (Mb). By
default, <x> is 1.5 Mb. Searching more sequences will make
the exponential tail fits more precise, but will take longer: using
<x> of 3.0 instead of the default of 1.5 will cause the
running time of cmcalibrate to increase by roughly 50%.
- --exp-cmL-loc <x>
- Set the length of random sequence to search for the CM
local exponential tail fits to <x> megabases (Mb). By
default, <x> is 1.5 Mb. Searching more sequences will make
the exponential tail fits more precise, but will take longer: using
<x> of 3.0 instead of the default of 1.5 will cause the
running time of cmcalibrate to increase by roughly 50%.
- --exp-hmmLn-glc <x>
- Set the minimum random sequence length to search for the
HMM glocal exponential tail fits to <x> megabases
(Mb). By default, <x> is 15.0. For more information, see the
explanation regarding sequence lengths for E-value calibration above
before the Options section.
- --exp-hmmLn-loc <x>
- Set the minimum random sequence length to search for the
HMM local exponential tail fits to <x> megabases (Mb).
By default, <x> is 15.0. For more information, see the
explanation regarding sequence lengths for E-value calibration above
before the Options section.
- --exp-hmmLx <x>
- Set the maximum random sequence length to search when
determining HMM E-values to <x> megabases (Mb). By default,
<x> is 1000.0. For more information, see the explanation
regarding sequence lengths for E-value calibration above before the
Options section.
- --exp-fract <x>
- Set the HMM/CM fraction of dynamic programming calculations
to <x>. By default, <x> is 0.10. For more
information, see the explanation regarding sequence lengths for E-value
calibration above before the Options section.
- --exp-tailn-cglc <x>
- During E-value calibration of glocal CM search modes
fit the exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this
<x> is 25. The value 25 was chosen because it works well
empirically for glocal CM modes relative to other values.
- --exp-tailn-cloc <x>
- During E-value calibration of local CM search modes
fit the exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this
<x> is 75. The value 75 was chosen because it works well
empirically for local CM modes relative to other values.
- --exp-tailn-hglc <x>
- During E-value calibration of glocal HMM search
modes fit the exponential tail to the high scores in the histogram tail
that includes <x> hits per Mb searched. By default this
<x> is 250. The value 250 was chosen because it works well
empirically for glocal HMM modes relative to other values.
- --exp-tailn-hloc <x>
- During E-value calibration of local HMM search modes
fit the exponential tail to the high scores in the histogram tail that
includes <x> hits per Mb searched. By default this
<x> is 750. The value 750 was chosen because it works well
empirically for glocal HMM modes relative to other values.
- --exp-tailp <x>
- Ignore the --exp-tailn prefixed options and fit the
<x> fraction right tail of the histogram to exponential
tails, for all search modes.
- --exp-tailxn <n>
- With --exp-tailp enforce that the maximum number of
hits in the tail that is fit is <n>.
- --exp-beta <x>
- During E-value calibration, by default query-dependent
banding (QDB) is used to accelerate the CM search algorithms with a beta
tail loss probability of 1E-15. This beta value can be changed to
<x> using the --exp-beta <x> option. The
beta parameter is the amount of probability mass excluded during band
calculation, higher values of beta give greater speedups but sacrifice
more accuracy than lower values. A recommended value is 1E-7 (0.00001).
QDB is explained in more detail in the manual page for cmsearch and
in (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56).
- --exp-no-qdb
- Turn of QDB during E-value calibration. This will slow down
calibration, and is not recommended unless you plan on using
--no-qdb in cmsearch.
- --exp-hfile <f>
- Save the histograms fit for the E-value calibration to file
<f>. The format of this file is two tab delimited columns.
The first column is the x-axis values of bit scores of each bin. The
second column is the y-axis values of number of hits per bin. Each series
is delimited by a line with a single character "&". The file
will contain one series for each exponential tail fit, i.e. one series of
empirical data for each line of output from cmcalibrate that begins
with "exp tail".
- --exp-sfile <f>
- Save a survival plot for the E-value calibration to file
<f>. The format of this file is two tab delimited columns.
The first column is the x-axis values of bit scores of each bin. The
second column is the y-axis values of fraction of hits that meet or exceed
the score for each bin. Each series is delimited by a line with a single
character "&". The file will contain three series' of data
for each exponential tail fit, i.e. three series for each line of output
from cmcalibrate that begins with "exp tail". The first
series is the empirical survival plot from the histogram of hits to the
random sequence. The second series is the exponential tail fit to the
empirical distribution. The third series is the exponential tail fit if
lambda were fixed and set as the natural log of 2 (0.691314718).
- --exp-qqfile <f>
- Save a quantile-quantile plot for the E-value calibration
to file <f>. The format of this file is two tab delimited
columns. The first column is the x-axis values, and the second column is
the y-axis values. The distance of the points from the identity line (y=x)
is a measure of how good the exponential tail fit is, the closer the
points are to the identity line, the better the fit is. Each series is
delimited by a line with a single character "&". The file
will contain one series of empirical data for each exponential tail fit,
i.e. one series for each line of output from cmcalibrate that
begins with "exp tail".
- --exp-ffile <f>
- Save statistics on the exponential tail statistics to file
<f>. The file will contain the lambda and mu values for
exponential tails fit to tails of different sizes. For example, by default
cmcalibrate fits exponential tails to the rightmost 0.01 (1) of the
score histogram and stores the parameters of that exponential tail to the
CM file. (The value of 0.01 can be changed to <x> with the
--exp-tailp <x> option). When --exp-ffile
<f> is used the file <f> will include the
exponential tail parameters for fits to various fractions of the histogram
tail, instead of just to 0.01.
- --fil-N <n>
- Set the number of sequences sampled and searched for the
HMM filter threshold calibration to <n>. By default,
<n> is 10,000.
- --fil-F <x>
- Set the fraction of sample sequences the HMM filter must be
able to recognize, and allow to survive, to <x>, where
<x> is a positive real number less than or equal to 1.0. By
default, <x> is 0.993.
- --fil-tau <x>
- Set the tail loss probability during HMM band calculation
for HMM filter threshold calibration to <x>. This is the
amount of probability mass within the HMM posterior probabilities that is
considered negligible. The default value is 1E-7. In general, higher
values will result in greater acceleration, but increase the chance of
missing the optimal alignment due to the HMM bands.
- --fil-gemit
- During HMM filter calibration, always sample sequences from
a globally configured CM, even when calibrating local modes. By default,
sequences are sampled from a globally configured CM when calibrating the
global search modes, and sampled from a locally configured CM when
calibrating the local search modes.
- --fil-dfile <f>
- Save statistics on filter threshold calibration, including
HMM and CM scores for all sampled sequences, to file <f>.
- --mxsize <x>
- Set the maximum allowable DP matrix size to
<x> megabytes. By default this size is 2,048 Mb. This should
be large enough for the vast majority of calibrations, however if it is
not cmcalibrate will exit prematurely and report an error message
that the matrix exceeded it's maximum allowable size. In this case, the
--mxsize can be used to raise the limit.
SEE ALSO¶
For complete documentation, see the User's Guide (Userguide.pdf) that came with
the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT¶
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution
conditions.
AUTHOR¶
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/