Infernal Manual

NAME¶

cmcalibrate - fit exponential tails for E-values and determine HMM: filter thresholds for a CM

SYNOPSIS¶

cmcalibrate [options] cmfile

DESCRIPTION¶

cmcalibrate calibrates E-value statistics and HMM filter thresholds for the covariance models (CMs) in cmfile. The E-values and HMM filter threshold statistics are added to the cmfile and are used by cmsearch for increased sensitivity and acceleration in RNA homology search.

CMs are profiles of RNA consensus sequence and secondary structure. A CM file is produced by the cmbuild program, from a given RNA sequence alignment of known consensus structure.

cmcalibrate is very slow. It takes several hours to calibrate a single average sized CM. cmcalibrate can be run in parallel with MPI. To do this, use the --mpi option and run cmsearch inside a MPI wrapper program such as mpirun. For example: mpirun C cmcalibrate --mpi [other options] cmfile. Note that cmcalibrate must have been compiled in MPI mode to use the --mpi option. See the Installation seciton of the user's guide for instructions on how to compile in MPI mode.

The --forecast <n> option can be used to estimate how long the program will take to run on <n> processors. Unless you plan on running cmcalibrate in MPI mode, <n> should be set as 1.

cmcalibrate performs two main tasks. The first is to calibrate E-value statistics. This is done by generating random sequences and searching them with the CM and collecting hits. The histogram of the bit scores of the hits is fit to an exponential tail, and the parameters of the fitted tail are saved to the CM file. The exponential tail is used to predict the expected number of hits (E-values) at a given bit score in cmsearch. The random sequences are generated by an HMM that was trained on real genomic sequences with various GC contents. The goal is to have the GC distributions in the random sequences to be similar to actual genomic sequences.

The second task is to determine appropriate HMM filter thresholds for the CM over the possible range of final CM bit score thresholds. This is done by sampling 10,000 sequences from the CM itself and searching them with the CM and HMM. The appropriate HMM bit score threshold for a given CM threshold is set as the HMM threshold that will recognize 99.5% of the hits that score above the CM threshold. This HMM threshold is calculated over the range of reasonable CM thresholds.

Both tasks must be performed for each configuration and algorithm that cmsearch might use. These include HMM Viterbi, HMM Forward, CM CYK and CM Inside algorithms for E-value calibration, and CM CYK and CM Inside algorithms for HMM filter thresholds. Additionally, for each algorithm, each task must be performed twice, once for a locally configured model and once for a globally configured model.

The E-values and HMM filter thresholds determined by cmcalibrate are only used by the cmsearch program. If you are not going to use cmsearch, do not waste time calibrating your models.

The majority of the options to cmcalibrate fall into one of two categories, depending on which of the two main tasks they're associated with. Options that affect the exponential tail E-value fitting are prefixed with --exp. Options that affect the HMM filter threshold determination are prefixed with --fil.

The calibration of E-value statistics takes the majority of the running time of cmcalibrate. This is because CM search algorithms are slow, and the random sequences that must be searched have to be long enough to include enough random hits that can be binned into a histogram to which an exponential tail can be reliably fit. By default the random sequence length for CM searches is 1.5 megabases (Mb), for all search modes, but 1.5 can be changed to <x> with --exp-cmL-glc <x> or --exp-cmL-loc <x> options for glocal and local CM search calibrations respectively. Because cmsearch uses HMM search algorithms to filter, cmcalibrate must also fit exponential tails for HMM search algorithms. HMMs are much faster than CMs so it is possible to search much longer random sequence than 1.5 MB and not significantly increase the running time of cmcalibrate. The length of sequence searched with the HMM is controlled by the --exp-fract <x>, --exp-hmmLn-glc <x>, --exp-hmmLn-loc <x>, and the --exp-hmmLx <x> options. By default, the sequence length for HMM calibration is set as the length that will require 0.10 times the number of dynamic programming calculations as a CM E-value calibration step. (The value 0.10 can be changed to <x> with the --exp-fract <x> option). If this sequence length is less than a minimum value, which by default is 15.0 MB, then the minimum value is used. The minimum value can be changed to <x> with --exp-hmmLn-glc <x> and --exp-hmmLn-loc <x> for glocal and local HMM search calibrations separately. Similarily if this value is more than a maximum value, which by default is 1000.0 MB, then the maximum value is used. The maximum value can be changed to <x> with the --exp-hmmLx <x> option.

OPTIONS¶

-h: Print brief help; includes version number and summary of all options, including expert options.

-s <n>: Set the random number generator seed to <n>, where <n> is a positive integer. The default is to use time() to generate a different seed for each run, which means that two different runs of cmcalibrate on the same CM will give slightly different E-value and HMM filter threshold parameters. You can use this option to generate reproducible results.

--forecast <n>: Predict the running time of the calibration for cmfile and provided options and exit, DO NOT perform the calibration. The predictions should be used as rough estimates. The value <n> is the number of processors the calibration will be run on, so <n> equal to 1 is appropriate unless you will run cmcalibrate in parallel with MPI.

--devhelp: Print help, as with -h , but also include undocumented developer options. These options are not listed below, are under development or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.

--mpi: Run as an MPI parallel program. This option will only be available if Infernal has been configured and built with the "--enable-mpi" flag (see User's Guide for details).

EXPERT OPTIONS¶

--exp-cmL-glc <x>: Set the length of random sequence to search for the CM glocal exponential tail fits to <x> megabases (Mb). By default, <x> is 1.5 Mb. Searching more sequences will make the exponential tail fits more precise, but will take longer: using <x> of 3.0 instead of the default of 1.5 will cause the running time of cmcalibrate to increase by roughly 50%.

--exp-cmL-loc <x>: Set the length of random sequence to search for the CM local exponential tail fits to <x> megabases (Mb). By default, <x> is 1.5 Mb. Searching more sequences will make the exponential tail fits more precise, but will take longer: using <x> of 3.0 instead of the default of 1.5 will cause the running time of cmcalibrate to increase by roughly 50%.

--exp-hmmLn-glc <x>: Set the minimum random sequence length to search for the HMM glocal exponential tail fits to <x> megabases (Mb). By default, <x> is 15.0. For more information, see the explanation regarding sequence lengths for E-value calibration above before the Options section.

--exp-hmmLn-loc <x>: Set the minimum random sequence length to search for the HMM local exponential tail fits to <x> megabases (Mb). By default, <x> is 15.0. For more information, see the explanation regarding sequence lengths for E-value calibration above before the Options section.

--exp-hmmLx <x>: Set the maximum random sequence length to search when determining HMM E-values to <x> megabases (Mb). By default, <x> is 1000.0. For more information, see the explanation regarding sequence lengths for E-value calibration above before the Options section.

--exp-fract <x>: Set the HMM/CM fraction of dynamic programming calculations to <x>. By default, <x> is 0.10. For more information, see the explanation regarding sequence lengths for E-value calibration above before the Options section.

--exp-tailn-cglc <x>: During E-value calibration of glocal CM search modes fit the exponential tail to the high scores in the histogram tail that includes <x> hits per Mb searched. By default this <x> is 25. The value 25 was chosen because it works well empirically for glocal CM modes relative to other values.

--exp-tailn-cloc <x>: During E-value calibration of local CM search modes fit the exponential tail to the high scores in the histogram tail that includes <x> hits per Mb searched. By default this <x> is 75. The value 75 was chosen because it works well empirically for local CM modes relative to other values.

--exp-tailn-hglc <x>: During E-value calibration of glocal HMM search modes fit the exponential tail to the high scores in the histogram tail that includes <x> hits per Mb searched. By default this <x> is 250. The value 250 was chosen because it works well empirically for glocal HMM modes relative to other values.

--exp-tailn-hloc <x>: During E-value calibration of local HMM search modes fit the exponential tail to the high scores in the histogram tail that includes <x> hits per Mb searched. By default this <x> is 750. The value 750 was chosen because it works well empirically for glocal HMM modes relative to other values.

--exp-tailp <x>: Ignore the --exp-tailn prefixed options and fit the <x> fraction right tail of the histogram to exponential tails, for all search modes.

--exp-tailxn <n>: With --exp-tailp enforce that the maximum number of hits in the tail that is fit is <n>.

--exp-beta <x>: During E-value calibration, by default query-dependent banding (QDB) is used to accelerate the CM search algorithms with a beta tail loss probability of 1E-15. This beta value can be changed to <x> using the --exp-beta <x> option. The beta parameter is the amount of probability mass excluded during band calculation, higher values of beta give greater speedups but sacrifice more accuracy than lower values. A recommended value is 1E-7 (0.00001). QDB is explained in more detail in the manual page for cmsearch and in (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56).

--exp-no-qdb: Turn of QDB during E-value calibration. This will slow down calibration, and is not recommended unless you plan on using --no-qdb in cmsearch.

--exp-hfile <f>: Save the histograms fit for the E-value calibration to file <f>. The format of this file is two tab delimited columns. The first column is the x-axis values of bit scores of each bin. The second column is the y-axis values of number of hits per bin. Each series is delimited by a line with a single character "&". The file will contain one series for each exponential tail fit, i.e. one series of empirical data for each line of output from cmcalibrate that begins with "exp tail".

--exp-sfile <f>: Save a survival plot for the E-value calibration to file <f>. The format of this file is two tab delimited columns. The first column is the x-axis values of bit scores of each bin. The second column is the y-axis values of fraction of hits that meet or exceed the score for each bin. Each series is delimited by a line with a single character "&". The file will contain three series' of data for each exponential tail fit, i.e. three series for each line of output from cmcalibrate that begins with "exp tail". The first series is the empirical survival plot from the histogram of hits to the random sequence. The second series is the exponential tail fit to the empirical distribution. The third series is the exponential tail fit if lambda were fixed and set as the natural log of 2 (0.691314718).

--exp-qqfile <f>: Save a quantile-quantile plot for the E-value calibration to file <f>. The format of this file is two tab delimited columns. The first column is the x-axis values, and the second column is the y-axis values. The distance of the points from the identity line (y=x) is a measure of how good the exponential tail fit is, the closer the points are to the identity line, the better the fit is. Each series is delimited by a line with a single character "&". The file will contain one series of empirical data for each exponential tail fit, i.e. one series for each line of output from cmcalibrate that begins with "exp tail".

--exp-ffile <f>: Save statistics on the exponential tail statistics to file <f>. The file will contain the lambda and mu values for exponential tails fit to tails of different sizes. For example, by default cmcalibrate fits exponential tails to the rightmost 0.01 (1) of the score histogram and stores the parameters of that exponential tail to the CM file. (The value of 0.01 can be changed to <x> with the --exp-tailp <x> option). When --exp-ffile <f> is used the file <f> will include the exponential tail parameters for fits to various fractions of the histogram tail, instead of just to 0.01.

--fil-N <n>: Set the number of sequences sampled and searched for the HMM filter threshold calibration to <n>. By default, <n> is 10,000.

--fil-F <x>: Set the fraction of sample sequences the HMM filter must be able to recognize, and allow to survive, to <x>, where <x> is a positive real number less than or equal to 1.0. By default, <x> is 0.993.

--fil-tau <x>: Set the tail loss probability during HMM band calculation for HMM filter threshold calibration to <x>. This is the amount of probability mass within the HMM posterior probabilities that is considered negligible. The default value is 1E-7. In general, higher values will result in greater acceleration, but increase the chance of missing the optimal alignment due to the HMM bands.

--fil-gemit: During HMM filter calibration, always sample sequences from a globally configured CM, even when calibrating local modes. By default, sequences are sampled from a globally configured CM when calibrating the global search modes, and sampled from a locally configured CM when calibrating the local search modes.

--fil-dfile <f>: Save statistics on filter threshold calibration, including HMM and CM scores for all sampled sequences, to file <f>.

--mxsize <x>: Set the maximum allowable DP matrix size to <x> megabytes. By default this size is 2,048 Mb. This should be large enough for the vast majority of calibrations, however if it is not cmcalibrate will exit prematurely and report an error message that the matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used to raise the limit.

COPYRIGHT¶

Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).

See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR¶

Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/

October 2009

Infernal 1.0.2

Source file:	cmcalibrate.1.en.gz (from infernal 1.0.2-2)
Source last updated:	2011-09-27T15:41:55Z
Converted to HTML:	2017-06-07T16:48:39Z