NAME¶
- cmsearch - search a sequence database for RNAs homologous
to a CM
-
SYNOPSIS¶
cmsearch [options] cmfile seqfile
DESCRIPTION¶
cmsearch uses the covariance model (CM) in
cmfile to search for
homologous RNAs in
seqfile, and outputs high-scoring alignments.
Currently, the sequence file must be in FASTA format.
CMs are profiles of RNA consensus sequence and secondary structure. A CM file is
produced by the
cmbuild program, from a given RNA sequence alignment of
known consensus structure. CM files can be calibrated prior to running
cmsearch with the
cmcalibrate program. Searches with calibrated
CM files will include E-values and will use appropriate filter thresholds for
acceleration. It is strongly recommended to calibrate your CM files before
using
cmsearch. CM calibration is described in more detail below and in
chapters 5 and 6 of the User's Guide.
cmsearch output consists of alignments of all hits in the database sorted
by decreasing score per sequence and per strand. That is, all hits for the
same sequence and the same (Watson or Crick) strand are sorted, but hits
across sequences or strands are not sorted.
The threshold for reporting scores is different depending on whether the CM file
has been calibrated or not. If the CM file has been calibrated, the default
reporting threshold is an E-value of 1.0. This is the threshold at which 1 hit
is expected by chance. It is possible to manually set the threshold to bit
score
<x> using the
-T <x> option as
described below, or to E-value
<x> using the
-E
<x> option. The
-E option will only work if the CM file has
been calibrated.
RNA homology search with CMs is slow. To speed it up,
cmsearch by default
uses two rounds of filters with faster algorithms to prune the database prior
to searching with the slow CM algorithm. The first round of filtering is
faster but less strict than the second round. First, the full database is
searched with the first round filter, then any hits that survive the first
round are searched with the second round filter. Finally any hits that survive
the first and second round of filtering are searched with the final round
search strategy. During the filter rounds, hits are padded with a short
stretch of residues on either side prior to searching with the subsequent
round. The exact number of residues is dependent on the size of the model
being searched with.
The first round of filtering is performed with an HMM. If the CM file is
calibrated, the threshold for the HMM filter will be automatically chosen as
an appropriate one as determined in
cmcalibrate. The minimum threshold
that will automatically be chosen is the threshold that will allow a predicted
fraction (0.02 by default, changeable to
<x> with
--fil-Smin-hmm <x> ) of the database to survive the
filter. The maximum threshold that will automatically be chosen is the
threshold that will allow a predicted fraction (0.5 by default, changeable to
<x> with
--fil-Smax-hmm <x> ) of the database
to survive the filter. If the threshold from
cmcalibrate is greater
than this maximum fraction, the HMM filter will be turned off and not used. To
ensure that the HMM filter is never turned off and always uses a threshold
that gives this maximum fraction you must use the
--fil-A-hmm option.
If the model is not calibrated, the default HMM filter threshold is 3.0 bits.
The HMM filter threshold can be manually set to bit score
<x>
using the
--fil-T-hmm <x> option as described below, or to
E-value
<x> using the
--fil-E-hmm <x> option,
or to the bit score that will allow a predicted database fraction of
<x> to survive the filter using the
--fil-S-hmm
<x> option. The
--fil-E-hmm and
--fil-S-hmm options
will only work if the CM file has been calibrated. The HMM filter can be
turned off with the
--fil-no-hmm option.
The second round of filtering is performed with the CM CYK algorithm (not an
HMM) using query-dependent banding (QDB) for acceleration. Briefly, QDB
precalculates regions of the dynamic programming matrix that have negligible
probability based on the query CM's transition probabilities. During search,
these regions of the matrix are ignored to make searches faster. For more
information on QDB see (Nawrocki and Eddy, PLoS Computational Biology 3(3):
e56). The beta paramater is the amount of probability mass considered
negligible during band calculation, lower values of beta yield greater
speedups but also a greater chance of missing the optimal alignment. The
default beta is 1E-10: determined empirically as a good tradeoff between
sensitivity and speed, though this value can be changed with the
--fil-beta <x> option. If the CM file has been calibrated,
the QDB filter threshold will be automatic set to an appropriate value using
an ad-hoc procedure (see the User's Guide). If the CM file has not been
calibrated, the default QDB filter threshold is 0.0 bits. The QDB filter
threshold can be manually set to bit score
<x> using the
--fil-T-qdb <x> option as described below, or to E-value
<x> using the
--fil-E-qdb <x> option. The
--fil-E-qdb option will only work if the CM file has been calibrated.
The QDB filter can be turned off with the
--fil-no-qdb option.
Another way to accelerate
cmsearch is to run it in parallel with MPI on
multiple computers. To do this, use the
--mpi option and run
cmsearch inside a MPI wrapper program such as
mpirun. For
example:
mpirun C cmsearch --mpi [other options]
cmfile seqfile. The
cmsearch program must have been
compiled in MPI mode for this to work. See the Installation section of the
User's Guide for more information.
The
--forecast <n> option will estimate how long a search
will take for your
cmfile and
seqfile on
<n>
processors. Unless you plan on running
cmsearch in MPI mode,
<n> should be set as 1.
Another technique for accelerated CM homology search with HMM filters is the
construction and use of a "rigorous filter" HMM which was developed
by Zasha Weinberg and Larry Ruzzo. All hits above a certain CM bit score
threshold are guaranteed to survive the HMM filtering step. Their
implementation of rigorous filters has been included in previous versions of
Infernal, but not in the current version. For more information see the User's
Guide.
OUTPUT¶
By default,
cmsearch outputs the alignments of search hits that score
above the final search round threshold. The format of this output is described
in the "Tutorial" section of the User's Guide. This format has
purposefully not been changed from the 0.x versions of Infernal so as not to
break existing parsers. However, it can be augmented with a line of output
that marks non-compensatory (negative scoring) basepairs with an 'x' by using
the
-x option. Alternatively, only negative scoring non-canonical
basepairs (those other than A:U, U:A, C:G, G:C, U:G, and G:U) are marked if
the
-v option is enabled. These two options were added to facilitate
quick analysis of the secondary structure of hits by eye. Additionally, the
-p option can be used to annotate the posterior probability of each
aligned residue in the hit alignments as described below.
The
--tabfile <f> outputs a tabular representation of the
hits found by
cmsearch to the file
<f>. Each non- such
line has 9 fields: "model name" the name of the CM used for the
search, "target name" the name of the target sequence the hit was
found in, "target coord - start": the start position of the hit in
the target sequence, "target coord - stop": the end position of hit
in the target sequence, "query coord - start": the start position of
the hit in the query model, "query coord - stop": the end position
of hit in the query sequence, "bit sc": the bit score of the hit,
"E-value": the E-value of the hit (if available, "-" if
not), and "GC" the percentage of G and C residues in the hit within
the target sequence.
cmsearch tab files can be used as input to the
Easel miniapp
esl-sfetch (included in the easel/miniapp/ subdirectory
of infernal) with the
-C -f --tabfile options to extract all the hits
from the target database file to a new FASTA file. This file can then be
aligned to a CM with
cmalign.
OPTIONS¶
- -h
- Print brief help; includes version number and summary of
all options, including expert options.
- -o <f>
- Save the high-scoring alignments of hits to a file
<f>. The default is to write them to standard output.
- -g <f>
- Turn on the 'glocal' alignment algorithm, local with
respect to the target database, and global with respect to the model. By
default, the local alignment algorithm is used which is local with respect
to both the target sequence and the model. In local mode, the alignment to
span two or more subsequences if necessary (e.g. if the structures of the
query model and target sequence are only partially shared), allowing
certain large insertions and deletions in the structure to be penalized
differently than normal indels. Local mode performs better on empirical
benchmarks and is significantly more sensitive for remote homology
detection. Empirically, glocal searches return many fewer hits than local
searches, so glocal may be desired for some applications.
- -p
- Append posterior probabilities to alignments of hits. For
more information on posterior probabilities see the description of the
-p option in the manual page for cmalign.
- -x
- Annotate negative scoring basepairs and basepairs that
include a gap in the left or right half of the pair (but not both) with
x's in the alignments of hits. The x's appear above the structural
annotation in the alignment output. Basepairs without x's above them are
compensatory with respect to the model. Compensatory mutations are good
evidence for structural homology.
- -v
- Very similar to -x, but only mark negative scoring
basepairs that are non-canonical basepairs (not an A:U, U:A, C:G, G:C, G:U
or U:G), and mark them with a 'v' instead of an 'x' in the output.
- -Z <x>
- Calculate E-values as if the target database size was
<x> megabases (Mb). Ignore the actual size of the database.
This option is only valid if the CM file has been calibrated. Warning: the
predictions for timings and survival fractions will be calculated as if
the database was of size <x> Mb, which means they will be
inaccurate.
- --toponly
- Only search the top (Watson) strand of the sequences in
seqfile. By default, both strands are searched.
- --bottomonly
- Only search the bottom (Crick) strand of the sequences in
seqfile. By default, both strands are searched.
- --forecast <n>
- Predict the running time of the search with provided files
and options and exit, DO NOT perform the search. This option is
only available with calibrated CM files. The predictions should be used as
rough estimates and can be fairly inaccurate, especially for highly biased
target databases (for example 80% AT genomes). The value for
<n> is the number of processors the search will be run on, so
<n> equal to 1 is appropriate unless you will run
cmsearch in parallel with MPI.
- --informat <s>
- Assert that the input seqfile is in format
<s>. Do not run Babelfish format autodection. This increases
the reliability of the program somewhat, because the Babelfish can make
mistakes; particularly recommended for unattended, high-throughput runs of
Infernal. <s> is case-insensitive. Acceptable formats are:
FASTA, EMBL, UNIPROT, GENBANK, and DDBJ. <s> is
case-insensitive.
- --mxsize <x>
- Set the maximum allowable DP matrix size to
<x> megabytes. By default this size is 2,048 Mb. This should
be large enough for the vast majority of alignments, however if it is not
cmsearch will exit prematurely and report an error message that the
matrix exceeded it's maximum allowable size. In this case, the
--mxsize can be used to raise the limit.
- --devhelp
- Print help, as with -h , but also include
undocumented developer options. These options are not listed below, are
under development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only resources for
understanding what they actually do are the brief one-line description
printed when --devhelp is enabled, and the source code.
- --mpi
- Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see User's Guide for details).
EXPERT OPTIONS¶
- --inside
- Use the Inside algorithm for the final round of searching.
This is true by default.
- --cyk
- Use the CYK algorithm for the final round of searching.
- --forward
- Search only with an HMM. This is much faster but less
sensitive than a CM search. Use the Forward algorithm for the HMM search.
- --viterbi
- Search only with an HMM. This is much faster but less
sensitive than a CM search. Use the Viterbi algorithm for the HMM search.
- -E <x>
- Set the E-value cutoff for the per-sequence/strand ranked
hit list to <x>, where <x> is a positive real
number. Hits with E-values better than (less than) or equal to this
threshold will be shown. This option is only available if the CM file has
been calibrated. This threshold is relevant only to the final round of
searching performed after all filters have been used, not to the filter
rounds themselves.
- -T <x>
- Set the bit score cutoff for the per-sequence ranked hit
list to <x>, where <x> is a positive real
number. Hits with bit scores better than (greater than) this threshold
will be shown. This threshold is relevant only to the final round of
searching performed after all filters have been used, not to the filter
rounds themselves.
- --nc
- Set the bit score cutoff as the NC cutoff value used by
Rfam curators as the noise cutoff score. This is the highest scoring hit
found by this model during Rfam curation that the Rfam curators defined as
a noise (false positive) sequence. The NC cutoff is defined as
<x> bits in the original Stockholm alignment the model was
built from with a line: #=GF NC <x> positioned before the
sequence alignment. If such a line existed in the alignment provided to
cmbuild then the --nc option will be available in
cmsearch. If no such line existed when cmbuild was run, then
using the --nc option to cmsearch will cause the program to
print an error message and exit.
- --ga
- Set the bit score cutoff as the GA cutoff value used by
Rfam curators as the gathering threshold. The GA cutoff is defined in a
stockholm file used to build the model in the same way as the NC cutoff
(see above), but with a line: #=GF GA <x>
- --tc
- Set the bit score cutoff as the TC cutoff value used by
Rfam curators as the trusted cutoff. The TC cutoff is defined in the
stockholm file used to build the model in the same way as the NC cutoff
(see above), but with a line: #=GF TC <x>
- --no-qdb
- Do not use query-dependent banding (QDB) for the final
round of search. By default, QDB is used in the final round of search with
beta = 1E-15, after all filtering is finished.
- --beta <x>
- For query-dependent banding (QDB) during the final round of
search, set the beta parameter to <x> where <x>
is any positive real number less than 1.0. Beta is the probability mass
considered negligible during band calculation. The default beta for the
final round of search is 1E-15.
- --hbanded
- Use HMM bands to accelerate the final round of search.
Constraints for the CM search are derived from posterior probabilities
from an HMM. This is an experimental option and it is not recommended for
use unless you know exactly what you're doing.
- --tau <x>
- Set the tail loss probability during HMM band calculation
to <x>. This is the amount of probability mass within the HMM
posterior probabilities that is considered negligible. The default value
is 1E-7. In general, higher values will result in greater acceleration,
but increase the chance of missing the optimal alignment due to the HMM
bands. This option only makes sense in combination with --hbanded
- --fil-no-hmm
- Turn the HMM filter off.
- --fil-no-qdb
- Turn the QDB filter off.
- --fil-beta
- For the QDB filter, set the beta parameter to
<x> where <x> is any positive real number less
than 1.0. Beta is the probability mass considered negligible during band
calculation. The default beta for the QDB filter round of search is 1E-10.
- --fil-T-qdb <x>
- Set the bit score cutoff for the QDB filter round to
<x>, where <x> is a positive real number. Hits
with bit scores better than (greater than) this threshold will survive the
QDB filter and be passed to the final round.
- --fil-T-hmm <x>
- Set the bit score cutoff for the HMM filter round to
<x>, where <x> is a positive real number. Hits
with bit scores better than (greater than) this threshold will survive the
HMM filter and be passed to the next round, either a QDB filter round, or
if the QDB filter is disabled, to the final round of search.
- --fil-E-qdb <x>
- Set the E-value cutoff for the QDB filter round.
<x>, where <x> is a positive real number. Hits
with E-values better than (less than) or equal to this threshold will
survive and be passed to the final round. This option is only available if
the CM file has been calibrated.
- --fil-E-hmm <x>
- Set the E-value cutoff for the HMM filter round.
<x>, where <x> is a positive real number. Hits
with E-values better than (less than) or equal to this threshold will
survive and be passed to the next round, either a QDB filter round, or if
the QDB filter is disable, to the final round of search. This option is
only available if the CM file has been calibrated.
- --fil-S-hmm <x>
- Set the bit score cutoff for the HMM filter round as the
score that will allow a predicted <x> fraction of the
database to survive the HMM filter round, where <x> is a
positive real number between 0 and 1.
- --fil-Smax-hmm <x>
- When using automatically calibrated HMM thresholds for a CM
file calibrated with cmcalibrate, set the maximum HMM filter
threshold as the score that will allow a predicted <x>
fraction of the database to survive the filter. If the automatic threshold
from cmcalibrate exceeds this value, turn the HMM filter off and do
not use it for the search. By default, this option is ON with the default
value of 0.5 used for <x>. To modify the behavior of this
option so it does not turn off the HMM filter if exceeded use the
--fil-A-hmm option described below.
- --fil-Smin-hmm <x>
- When using automatically calibrated HMM thresholds for a CM
file calibrated with cmcalibrate, set the minimum HMM filter
threshold as the score that will allow a predicted <x>
fraction of the database to survive the filter. By default, this option is
ON with the default value of 0.02 used for <x>. Setting
<x> lower will only accelerate the majority of searches by a
small amount.
- --fil-A-hmm
- Always enforce the maximum HMM filter threshold of
<x> from --fil-Smax-hmm <x>. That
is, never turn off the HMM filter, or set its threshold above the score
that will allow a predicted <x> fraction of the database to
survive. This option is OFF by default.
- --hmm-W <n>
- Set the HMM window size W (maximum size of a hit) to
<n>. This option only works in combination with
--forward or --viterbi. By default, W is calculated
automatically, but this automatic calculation is time consuming for large
models.
- --hmm-cW <x>
- Set the HMM window size W (maximum size of a hit) as
<x> times the consensus length of the CM. The consensus
length (clen) of the CM can be determined using the cmstat program.
This option only works in combination with --forward or
--viterbi. By default, W is calculated automatically, but this
automatic calculation is time consuming for large models. To find
potential full length hits to the model <x> should be greater
than 1.0, but values above 2.0 are probably wasteful.
- --noalign
- Do not calculate and print alignments of each hit, only
print locations and scores.
- --aln-hbanded
- Use HMM bands to accelerate alignment during the hit
alignment stage.
- --aln-optacc
- Calculate alignments of hits from final round of search
using the optimal accuracy algorithm which computes the alignment that
maximizes the summed posterior probability of all aligned residues given
the model, which can be different from the highest scoring one.
- --tabfile <f>
- Create a new output file <f> and print tabular
results to it. The format of the tabular results is listed in the
OUTPUT section. The tabular results can be more easily parsed by
scripts than the default cmsearch output. The esl-sfetch
miniapp included in the easel/miniapps/ subdirectory of infernal has a
--tabfile option that allows it to read cmsearch tab files
and fetch the hits reported within them from the target database into a
new sequence file.
- --gcfile <f>
- Create a new output file <f> and print
statistics of the GC content of the sequences in seqfile to it. The
sequences are partitioned into 100 nt non-overlapping windows, and the GC
percentage of each window is calculated. A normalized histogram of those
GC percentages is then printed to <f> This file can be
generated even if cmsearch is run with --forecast and no
search is performed.
- --rna
- Output the hit alignments as RNA sequences alignments. This
is true by default.
- --dna
- Output the hit alignments as DNA sequence alignments.
SEE ALSO¶
For complete documentation, see the User's Guide (Userguide.pdf) that came with
the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT¶
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution
conditions.
AUTHOR¶
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/