NAME¶
  - cmsearch - search a sequence database for RNAs homologous
    to a CM
 
  - 
    
 
   
SYNOPSIS¶
cmsearch [options] cmfile seqfile
 
DESCRIPTION¶
cmsearch uses the covariance model (CM) in 
cmfile to search for
  homologous RNAs in 
seqfile, and outputs high-scoring alignments.
 
Currently, the sequence file must be in FASTA format.
 
CMs are profiles of RNA consensus sequence and secondary structure. A CM file is
  produced by the 
cmbuild program, from a given RNA sequence alignment of
  known consensus structure. CM files can be calibrated prior to running
  
cmsearch with the 
cmcalibrate program. Searches with calibrated
  CM files will include E-values and will use appropriate filter thresholds for
  acceleration. It is strongly recommended to calibrate your CM files before
  using 
cmsearch. CM calibration is described in more detail below and in
  chapters 5 and 6 of the User's Guide.
 
cmsearch output consists of alignments of all hits in the database sorted
  by decreasing score per sequence and per strand. That is, all hits for the
  same sequence and the same (Watson or Crick) strand are sorted, but hits
  across sequences or strands are not sorted.
 
The threshold for reporting scores is different depending on whether the CM file
  has been calibrated or not. If the CM file has been calibrated, the default
  reporting threshold is an E-value of 1.0. This is the threshold at which 1 hit
  is expected by chance. It is possible to manually set the threshold to bit
  score 
<x> using the 
-T <x> option as
  described below, or to E-value 
<x> using the 
-E
  <x> option. The 
-E option will only work if the CM file has
  been calibrated.
 
RNA homology search with CMs is slow. To speed it up, 
cmsearch by default
  uses two rounds of filters with faster algorithms to prune the database prior
  to searching with the slow CM algorithm. The first round of filtering is
  faster but less strict than the second round. First, the full database is
  searched with the first round filter, then any hits that survive the first
  round are searched with the second round filter. Finally any hits that survive
  the first and second round of filtering are searched with the final round
  search strategy. During the filter rounds, hits are padded with a short
  stretch of residues on either side prior to searching with the subsequent
  round. The exact number of residues is dependent on the size of the model
  being searched with.
 
The first round of filtering is performed with an HMM. If the CM file is
  calibrated, the threshold for the HMM filter will be automatically chosen as
  an appropriate one as determined in 
cmcalibrate. The minimum threshold
  that will automatically be chosen is the threshold that will allow a predicted
  fraction (0.02 by default, changeable to 
<x> with
  
--fil-Smin-hmm <x> ) of the database to survive the
  filter. The maximum threshold that will automatically be chosen is the
  threshold that will allow a predicted fraction (0.5 by default, changeable to
  
<x> with 
--fil-Smax-hmm <x> ) of the database
  to survive the filter. If the threshold from 
cmcalibrate is greater
  than this maximum fraction, the HMM filter will be turned off and not used. To
  ensure that the HMM filter is never turned off and always uses a threshold
  that gives this maximum fraction you must use the 
--fil-A-hmm option.
  If the model is not calibrated, the default HMM filter threshold is 3.0 bits.
  The HMM filter threshold can be manually set to bit score 
<x>
  using the 
--fil-T-hmm <x> option as described below, or to
  E-value 
<x> using the 
--fil-E-hmm <x> option,
  or to the bit score that will allow a predicted database fraction of
  
<x> to survive the filter using the 
--fil-S-hmm
  <x> option. The 
--fil-E-hmm and 
--fil-S-hmm options
  will only work if the CM file has been calibrated. The HMM filter can be
  turned off with the 
--fil-no-hmm option.
 
The second round of filtering is performed with the CM CYK algorithm (not an
  HMM) using query-dependent banding (QDB) for acceleration. Briefly, QDB
  precalculates regions of the dynamic programming matrix that have negligible
  probability based on the query CM's transition probabilities. During search,
  these regions of the matrix are ignored to make searches faster. For more
  information on QDB see (Nawrocki and Eddy, PLoS Computational Biology 3(3):
  e56). The beta paramater is the amount of probability mass considered
  negligible during band calculation, lower values of beta yield greater
  speedups but also a greater chance of missing the optimal alignment. The
  default beta is 1E-10: determined empirically as a good tradeoff between
  sensitivity and speed, though this value can be changed with the
  
--fil-beta <x> option. If the CM file has been calibrated,
  the QDB filter threshold will be automatic set to an appropriate value using
  an ad-hoc procedure (see the User's Guide). If the CM file has not been
  calibrated, the default QDB filter threshold is 0.0 bits. The QDB filter
  threshold can be manually set to bit score 
<x> using the
  
--fil-T-qdb <x> option as described below, or to E-value
  
<x> using the 
--fil-E-qdb <x> option. The
  
--fil-E-qdb option will only work if the CM file has been calibrated.
  The QDB filter can be turned off with the 
--fil-no-qdb option.
 
Another way to accelerate 
cmsearch is to run it in parallel with MPI on
  multiple computers. To do this, use the 
--mpi option and run
  
cmsearch inside a MPI wrapper program such as 
mpirun. For
  example: 
mpirun C cmsearch --mpi [other options]
  cmfile seqfile. The 
cmsearch program must have been
  compiled in MPI mode for this to work. See the Installation section of the
  User's Guide for more information.
 
The 
--forecast <n> option will estimate how long a search
  will take for your 
cmfile and 
seqfile on 
<n>
  processors. Unless you plan on running 
cmsearch in MPI mode,
  
<n> should be set as 1.
 
Another technique for accelerated CM homology search with HMM filters is the
  construction and use of a "rigorous filter" HMM which was developed
  by Zasha Weinberg and Larry Ruzzo. All hits above a certain CM bit score
  threshold are guaranteed to survive the HMM filtering step. Their
  implementation of rigorous filters has been included in previous versions of
  Infernal, but not in the current version. For more information see the User's
  Guide.
 
OUTPUT¶
By default, 
cmsearch outputs the alignments of search hits that score
  above the final search round threshold. The format of this output is described
  in the "Tutorial" section of the User's Guide. This format has
  purposefully not been changed from the 0.x versions of Infernal so as not to
  break existing parsers. However, it can be augmented with a line of output
  that marks non-compensatory (negative scoring) basepairs with an 'x' by using
  the 
-x option. Alternatively, only negative scoring non-canonical
  basepairs (those other than A:U, U:A, C:G, G:C, U:G, and G:U) are marked if
  the 
-v option is enabled. These two options were added to facilitate
  quick analysis of the secondary structure of hits by eye. Additionally, the
  
-p option can be used to annotate the posterior probability of each
  aligned residue in the hit alignments as described below.
 
The 
--tabfile <f> outputs a tabular representation of the
  hits found by 
cmsearch to the file 
<f>. Each non- such
  line has 9 fields: "model name" the name of the CM used for the
  search, "target name" the name of the target sequence the hit was
  found in, "target coord - start": the start position of the hit in
  the target sequence, "target coord - stop": the end position of hit
  in the target sequence, "query coord - start": the start position of
  the hit in the query model, "query coord - stop": the end position
  of hit in the query sequence, "bit sc": the bit score of the hit,
  "E-value": the E-value of the hit (if available, "-" if
  not), and "GC" the percentage of G and C residues in the hit within
  the target sequence. 
cmsearch tab files can be used as input to the
  Easel miniapp 
esl-sfetch (included in the easel/miniapp/ subdirectory
  of infernal) with the 
-C -f --tabfile options to extract all the hits
  from the target database file to a new FASTA file. This file can then be
  aligned to a CM with 
cmalign.
 
OPTIONS¶
  - -h
 
  - Print brief help; includes version number and summary of
      all options, including expert options.
    
 
   
  - -o <f>
 
  - Save the high-scoring alignments of hits to a file
      <f>. The default is to write them to standard output.
    
 
   
  - -g <f>
 
  - Turn on the 'glocal' alignment algorithm, local with
      respect to the target database, and global with respect to the model. By
      default, the local alignment algorithm is used which is local with respect
      to both the target sequence and the model. In local mode, the alignment to
      span two or more subsequences if necessary (e.g. if the structures of the
      query model and target sequence are only partially shared), allowing
      certain large insertions and deletions in the structure to be penalized
      differently than normal indels. Local mode performs better on empirical
      benchmarks and is significantly more sensitive for remote homology
      detection. Empirically, glocal searches return many fewer hits than local
      searches, so glocal may be desired for some applications.
    
 
   
  - -p
 
  - Append posterior probabilities to alignments of hits. For
      more information on posterior probabilities see the description of the
      -p option in the manual page for cmalign.
    
 
   
  - -x
 
  - Annotate negative scoring basepairs and basepairs that
      include a gap in the left or right half of the pair (but not both) with
      x's in the alignments of hits. The x's appear above the structural
      annotation in the alignment output. Basepairs without x's above them are
      compensatory with respect to the model. Compensatory mutations are good
      evidence for structural homology.
    
 
   
  - -v
 
  - Very similar to -x, but only mark negative scoring
      basepairs that are non-canonical basepairs (not an A:U, U:A, C:G, G:C, G:U
      or U:G), and mark them with a 'v' instead of an 'x' in the output.
    
 
   
  - -Z <x>
 
  - Calculate E-values as if the target database size was
      <x> megabases (Mb). Ignore the actual size of the database.
      This option is only valid if the CM file has been calibrated. Warning: the
      predictions for timings and survival fractions will be calculated as if
      the database was of size <x> Mb, which means they will be
      inaccurate.
    
 
   
  - --toponly
 
  - Only search the top (Watson) strand of the sequences in
      seqfile. By default, both strands are searched.
    
 
   
  - --bottomonly
 
  - Only search the bottom (Crick) strand of the sequences in
      seqfile. By default, both strands are searched.
    
 
   
  - --forecast <n>
 
  - Predict the running time of the search with provided files
      and options and exit, DO NOT perform the search. This option is
      only available with calibrated CM files. The predictions should be used as
      rough estimates and can be fairly inaccurate, especially for highly biased
      target databases (for example 80% AT genomes). The value for
      <n> is the number of processors the search will be run on, so
      <n> equal to 1 is appropriate unless you will run
      cmsearch in parallel with MPI.
    
 
     
   
  - --informat <s>
 
  - Assert that the input seqfile is in format
      <s>. Do not run Babelfish format autodection. This increases
      the reliability of the program somewhat, because the Babelfish can make
      mistakes; particularly recommended for unattended, high-throughput runs of
      Infernal. <s> is case-insensitive. Acceptable formats are:
      FASTA, EMBL, UNIPROT, GENBANK, and DDBJ. <s> is
      case-insensitive.
    
 
   
  - --mxsize <x>
 
  - Set the maximum allowable DP matrix size to
      <x> megabytes. By default this size is 2,048 Mb. This should
      be large enough for the vast majority of alignments, however if it is not
      cmsearch will exit prematurely and report an error message that the
      matrix exceeded it's maximum allowable size. In this case, the
      --mxsize can be used to raise the limit.
    
 
   
  - --devhelp
 
  - Print help, as with -h , but also include
      undocumented developer options. These options are not listed below, are
      under development or experimental, and are not guaranteed to even work
      correctly. Use developer options at your own risk. The only resources for
      understanding what they actually do are the brief one-line description
      printed when --devhelp is enabled, and the source code.
    
 
   
  - --mpi
 
  - Run as an MPI parallel program. This option will only be
      available if Infernal has been configured and built with the
      "--enable-mpi" flag (see User's Guide for details).
    
 
   
EXPERT OPTIONS¶
  - --inside
 
  - Use the Inside algorithm for the final round of searching.
      This is true by default.
    
 
   
  - --cyk
 
  - Use the CYK algorithm for the final round of searching.
    
 
   
  - --forward
 
  - Search only with an HMM. This is much faster but less
      sensitive than a CM search. Use the Forward algorithm for the HMM search.
    
 
   
  - --viterbi
 
  - Search only with an HMM. This is much faster but less
      sensitive than a CM search. Use the Viterbi algorithm for the HMM search.
    
 
   
  - -E <x>
 
  - Set the E-value cutoff for the per-sequence/strand ranked
      hit list to <x>, where <x> is a positive real
      number. Hits with E-values better than (less than) or equal to this
      threshold will be shown. This option is only available if the CM file has
      been calibrated. This threshold is relevant only to the final round of
      searching performed after all filters have been used, not to the filter
      rounds themselves.
    
 
   
  - -T <x>
 
  - Set the bit score cutoff for the per-sequence ranked hit
      list to <x>, where <x> is a positive real
      number. Hits with bit scores better than (greater than) this threshold
      will be shown. This threshold is relevant only to the final round of
      searching performed after all filters have been used, not to the filter
      rounds themselves.
    
 
   
  - --nc
 
  - Set the bit score cutoff as the NC cutoff value used by
      Rfam curators as the noise cutoff score. This is the highest scoring hit
      found by this model during Rfam curation that the Rfam curators defined as
      a noise (false positive) sequence. The NC cutoff is defined as
      <x> bits in the original Stockholm alignment the model was
      built from with a line: #=GF NC <x> positioned before the
      sequence alignment. If such a line existed in the alignment provided to
      cmbuild then the --nc option will be available in
      cmsearch. If no such line existed when cmbuild was run, then
      using the --nc option to cmsearch will cause the program to
      print an error message and exit.
    
 
   
  - --ga
 
  - Set the bit score cutoff as the GA cutoff value used by
      Rfam curators as the gathering threshold. The GA cutoff is defined in a
      stockholm file used to build the model in the same way as the NC cutoff
      (see above), but with a line: #=GF GA <x>
    
 
   
  - --tc
 
  - Set the bit score cutoff as the TC cutoff value used by
      Rfam curators as the trusted cutoff. The TC cutoff is defined in the
      stockholm file used to build the model in the same way as the NC cutoff
      (see above), but with a line: #=GF TC <x>
    
 
   
  - --no-qdb
 
  - Do not use query-dependent banding (QDB) for the final
      round of search. By default, QDB is used in the final round of search with
      beta = 1E-15, after all filtering is finished.
    
 
   
  - --beta <x>
 
  - For query-dependent banding (QDB) during the final round of
      search, set the beta parameter to <x> where <x>
      is any positive real number less than 1.0. Beta is the probability mass
      considered negligible during band calculation. The default beta for the
      final round of search is 1E-15.
    
 
   
  - --hbanded
 
  - Use HMM bands to accelerate the final round of search.
      Constraints for the CM search are derived from posterior probabilities
      from an HMM. This is an experimental option and it is not recommended for
      use unless you know exactly what you're doing.
    
 
   
  - --tau <x>
 
  - Set the tail loss probability during HMM band calculation
      to <x>. This is the amount of probability mass within the HMM
      posterior probabilities that is considered negligible. The default value
      is 1E-7. In general, higher values will result in greater acceleration,
      but increase the chance of missing the optimal alignment due to the HMM
      bands. This option only makes sense in combination with --hbanded
    
 
   
  - --fil-no-hmm
 
  - Turn the HMM filter off.
    
 
   
  - --fil-no-qdb
 
  - Turn the QDB filter off.
    
 
   
  - --fil-beta
 
  - For the QDB filter, set the beta parameter to
      <x> where <x> is any positive real number less
      than 1.0. Beta is the probability mass considered negligible during band
      calculation. The default beta for the QDB filter round of search is 1E-10.
    
 
   
  - --fil-T-qdb <x>
 
  - Set the bit score cutoff for the QDB filter round to
      <x>, where <x> is a positive real number. Hits
      with bit scores better than (greater than) this threshold will survive the
      QDB filter and be passed to the final round.
    
 
   
  - --fil-T-hmm <x>
 
  - Set the bit score cutoff for the HMM filter round to
      <x>, where <x> is a positive real number. Hits
      with bit scores better than (greater than) this threshold will survive the
      HMM filter and be passed to the next round, either a QDB filter round, or
      if the QDB filter is disabled, to the final round of search.
    
 
   
  - --fil-E-qdb <x>
 
  - Set the E-value cutoff for the QDB filter round.
      <x>, where <x> is a positive real number. Hits
      with E-values better than (less than) or equal to this threshold will
      survive and be passed to the final round. This option is only available if
      the CM file has been calibrated.
    
 
   
  - --fil-E-hmm <x>
 
  - Set the E-value cutoff for the HMM filter round.
      <x>, where <x> is a positive real number. Hits
      with E-values better than (less than) or equal to this threshold will
      survive and be passed to the next round, either a QDB filter round, or if
      the QDB filter is disable, to the final round of search. This option is
      only available if the CM file has been calibrated.
    
 
     
   
  - --fil-S-hmm <x>
 
  - Set the bit score cutoff for the HMM filter round as the
      score that will allow a predicted <x> fraction of the
      database to survive the HMM filter round, where <x> is a
      positive real number between 0 and 1.
    
 
   
  - --fil-Smax-hmm <x>
 
  - When using automatically calibrated HMM thresholds for a CM
      file calibrated with cmcalibrate, set the maximum HMM filter
      threshold as the score that will allow a predicted <x>
      fraction of the database to survive the filter. If the automatic threshold
      from cmcalibrate exceeds this value, turn the HMM filter off and do
      not use it for the search. By default, this option is ON with the default
      value of 0.5 used for <x>. To modify the behavior of this
      option so it does not turn off the HMM filter if exceeded use the
      --fil-A-hmm option described below.
    
 
   
  - --fil-Smin-hmm <x>
 
  - When using automatically calibrated HMM thresholds for a CM
      file calibrated with cmcalibrate, set the minimum HMM filter
      threshold as the score that will allow a predicted <x>
      fraction of the database to survive the filter. By default, this option is
      ON with the default value of 0.02 used for <x>. Setting
      <x> lower will only accelerate the majority of searches by a
      small amount.
    
 
   
  - --fil-A-hmm
 
  - Always enforce the maximum HMM filter threshold of
      <x> from --fil-Smax-hmm <x>. That
      is, never turn off the HMM filter, or set its threshold above the score
      that will allow a predicted <x> fraction of the database to
      survive. This option is OFF by default.
    
 
   
  - --hmm-W <n>
 
  - Set the HMM window size W (maximum size of a hit) to
      <n>. This option only works in combination with
      --forward or --viterbi. By default, W is calculated
      automatically, but this automatic calculation is time consuming for large
      models.
    
 
   
  - --hmm-cW <x>
 
  - Set the HMM window size W (maximum size of a hit) as
      <x> times the consensus length of the CM. The consensus
      length (clen) of the CM can be determined using the cmstat program.
      This option only works in combination with --forward or
      --viterbi. By default, W is calculated automatically, but this
      automatic calculation is time consuming for large models. To find
      potential full length hits to the model <x> should be greater
      than 1.0, but values above 2.0 are probably wasteful.
    
 
   
  - --noalign
 
  - Do not calculate and print alignments of each hit, only
      print locations and scores.
    
 
   
  - --aln-hbanded
 
  - Use HMM bands to accelerate alignment during the hit
      alignment stage.
    
 
   
  - --aln-optacc
 
  - Calculate alignments of hits from final round of search
      using the optimal accuracy algorithm which computes the alignment that
      maximizes the summed posterior probability of all aligned residues given
      the model, which can be different from the highest scoring one.
    
 
   
  - --tabfile <f>
 
  - Create a new output file <f> and print tabular
      results to it. The format of the tabular results is listed in the
      OUTPUT section. The tabular results can be more easily parsed by
      scripts than the default cmsearch output. The esl-sfetch
      miniapp included in the easel/miniapps/ subdirectory of infernal has a
      --tabfile option that allows it to read cmsearch tab files
      and fetch the hits reported within them from the target database into a
      new sequence file.
    
 
   
  - --gcfile <f>
 
  - Create a new output file <f> and print
      statistics of the GC content of the sequences in seqfile to it. The
      sequences are partitioned into 100 nt non-overlapping windows, and the GC
      percentage of each window is calculated. A normalized histogram of those
      GC percentages is then printed to <f> This file can be
      generated even if cmsearch is run with --forecast and no
      search is performed.
    
 
   
  - --rna
 
  - Output the hit alignments as RNA sequences alignments. This
      is true by default.
    
 
   
  - --dna
 
  - Output the hit alignments as DNA sequence alignments.
    
 
     
   
SEE ALSO¶
For complete documentation, see the User's Guide (Userguide.pdf) that came with
  the distribution; or see the Infernal web page, 
http://infernal.janelia.org/.
 
COPYRIGHT¶
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution
  conditions.
 
AUTHOR¶
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/