NAME¶
- cmscore - align and score one or more sequences to a
CM
-
SYNOPSIS¶
cmscore [options] cmfile seqfile
DESCRIPTION¶
cmscore uses the covariance model (CM) in
cmfile to align and
score the sequences in
seqfile, and output summary statistics on
timings and scores.
cmscore is a testbed for new CM alignment
algorithms, and it is also used by the testsuite. It is not intended to be
particularly useful in the real world. Documentation is provided for
completeness, and to aid our own memories.
CMs are profiles of RNA consensus sequence and secondary structure. A CM file is
produced by the
cmbuild program, from a given RNA sequence alignment of
known consensus structure.
cmscore aligns the sequence(s) in
seqfile using two alignment
algorithms and compares the scores and timings of each algorithm. By default
the two algorithms compared are the divide and conquer (D&C) CYK algorithm
(SR Eddy, BMC Bioinformatics 3:18, 2002), and the HMM banded standard CYK
algorithm. When the
--nonbanded option is enabled D&C CYK is
compared with the standard CYK alignment algorithm. In this case, because both
algorithms should find the optimal alignment the parsetree scores should be
nearly identical (within 0.01 bits), if this is not the case for any sequence
cmscore exits and prints an error message. When the
--viterbi
option is enabled D&C CYK is compared with Viterbi alignment to a CM Plan
9 HMM constructed to be maximally similar to the CM.
While non-banded CYK variants are guaranteed to find the optimal alignment and
score of each sequence, HMM banded CYK sacrifices this guarantee for
acceleration. The level of acceleration can be controlled by the tau
parameter, which is set with the
--tau <x> option. This is
described in more detail in the man page for
cmalign, but in short,
<x> is a rough estimate at the probability that the optimal
alignment will be missed. The greater
<x> is, the greater the
acceleration, but the greater the chance of missing the optimal alignment. By
default tau is set as 1E-7.
cmscore is useful for testing for values of
tau that give the best trade-off between acceleration versus accuracy. To make
this testing easier, multiple tau values can be tested within a single cmscore
call. The
--taus <x> and
--taue <x>
option combination allow the user to specify a beginning tau value and an
ending tau value. For example,
--taus 3 and and
--taue
5 would first align the sequences in
seqfile with non-banded
D&C CYK, and then perform 3 additional HMM banded alignments, first with
tau=1E-3, next with tau=1E-4 and finally with tau=1E-5. Currently, only values
of 1E-<x> can be used. Summary statistics on timings and how often the
optimal alignment is missed for each value or tau are printed to stdout.
When comparing the non-banded standard CYK and D&C CYK algorithms with the
--nonbanded option, the two parse trees should usually be identical for
any sequence, because the optimal alignment score is guaranteed. However,
there can be cases of ties, where two or more different parse trees have
identical scores. In such cases, it is possible for the two parse trees to
differ. The parse tree selected as "optimal" from amongst the ties
is arbitrary, dependent on order of evaluation in the DP traceback, and the
order of evaluation for D&C vs. standard CYK is different. Thus, in its
testsuite role,
cmscore checks that the scores are within 0.01 bits of
each other, but does not check that the parse trees are absolutely identical.
The alignment algorithms can be run in "search" mode within
cmscore by using the
--search option. When
--search is
enabled,
--inside specifies that the Inside algorithm be used instead
of CYK and
--forward specifies that the HMM Forward algorithm be used
instead of CYK.
The sequences are treated as single stranded RNAs; that is, only the given
strand of each sequence is aligned and scored, and no reverse complementing is
done.
OPTIONS¶
- -h
- Print brief help; includes version number and summary of
all options, including expert options.
- -n <n>
- Set the number of sequences to generate and align to
<n>. This option is incompatible with the --infile
option.
- -l
- Turn on the local alignment algorithm, which allows the
alignment to span two or more subsequences if necessary (e.g. if the
structures of the query model and target sequence are only partially
shared), allowing certain large insertions and deletions in the structure
to be penalized differently than normal indels. The default is to globally
align the query model to the target sequences.
- -s <n>
- Set the random seed to <n>, where <n> is
a positive integer. The default is to use time() to generate a different
seed for each run, which means that two different runs of cmscore
on the same CM will give different results. You can use this option to
generate reproducible results. The random number generator is used to
generate sequences to score, so -s is incompatible with the
--infile option which supplies the sequences to score in an input
file.
- -a
- Print individual timings and score comparisons for each
sequence in seqfile. By default only summary statistics are
printed.
- --sub
- Turn on the sub model construction and alignment procedure.
For each sequence, an HMM is first used to predict the model start and end
consensus columns, and a new sub CM is constructed that only models
consensus columns from start to end. The sequence is then aligned to this
sub CM. This option is useful for aligning sequences that are known to
truncated, non-full length sequences. This "sub CM" procedure is
not the same as the "sub CMs" described by Weinberg and Ruzzo.
When used in combination with --tfile the parsetree printed is not
the sub CM parsetree, but rather the sub CM parstree mapped onto the
topology of the original CM. This mapped parsetree will likely have a
different score (sometimes much worse) than the sub CM parsetree, both of
those scores are printed to the parsetree file for each sequence.
- --mxsize <x>
- Set the maximum allowable DP matrix size to
<x> megabytes. By default this size is 2048 Mb. This should
be large enough for the most alignments, however if it is not
cmscore will exit prematurely and report an error message that the
matrix exceeded it's maximum allowable size. In this case, the
--mxsize can be used to raise the limit, or if --nonbanded
is enabled, the --scoreonly option will solve the memory issue.
This memory error is most likely to occur when the --nonbanded
option is used without the --scoreonly option, but can still occur
when --nonbanded is not used.
- --devhelp
- Print help, as with -h , but also include
undocumented developer options. These options are not listed below, are
under development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only resources for
understanding what they actually do are the brief one-line description
printed when --devhelp is enabled, and the source code.
- --mpi
- Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
--enable-mpi flag (see User's Guide for details).
EXPERT OPTIONS¶
- --emit
- Generate sequences to score by sampling from the CM. This
option is on by default. The number of sequences generated is 10 by
default but can be changed with the -n option. The sequences
generated from the CM can be saved to an output file in FASTA format with
the --outfile option.
- --random
- Generate sequences to score by sampling from the CMs null
distribution. This option turns the --emit option off. By default
the CM distribution will be 0.25 for each of the four RNA nucleotides, but
it may be different if the --null option was used when
cmbuild created the cmfile. By default, the length of the
sequences generated is sampled from the length distribution of the CM. The
average length of the random sequences will be the consensus length of the
RNA family modelled by the CM (or very close to it). Alternatively, the
--Lmin <n> and --Lmax <n> options
can be used to specify a length distribution. The number of sequences
generated is 10 by default but can be changed with the -n option.
The random sequences generated can be saved to an output file in FASTA
format with the --outfile option.
- --infile <f>
- Sequences to score are read from the file <f>.
All the sequences from <f> are read and scored, the -n
and -s options are incompatible with --infile.
- --outfile <f>
- Save generated sequences that are scored to the file
<f> in FASTA format. This option is incompatible with the
--infile option.
- --Lmin <n1>
- Must be used in combination with --random and
--Lmax <n2>. The lengths of the random sequences
generated and scored will be uniform between the range of
<n1>..<n2>.
- --Lmax <n2>
- Must be used in combination with --random and
--Lmin <n1>. The lengths of the random
sequences generated and scored will be uniform between the range of
<n1>..<n2>.
- --pad
- Must be used in combination with --emit and
--search. Add <n> cm->W (max hit length) minus L
(sequence <x> length) residues to the 5' and 3' end of each emitted
sequence <x>.
- --hbanded
- Specify that the second stage alignment algorithm be HMM
banded CYK. This option is on by default. For more information on this
option, see the description of the --hbanded option in the man page
for cmalign.
- --tau <x>
- For stage 2 alignment, set the tail loss probability used
during HMM band calculation to <x>. This is the amount of
probability mass within the HMM posterior probabilities that is considered
negligible. The default value is 1E-7. In general, higher values will
result in greater acceleration, but increase the chance of missing the
optimal alignment due to the HMM bands.
- --aln2bands
- With --search, when calculating HMM bands, use an
HMM alignment algorithm instead of an HMM search algorithm. In general,
using this option will result in greater acceleration, but will increase
the chance of missing the optimal alignment.
- --hsafe
- For stage 2 HMM banded alignment, realign any sequences
with a negative alignment score using non-banded CYK to guarantee finding
the optimal alignment.
- --nonbanded
- Specify that the second stage alignment algorithm be
standard, non-banded, non-D&C CYK. When --nonbanded is enabled,
the program fails with a non-zero exit code and prints an error message if
the parsetree score for any sequence from stage 1 D&C alignment and
stage 2 alignment differs by more than 0.01 bits. In theory, this should
never happen as both algorithms are guaranteed to determine the optimal
parsetree. For larger RNAs (more than 300 residues) if memory is limiting,
--nonbanded should be used in combination with --scoreonly.
- --scoreonly
- With --nonbanded during the second stage standard
non-banded CYK alignment, use the "score only" variant of the
algorithm to save memory, and don't recover a parse tree.
- --viterbi
- Specify that the second stage alignment algorithm be
Viterbi to a CM Plan 9 HMM.
- --search
- Run all algorithms in scanning mode, not alignment mode.
This means the highest scoring subsequence within each sequence is
returned as the score, not necessarily the score of an alignment of the
full sequence.
- --inside
- With --search Compare the non-banded scanning Inside
algorithm to the HMM banded scanning Inside algorith, instead of using CYK
versions.
- --forward
- With --search Compare the scanning Forward scoring
algorithm against CYK.
- --taus <n>
- Specify the first alignment algorithm as non-banded D&C
CYK, and multiple stages of HMM banded CYK alignment. The first HMM banded
alignment will use tau=1E-<x>, which will be the highest value of
tau used. Must be used in combination with --taue.
- --taue <n>
- Specify the first alignment algorithm as non-banded D&C
CYK, and multiple stages of HMM banded CYK alignment. The final HMM banded
alignment will use tau=1E-<x>, which will be the lowest value of tau
used. Must be used in combination with --taus.
- --tfile <f>
- Print the parsetrees for each alignment of each sequence to
file <f>.
SEE ALSO¶
For complete documentation, see the User's Guide (Userguide.pdf) that came with
the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT¶
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution
conditions.
AUTHOR¶
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/