NAME¶
- cmbuild - construct a CM from an RNA multiple sequence
alignment
-
SYNOPSIS¶
cmbuild [options] cmfile alifile
DESCRIPTION¶
cmbuild read an RNA multiple sequence alignment from
alifile,
constructs a covariance model (CM), and saves the CM to
cmfile.
The alignment file must be in Stockholm format, and must contain consensus
secondary structure annotation.
cmbuild uses the consensus structure to
determine the architecture of the CM.
The alignment file may be a database containing more than one alignment. If so,
the resulting
cmfile will be a database of CMs, one per alignment.
The expert options
--ctarget, --cmindiff, and
--call result in
multiple CMs being built from each alignment in
alifile as described
below.
OUTPUT¶
The default output from
cmbuild is tabular, with a single line printed
for each model . Each line has the following fields:
aln: the index of
the alignment used to build the CM,
cm idx: the index of the CM in the
cmfile; name: the name of the CM,
nseq: the number of
sequences in the alignment used to build the CM,
eff_nseq: the
effective number of sequences used to build the model (see the User Guide);
alen: the length of the alignment used to build the CM;
clen:
the number of columns from the alignment defined as consensus columns;
rel
entropy, CM: the total relative entropy of the model divided by the number
of consensus columns;
rel entropy, HMM: the total relative entropy of
the model
ignoring secondary structure divided by the number of
consensus columns.
OPTIONS¶
- -h
- Print brief help; includes version number and summary of
all options, including expert options.
- -n <s>
- Name the covariance model <s>. (Does not work
if alifile contains more than one alignment). The default is to use
the name of the alignment (given by the #=GF ID tag, in Stockholm format),
or if that is not present, to use the name of the alignment file minus any
file type extension plus a "-" and a positive integer indicating
the position of that alignment in the file (that is, the first alignment
in a file "myrnas.sto" would give a CM named
"myrnas-1", the second alignment would give a CM named
"myrnas-2").
- -A
- Append the CM to cmfile, if cmfile already
exists.
- -F
- Allow cmfile to be overwritten. Normally, if
cmfile already exists, cmbuild exits with an error unless
the -A or -F option is set.
- -v
- Run in verbose output mode instead of using the default
single line tabular format. This output format is similar to that used by
older versions of Infernal.
- --iins
- Allow informative insert emissions for the CM. By default,
all CM insert emission scores are set to 0.0 bits. The motivation for zero
bit scores is to avoid high-scoring hits to low complexity sequence
favored by high insert state emission scores.
- --Wbeta<x>
- Set the beta tail loss probability for query-dependent
banding (QDB) to <x> The QDB algorithm is used to determine
the maximium length of a hit to the model. For more information on QDB see
(Nawrocki and Eddy, PLoS Computational Biology 3(3): e56). The beta
paramater is the amount of probability mass considered negligible during
band calculation, lower values of beta will result in shorter maximum hit
lengths, which will yield faster searches. The default beta is 1E-7:
determined empirically as a good tradeoff between sensitivity, specificity
and speed.
- --devhelp
- Print help, as with -h , but also include
undocumented developer options. These options are not listed below. They
are under development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only resources for
understanding what they actually do are the brief one-line description
printed when --devhelp is enabled, and the source code.
EXPERT OPTIONS¶
- --rsearch <f>
- Parameterize emission scores a la RSEARCH, using the
RIBOSUM matrix in file <f>. (Actually, the emission scores
will not be identical to RIBOSUM scores due of differences in the
modelling strategy between Infernal and RSEARCH, but they will be as
similar as possible.) RIBOSUM matrix files are included with Infernal in
the "matrices/" subdirectory of the top-level Infernal
directory. RIBOSUM matrices are substitution score matrices trained
specifically for structural RNAs with separate single stranded residue and
base pair substitution scores. For more information see the RSEARCH
publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003). Actually, the
emission scores will not exactly
With --rsearch enabled, all alignments in alifile must contain
exactly one sequence or the --call option must also be enabled.
- --binary
- Save the model in a compact binary format. The default is a
more readable ASCII text format.
- --rf
- Use reference coordinate annotation (#=GC RF line, in
Stockholm) to determine which columns are consensus, and which are
inserts. Any non-gap character indicates a consensus column. (For example,
mark consensus columns with "x", and insert columns with
".".) The default is to determine this automatically; if the
frequency of gap characters in a column is greater than a threshold,
gapthresh (default 0.5), the column is called an insertion.
- --gapthresh <x>
- Set the gap threshold (used for determining which columns
are insertions versus consensus; see --rf above) to
<x>. The default is 0.5.
- --ignorant
- Strip all base pair secondary structure information from
all input alignments in alifile before building the CM(s). All
resulting CM(s) will have zero MATP (base pair) nodes, with zero
bifurcations.
- --wgsc
- Use the Gerstein/Sonnhammer/Chothia (GSC) weighting
algorithm. This is the default unless the number of sequences in the
alignment exceeds a cutoff (see --pbswitch), in which case the
default becomes the faster Henikoff position-based weighting scheme.
- --wblosum
- Use the BLOSUM filtering algorithm to weight the sequences,
instead of the default GSC weighting. Cluster the sequences at a given
percentage identity (see --wid); assign each cluster a total weight
of 1.0, distributed equally amongst the members of that cluster.
- --wpb
- Use the Henikoff position-based weighting scheme. This
weighting scheme is automatically used (overriding --wgsc and
--wblosum) if the number of sequences in the alignment exceeds a
cutoff (see --pbswitch).
- --wnone
- Turn sequence weighting off; e.g. explicitly set all
sequence weights to 1.0.
- --wgiven
- Use sequence weights as given in annotation in the input
alignment file. If no weights were given, assume they are all 1.0. The
default is to determine new sequence weights by the
Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated weights.
- --pbswitch <n>
- Set the cutoff for automatically switching the weighting
method to the Henikoff position-based weighting scheme to
<n>. If the number of sequences in the alignment exceeds
<n> Henikoff weighting is used. By default <n>
is 5000.
- --wid <x>
- Controls the behavior of the --wblosum weighting
option by setting the percent identity for clustering the alignment to
<x>.
- --eent
- Use the entropy weighting strategy to determine the
effective sequence number that gives a target mean match state relative
entropy. This option is the default, and can be turned off with
--enone. The default target mean match state relative entropy is
0.59 bits but can be changed with --ere. The default of 0.59 bits
is automatically changed if the total relative entropy of the model
(summed match state relative entropy) is less than a cutoff, which is is
6.0 bits by default, but can be changed with the expert, undocumented
--eX option. If you really want to play with that option, consult
the source code.
- --enone
- Turn off the entropy weighting strategy. The effective
sequence number is just the number of sequences in the alignment.
- --ere <x>
- Set the target mean match state relative entropy as
<x>. By default the target relative entropy per match
position is 0.59 bits.
- --null <f>
- Read a null model from <f>. The null model
defines the probability of each RNA nucleotide in background sequence, the
default is to use 0.25 for each nucleotide. The format of null files is
documented in the User's Guide.
- --prior <f>
- Read a Dirichlet prior from <f>, replacing the
default mixture Dirichlet. The format of prior files is documented in the
User's Guide.
- --ctarget <n>
- Cluster each alignment in alifile by percent
identity. Find a cutoff percent id threshold that gives exactly
<n> clusters and build a separate CM from each cluster. If
<n> is greater than the number of sequences in the alignment
the program will not complain, and each sequence in the alignment will be
its own cluster. Each CM will have a positive integer appended to its name
indicating the order in which it was built. For example, if cmbuild
--ctarget 3 is called with alifile "myrnas.sto", and
"myrnas.sto" has exactly one Stockholm alignment in it with no
#=GF ID tag annotation, three CMs will be built, the first will be named
"myrnas-1.1", the second, "myrnas-1.2", and the third
"myrnas-1.3". (As explained above for the -n option, the
first number "1" after "myrnas" indicates the CM was
built from the first alignment in "myrnas.sto".)
- --cmaxid <x>
- Cluster each sequence alignment in alifile by
percent identity. Define clusters at the cutoff fractional id similarity
of <x> and build a separate CM from each cluster. No two
sequences will be be more than <x> fractionally identical (
<x> * 100 percent identical) if those two sequences are in
different clusters. The CMs are named as described above for
--ctarget.
- --call
- Build a separate CM from each sequence in each alignment in
alifile. Naming of CMs takes place as described above for
--ctarget. Using this option in combination with --rsearch
causes a separate CM to be built and parameterized using a RIBOSUM matrix
for each sequence in alifile.
- --corig
- After building multiple CMs using --ctarget,
--cmindiff or --call as described above, build a final CM using
the complete original alignment from alifile. The CMs are named as
described above for --ctarget with the exception of the final CM
built from the original alignment which is named in the default manner,
without an appended integer.
- --cdump<f>
- Dump the multiple alignments of each cluster to
<f> in Stockholm format. This option only works in
combination with --ctarget, --cmindiff or --call.
- --refine <f>
- Attempt to refine the alignment before building the CM
using expectation-maximization (EM). A CM is first built from the initial
alignment as usual. Then, the sequences in the alignment are realigned
optimally (with the HMM banded CYK algorithm, optimal means optimal given
the bands) to the CM, and a new CM is built from the resulting alignment.
The sequences are then realigned to the new CM, and a new CM is built from
that alignment. This is continued until convergence, specifically when the
alignments for two successive iterations are not significantly different
(the summed bit scores of all the sequences in the alignment changes less
than 1% between two successive iterations). The final alignment (the
alignment used to build the CM that gets written to cmfile) is
written to <f>.
- --gibbs
- Modifies the behavior of --refine so Gibbs sampling
is used instead of EM. The difference is that during the alignment stage
the alignment is not necessarily optimal, instead an alignment (parsetree)
for each sequences is sampled from the posterior distribution of
alignments as determined by the Inside algorithm. Due to this sampling
step --gibbs is non-deterministic, so different runs with the same
alignment may yield different results. This is not true when
--refine is used without the --gibbs option, in which case
the final alignment and CM will always be the same. When --gibbs is
enabled, the -s <n> option can be used to seed the random
number generator predictably, making the results reproducible. The goal of
the --gibbs option is to help expert RNA alignment curators refine
structural alignments by allowing them to observe alternative high scoring
alignments.
- -s <n>
- Set the random seed to <n>, where
<n> is a positive integer. This option can only be used in
combination with --gibbs. The default is to use time() to generate
a different seed for each run, which means that two different runs of
cmbuild --refine <f> --gibbs on the same alignment
will give slightly different results. You can use this option to generate
reproducible results.
- -l
- With --refine, turn on the local alignment
algorithm, which allows the alignment to span two or more subsequences if
necessary (e.g. if the structures of the query model and target sequence
are only partially shared), allowing certain large insertions and
deletions in the structure to be penalized differently than normal indels.
The default is to globally align the query model to the target sequences.
- -a
- With --refine, print the scores of each individual
sequence alignment.
- --cyk
- With --refine, align with the CYK algorithm. By
default the optimal accuracy algorithm is used. There is more information
on this in the cmalign manual page.
- --sub
- With --refine, turn on the sub model construction
and alignment procedure. For each sequence to be realigned an HMM is first
used to predict the model start and end consensus columns, and a new sub
CM is constructed that only models consensus columns from start to end.
The sequence is then aligned to this sub CM. This option is useful for
building CMs for alignments with sequences that are known to truncated,
non-full length sequences. This option is experimental and not rigorously
tested, use at your own risk. This "sub CM" procedure is not the
same as the "sub CMs" described by Weinberg and Ruzzo.
- --nonbanded
- With --refine, do not use HMM bands to accelerate
alignment. Use the full CYK algorithm which is guaranteed to give the
optimal alignment. This will slow down the run significantly, especially
for large models.
- --tau <x>
- With --refine, set the tail loss probability used
during HMM band calculation to <f>. This is the amount of
probability mass within the HMM posterior probabilities that is considered
negligible. The default value is 1E-7. In general, higher values will
result in greater acceleration, but increase the chance of missing the
optimal alignment due to the HMM bands.
- --fins
- With --refine, change the behavior of how insert
emissions are placed in the alignment. By default, all contiguous blocks
of inserts are split in half, and half the residues are flushed left
against the nearest consensus column to the left, and half are flushed
right against the nearest consensus column on the right. With
--fins inserts are not split in half, instead all inserted residues
from IL states are flushed left, instead all inserted residues from IR
states are flushed right. This was the default behavior of previous
versions of Infernal.
- --mxsize <x>
- With --refine, set the maximum allowable matrix size
for alignment to <x> megabytes. By default this size is 2 Gb.
This should be large enough for the vast majority of alignments, however
it is possible that when run with --refine, cmbuild will exit
prematurely, reporting an error message that the matrix exceeded it's
maximum allowable size. In this case, the --mxsize can be used to
raise the limit.
- --rdump<x>
- With --refine, output the intermediate alignments at
each iteration of the refinement procedure (as described above for
--refine ) to file <f>.
SEE ALSO¶
For complete documentation, see the User's Guide (Userguide.pdf) that came with
the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT¶
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution
conditions.
AUTHOR¶
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/