NAME¶
- cmalign - use a CM to make a structured RNA multiple
alignment
-
SYNOPSIS¶
- Align sequences to a CM:
- cmalign [options] cmfile
seqfile
- Merge two alignments:
- cmalign --merge [options] cmfile
msafile1 msafile2
DESCRIPTION¶
cmalign aligns the RNA sequences in
seqfile to the covariance
model (CM) in
cmfile, and outputs a multiple sequence alignment.
Alternatively, with the
--merge option,
cmalign merges the two
alignments
msafile1 and
msafile2 created by previous runs of
cmalign with
cmfile into a single alignment.
The sequence file
seqfile must be in FASTA, EMBL, or Genbank format.
CMs are profiles of RNA consensus sequence and secondary structure. A CM file is
produced by the
cmbuild program, from a given RNA sequence alignment of
known consensus structure.
The alignment that
cmalign makes is written in Stockholm format. It can
be redirected to a file using the
-o option.
cmalign uses an HMM banding technique to accelerate alignment by default
as described below for the
--hbanded option. HMM banding can be turned
off with the
--nonbanded option.
By default,
cmalign computes the alignment with maximum expected accuracy
that is consistent with constraints (bands) derived from an HMM, using a
banded version of the Durbin/Holmes optimal accuracy algorithm. This behavior
can be changed, as described below and in the User's Guide, with the
--cyk, --sample, or
--viterbi options.
It is possible to include the fixed training alignment used to build the CM
within the output alignment of
cmalign. This is done using the
--withali option, as described below and in the User's Guide.
OUTPUT¶
cmalign first outputs tabular information on the scores of each sequence
being aligned, then the alignment itself is printed. The alignment can be
redirected to an output file
<f> with the
-o
<f> option. The tabular output section includes one line per
sequences and seven fields per line: "seq idx": the index of the
sequence in the input file, "seq name": the sequence name,
"len": the length of the sequence, "total": the total bit
score of the sequence, "struct": an approximation of the
contribution of the secondary structure to the bit score, "avg
prob": the average posterior probability (confidence estimate) of each
aligned residue, and "elapsed": the wall time spent aligning the
sequence.
The fields can change if different options are selected. For example if the
--cyk option is enabled, the "avg prob" field disappears
because posterior probabilities are not calculated by the CYK algorithm.
OPTIONS¶
- -h
- Print brief help; includes version number and summary of
all options, including expert options.
- -o <f>
- Save the alignment in Stockholm format to a file
<f>. The default is to write it to standard output.
- -l
- Turn on the local alignment algorithm, which allows the
alignment to span two or more subsequences if necessary (e.g. if the
structures of the query model and target sequence are only partially
shared), allowing certain large insertions and deletions in the structure
to be penalized differently than normal indels. The default is to globally
align the query model to the target sequences.
- -p
- Annotate the alignment with posterior probabilities
calculated using the Inside and Outside algorithms. The -p option
causes additional annotation to appear in the output alignment, but does
not modify the alignment itself (that is, the relative positions of the
residues are unchanged). Two characters for each residue are used to
annotate the posterior probability that the corresponding residue aligns
at the corresponding position in the Stockholm alignment. These characters
have the Stockholm markup tags "#=GR <seq name> POSTX."
and "#=GR <seq name> POST.X", and can only have the
values: "0-9", "*" or ".". They indicate the
tens and ones place for the posterior probability: an "8" for
"POSTX." and a "3" for "POST.X" indicates
that the posterior probability is between 0.83 and 0.84. A "*"
for both "POSTX." and "POST.X" indicates that the
confidence estimate is "very nearly" 1.0 (it's hard to be exact
here due to numerical precision issues) A "." in both
"POSTX." and "POST.X" indicates that that column
aligns to a gap. When used in combination with --nonbanded, the
calculation of the posterior probabilities considers all possible
alignments of the target sequence to the CM. Without --nonbanded
(in HMM banded mode), the calculation considers only possible alignments
within the HMM bands.
- -q
- Quiet; suppress the verbose banner, and only print the
resulting alignment to stdout. This allows piping the alignment to the
input of other programs, for example.
- -1
- Output the alignment in pfam format, a non-interleaved
Stockholm format in which each sequence is on a single line.
- --informat <s>
- Assert that the input seqfile is in format
<s>. Do not run Babelfish format autodection. This increases
the reliability of the program somewhat, because the Babelfish can make
mistakes; particularly recommended for unattended, high-throughput runs of
Infernal. Acceptable formats are: FASTA, EMBL, UNIPROT, GENBANK, and DDBJ.
<s> is case-insensitive.
- --devhelp
- Print help, as with -h , but also include
undocumented developer options. These options are not listed below, are
under development or experimental, and are not guaranteed to even work
correctly. Use developer options at your own risk. The only resources for
understanding what they actually do are the brief one-line description
printed when --devhelp is enabled, and the source code.
- --mpi
- Run as an MPI parallel program. This option will only be
available if Infernal has been configured and built with the
"--enable-mpi" flag (see User's Guide for details).
EXPERT OPTIONS¶
- --optacc
- Align sequences using the Durbin/Holmes optimal accuracy
algorithm. This is default behavior, so this option is probably useless.
The optimal accuracy alignment will be constrained by HMM bands for
acceleration unless the --nonbanded option is enabled. The optimal
accuracy algorithm determines the alignment that maximizes the posterior
probabilities of the aligned residues within it. The posterior
probabilites are determined using (possibly HMM banded) variants of the
Inside and Outside algorithms.
- --cyk
- Do not use the Durbin/Holmes optimal accuracy alignment to
align the sequences, instead use the CYK algorithm which determines the
optimally scoring alignment of the sequence to the model.
- --sample
- Sample an alignment from the posterior distribution of
alignments. The posterior distribution is determined using an HMM banded
(unless --nonbanded) variant of the Inside algorithm.
- -s <n>
- Set the random number generator seed to <n>,
where <n> is a positive integer. This option can only be used
in combination with --sample. The default is to use time() to
generate a different seed for each run, which means that two different
runs of cmalign --sample on the same alignment will give slightly
different results. You can use this option to generate reproducible
results.
- --viterbi
- Do not use the CM to align the sequences, instead use the
HMM Viterbi algorithm to align with a CM Plan 9 HMM. The HMM is
automatically constructed to be maximally similar to the CM. This HMM
alignment is faster than CM alignment, but can be less accurate because
the structure of the RNA family is ignored.
- --sub
- Turn on the sub model construction and alignment procedure.
For each sequence, an HMM is first used to predict the model start and end
consensus columns, and a new sub CM is constructed that only models
consensus columns from start to end. The sequence is then aligned to this
sub CM. This option is useful for aligning sequences that are known to
truncated, non-full length sequences. This "sub CM" procedure is
not the same as the "sub CMs" described by Weinberg and Ruzzo.
- --small
- Use the divide and conquer CYK alignment algorithm
described in SR Eddy, BMC Bioinformatics 3:18, 2002. The
--nonbanded option must be used in combination with this options.
Also, it is recommended whenever --nonbanded is used that
--small is also used because standard CM alignment without HMM
banding requires a lot of memory, especially for large RNAs.
--small allows CM alignment within practical memory limits,
reducing the memory required for alignment LSU rRNA, the largest known
RNAs, from 150 Gb to less than 300 Mb. This option can only be used in
combination with --nonbanded and --cyk.
- --hbanded
- This option is turned on by default. Accelerate alignment
by pruning away regions of the CM DP matrix that are deemed negligible by
an HMM. First, each sequence is scored with a CM plan 9 HMM derived from
the CM using the Forward and Backward HMM algorithms and calculate
posterior probabilities that each residue aligns to each state of the HMM.
These posterior probabilities are used to derive constraints (bands) on
the CM DP matrix. Finally, the target sequence is aligned to the CM using
the banded DP matrix, during which cells outside the bands are ignored.
Usually most of the full DP matrix lies outside the bands (often more than
95%), making this technique faster because fewer DP calculations are
required, and more memory efficient because only cells within the bands
need be allocated.
Importantly, HMM banding sacrifices the guarantee of determining the
optimally accurarte or optimal alignment, which will be missed if it lies
outside the bands. The tau paramater (analagous to the beta parameter for
QDB calculation in cmsearch ) is the amount of probability mass
considered negligible during HMM band calculation; lower values of tau
yield greater speedups but also a greater chance of missing the optimal
alignment. The default tau is 1E-7, determined empirically as a good
tradeoff between sensitivity and speed, though this value can be changed
with the --tau <x> option. The level of acceleration
increases with both the length and primary sequence conservation level of
the family. For example, with the default tau of 1E-7, tRNA models (low
primary sequence conservation with length of about 75 residues) show about
10X acceleration, and SSU bacterial rRNA models (high primary sequence
conservation with length of about 1500 residues) show about 700X. HMM
banding can be turned off with the --nonbanded option.
- --nonbanded
- Turns off HMM banding. The returned alignment is guaranteed
to be the globally optimally accurate one (by default) or the globally
optimally scoring one (if --cyk is enabled). The --small
option is recommended in combination with this option, because standard
alignment without HMM banding requires a lot of memory (see --small
).
- --tau <x>
- Set the tail loss probability used during HMM band
calculation to <x>. This is the amount of probability mass
within the HMM posterior probabilities that is considered negligible. The
default value is 1E-7. In general, higher values will result in greater
acceleration, but increase the chance of missing the optimal alignment due
to the HMM bands.
- --mxsize <x>
- Set the maximum allowable DP matrix size to
<x> megabytes. By default this size is 2,048 Mb. This should
be large enough for the vast majority of alignments, however if it is not
cmalign will exit prematurely and report an error message that the
matrix exceeded it's maximum allowable size. In this case, the
--mxsize can be used to raise the limit. This is most likely to
occur when the --nonbanded option is used without the
--small option, but can still occur when --nonbanded is not
used.
- --rna
- Output the alignments as RNA sequence alignments. This is
true by default.
- --dna
- Output the alignments as DNA sequence alignments.
- --matchonly
- Only include match columns in the output alignment, do not
include any insertions relative to the consensus model.
- --resonly
- Only include match columns in the output alignment that
have at least 1 residue (non-gap character) in them. By default all match
columns are printed to the alignment, even those that are 100% gaps.
--resonly replicates the default behavior of previous versions of
cmalign.
- --fins
- Change the behavior of how insert emissions are placed in
the alignment. By default, all contiguous blocks of inserts are split in
half, and half the residues are flushed left against the nearest consensus
column to the left, and half are flushed right against the nearest
consensus column on the right. With --fins inserts are not split in
half, instead all inserted residues from IL states are flushed left, and
all inserted residues from IR states are flushed right. --fins
replicates the default behavior of previous versions of cmalign.
- --onepost
- Modifies behavior of the -p option. Use only one
character instead of two to annotate the posterior probability of each
aligned residue. Specifically, only the "#=GR <seq name>
POSTX." tag is printed to the alignment. An "8" for
"POSTX." indicates a posterior probability between 0.8 and 0.9
for the corresponding residue.
- --merge
- With --merge the usage of cmalign changes to
cmalign --merge [options] cmfile msafile1
msafile2. Merge the two alignments in msafile1 and
msafile2 created by previous runs of cmalign with
cmfile together into a single alignment and exit. msafile1
and msafile2 must only have one alignment per file. This option
allows the user to split up large sequence files into many smaller files,
align them independently to cmfile on different computers to get
many small alignments, and then merge them into a single large alignment.
- --withali <f>
- Reads an alignment from file <f> and aligns it
as a single object to the CM; e.g. the alignment in <f> is
held fixed. This allows you to align sequences to a model with
cmalign and view them in the context of an existing trusted
multiple alignment. The alignment in the file <f> must be
exactly the alignment that the CM was built from, or a subset of it with
the following special property: the definition of consensus columns and
consensus secondary structure must be identical between <f>
and the alignment the CM was built from. One easy way to achieve this is
to use the --rf option to cmbuild (see man page for
cmbuild ) and to maintain the "#=GC RF" annotation in the
alignment when removing sequences to create the subset alignment
<f>. To specify that the --rf option to cmbuild
was used, enable the --rf option to cmalign (see --rf
below).
- --withpknots
- Must be used in combination with --withali
<f>. Propogate structural information for any pseudoknots
that exist in <f> to the output alignment.
- --rf
- Must be used in combination with --withali
<f>. Specify that the alignment in <f> has
the same "#=GC RF" annotation as the alignment file the CM was
built from using cmbuild and further that the --rf option
was supplied to cmbuild when the CM was constructed.
- --gapthresh <x>
- Must be used in combination with --withali
<f>. Specify that the --gapthresh <x>
option was supplied to cmbuild when the CM was constructed from the
alignment file <f>.
- --tfile <f>
- Dump tabular sequence tracebacks for each individual
sequence to a file <f>. Primarily useful for debugging.
SEE ALSO¶
For complete documentation, see the User's Guide (Userguide.pdf) that came with
the distribution; or see the Infernal web page,
http://infernal.janelia.org/.
COPYRIGHT¶
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution
conditions.
AUTHOR¶
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/