Infernal Manual

NAME¶

cmalign - use a CM to make a structured RNA multiple alignment

SYNOPSIS¶

Align sequences to a CM:: cmalign [options] cmfile seqfile

Merge two alignments:: cmalign --merge [options] cmfile msafile1 msafile2

DESCRIPTION¶

cmalign aligns the RNA sequences in seqfile to the covariance model (CM) in cmfile, and outputs a multiple sequence alignment. Alternatively, with the --merge option, cmalign merges the two alignments msafile1 and msafile2 created by previous runs of cmalign with cmfile into a single alignment.

The sequence file seqfile must be in FASTA, EMBL, or Genbank format.

CMs are profiles of RNA consensus sequence and secondary structure. A CM file is produced by the cmbuild program, from a given RNA sequence alignment of known consensus structure.

The alignment that cmalign makes is written in Stockholm format. It can be redirected to a file using the -o option.

cmalign uses an HMM banding technique to accelerate alignment by default as described below for the --hbanded option. HMM banding can be turned off with the --nonbanded option.

By default, cmalign computes the alignment with maximum expected accuracy that is consistent with constraints (bands) derived from an HMM, using a banded version of the Durbin/Holmes optimal accuracy algorithm. This behavior can be changed, as described below and in the User's Guide, with the --cyk, --sample, or --viterbi options.

It is possible to include the fixed training alignment used to build the CM within the output alignment of cmalign. This is done using the --withali option, as described below and in the User's Guide.

OUTPUT¶

cmalign first outputs tabular information on the scores of each sequence being aligned, then the alignment itself is printed. The alignment can be redirected to an output file <f> with the -o <f> option. The tabular output section includes one line per sequences and seven fields per line: "seq idx": the index of the sequence in the input file, "seq name": the sequence name, "len": the length of the sequence, "total": the total bit score of the sequence, "struct": an approximation of the contribution of the secondary structure to the bit score, "avg prob": the average posterior probability (confidence estimate) of each aligned residue, and "elapsed": the wall time spent aligning the sequence.

The fields can change if different options are selected. For example if the --cyk option is enabled, the "avg prob" field disappears because posterior probabilities are not calculated by the CYK algorithm.

OPTIONS¶

-h: Print brief help; includes version number and summary of all options, including expert options.

-o <f>: Save the alignment in Stockholm format to a file <f>. The default is to write it to standard output.

-l: Turn on the local alignment algorithm, which allows the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target sequences.

-p: Annotate the alignment with posterior probabilities calculated using the Inside and Outside algorithms. The -p option causes additional annotation to appear in the output alignment, but does not modify the alignment itself (that is, the relative positions of the residues are unchanged). Two characters for each residue are used to annotate the posterior probability that the corresponding residue aligns at the corresponding position in the Stockholm alignment. These characters have the Stockholm markup tags "#=GR <seq name> POSTX." and "#=GR <seq name> POST.X", and can only have the values: "0-9", "*" or ".". They indicate the tens and ones place for the posterior probability: an "8" for "POSTX." and a "3" for "POST.X" indicates that the posterior probability is between 0.83 and 0.84. A "*" for both "POSTX." and "POST.X" indicates that the confidence estimate is "very nearly" 1.0 (it's hard to be exact here due to numerical precision issues) A "." in both "POSTX." and "POST.X" indicates that that column aligns to a gap. When used in combination with --nonbanded, the calculation of the posterior probabilities considers all possible alignments of the target sequence to the CM. Without --nonbanded (in HMM banded mode), the calculation considers only possible alignments within the HMM bands.

-q: Quiet; suppress the verbose banner, and only print the resulting alignment to stdout. This allows piping the alignment to the input of other programs, for example.

-1: Output the alignment in pfam format, a non-interleaved Stockholm format in which each sequence is on a single line.

--informat <s>: Assert that the input seqfile is in format <s>. Do not run Babelfish format autodection. This increases the reliability of the program somewhat, because the Babelfish can make mistakes; particularly recommended for unattended, high-throughput runs of Infernal. Acceptable formats are: FASTA, EMBL, UNIPROT, GENBANK, and DDBJ. <s> is case-insensitive.

--devhelp: Print help, as with -h , but also include undocumented developer options. These options are not listed below, are under development or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.

--mpi: Run as an MPI parallel program. This option will only be available if Infernal has been configured and built with the "--enable-mpi" flag (see User's Guide for details).

EXPERT OPTIONS¶

--optacc: Align sequences using the Durbin/Holmes optimal accuracy algorithm. This is default behavior, so this option is probably useless. The optimal accuracy alignment will be constrained by HMM bands for acceleration unless the --nonbanded option is enabled. The optimal accuracy algorithm determines the alignment that maximizes the posterior probabilities of the aligned residues within it. The posterior probabilites are determined using (possibly HMM banded) variants of the Inside and Outside algorithms.

--cyk: Do not use the Durbin/Holmes optimal accuracy alignment to align the sequences, instead use the CYK algorithm which determines the optimally scoring alignment of the sequence to the model.

--sample: Sample an alignment from the posterior distribution of alignments. The posterior distribution is determined using an HMM banded (unless --nonbanded) variant of the Inside algorithm.

-s <n>: Set the random number generator seed to <n>, where <n> is a positive integer. This option can only be used in combination with --sample. The default is to use time() to generate a different seed for each run, which means that two different runs of cmalign --sample on the same alignment will give slightly different results. You can use this option to generate reproducible results.

--viterbi: Do not use the CM to align the sequences, instead use the HMM Viterbi algorithm to align with a CM Plan 9 HMM. The HMM is automatically constructed to be maximally similar to the CM. This HMM alignment is faster than CM alignment, but can be less accurate because the structure of the RNA family is ignored.

--sub: Turn on the sub model construction and alignment procedure. For each sequence, an HMM is first used to predict the model start and end consensus columns, and a new sub CM is constructed that only models consensus columns from start to end. The sequence is then aligned to this sub CM. This option is useful for aligning sequences that are known to truncated, non-full length sequences. This "sub CM" procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo.

--small: Use the divide and conquer CYK alignment algorithm described in SR Eddy, BMC Bioinformatics 3:18, 2002. The --nonbanded option must be used in combination with this options. Also, it is recommended whenever --nonbanded is used that --small is also used because standard CM alignment without HMM banding requires a lot of memory, especially for large RNAs. --small allows CM alignment within practical memory limits, reducing the memory required for alignment LSU rRNA, the largest known RNAs, from 150 Gb to less than 300 Mb. This option can only be used in combination with --nonbanded and --cyk.

--hbanded: This option is turned on by default. Accelerate alignment by pruning away regions of the CM DP matrix that are deemed negligible by an HMM. First, each sequence is scored with a CM plan 9 HMM derived from the CM using the Forward and Backward HMM algorithms and calculate posterior probabilities that each residue aligns to each state of the HMM. These posterior probabilities are used to derive constraints (bands) on the CM DP matrix. Finally, the target sequence is aligned to the CM using the banded DP matrix, during which cells outside the bands are ignored. Usually most of the full DP matrix lies outside the bands (often more than 95%), making this technique faster because fewer DP calculations are required, and more memory efficient because only cells within the bands need be allocated.

Importantly, HMM banding sacrifices the guarantee of determining the optimally accurarte or optimal alignment, which will be missed if it lies outside the bands. The tau paramater (analagous to the beta parameter for QDB calculation in cmsearch ) is the amount of probability mass considered negligible during HMM band calculation; lower values of tau yield greater speedups but also a greater chance of missing the optimal alignment. The default tau is 1E-7, determined empirically as a good tradeoff between sensitivity and speed, though this value can be changed with the --tau <x> option. The level of acceleration increases with both the length and primary sequence conservation level of the family. For example, with the default tau of 1E-7, tRNA models (low primary sequence conservation with length of about 75 residues) show about 10X acceleration, and SSU bacterial rRNA models (high primary sequence conservation with length of about 1500 residues) show about 700X. HMM banding can be turned off with the --nonbanded option.

--nonbanded: Turns off HMM banding. The returned alignment is guaranteed to be the globally optimally accurate one (by default) or the globally optimally scoring one (if --cyk is enabled). The --small option is recommended in combination with this option, because standard alignment without HMM banding requires a lot of memory (see --small ).

--tau <x>: Set the tail loss probability used during HMM band calculation to <x>. This is the amount of probability mass within the HMM posterior probabilities that is considered negligible. The default value is 1E-7. In general, higher values will result in greater acceleration, but increase the chance of missing the optimal alignment due to the HMM bands.

--mxsize <x>: Set the maximum allowable DP matrix size to <x> megabytes. By default this size is 2,048 Mb. This should be large enough for the vast majority of alignments, however if it is not cmalign will exit prematurely and report an error message that the matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used to raise the limit. This is most likely to occur when the --nonbanded option is used without the --small option, but can still occur when --nonbanded is not used.

--rna: Output the alignments as RNA sequence alignments. This is true by default.

--dna: Output the alignments as DNA sequence alignments.

--matchonly: Only include match columns in the output alignment, do not include any insertions relative to the consensus model.

--resonly: Only include match columns in the output alignment that have at least 1 residue (non-gap character) in them. By default all match columns are printed to the alignment, even those that are 100% gaps. --resonly replicates the default behavior of previous versions of cmalign.

--fins: Change the behavior of how insert emissions are placed in the alignment. By default, all contiguous blocks of inserts are split in half, and half the residues are flushed left against the nearest consensus column to the left, and half are flushed right against the nearest consensus column on the right. With --fins inserts are not split in half, instead all inserted residues from IL states are flushed left, and all inserted residues from IR states are flushed right. --fins replicates the default behavior of previous versions of cmalign.

--onepost: Modifies behavior of the -p option. Use only one character instead of two to annotate the posterior probability of each aligned residue. Specifically, only the "#=GR <seq name> POSTX." tag is printed to the alignment. An "8" for "POSTX." indicates a posterior probability between 0.8 and 0.9 for the corresponding residue.

--merge: With --merge the usage of cmalign changes to cmalign --merge [options] cmfile msafile1 msafile2. Merge the two alignments in msafile1 and msafile2 created by previous runs of cmalign with cmfile together into a single alignment and exit. msafile1 and msafile2 must only have one alignment per file. This option allows the user to split up large sequence files into many smaller files, align them independently to cmfile on different computers to get many small alignments, and then merge them into a single large alignment.

--withali <f>: Reads an alignment from file <f> and aligns it as a single object to the CM; e.g. the alignment in <f> is held fixed. This allows you to align sequences to a model with cmalign and view them in the context of an existing trusted multiple alignment. The alignment in the file <f> must be exactly the alignment that the CM was built from, or a subset of it with the following special property: the definition of consensus columns and consensus secondary structure must be identical between <f> and the alignment the CM was built from. One easy way to achieve this is to use the --rf option to cmbuild (see man page for cmbuild ) and to maintain the "#=GC RF" annotation in the alignment when removing sequences to create the subset alignment <f>. To specify that the --rf option to cmbuild was used, enable the --rf option to cmalign (see --rf below).

--withpknots: Must be used in combination with --withali <f>. Propogate structural information for any pseudoknots that exist in <f> to the output alignment.

--rf: Must be used in combination with --withali <f>. Specify that the alignment in <f> has the same "#=GC RF" annotation as the alignment file the CM was built from using cmbuild and further that the --rf option was supplied to cmbuild when the CM was constructed.

--gapthresh <x>: Must be used in combination with --withali <f>. Specify that the --gapthresh <x> option was supplied to cmbuild when the CM was constructed from the alignment file <f>.

--tfile <f>: Dump tabular sequence tracebacks for each individual sequence to a file <f>. Primarily useful for debugging.

COPYRIGHT¶

Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).

See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR¶

Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/

October 2009

Infernal 1.0.2

Source file:	cmalign.1.en.gz (from infernal 1.0.2-2)
Source last updated:	2011-09-27T15:41:55Z
Converted to HTML:	2017-06-07T16:48:33Z