Infernal Manual

NAME¶

cmbuild - construct a CM from an RNA multiple sequence alignment

SYNOPSIS¶

cmbuild [options] cmfile alifile

DESCRIPTION¶

cmbuild read an RNA multiple sequence alignment from alifile, constructs a covariance model (CM), and saves the CM to cmfile.

The alignment file must be in Stockholm format, and must contain consensus secondary structure annotation. cmbuild uses the consensus structure to determine the architecture of the CM.

The alignment file may be a database containing more than one alignment. If so, the resulting cmfile will be a database of CMs, one per alignment.

The expert options --ctarget, --cmindiff, and --call result in multiple CMs being built from each alignment in alifile as described below.

OUTPUT¶

The default output from cmbuild is tabular, with a single line printed for each model . Each line has the following fields: aln: the index of the alignment used to build the CM, cm idx: the index of the CM in the cmfile; name: the name of the CM, nseq: the number of sequences in the alignment used to build the CM, eff_nseq: the effective number of sequences used to build the model (see the User Guide); alen: the length of the alignment used to build the CM; clen: the number of columns from the alignment defined as consensus columns; rel entropy, CM: the total relative entropy of the model divided by the number of consensus columns; rel entropy, HMM: the total relative entropy of the model ignoring secondary structure divided by the number of consensus columns.

OPTIONS¶

-h: Print brief help; includes version number and summary of all options, including expert options.

-n <s>: Name the covariance model <s>. (Does not work if alifile contains more than one alignment). The default is to use the name of the alignment (given by the #=GF ID tag, in Stockholm format), or if that is not present, to use the name of the alignment file minus any file type extension plus a "-" and a positive integer indicating the position of that alignment in the file (that is, the first alignment in a file "myrnas.sto" would give a CM named "myrnas-1", the second alignment would give a CM named "myrnas-2").

-A: Append the CM to cmfile, if cmfile already exists.

-F: Allow cmfile to be overwritten. Normally, if cmfile already exists, cmbuild exits with an error unless the -A or -F option is set.

-v: Run in verbose output mode instead of using the default single line tabular format. This output format is similar to that used by older versions of Infernal.

--iins: Allow informative insert emissions for the CM. By default, all CM insert emission scores are set to 0.0 bits. The motivation for zero bit scores is to avoid high-scoring hits to low complexity sequence favored by high insert state emission scores.

--Wbeta<x>: Set the beta tail loss probability for query-dependent banding (QDB) to <x> The QDB algorithm is used to determine the maximium length of a hit to the model. For more information on QDB see (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56). The beta paramater is the amount of probability mass considered negligible during band calculation, lower values of beta will result in shorter maximum hit lengths, which will yield faster searches. The default beta is 1E-7: determined empirically as a good tradeoff between sensitivity, specificity and speed.

--devhelp: Print help, as with -h , but also include undocumented developer options. These options are not listed below. They are under development or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.

EXPERT OPTIONS¶

--rsearch <f>: Parameterize emission scores a la RSEARCH, using the RIBOSUM matrix in file <f>. (Actually, the emission scores will not be identical to RIBOSUM scores due of differences in the modelling strategy between Infernal and RSEARCH, but they will be as similar as possible.) RIBOSUM matrix files are included with Infernal in the "matrices/" subdirectory of the top-level Infernal directory. RIBOSUM matrices are substitution score matrices trained specifically for structural RNAs with separate single stranded residue and base pair substitution scores. For more information see the RSEARCH publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003). Actually, the emission scores will not exactly

With --rsearch enabled, all alignments in alifile must contain exactly one sequence or the --call option must also be enabled.

--binary: Save the model in a compact binary format. The default is a more readable ASCII text format.

--rf: Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which columns are consensus, and which are inserts. Any non-gap character indicates a consensus column. (For example, mark consensus columns with "x", and insert columns with ".".) The default is to determine this automatically; if the frequency of gap characters in a column is greater than a threshold, gapthresh (default 0.5), the column is called an insertion.

--gapthresh <x>: Set the gap threshold (used for determining which columns are insertions versus consensus; see --rf above) to <x>. The default is 0.5.

--ignorant: Strip all base pair secondary structure information from all input alignments in alifile before building the CM(s). All resulting CM(s) will have zero MATP (base pair) nodes, with zero bifurcations.

--wgsc: Use the Gerstein/Sonnhammer/Chothia (GSC) weighting algorithm. This is the default unless the number of sequences in the alignment exceeds a cutoff (see --pbswitch), in which case the default becomes the faster Henikoff position-based weighting scheme.

--wblosum: Use the BLOSUM filtering algorithm to weight the sequences, instead of the default GSC weighting. Cluster the sequences at a given percentage identity (see --wid); assign each cluster a total weight of 1.0, distributed equally amongst the members of that cluster.

--wpb: Use the Henikoff position-based weighting scheme. This weighting scheme is automatically used (overriding --wgsc and --wblosum) if the number of sequences in the alignment exceeds a cutoff (see --pbswitch).

--wnone: Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.

--wgiven: Use sequence weights as given in annotation in the input alignment file. If no weights were given, assume they are all 1.0. The default is to determine new sequence weights by the Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated weights.

--pbswitch <n>: Set the cutoff for automatically switching the weighting method to the Henikoff position-based weighting scheme to <n>. If the number of sequences in the alignment exceeds <n> Henikoff weighting is used. By default <n> is 5000.

--wid <x>: Controls the behavior of the --wblosum weighting option by setting the percent identity for clustering the alignment to <x>.

--eent: Use the entropy weighting strategy to determine the effective sequence number that gives a target mean match state relative entropy. This option is the default, and can be turned off with --enone. The default target mean match state relative entropy is 0.59 bits but can be changed with --ere. The default of 0.59 bits is automatically changed if the total relative entropy of the model (summed match state relative entropy) is less than a cutoff, which is is 6.0 bits by default, but can be changed with the expert, undocumented --eX option. If you really want to play with that option, consult the source code.

--enone: Turn off the entropy weighting strategy. The effective sequence number is just the number of sequences in the alignment.

--ere <x>: Set the target mean match state relative entropy as <x>. By default the target relative entropy per match position is 0.59 bits.

--null <f>: Read a null model from <f>. The null model defines the probability of each RNA nucleotide in background sequence, the default is to use 0.25 for each nucleotide. The format of null files is documented in the User's Guide.

--prior <f>: Read a Dirichlet prior from <f>, replacing the default mixture Dirichlet. The format of prior files is documented in the User's Guide.

--ctarget <n>: Cluster each alignment in alifile by percent identity. Find a cutoff percent id threshold that gives exactly <n> clusters and build a separate CM from each cluster. If <n> is greater than the number of sequences in the alignment the program will not complain, and each sequence in the alignment will be its own cluster. Each CM will have a positive integer appended to its name indicating the order in which it was built. For example, if cmbuild --ctarget 3 is called with alifile "myrnas.sto", and "myrnas.sto" has exactly one Stockholm alignment in it with no #=GF ID tag annotation, three CMs will be built, the first will be named "myrnas-1.1", the second, "myrnas-1.2", and the third "myrnas-1.3". (As explained above for the -n option, the first number "1" after "myrnas" indicates the CM was built from the first alignment in "myrnas.sto".)

--cmaxid <x>: Cluster each sequence alignment in alifile by percent identity. Define clusters at the cutoff fractional id similarity of <x> and build a separate CM from each cluster. No two sequences will be be more than <x> fractionally identical ( <x> * 100 percent identical) if those two sequences are in different clusters. The CMs are named as described above for --ctarget.

--call: Build a separate CM from each sequence in each alignment in alifile. Naming of CMs takes place as described above for --ctarget. Using this option in combination with --rsearch causes a separate CM to be built and parameterized using a RIBOSUM matrix for each sequence in alifile.

--corig: After building multiple CMs using --ctarget, --cmindiff or --call as described above, build a final CM using the complete original alignment from alifile. The CMs are named as described above for --ctarget with the exception of the final CM built from the original alignment which is named in the default manner, without an appended integer.

--cdump<f>: Dump the multiple alignments of each cluster to <f> in Stockholm format. This option only works in combination with --ctarget, --cmindiff or --call.

--refine <f>: Attempt to refine the alignment before building the CM using expectation-maximization (EM). A CM is first built from the initial alignment as usual. Then, the sequences in the alignment are realigned optimally (with the HMM banded CYK algorithm, optimal means optimal given the bands) to the CM, and a new CM is built from the resulting alignment. The sequences are then realigned to the new CM, and a new CM is built from that alignment. This is continued until convergence, specifically when the alignments for two successive iterations are not significantly different (the summed bit scores of all the sequences in the alignment changes less than 1% between two successive iterations). The final alignment (the alignment used to build the CM that gets written to cmfile) is written to <f>.

--gibbs: Modifies the behavior of --refine so Gibbs sampling is used instead of EM. The difference is that during the alignment stage the alignment is not necessarily optimal, instead an alignment (parsetree) for each sequences is sampled from the posterior distribution of alignments as determined by the Inside algorithm. Due to this sampling step --gibbs is non-deterministic, so different runs with the same alignment may yield different results. This is not true when --refine is used without the --gibbs option, in which case the final alignment and CM will always be the same. When --gibbs is enabled, the -s <n> option can be used to seed the random number generator predictably, making the results reproducible. The goal of the --gibbs option is to help expert RNA alignment curators refine structural alignments by allowing them to observe alternative high scoring alignments.

-s <n>: Set the random seed to <n>, where <n> is a positive integer. This option can only be used in combination with --gibbs. The default is to use time() to generate a different seed for each run, which means that two different runs of cmbuild --refine <f> --gibbs on the same alignment will give slightly different results. You can use this option to generate reproducible results.

-l: With --refine, turn on the local alignment algorithm, which allows the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target sequences.

-a: With --refine, print the scores of each individual sequence alignment.

--cyk: With --refine, align with the CYK algorithm. By default the optimal accuracy algorithm is used. There is more information on this in the cmalign manual page.

--sub: With --refine, turn on the sub model construction and alignment procedure. For each sequence to be realigned an HMM is first used to predict the model start and end consensus columns, and a new sub CM is constructed that only models consensus columns from start to end. The sequence is then aligned to this sub CM. This option is useful for building CMs for alignments with sequences that are known to truncated, non-full length sequences. This option is experimental and not rigorously tested, use at your own risk. This "sub CM" procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo.

--nonbanded: With --refine, do not use HMM bands to accelerate alignment. Use the full CYK algorithm which is guaranteed to give the optimal alignment. This will slow down the run significantly, especially for large models.

--tau <x>: With --refine, set the tail loss probability used during HMM band calculation to <f>. This is the amount of probability mass within the HMM posterior probabilities that is considered negligible. The default value is 1E-7. In general, higher values will result in greater acceleration, but increase the chance of missing the optimal alignment due to the HMM bands.

--fins: With --refine, change the behavior of how insert emissions are placed in the alignment. By default, all contiguous blocks of inserts are split in half, and half the residues are flushed left against the nearest consensus column to the left, and half are flushed right against the nearest consensus column on the right. With --fins inserts are not split in half, instead all inserted residues from IL states are flushed left, instead all inserted residues from IR states are flushed right. This was the default behavior of previous versions of Infernal.

--mxsize <x>: With --refine, set the maximum allowable matrix size for alignment to <x> megabytes. By default this size is 2 Gb. This should be large enough for the vast majority of alignments, however it is possible that when run with --refine, cmbuild will exit prematurely, reporting an error message that the matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used to raise the limit.

--rdump<x>: With --refine, output the intermediate alignments at each iteration of the refinement procedure (as described above for --refine ) to file <f>.

COPYRIGHT¶

Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).

See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR¶

Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/

October 2009

Infernal 1.0.2

Source file:	cmbuild.1.en.gz (from infernal 1.0.2-2)
Source last updated:	2011-09-27T15:41:55Z
Converted to HTML:	2017-06-07T16:48:37Z