NAME¶
Zerg - a lexical scanner for BLAST reports.
SYNOPSIS¶
use Zerg;
DESCRIPTION¶
This manpage describes the Zerg library and its interface for use with Perl.
The Zerg library contains a C/flex lexical scanner for BLAST reports and a set
of supporting functions. It is centered on a "get_token" function
that scans the input for specified lexical elements and, when one is found,
returns its code and value to the user.
It is intended to be fast: for that we used flex, which provides simple regular
expression matching and input buffering in the generated C scanner. And it is
intended to be simple in the sense of providing just a lexical scanner, with
no features whose support could slow down its main function.
FUNCTIONS¶
zerg_get_token() is the core function of this module. Each time it is
called, it scans the input BLAST report for the next "interesting"
lexical element and returns its code and value. Codes are listed in the
section "EXPORTED CONSTANTS (TOKEN CODES)". Code zero (not listed)
means end of file.
($code, $value) = Zerg::zerg_get_token();
zerg_open_file($filename) opens $filename in read-only mode and set it as the
input to the scanner. If this function is not called, the standard input is
used.
Zerg::zerg_open_file($filename);
zerg_close_file() closes the file opened with
zerg_open_file().
zerg_get_token_offset() returns the byte offset (relative to the
beginning of file) of the last token read. (See section BUGS).
zerg_ignore($code) instructs zerg_get_token not to return when it finds a token
with code $code.
zerg_ignore_all() does zerg_ignore on all token codes.
zerg_unignore($code) instructs zerg_get_token to return when it finds a token
with code $code.
zerg_unignore_all() does zerg_unignore on all token codes.
Example:
Zerg::zerg_ignore_all();
Zerg::zerg_unignore(QUERY_NAME);
Zerg::zerg_unignore(SUBJECT_NAME);
EXPORTED CONSTANTS (TOKEN CODES)¶
ALIGNMENT_LENGTH
BLAST_VERSION
CONVERGED
DATABASE
DESCRIPTION_ANNOTATION
DESCRIPTION_EVALUE
DESCRIPTION_HITNAME
DESCRIPTION_SCORE
END_OF_REPORT
EVALUE
GAPS
HSP_METHOD
IDENTITIES
NOHITS
PERCENT_IDENTITIES
PERCENT_POSITIVES
POSITIVES
QUERY_ALI
QUERY_ANNOTATION
QUERY_END
QUERY_FRAME
QUERY_LENGTH
QUERY_NAME
QUERY_ORIENTATION
QUERY_START
REFERENCE
ROUND_NUMBER
ROUND_SEQ_FOUND
ROUND_SEQ_NEW
SCORE
SCORE_BITS
SEARCHING
SUBJECT_ALI
SUBJECT_ANNOTATION
SUBJECT_END
SUBJECT_FRAME
SUBJECT_LENGTH
SUBJECT_NAME
SUBJECT_ORIENTATION
SUBJECT_START
TAIL_OF_REPORT
UNMATCHED
NOTES ON THE SCANNER¶
Some BLAST parsers rely on some simple regular expression matches to conclude
about token types and values. For example: an input line matching
/^Query=\s(\S+)/ should make such a "loose" parser to infer that a
token was found, it is a query name and its value is $1. Although improbable,
it is perfectly possible for an anotation field to match /^Query=\s(\S+)/.
Worse than this is the fact that those parsers are often unable to detect
corrupt or truncated BLAST reports, possibly producing inaccurate information.
The scanner provided by this library is much more stringent: for a token to
match it must be in its place in the context of a BLAST report. For example:
in a single BLAST report, a QUERY_NAME cannot follow another QUERY_NAME. The
scanner can be thought as, and in fact it is, a big regular expression that
matches an entire BLAST report.
A special token code (UNMATCHED) is provided for cases in which the input text
does not match any other lexical rule of the scanner. When an umnacthed
character is found, either the report is corrupt or the scanner has a bug.
If you are interested in only a few token codes, try to
zerg_ignore() as
much codes you can. This will avoid unnecessary function calls that eat a lot
of CPU.
EXAMPLES¶
This program prints the code and the value of each token it finds.
#!/usr/bin/perl -w
use strict;
use Zerg;
my ($code, $value);
while((($code, $value)= Zerg::zerg_get_token()) && $code)
{
print "$code\t$value\n";
}
The program below is a "syntax checker". The presence of UNMATCHEDs is
a strong indicator of problems in the BLAST report. (See section NOTES ON THE
SCANNER)
#!/usr/bin/perl -w
use strict;
use Zerg;
my ($code, $value);
Zerg::zerg_ignore_all();
Zerg::zerg_unignore(UNMATCHED);
while((($code, $value)= Zerg::zerg_get_token()) && $code)
{
print "UNMATCHED CHAR:\t$value\n";
}
BUGS¶
The tokens DESCRIPTION_ANNOTATION, DESCRIPTION_SCORE and DESCRIPTION_EVALUE are
scanned all at once and released one by one on user request. So, if the user
wants to get any of these fields, they must be unignored BEFORE scanning
DESCRIPTION_ANNOTATION.
zerg_get_token_offset() may return incorrect values for these tokens and
those that are modified by the parser, namely: QUERY_LENGTH, SUBJECT_LENGTH,
EVALUE, GAPS.
TODO¶
Add more tokens to the scanner as the need for that appears.
AUTHOR¶
ApuA~X Paquola, IQ-USP Bioinformatics Lab, apua@iq.usp.br
Laszlo Kajan <lkajan@rostlab.org>, Technical University of Munich, Germany
SEE ALSO¶
perl(1),
flex(1),
http://www.bioperl.org,
http://www.ncbi.nlm.nih.gov/BLAST