table of contents
Search::QueryParser(3pm) | User Contributed Perl Documentation | Search::QueryParser(3pm) |
NAME¶
Search::QueryParser - parses a query string into a data structure suitable for external search engines
SYNOPSIS¶
my $qp = new Search::QueryParser; my $s = '+mandatoryWord -excludedWord +field:word "exact phrase"'; my $query = $qp->parse($s) or die "Error in query : " . $qp->err; $someIndexer->search($query); # query with comparison operators and implicit plus (second arg is true) $query = $qp->parse("txt~'^foo.*' date>='01.01.2001' date<='02.02.2002'", 1); # boolean operators (example below is equivalent to "+a +(b c) -d") $query = $qp->parse("a AND (b OR c) AND NOT d"); # subset of rows $query = $qp->parse("Id#123,444,555,666 AND (b OR c)");
DESCRIPTION¶
This module parses a query string into a data structure to be handled by external search engines. For examples of such engines, see File::Tabular and Search::Indexer.
The query string can contain simple terms, "exact phrases", field names and comparison operators, '+/-' prefixes, parentheses, and boolean connectors.
The parser can be parameterized by regular expressions for specific notions of "term", "field name" or "operator" ; see the new method. The parser has no support for lemmatization or other term transformations : these should be done externally, before passing the query data structure to the search engine.
The data structure resulting from a parsed query is a tree of terms and operators, as described below in the parse method. The interpretation of the structure is up to the external search engine that will receive the parsed query ; the present module does not make any assumption about what it means to be "equal" or to "contain" a term.
QUERY STRING¶
The query string is decomposed into "items", where each item has an optional sign prefix, an optional field name and comparison operator, and a mandatory value.
Sign prefix¶
Prefix '+' means that the item is mandatory. Prefix '-' means that the item must be excluded. No prefix means that the item will be searched for, but is not mandatory.
As far as the result set is concerned, "+a +b c" is strictly equivalent to "+a +b" : the search engine will return documents containing both terms 'a' and 'b', and possibly also term 'c'. However, if the search engine also returns relevance scores, query "+a +b c" might give a better score to documents containing also term 'c'.
See also section "Boolean connectors" below, which is another way to combine items into a query.
Field name and comparison operator¶
Internally, each query item has a field name and comparison operator; if not written explicitly in the query, these take default values '' (empty field name) and ':' (colon operator).
Operators have a left operand (the field name) and a right operand (the value to be compared with); for example, "foo:bar" means "search documents containing term 'bar' in field 'foo'", whereas "foo=bar" means "search documents where field 'foo' has exact value 'bar'".
Here is the list of admitted operators with their intended meaning :
- ":"
- treat value as a term to be searched within field. This is the default operator.
- "~" or "=~"
- treat value as a regex; match field against the regex.
- "!~"
- negation of above
- "==" or "=", "<=", ">=", "!=", "<", ">"
- classical relational operators
- "#"
- Inclusion in the set of comma-separated integers supplied on the right-hand side.
Operators ":", "~", "=~", "!~" and "#" admit an empty left operand (so the field name will be ''). Search engines will usually interpret this as "any field" or "the whole data record".
Value¶
A value (right operand to a comparison operator) can be
- just a term (as recognized by regex "rxTerm", see new method below)
- A quoted phrase, i.e. a collection of terms within single or double
quotes.
Quotes can be used not only for "exact phrases", but also to prevent misinterpretation of some values : for example "-2" would mean "value '2' with prefix '-'", in other words "exclude term '2'", so if you want to search for value -2, you should write "-2" instead. In the last example of the synopsis, quotes were used to prevent splitting of dates into several search terms.
- a subquery within parentheses. Field names and operators distribute over parentheses, so for example "foo:(bar bie)" is equivalent to "foo:bar foo:bie". Nested field names such as "foo:(bar:bie)" are not allowed. Sign prefixes do not distribute : "+(foo bar) +bie" is not equivalent to "+foo +bar +bie".
Boolean connectors¶
Queries can contain boolean connectors 'AND', 'OR', 'NOT' (or their equivalent in some other languages). This is mere syntactic sugar for the '+' and '-' prefixes : "a AND b" is translated into "+a +b"; "a OR b" is translated into "(a b)"; "NOT a" is translated into "-a". "+a OR b" does not make sense, but it is translated into "(a b)", under the assumption that the user understands "OR" better than a '+' prefix. "-a OR b" does not make sense either, but has no meaningful approximation, so it is rejected.
Combinations of AND/OR clauses must be surrounded by parentheses, i.e. "(a AND b) OR c" or "a AND (b OR c)" are allowed, but "a AND b OR c" is not.
METHODS¶
- new
-
new(rxTerm => qr/.../, rxOp => qr/.../, ...)
Creates a new query parser, initialized with (optional) regular expressions :
- rxTerm
- Regular expression for matching a term. Of course it should not match the empty string. Default value is "qr/[^\s()]+/". A term should not be allowed to include parenthesis, otherwise the parser might get into trouble.
- rxField
- Regular expression for matching a field name. Default value is "qr/\w+/" (meaning of "\w" according to "use locale").
- rxOp
- Regular expression for matching an operator. Default value is "qr/==|<=|>=|!=|=~|!~|:|=|<|>|~/". Note that the longest operators come first in the regex, because "alternatives are tried from left to right" (see "Version 8 Regular Expressions" in perlre) : this is to avoid "a<=3" being parsed as "a < '=3'".
- rxOpNoField
- Regular expression for a subset of the operators which admit an empty left operand (no field name). Default value is "qr/=~|!~|~|:/". Such operators can be meaningful for comparisons with "any field" or with "the whole record" ; the precise interpretation depends on the search engine.
- rxAnd
- Regular expression for boolean connector AND. Default value is "qr/AND|ET|UND|E/".
- rxOr
- Regular expression for boolean connector OR. Default value is "qr/OR|OU|ODER|O/".
- rxNot
- Regular expression for boolean connector NOT. Default value is "qr/NOT|PAS|NICHT|NON/".
- defField
- If no field is specified in the query, use defField. The default is the empty string "".
- parse
-
$q = $queryParser->parse($queryString, $implicitPlus);
Returns a data structure corresponding to the parsed string. The second argument is optional; if true, it adds an implicit '+' in front of each term without prefix, so "parse("+a b c -d", 1)" is equivalent to "parse("+a +b +c -d")". This is often seen in common WWW search engines as an option "match all words".
The return value has following structure :
{ '+' => [{field=>'f1', op=>':', value=>'v1', quote=>'q1'}, {field=>'f2', op=>':', value=>'v2', quote=>'q2'}, ...], '' => [...], '-' => [...] }
In other words, it is a hash ref with 3 keys '+', '' and '-', corresponding to the 3 sign prefixes (mandatory, ordinary or excluded items). Each key holds either a ref to an array of items, or "undef" (no items with this prefix in the query).
An item is a hash ref containing
- "field"
- scalar, field name (may be the empty string)
- "op"
- scalar, operator
- "quote"
- scalar, character that was used for quoting the value ('"', "'" or undef)
- "value"
- Either
- a scalar (simple term), or
- a recursive ref to another query structure. In that case, "op" is necessarily '()' ; this corresponds to a subquery in parentheses.
In case of a parsing error, "parse" returns "undef"; method err can be called to get an explanatory message.
AUTHOR¶
Laurent Dami, <laurent.dami AT etat ge ch>
COPYRIGHT AND LICENSE¶
Copyright (C) 2005, 2007 by Laurent Dami.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
2022-11-27 | perl v5.36.0 |