NAME¶
Text::Shellwords::Cursor - Parse a string into tokens
SYNOPSIS¶
use Text::Shellwords::Cursor;
my $parser = Text::Shellwords::Cursor->new();
my $str = 'ab cdef "ghi" j"k\"l "';
my ($tok1) = $parser->parse_line($str);
$tok1 = ['ab', 'cdef', 'ghi', 'j', 'k"l ']
my ($tok2, $tokno, $tokoff) = $parser->parse_line($str, cursorpos => 6);
as above, but $tokno=1, $tokoff=3 (under the 'f')
DESCRIPTION
This module is very similar to Text::Shellwords and Text::ParseWords. However,
it has one very significant difference: it keeps track of a character position
in the line it's parsing. For instance, if you pass it ("zq fmgb",
cursorpos=>6), it would return (['zq', 'fmgb'], 1, 3). The cursorpos
parameter tells where in the input string the cursor resides (just before the
'b'), and the result tells you that the cursor was on token 1 ('fmgb'),
character 3 ('b'). This is very useful when computing command-line completions
involving quoting, escaping, and tokenizing characters (like '(' or '=').
A few helper utilities are included as well. You can escape a string to ensure
that parsing it will produce the original string (parse_escape). You can also
reassemble the tokens with a visually pleasing amount of whitespace between
them (join_line).
This module started out as an integral part of Term::GDBUI using code loosely
based on Text::ParseWords. However, it is now basically a ground-up
reimplementation. It was split out of Term::GDBUI for version 0.8.
METHODS¶
- new
- Creates a new parser. Takes named arguments on the command
line.
- keep_quotes
- Normally all unescaped, unnecessary quote marks are
stripped. If you specify "keep_quotes=>1", however, they are
preserved. This is useful if you need to know whether the string was
quoted or not (string constants) or what type of quotes was around it
(affecting variable interpolation, for instance).
- token_chars
- This argument specifies the characters that should be
considered tokens all by themselves. For instance, if I pass
token_chars=>'=', then 'ab=123' would be parsed to ('ab', '=', '123').
Without token_chars, 'ab=123' remains a single string.
NOTE: you cannot change token_chars after the constructor has been called!
The regexps that use it are compiled once (m//o). Also, until the Gnu
Readline library can accept "=[]," without diving into an
endless loop, we will not tell history expansion to use token_chars (it
uses " \t\fIen()<>;&|" by default).
- debug
- Turns on rather copious debugging to try to show what the
parser is thinking at every step.
- space_none
- space_before
- space_after
- These variables affect how whitespace in the line is
normalized and it is reassembled into a string. See the join_line
routine.
- error
- This is a reference to a routine that should be called to
display a parse error. The routine takes two arguments: a reference to the
parser, and the error message to display as a string.
- parsebail(msg)
- If the parsel routine or any of its subroutines runs into a
fatal error, they call parsebail to present a very descriptive
diagnostic.
- parsel
- This is the heinous routine that actually does the parsing.
You should never need to call it directly. Call parse_line instead.
- parse_line(line, named args)
- This is the entrypoint to this module's parsing
functionality. It converts a line into tokens, respecting quoted text,
escaped characters, etc. It also keeps track of a cursor position on the
input text, returning the token number and offset within the token where
that position can be found in the output.
This routine originally bore some resemblance to Text::ParseWords. It has
changed almost completely, however, to support keeping track of the cursor
position. It also has nicer failure modes, modular quoting, token
characters (see token_chars in "new"), etc. This routine now
does much more.
Arguments:
- line
- This is a string containing the command-line to parse.
This routine also accepts the following named parameters:
- cursorpos
- This is the character position in the line to keep track
of. Pass undef (by not specifying it) or the empty string to have the line
processed with cursorpos ignored.
Note that passing undef is not the same as passing some random number
and ignoring the result! For instance, if you pass 0 and the line begins
with whitespace, you'll get a 0-length token at the beginning of the line
to represent the cursor in the middle of the whitespace. This allows
command completion to work even when the cursor is not near any tokens. If
you pass undef, all whitespace at the beginning and end of the line will
be trimmed as you would expect.
If it is ambiguous whether the cursor should belong to the previous token or
to the following one (i.e. if it's between two quoted strings, say
"a""b" or a token_char), it always gravitates to the
previous token. This makes more sense when completing.
- fixclosequote
- Sometimes you want to try to recover from a missing close
quote (for instance, when calculating completions), but usually you want a
missing close quote to be a fatal error. fixclosequote=>1 will
implicitly insert the correct quote if it's missing. fixclosequote=>0
is the default.
- messages
- parse_line is capable of printing very informative error
messages. However, sometimes you don't care enough to print a message
(like when calculating completions). Messages are printed by default, so
pass messages=>0 to turn them off.
This function returns a reference to an array containing three items:
- tokens
- A the tokens that the line was separated into (ref to an
array of strings).
- tokno
- The number of the token (index into the previous array)
that contains cursorpos.
- tokoff
- The character offet into tokno of cursorpos.
If the cursor is at the end of the token, tokoff will point to 1 character past
the last character in tokno, a non-existant character. If the cursor is
between tokens (surrounded by whitespace), a zero-length token will be created
for it.
- parse_escape(lines)
- Escapes characters that would be otherwise interpreted by
the parser. Will accept either a single string or an arrayref of strings
(which will be modified in-place).
- join_line(tokens)
- This routine does a somewhat intelligent job of joining
tokens back into a command line. If token_chars (see "new") is
empty (the default), then it just escapes backslashes and quotes, and
joins the tokens with spaces.
However, if token_chars is nonempty, it tries to insert a visually pleasing
amount of space between the tokens. For instance, rather than 'a ( b , c
)', it tries to produce 'a (b, c)'. It won't reformat any tokens that
aren't found in $self->{token_chars}, of course.
To change the formatting, you can redefine the variables
$self->{space_none}, $self->{space_before}, and
$self->{space_after}. Each variable is a string containing all
characters that should not be surrounded by whitespace, should have
whitespace before, and should have whitespace after, respectively. Any
character found in token_chars, but non in any of these space_ variables,
will have space placed both before and after.
BUGS¶
None known.
LICENSE¶
Copyright (c) 2003-2011 Scott Bronson, all rights reserved. This program is
covered by the MIT license.
AUTHOR¶
Scott Bronson <bronson@rinspin.com>