April 22nd, 1994
NAME¶
lookup - interactive file search and display
SYNOPSIS¶
lookup [ args ] [
file ... ]
DESCRIPTION¶
Lookup allows the quick interactive search of text files. It supports
ASCII, JIS-ROMAN, and Japanese EUC Packed formated text, and has an integrated
romaji¢ªkana converter.
THIS MANUAL¶
Lookup is flexible for a variety of applications. This manual will,
however, focus on the application of searching Jim Breen's
edict
(Japanese-English dictionary) and
kanjidic (kanji database). Being
familiar with the content and format of these files would be helpful. See the
INFO section near the end of this manual for information on how to obtain
these files and their documentation.
OVERVIEW OF MAJOR FEATURES¶
The following just mentions some major features to whet your appetite to
actually read the whole manual (-:
- Romaji-to-Kana Converter
- Lookup can convert romaji to kana for you, even¡Èon
the fly¡Éas you type.
- Fuzzy Searching
- Searches can be a
bit¡Èvague¡Éor¡Èfuzzy¡É,
so that you'll be able to
find¡ÈÅìµþ¡Éeven
if you try to search
for¡È¤È¤¤ç¡É(the
proper yomikata
being¡È¤È¤¦¤¤ç¤¦¡É).
- Regular Expressions
- Uses the powerful and expressive regular expression for searching.
One can easily specify complex searches that affect¡ÈI want
lines that look like such-and-such, but not like this-and-that, but that
also have this particular characteristic....¡É
- Wildcard ``Glob'' Patterns
- Optionally, can use well-known filename wildcard patterns instead of
full-fledged regular expressions.
- Filters
- You can have lookup not list certain lines that would otherwise
match your search, yet can optionally save them for quick review. For
example, you could have all name-only entries from edict filtered
from normal output.
- Automatic Modifications
- Similarly, you can do a standard search-and-replace on lines just before
they print, perhaps to remove information you don't care to see on most
searches. For example, if you're generally not interested in
kanjidic's info on Chinese readings, you can have them removed from
lines before printing.
- Smart Word-Preference Mode
- You can have lookup list only entries with whole words that
match your search (as opposed to an embedded match, such as
finding¡Èthe¡Éinside¡Èthem¡É),
but if no whole-word matches exist, will go ahead and list any entry that
matches the search.
- Handy Features
- Other handy features include a dynamically settable and parameterized
prompt, automatic highlighting of that part of the line that matches your
search, an output pager, readline-like input with horizontal scrolling for
long input lines, a¡È.lookup¡Éstartup file,
automated programability, and much more. Read on!
REGULAR EXPRESSIONS¶
Lookup makes liberal use of
regular expressions (or
regex
for short) in controlling various aspects of the searches. If you are not
familiar with the important concepts of regexes, read the tutorial appendix of
this manual before continuing.
JAPANESE CHARACTER ENCODING METHODS¶
Internally,
lookup works with Japanese packed-format EUC, and all files
loaded must be encoded similarly. If you have files encoded in JIS or
Shift-JIS, you must first convert them to EUC before loading (see the INFO
section for programs that can do this).
Interactive input and output encoding, however, may be be selected via the -jis,
-sjis, and -euc invocation flags (default is -euc), or by various commands to
the program (described later).
Make sure to use the encoding appropriate for your system. If you're using kterm
under the X Window System, you can use
lookup's -jis flag to match
kterm's default JIS encoding. Or, you might use kterm's¡È-km
euc¡Éstartup option (or menu selection) to put kterm into EUC
mode. Also, I have found kterm's scrollbar (¡È-sb -sl
500¡É) to be quite useful.
With many¡ÈEnglish¡Éfonts in Japan, the character
that normally prints as a backslash (halfwidth version of ¡À) in
The States appears as a yen symbol (the half-width version of
¡ï). How it will appear on your system is a function of what
font you use and what output encoding method you choose, which may be
different from the font and method that was used to print this manual (both of
which may be different from what's printed on your keyboard's appropriate
key). Make sure to keep this in mind while reading.
STARTUP¶
Let's assume that your copy of
edict is in ~/lib/edict. You can start the
program simply with
lookup ~/lib/edict
You'll note that
lookup spends some time building an index before the
default¡Èlookup> ¡Éprompt appears.
Lookup gains much of its search speed by constructing an index of the
file(s) to be searched. Since building the index can be time consuming itself,
you can have
lookup write the built index to a file that can be quickly
loaded the next time you run the program. Index files will be given
a¡È.jin¡É(Jeffrey's Index) ending.
Let's build the indices for
edict and
kanjidic now:
lookup -write ~/lib/edict ~/lib/kanjidic
This will create the index files
~/lib/edict.jin
~/lib/kanjidic.jin
and exit.
You can now re-start
lookup , automatically using the pre-computed index
files as:
lookup ~/lib/edict ~/lib/kanjidic
You should then be presented with the prompt without having to wait for the
index to be constructed (but see the section on Operating System concerns for
possible reasons of delay).
There are basically two types of input: searches and commands. Commands do such
things as tell
lookup to load more files or set flags. Searches report
lines of a file that match some search specifier (where lines to search for
are specified by one or more regular expressions).
The input syntax may perhaps at first seem odd, but has been designed to be
powerful and concise. A bit of time invested to learn it well will pay off
greatly when you need it.
BRIEF EXAMPLE¶
Assuming you've started
lookup with
edict and
kanjidic as
noted above, let's try a few searches. In these examples, the
¡Èsearch [edict]> ¡É
is the prompt. Note that the space after
the¡Æ>¡Çis part of the prompt.
Given the input:
search [edict]> tranquil
lookup will report all lines with the
string¡Ètranquil¡Éin them. There are currently
about a dozen such lines, two of which look like:
°Â¤é¤« [¤ä¤¹¤é¤«] /peaceful (an)/tranquil/calm/restful/
°Â¤é¤® [¤ä¤¹¤é¤®] /peace/tranquility/
Notice that lines
with¡Ètranquil¡É
and¡Ètranquility¡Ématched?
This is because¡Ètranquil¡Éwas embedded in the
word¡Ètranquility¡É. You could restrict the search
to only the
word¡Ètranquil¡Éby prepending
the special¡Èstart of
word¡Ésymbol¡Æ<¡Çand appending
the special¡Èend of
word¡Ésymbol¡Æ>¡Çto the regex, as
in:
search [edict]> <tranquil>
This is the regular expression that says¡Èthe beginning of a word,
followed by
a¡Æt¡Ç,¡Ær¡Ç,
...,¡Æl¡Ç, which is at the end of a
word.¡ÉThe current version of
edict has just three
matching entries.
Let's try another:
search [edict]> fukushima
This is a search for the¡ÈEnglish¡Éfukushima -- ways
to search for kana or kanji will be explored later. Note that among the
several lines selected and printed are:
ÉûÅç [¤Õ¤¯¤·¤Þ] /Fukushima (pn,pl)/
ÌÚÁ¾Ê¡Åç [¤¤½¤Õ¤¯¤·¤Þ] /Kisofukushima (pl)/
By default, searches are done in a case-insensitive manner
--¡ÆF¡Çand¡Æf¡Çare
treated the same by
lookup, at least so far as the matching goes. This
is called
case folding.
Let's give a command to turn this option off, so
that¡Æf¡Çand¡ÆF¡Çwon't
be considered the same. Here's an odd point about
lookup's input
syntax: the default setting is that all command lines must begin with a space.
The space is the (default) command-introduction character and tells the input
parser to expect a command rather than a search regular expression.
It is a
common mistake at first to forget the leading space when issuing a
command. Be careful.
Try the command¡È fold¡Éto report the current
status of case-folding. Notice that as soon as you type the space, the prompt
changes to
¡Èlookup command> ¡É
as a reminder that now you're typing a command rather than a search
specification.
lookup command> fold
The reply should be¡Èfile #0's case folding is on¡É
You can actually turn it off with¡È fold off¡É. Now
try the search for¡Èfukushima¡Éagain. Notice that
this time the entries with¡ÈFukushima¡Éaren't
listed? Now try the search string¡ÈFukushima¡Éand
see that the entries with¡Èfukushima¡Éaren't
listed.
Case folding is usually very convenient (it also makes corresponding katakana
and hiragana match the same), so don't forget to turn it back on:
lookup command> fold on
Lookup has an automatic romaji¢ªkana converter. A
leading¡Æ/¡Çindicates that romaji is to follow.
Try typing¡È/tokyo¡Éand you'll see it convert
to¡È/¤È¤¤ç¡Éas
you type. When you hit return,
lookup will list all lines that have
a¡È¤È¤¤ç¡Ésomewhere
in them. Well, sort of. Look carefully at the lines which match. Among them
(if you had case folding back on) you'll see:
¥¥ê¥¹¥È¶µ [¥¥ê¥¹¥È¤¤ç¤¦] /Christianity/
Åìµþ [¤È¤¦¤¤ç¤¦] /Toukyou (pl)/Tokyo/current capital of Japan/
ÆÌ¶À [¤È¤Ã¤¤ç¤¦] /convex lens/
The first one
has¡È¤È¤¤ç¡Éin
it
(as¡È¥È¤¤ç¡É,
where the katakana¡È¥È¡Ématches in a
case-insensitive manner to the
hiragana¡È¤È¡É), but you might
consider the others unexpected, since they don't
have¡È¤È¤¤ç¡Éin
them. They're close
(¡È¤È¤¦¤¤ç¡Éand¡È¤È¤Ã¤¤ç¡É),
but not exact. This is the result of
lookup's¡Èfuzzification¡É. Try the
command¡È fuzz¡É(again, don't forget the
command-introduction space). You'll see that fuzzification is turned on. Turn
it off with¡È fuzz off¡Éand
try¡È/tokyo¡É(which will convert as you type)
again. This time you only get the lines which
have¡È¤È¤¤ç¡Éexactly
(well, case folding is still on, so it might match katakana as well).
In a fuzzy search, length of vowels is ignored
--¡È¤È¡Éis considered the same
as¡È¤È¤¦¡É, for
example. Also, the presence or absence of
any¡È¤Ã¡Écharacter is ignored, and
the pairs ¤¸ ¤Â, ¤º
¤Å, ¤¨ ¤ñ, and ¤ª
¤ò are considered identical in a fuzzy search.
It might be convenient to consider a fuzzy search to be
a¡Èpronunciation search¡É. Special note:
fuzzification will not be performed if a regular
expression¡È*¡É,¡È+¡É,or¡È?¡Émodifies
a non-ASCII character. This is not an issue when input patterns are
filename-like wildcard patterns (discussed below).
In addition to kana fuzziness, there's one special case for kanji when fuzziness
is on. The kanji repeater
mark¡È¡¹¡Éwill be recognized such
that¡È»þ¡¹¡Éand¡È»þ»þ¡Éwill
match each-other.
Turn fuzzification back on (¡Èfuzz on¡É), and search
for all
whole words which sound
like¡Ètokyo¡É. That search would be specified as:
search [edict]> /<tokyo>
(again, the¡Ètokyo¡Éwill be converted
to¡È¤È¤¤ç¡Éas
you type). My copy of
edict has the three lines
Åìµþ [¤È¤¦¤¤ç¤¦] /Toukyou (pl)/Tokyo/current capital of Japan/
ÆÃµö [¤È¤Ã¤¤ç] /special permission/patent/
ÆÌ¶À [¤È¤Ã¤¤ç¤¦] /convex lens/
This kind of whole-word romaji-to-kana search is so common, there's a special
short cut. Instead of typing¡È/<tokyo>¡É,
you can type¡È[tokyo]¡É. The
leading¡Æ[¡Çmeans¡Èstart
romaji¡É
and¡Èstart of word¡É.
Were you to type¡È<tokyo>¡Éinstead (without
a
leading¡Æ/¡Çor¡Æ[¡Çto
indicate romaji-to-kana conversion), you would get all lines with the
English whole-word¡Ètokyo¡Éin them. That
would be a reasonable request as well, but not what we want at the moment.
Besides the kana conversion, you can use any cut-and-paste that your windowing
system might provide to get Japanese text onto the search line.
Cut¡È¤È¤¤ç¡Éfrom
somewhere and paste onto the search line. When hitting enter to run the
search, you'll notice that it is done without fuzzification (even if the
fuzzification flag was¡Èon¡É). That's because
there's no leading¡Æ/¡Ç. Not only does a
leading¡Æ/¡Çndicate that you want the
romaji-to-kana conversion, but that you want it done fuzzily.
So, if you'd like fuzzy cut-and-paste, just type a
leading¡Æ/¡Çefore pasting (or go back and prepend
one after pasting).
These examples have all been pretty simple, but you can use all the power that
regexes have to offer. As a slightly more complex example, the
search¡È<gr[ea]y>¡Éwould look for all lines
with the
words¡Ègrey¡Éor¡Ègray¡Éin
them. Since the¡Æ[¡Çisn't the first character of
the line, it doesn't mean what was mentioned above (start-of-word romaji). In
this case, it's just the
regular-expression¡Èclass¡Éindicator.
If you feel more comfortable using
filename-like¡È*.txt¡Éwildcard patterns, you can
use the¡Èwildcard on¡Écommand to have patterns be
considered this way.
This has been a quick introduction to the basics of
lookup.
It can be very powerful and much more complex. Below is a detailed description
of its various parts and features.
The actual keystrokes are read by a readline-ish package that is pretty
standard. In addition to just typing away, the following keystrokes are
available:
^B / ^F move left/right one character on the line
^A / ^E move to the start/end of the line
^H / ^G delete one character to the left/right of the cursor
^U / ^K delete all characters to the left/right of the cursor
^P / ^N previous/next lines on the history list
^L or ^R redraw the line
^D delete char under the cursor, or EOF if line is empty
^space force romaji conversion (^@ on some systems)
If automatic romaji-to-kana conversion is turned on (as it is by default), there
are certain situations where the conversion will be done, as we saw above.
Lower-case romaji will be converted to hiragana, while upper-case romaji to
katakana. This usually won't matter, though, as case folding will treat
hiragana and katakana the same in the searches.
In exactly what situations the automatic conversion will be done is intended to
be rather intuitive once the basic idea is learned. However, at
any
time, one can use control-space to convert the ASCII to the left of the
cursor to kana. This can be particularly useful when needing to enter kana on
a command line (where auto conversion is never done; see below)
ROMAJI FLAVOR¶
Most flavors of romaji are recognized. Special or non-obvious items are
mentioned below. Lowercase are converted to hiragana, uppercase to katakana.
Long vowels can be entered by repeating the vowel, or
with¡Æ-¡Çor¡Æ^¡Ç.
In situations where an¡Èn¡Écould be vague, as
in¡Èna¡Ébeing ¤Ê or
¤ó¤¢, use a single quote to force
¤ó.
Therefore,¡Ökenichi¡×¢ª¤±¤Ë¤Á
while¡Öken'ichi¡×¢ª¤±¤ó¤¤¤Á.
The romaji has been richly extended with many non-standard combinations such as
¤Õ¤¡ or ¤Á¤§, which
are represented in intuitive
ways:¡Öfa¡×¢ª¤Õ¤¡,¡Öche¡×¢ª¤Á¤§.
etc.
Various other mappings of interest:
wo ¢ª¤ò we¢ª¤ñ wi¢ª¤ð
VA ¢ª¥ô¥¡ VI¢ª¥ô¥£ VU¢ª¥ô VE¢ª¥ô¥§ VO¢ª¥ô¥©
di ¢ª¤Â dzi¢ª¤Â dya¢ª¤Â¤ã dyu¢ª¤Â¤å dyo¢ª¤Â¤ç
du ¢ª¤Å tzu¢ª¤Å dzu¢ª¤Å
(the following kana are all smaller versions of the regular kana)
xa ¢ª¤¡ xi¢ª¤£ xu¢ª¤¥ xe¢ª¤§ xo¢ª¤©
xu ¢ª¤¥ xtu¢ª¤Ã xwa¢ª¤î xka¢ª¥õ xke¢ª¥ö
xya¢ª¤ã xyu¢ª¤å xyo¢ª¤ç
Any input line beginning with a space (or whichever character is set as the
command-introduction character) is processed as a command to
lookup
rather than a search spec.
Automatic kana conversion is never done on
these lines (but
forced conversion with control-space may be done at
any time).
Other lines are taken as search regular expressions, with the following special
cases:
- ?
- A line consisting of a single question mark will report the current
command-introduction character (the default is a space, but can be changed
with the¡Ècmdchar¡Écommand).
- =
- If a line begins with¡Æ=¡Ç, the line (without
the¡Æ=¡Ç) is taken as a search regular
expression, and no automatic (or internal -- see below) kana conversion is
done anywhere on the line (although again, conversion can always be forced
with control-space). This can be used to initiate a search where the
beginning of the regex is the command-introduction character, or in
certain situations where automatic kana conversion is temporarily not
desired.
- /
- A line beginning with¡Æ/¡Çindicates romaji
input for the whole line. If automatic kana conversion is turned on, the
conversion will be done in real-time, as the romaji is typed. Otherwise it
will be done internally once the line is entered. Regardless, the
presence of the leading¡Æ/¡Çindicates that any
kana (either converted or cut-and-pasted in) should
be¡Èfuzzified¡Éif fuzzification is turned on.
As an addition to the above, if the line doesn't begin
with¡Æ=¡Çor the command-introduction character
(and automatic conversion is turned on),¡Æ/¡Ç
anywhere on the line initiates automatic conversion for the
following word.
- [
- A line beginning with¡Æ[¡Çis taken to be
romaji (just as a line beginning with¡Æ/¡Ç,
and the converted romaji is subject to fuzzification (if turned on).
However, if¡Æ[¡Çis used rather
than¡Æ/¡Ç, an
implied¡Æ<¡Ç¡Èbeginning of
word¡Éis prepended to the resulting kana regex. Also, any
ending¡Æ]¡Çon such a line is converted to
the¡Èending of
word¡Éspecifier¡Æ>¡Çin the
resulting regex.
In addition to the above, lines may have certain prefixes and suffixes to
control aspects of the search or command:
- !
- Various flags can be toggled for the duration of a particular search by
prepending a¡È!!¡Ésequence to the input line.
Sequences are shown below, along with commands related to each:
!F! ¡Ä Filtration is toggled for this line (filter)
!M! ¡Ä Modification is toggled for this line (modify)
!w! ¡Ä Word-preference mode is toggled for this line (word)
!c! ¡Ä Case folding is toggled for this line (fold)
!f! ¡Ä Fuzzification is toggled for this line (fuzz)
!W! ¡Ä Wildcard-pattern mode is toggled for this line (wildcard)
!r! ¡Ä Raw. Force fuzzification off for this line
!h! ¡Ä Highlighting is toggled for this line (highlight)
!t! ¡Ä Tagging is toggled for this line (tag)
!d! ¡Ä Displaying is on for this line (display)
The letters can be combined, as in¡È!cf!¡É.
The final¡Æ!¡Ç can be omitted if the first
character after the sequence is not an ASCII letter.
If no letters are given
(¡È!!¡É).¡È!f!¡Éis
the default.
These last two points can be conveniently combined in the common case
of¡È!/romaji¡Éwhich would be the same
as¡È!f!/romaji¡É.
The special sequence¡È!?¡Élists the above, as
well as indicates which are currently turned on.
Note that the letters accepted in
a¡È!!¡Ésequence are many of the indicators
shown by the¡Èfiles¡Écommand.
- +
- A¡Æ+¡Çprepended to anything above will cause
the final search regex to be printed. This can be useful to see when and
what kind of fuzzification and/or internal kana conversion is happening.
Consider:
search [edict]> +/¤ï¤«¤ë
a match is¡È¤ï[¤¡¤¢¡¼]*¤Ã?¤«[¤¡¤¢¡¼]*¤ë[¤¥¤¦¤ª¤©¡¼]*¡É
Due to the¡Èleading¡É/ the kana is fuzzified,
which explains the somewhat complex resulting regex. For comparison, note:
search [edict]> +¤ï¤«¤ë
a match is¡È¤ï¤«¤ë¡É
search [edict]> +!/¤ï¤«¤ë
a match is¡È¤ï¤«¤ë¡É
As the¡Æ+¡Çshows, these are not fuzzified. The
first one has no
leading¡Æ/¡Çor¡Æ[¡Çto
induce fuzzification, while the second has
the¡Æ!¡Çline prefix (which is the default
version of¡È!f!¡É), which toggles
fuzzification mode to¡Èoff¡Éfor that
line.
- ,
- The default of all searches and most commands is to work with the first
file loaded ( edict in these examples). One can change this default
(see the¡Èselect¡Écommand) or, by appending a
comma+digit sequence at the end of an input line, force that line to work
with another previously-loaded file. An
appended¡È,1¡Éworks with first extra file
loaded (in these examples, kanjidic). An
appended¡È,2¡Éworks with the 2nd extra file
loaded, etc.
An appended¡È,0¡Éworks with the original first
file (and can be useful if the default file has been changed via
the¡Èselect¡Écommand).
The following sequence shows a common usage:
search [edict]> [¤È¤¤ç¤È]
ÅìµþÅÔ [¤È¤¦¤¤ç¤¦¤È] /Tokyo Metropolitan area/
cutting and pasting the ÅÔ from above, and adding
a¡È,1¡Éto search kanjidic:
search [edict]> ÅÔ,1
ÅÔ 4554 N4769 S11 ..... ¥È ¥Ä ¤ß¤ä¤³ {metropolis} {capital}
FILENAME-LIKE WILDCARD MATCHING¶
When wildcard-pattern mode is selected, patterns are considered as extended.Q
"*.txt" "-like" patterns. This is often more convenient
for users not familiar with regular expressions. To have this mode selected by
default, put
default wildcard on
into your¡È.lookup¡Éfile (see¡ÈSTARTUP
FILE¡Ébelow).
When wildcard mode is on, only
¡È*¡É,¡È?¡É,¡È+¡É,and¡È.¡É,are
effected. See the entry for the
¡Èwildcard¡Écommand below for details.
Other features, such as the multiple-pattern searches (described below) and
other regular-expression metacharacters are available.
MULTIPLE-PATTERN SEARCHES¶
You can put multiple patterns in a single search specifier. For example consider
search [edict]> china||japan
The first part (¡Èchina¡É) will select all lines
that have¡Èchina¡Éin them. Then,
from among
those lines, the second part will select lines that
have¡Èjapan¡Éin them.
The¡È||¡Éis not part of any pattern -- it is
lookup's¡Èpipe¡Émechanism.
The above example is very different from the single pattern
¡Èchina|japan¡Éwhich would select any line that
had
either¡Èchina¡É
or¡Èjapan¡É.
With¡Èchina||japan¡É, you get lines that
have¡Èchina¡É
and then also
have¡Èjapan¡Éas well.
Note that it is also different from the regular
expression¡Èchina.*japan¡É(or the wildcard
pattern¡Èchina*japan¡É)which would select lines
having¡Èchina, then maybe some stuff, then
japan¡É. But consider the case
when¡Èjapan¡Écomes on the line
before¡Èchina¡É. Just for your comparison, the
multiple-pattern specifier¡Èchina||japan¡Éis
pretty much the same as the single regular
expression¡Èchina.*japan|japan.*china¡É.
If you use¡È|!|¡Éinstead
of¡È||¡É, it will mean¡È...and then
lines
not matching...¡É.
Consider a way to find all lines of
kanjidic that do have a Halpern
number, but don't have a Nelson number:
search [edict]> <H\d+>|!|<N\d+>
If you then wanted to restrict the listing to those that
also had
a¡Èjinmeiyou¡Émarking (
kanjidic's¡ÈG9¡Éfield) and had a reading of
¤¢¤, you could make it:
search [edict]> <H\d+>|!|<N\d+>||<G9>||<¤¢¤>
A prepended¡Æ+¡Çwould explain:
a match is¡È<H\d+>¡É
and not¡È<N\d+>¡É
and¡È<G9>¡É
and¡È<¤¢¤>¡É
The¡È|!|¡Éand¡È||¡Écan
be used to make up to ten separate regular expressions in any one search
specification.
Again, it is important to stress that¡È||¡Édoes not
mean¡Èor¡É(as it does in a C program, or
as¡Æ|¡Çdoes within a regular expression). You
might find it convenient to
read¡È||¡Éas¡È
and
also¡É, while
reading¡È|!|¡Éas¡Èbut
not¡É.
It is also important to stress that any whitespace around
the¡È||¡Éand¡È|!|¡Éconstruct
is
not ignored, but kept as part of the regex on either side.
COMBINATION SLOTS¶
Each file, when loaded, is assigned to a¡Èslot¡Évia
which subsequent references to the file are then made. The slot may then be
searched, have filters and flags set, etc.
A special kind of slot, called a¡Ècombination
slot¡É,rather than representing a single file, can represent
multiple previously-loaded slots. Searches against a combination slot
(or¡Ècombo slot¡Éfor short) search all those
previously-loaded slots associated with it (called¡Ècomponent
slots¡É). Combo slots are set up with the
combine
command.
A Combo slot has no filter or modify spec, but can have a local prompt and flags
just like normal file slots. The flags, however, have special meanings with
combo slots. Most combo-slot flags act as a mask against the component-slot
flags; when acted upon as a member of the combo, a component-slot's flag will
be disabled if the corresponding combo-slot's flag is disabled.
Exceptions to this are the
autokana,
fuzz, and
tag flags.
The
autokana and
fuzz flags governs a combo slot exactly the same
as a regular file slot. When a slot is searched as a component of a
combination slot, the component slot's
fuzz (and
autokana)
flags, or lack thereof, are ignored.
The
tag flag is quite different altogether; see the
tag command
for complete information.
Consider the following output from the
files command:
¨®¨¬¨³¨¬¨¬¨¬¨¬¨¸¨¬¨¬¨³¨¬¨¬¨¬¨³¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬
¨ 0¨F wcfh d¨¢a I ¨ 2762k¨/usr/jfriedl/lib/edict
¨ 1¨FM cf d¨¢a I ¨ 705k¨/usr/jfriedl/lib/kanjidic
¨ 2¨F cfh@d¨¢a ¨ 1k¨/usr/jfriedl/lib/local.words
¨*3¨FM cfhtd¨¢a ¨ combo¨kotoba (#2, #0)
¨±¨¬¨µ¨¬¨¬¨¬¨¬¨º¨¬¨¬¨µ¨¬¨¬¨¬¨µ¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬
See the discussion of the
files command below for basic explanation of
the output.
As can be seen, slot #3 is a
combination slot with the
name¡Èkotoba¡Éwith
component slots two and
zero. When a search is initiated on this slot, first slot
#2¡Èlocal.words¡Éwill be searched, then slot
#0¡Èedict¡É. Because the combo slot's
filter flag is
on, the component slots'
filter flag will
remain on during the search. The combo slot's
word flag is
off,
however, so slot #0's
word flag will be forced off during the search.
See the
combine command for information about creating combo slots.
Lookup has a built in pager (a'la
more). Upon filling a screen
with text, the string
--MORE [space,return,c,q]--
is shown. A space will allow another screen of text; a return will allow one
more line. A¡Æc¡Ç will allow output text to
continue unpaged until the next command. A¡Æq¡Ç
will flush output of the current command.
If supported by the OS,
lookup's idea of the screen size is automatically
set upon startup and window resize.
Lookup must know the width of the
screen in doing both the horizontal input-line scrolling, and for knowing when
a long line wraps on the screen.
The pager parameters can be set manually with
the¡Èpager¡Écommand.
COMMANDS¶
Any line intended to be a command must begin with the command-introduction
character (the default is a space, but can be set via
the¡Ècmdchar¡Écommand). However, that character is
not part of the command itself and won't be shown in the following list of
commands.
There are a number of commands that work with the
selected file or
selected slot (both meaning the same thing). The selected file is the
one indicated by an appended comma+digit, as mentioned above. If no such
indication is given, the default
selected file is used (usually the
first file loaded, but can be changed with
the¡Èselect¡Écommand).
Some commands accept a
boolean argument, such as to turn a flag on or
off. In all such cases,
a¡È1¡Éor¡Èon¡Émeans to
turn the flag on, while
a¡È0¡Éor¡Èoff¡Éis used
to turn it off. Some flags are per-file
(¡Èfuzz¡É,¡Èfold¡É,
etc.), and a command to set such a flag normally sets the flag for the
selected file only. However, the default value inherited by subsequently
loaded files can be set by prepending¡Èdefault¡Éto
the command. This is particularly useful in the startup file before any files
are loaded (see the section STARTUP FILE).
Items separated by¡Æ|¡Çare mutually exclusive
possibilities (i.e. a boolean argument
is¡È1|on|0|off¡É).
Items shown in brackets
(¡Æ[¡Çand¡Æ]¡Ç) are
optional. All commands that accept a boolean argument to set a flag or mode do
so optionally -- with no argument the command will report the current status
of the mode or flag.
Any command that allows an argument in quotes (such as load, etc.) allow the use
of single or double quotes.
The commands:
- [default] autokana [boolean]
-
Automatic romaji ¢ª kana conversion for the selected
file is turned on or off (default is on). However,
if¡Èdefault¡Éis specified, the value to be
inherited as the default by subsequently-loaded files is set (or
reported).
Can be temporarily disabled by a
prepended¡Æ=¡Ç,as described in the INPUT
SYNTAX section.
- clear|cls
-
Attempts to clear the screen. If you're using a kterm it'll just output the
appropriate tty control sequence. Otherwise it'll try to run
the¡Èclear¡Écommand.
- cmdchar ['one-byte-char']
- The default command-introduction character is a space, but it may be
changed via this command. The single quotes surrounding the character are
required. If no argument is given, the current value is printed.
An input line consisting of a single question mark will also print the
current value (useful for when you don't know the current value).
Woe to the one that sets the command-introduction character to one of the
other special input-line characters, such
as¡Æ+¡Ç,¡Æ/¡Ç,
etc.
- combine ["name"] [ num += ] slotnum ...
-
Creates or adds file slots to a combination slot (see the COMBINATION SLOTS
section for general information). Note
that¡Ècombo¡Émay be used as the command as
well.
Assuming for this example that slots 0-2 are loaded with the files
curly, moe, and larry, we can create a combination
slot that will reference all three:
combo "three stooges" 2, 0, 1
The command will report
creating combo slot #3 (three stooges): 2 0 1
The name is optional, and will appear in the files list, and
also maybe be used to specify the slot as an argument to the select
command.
A search via the newly created combo slot would search in the order
specified on the combo command line: first larry, then
curly, and finally moe.
If you later load another file (say, jeffrey to slot #4), you can
then add it to the previously made combo:
combo 3 += 4
(the¡È+=¡Éwording comes from the C programming
language where it means¡Èadd on to¡É). Adding
to a combination always adds slots to the end of the list.
You can take the opportunity of adding the slot to also change the name, if
you like:
combo "four stooges" 3 += 4
The reply would be
adding to combo slot #3(four stooges): 4
A file slot can be a component of any particular combo slot only once. When
reporting the created or added slot numbers, the number will appear in
parenthesis if it had already been a member of the list.
Furthermore, only file slots can be component members of combo
slots. Attempting to combine combo slot X to combo slot Y
will result in having X's component file slots (rater than the
combo slot itself) added to Y.
- command debug [boolean]
-
Sets the internal command parser debugging flag on or off (default is
off).
- debug [boolean]
-
Sets the internal general-debugging flag on or off (default is off).
- describe specifier
- This command will tell you how a character (or each character in a string)
is encoded in the various encoding methods:
lookup command> describe "µ¤"
¡Èµ¤¡Éas EUC is 0xb5a4 (181 164; 265 \244)
as JIS is 0x3524 ( 53 36; 65 \044 "5$")
as KUTEN is 2104 ( 0x1504; 25 \004)
as S-JIS is 0x8b1f (139 31; 213 \037)
The quotes surrounding the character or string to describe are optional. You
can also give a regular ASCII character and have the double-width version
of the character described....
indicating¡ÈA¡É, for example, would
describe¡È£Á¡É. Specifier
can also be a four-digit kuten value, in which case the character with
that kuten will be described.
If a four-digit specifier has a hex digit in it, or if it is preceded
by¡È0x¡É, the value is taken as a JIS code.
You can precede the value
by¡Èjis¡É,¡Èsjis¡É,¡Èeuc¡É,
or¡Èkuten¡Éto force interpretation to the
requested code.
Finally, specifier can be a string of stripped JIS (JIS w/o the
kanji-in and kanji-out codes, or with the codes but without the escape
characters in them). For example¡ÈF|K\¡Éwould
describe the two characters Æü and ËÜ.
- encoding [euc|sjis|jis]
- The same as the -euc, -jis, and -sjis command-line options, sets the
encoding method for interactive input and output (or reports the current
status). More detail over the output encoding can be achieved with the
output encoding command. A separate encoding for input can be set
with the input encoding command.
- files [ - | long ]
-
Lists what files are loaded in what slots, and some status information about
them, as with:
¨*0¨F wcfh d¨¢a I ¨ 3749k¨/usr/jeff/lib/edict
¨ 1¨FM cf d¨¢a I ¨ 754k¨/usr/jeff/lib/kanjidic
¨®¨¬¨³¨¬¨¬¨¬¨¬¨¬¨¸¨¬¨¬¨³¨¬¨¬¨¬¨³¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬
¨ 0¨F wcf h d ¨¢a I ¨ 2762k¨/usr/jfriedl/lib/edict
¨ 1¨FM cf d ¨¢a I ¨ 705k¨/usr/jfriedl/lib/kanjidic
¨ 2¨F cfWh@d ¨¢a ¨ 1k¨/usr/jfriedl/lib/local.words
¨*3¨FM cf htd ¨¢a ¨ combo¨kotoba (#2, #0)
¨ 4¨ cf d ¨¢a ¨ 205k¨/usr/dict/words
¨±¨¬¨µ¨¬¨¬¨¬¨¬¨¬¨º¨¬¨¬¨µ¨¬¨¬¨¬¨µ¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬
The first section is the slot number, with
a¡È*¡Ébeside the default slot (as set
by the select command).
The second section shows per-slot flags and status. Letters are shown if the
flag is on, omitted if off. In the list below, related commands are given
for each item:
F ¡Ä if there is a filter {but '#' if disabled}. (filter)
M ¡Ä if there is a modify spec {but '%' if disabled}. (modify)
w ¡Ä if word-preference mode is turned on. (word)
c ¡Ä if case folding is turned on. (fold)
f ¡Ä if fuzzification is turned on. (fuzz)
W ¡Ä if wildcard-pattern mode is turned on (wildcard)
h ¡Ä if highlighting is turned on. (highlight)
t ¡Ä if there is a tag {but @ if disabled} (tag)
d ¡Ä if found lines should be displayed (display)
¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡
a ¡Ä if autokana is turned on (autokana)
P ¡Ä if there is a file-specific local prompt (prompt)
I ¡Ä if the file is loaded with a precomputed index (load)
d ¡Ä if the display flag is on (display)
Note that the letters in the upper section directly correspond to
the¡È!!¡Ésequence characters described in the
INPUT SYNTAX section.
If there is a digit at the end of the flag section, it indicates that only
#/10 of the file is actually loaded into memory (as opposed to the file
having been completely loaded). Unloaded files will be loaded while
lookup is idle, or when first used.
If the slot is a combination slot (as slot #3 is in the example above), that
is noted in the third section, and the combination name and component slot
numbers are noted in the fourth. Also, for combination slots (which have
no filter or modify specifications, only the flags),
F and/or M are shown if the corresponding mode is allowed
during searches via the combo slot. See the tag command for info
about t with respect to combination slots.
If an argument
(either¡È-¡Éor¡Èlong¡Éwill
work) is given to the command, a short message about what the flags mean
is also printed.
- filter ["label"] [!] /regex/[i]
-
Sets the filter for the selected slot (which must contain a file and
not a combination). If a filter is set and active for a file, any line
matching the given regex is filtered from the output (if
the¡Æ!¡Çis put before the regex, any
line not matching the regex is filtered). The label , which
isn't required, merely acts as documentation in various diagnostics.
As an example, consider that edict lines often
have¡È(pn)¡Éon them to indicate that the given
English is a place name. Often these place names can be a bother, so it
would be nice to elide them from the output unless specifically requested.
Consider the example:
lookup command> filter "name" /(pn)/
search [edict]> [¤¤Î]
µ¡Ç½ [¤¤Î¤¦] /function/faculty/
µ¢Ç¼ [¤¤Î¤¦] /inductive/
ºòÆü [¤¤Î¤¦] /yesterday/
¢ã3 "name" lines filtered¢ä
In the example,¡Æ/¡Çcharacters are used to
delimit the start and stop of the regex (as is common with many programs).
However, any character can be used. A
final¡Æi¡Ç, if present, indicates that the
regex should be applied in a case-insensitive manner.
The filter, once set, can be enabled or disabled with the other form of
the¡Èfilter¡Écommand (described below). It can
also be temporarily turned off (or, if disabled, temporarily turned on) by
the¡È!F!¡Éline prefix.
Filtered lines can optionally be saved and then displayed if you so desire.
See the¡Èsaved list
size¡Éand¡Èshow¡Écommands.
Note that if you have saving enabled and only one line would be filtered, it
is simply printed at the end (rather than print a one line message about
how one line was filtered).
By the way, a better¡Èname¡Éfilter for
edict would be:
filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
as it would filter all entries that had only one English section, that
section being a name. It is also an example of using something other
than¡Æ/¡Çto delimit a regex, as it makes
things a bit easier to read.
- filter [boolean]
-
Enables or disables the filter for the selected slot. If no argument
is given, displays the current filter and status.
- [default] fold [boolean]
-
The selected slot's case folding is turned on or off (default is on),
or reported if no argument given. However,
if¡Èdefault¡Éis specified, the value to be
inherited as the default by subsequently-loaded files is set (or
reported).
Can be temporarily toggled by the¡È!c!¡Éline
prefix.
- [default] fuzz [boolean]
-
The selected slot's fuzzification is turned on or off (default is
on), or reported if no argument given. However,
if¡Èdefault¡Éis specified, the value to be
inherited as the default by subsequently-loaded files is set (or
reported).
Can be temporarily toggled by the¡È!f!¡Éline
prefix.
- help [regex]
-
Without an argument gives a short help list. With an argument, lists only
commands whose help string is picked up by the given regex.
- [default] highlight [boolean]
-
Sets matched-string highlighting on or off for the selected slot
(default off), or reports the current status if no argument is given.
However, if¡Èdefault¡Éis specified, the value
to be inherited as the default by subsequently-loaded files is set (or
reported).
If on, shows in bold or reverse video (see below) that part of the line
which was matched by the search regex. If multiple regexes were
given, that part matched by the first regex is show.
Note that a regex might match a portion of a line which is later removed by
a modify parameter. In this case, no highlighting is done.
Can be temporarily toggled by the¡È!h!¡Éline
prefix.
- highlight style [bold | inverse | standout |
<___>]
-
Sets the style of highlighting for when highlighting is done. Inverse
(inverse video) and standout are the same. The default is
bold. You can also give an HTML tag, such
as¡È<BOLD>¡Éand items will be wrapped by
<BOLD>...</BOLD>. This would be particularly useful when the
output is going to a CGI, as when lookup has been built in a server
configuration.
Note that the highlighting is affected by using raw VT100/xterm control
sequences. This isn't particularly very nice if your terminal doesn't
understand them. Sorry.
- if {expression} command...
-
If the evaluated expression is non-zero, the command will be
executed.
Note that {} rather than () surround the expression.
Expression may be comprised of numbers, operators, parenthesis, etc.
In addition to the normal +, -, *, and /, are:
! x ¡Ä yields 0 if x is non-zero, 1 if x is zero.
x && y ¡Ä
! x ¡Ä¡Ænot¡ÇYields 1 if x is zero, 0 if non-zero.
x & y ¡Ä¡Æand¡ÇYields 1 if both x and y are non-zero, 0 otherwise.
x | y ¡Ä¡Æor¡Ç Yields 1 if x or y (or both) is non-zero, 0 otherwise
There may also be the special tokens true and false which are
1 and 0 respectively.
There are also checked, matched, printed,
nonword, and filtered which correspond to the values printed
by the stats command.
An example use might be the following kind of thing in an computer-generated
script:
!d!expect this line
if {!printed} msg Oops! couldn't find "expect this line"
- input encoding [ euc | sjis ]
- Used to set (or report) what encoding to use when 8-bit bytes are found in
the interactive input (all flavors of JIS are always recognized). Also see
the encoding and output encoding commands.
- limit [value]
-
Sets the number of lines to print during any search before aborting (or
reports the current number if no value given). Default is 100.
Output limiting is disabled if set to zero.
- log [ to [+] file ]
-
Begins logging the program output to file (the Japanese encoding
method being the same as for screen output).
If¡È+¡Éis given, the log is appended to any
text that might have previously been in file, in which case a
leading dashed line is inserted into the file.
If no arguments are given, reports the current logging status.
- log - | off
- If only¡È-¡Éor off is given, any
currently-opened log file is closed.
- load [-now|-whenneeded] "filename"
-
Loads the named file to the next available slot. If a precomputed index is
found (as¡Èfilename.jin¡É)it is loaded
as well. Otherwise, an index is generated internally.
The file to be loaded (and the index, if loaded) will be loaded during idle
times. This allows a startup file to list many files to be loaded, but not
have to wait for each of them to load in turn. Using the
¡È-now¡Éflag causes the load to happen
immediately, while using the
¡È-whenneeded¡Éoption (can be shortened to
¡È-wn¡É)causes the load to happen only when
the slot is first accessed.
Invoke lookup as
% lookup -writeindex filename
to generate and write an index file, which will then be automatically used
in the future.
If the file has already been loaded, the file is not re-read, but the
previously-read file is shared. The new slot will, however, have its own
separate flags, prompt, filter, etc.
- modify /regex/replace/[ig]
-
Sets the modify parameter for the selected file. If a file has
a modify parameter associated with it, each line selected during a search
will have that part of the line which matches regex (if any)
replaced by the replacement string before being printed.
Like the filter command, the delimiter need not
be¡Æ/¡Ç; any non-space character is fine. If a
final¡Æi¡Çis given, the regex is applied in a
case-insensitive manner. If a final¡Æg¡Çis
given, the replacement is done to all matches in the line, not just the
first part that might match regex.
The replacement may have embedded¡È1¡É,
etc. in it to refer to parts of the matched text (see the tutorial on
regular expressions).
The modify parameter, once set, may be enabled or disabled with the other
form of the modify command (described below). It may also be temporarily
toggled via the¡È!m!¡Éline prefix.
A silly example for the ultra-nationalist might be:
modify /<Japan>/Dainippon Teikoku/g
So that a line such as
Æü¶ä [¤Ë¤Á¤®¤ó] /Bank of Japan/
would come out as
Æü¶ä [¤Ë¤Á¤®¤ó] /Bank of Dainippon Teikoku/
As a real example of the modify command with kanjidic, consider that
it is likely that one is not interested in all the various fields each
entry has. The following can be used to remove the info on the U, N, Q, M,
E, B, C, and Y fields from the output:
modify /( [UNQMECBY]\S+)+//g,1
It's sort of complex, but works. Note that here the replacement part
is empty, meaning to just remove those parts which matched. The result of
such a search of Æü would normally print
Æü 467c U65e5 N2097 B72 B73 S4 G1 H3027 F1 Q6010.0 MP5.0714 ¡À
MN13733 E62 Yri4 P3-3-1 ¥Ë¥Á ¥¸¥Ä ¤Ò -¤Ó -¤« {day}
but with the above modify spec, appears more simply as
Æü 467c S4 G1 H3027 F1 P3-3-1 ¥Ë¥Á ¥¸¥Ä ¤Ò -¤Ó -¤« {day}
- modify [boolean]
-
Enables or disables the modify parameter for the selected file, or
report the current status if no argument is given.
- msg string
- The given string is printed.
Most likely used in a script as the target command of an if
command.
- output encoding [ euc | sjis | jis...]
- Used to set exactly what kind of encoding should be used for program
output (also see the input encoding command). Used when the
encoding command is not detailed enough for one's needs.
If no argument is given, reports the current output encoding. Otherwise,
arguments can usually be any reasonable dash-separated combination
of:
- euc
- Selects EUC for the output encoding.
- sjis
- Selects Shift-JIS for the output encoding.
- jis[78|83|90][-ascii|-roman]
- Selects JIS for the output encoding. If no year (78, 83, or 90) given, 78
is used. Can optionally specify
that¡ÈEnglish¡Éshould be encoded as regular
ASCII (the default when JIS selected) or as JIS-ROMAN.
- 212
- Indicates that JIS X0212-1990 should be supported (ignored for Shift-JIS
output).
- no212
- Indicates that JIS X0212-1990 should be not be supported (default
setting). This places JIS X0212-1990 characters under the domain of
disp, nodisp, code, or mark (described
below).
- hwk
- Indicates that half width kana should be left as-is
(default setting).
- nohwk
- Indicates that half width kana should be stripped
from the output. (not yet implemented).
- foldhwk
- Indicates that half width kana should be folded to
their full-width counterparts. (not yet implemented).
- disp
- Indicates that non-displayable characters (such as JIS X0212-1990
while the output encoding method is Shift-JIS) should be passed along
anyway (most likely resulting in screen garbage).
- nodisp
- Indicates that non-displayable characters should be quietly
stripped from the output.
- code
- Indicates that non-displayable characters should be printed as
their octal codes (default setting).
- mark
- Indicates that non-displayable characters should be printed
as¡È¡ú¡É.
Of course, not all options make sense in all combinations, or at all times. When
the current (or new) output encoding is reported, a complete and exact
specifier representing the output encoding selected. An example might
be¡Èjis78-ascii-no212-hwk-code¡É.
- pager [ boolean | size ]
- Turns on or off an output pager, sets it's idea of the screen size, or
reports the current status.
Size can be a single number indicating the number of lines to be
printed between¡ÈMORE?¡Éprompts (usually a few
lines less than the total screen height, the default being 20 lines). It
can also be two numbers in the
form¡È#x#¡Éwhere the first number is the width
(in half-width characters; default 80) and the second is the
lines-per-page as above.
If the pager is on, every page of output will result in
a¡ÈMORE?¡Éprompt, at which there are four
possible responses. A space will allow one more full page to print. A
return will allow one more line.
A¡Æc¡Ç(for¡Ècontinue¡É)
will all the rest of the output (for the current command) to proceed
without pause, while
a¡Æq¡Ç(for¡Èquit¡É)
will flush the output for the current command.
If supported by the OS, the pager size parameters are set appropriately from
the window size upon startup or window resize.
The default pager status is¡Èoff¡É.
- [local] prompt "string"
- Sets the prompt string. If¡Èlocal¡Éis
indicated, sets the prompt string for the selected slot only.
Otherwise, sets the global default prompt string.
Prompt strings may have the special %-sequences shown below, with related
commands given in parenthesis:
%N ¡Ä the default slot's file or combo name.
%n ¡Ä like %N, but any leading path is not shown if a filename.
%# ¡Ä the default slot's number.
%S ¡Ä the¡Ècommand-introduction¡Écharacter (cmdchar)
%0 ¡Ä the running program's name
%F=' string' ¡Ä string shown if filtering enabled (filter)
%M=' string' ¡Ä string shown if modification enabled (modify)
%w=' string' ¡Ä string shown if word mode on (word)
%c=' string' ¡Ä string shown if case folding on (fold)
%f=' string' ¡Ä string shown if fuzzification on (fuzz).
%W=' string' ¡Ä string shown if wildcard-pat. mode on (wildcard).
%d=' string' ¡Ä string shown if displaying on (display).
%C=' string' ¡Ä string shown if currently entering a command.
%l=' string' ¡Ä string shown if logging is on (log).
%L ¡Ä the name of the current output log, if any (log)
For the tests (%f, etc), you can put¡Æ!¡Çjust
after the¡Æ%¡Çto reverse the sense of the test
(i.e. %!f="no fuzz"). The reverse of %F is if a filter is
installed but disabled (i.e. string will never be shown if there is
no filter for the default file). The modify %M works comparably.
Also, you can use an alternative form for the items that take an argument
string. Replacing the quotes with parentheses will treat string as
a recursive prompt specifier. For example, the specifier
%C='command'%!C(%f='fuzzy 'search:)
would result in a¡Ècommand¡Éprompt if entering a
command, while it would result in either a¡Èfuzzy
search:¡Éor a¡Èsearch:¡Éprompt
if not entering a command. The parenthesized constructs may be nested.
Note that the letters of the test constructs are the same as the letters for
the¡È!!¡Ésequences described in INPUT SYNTAX.
An example of a nice prompt command might be:
prompt "%C(%0 command)%!C(%w'*'%!f'raw '%n)> "
With this prompt specification, the prompt would normally appear
as¡È filename> ¡Ébut when
fuzzification is turned off as¡Èraw
filename> ¡É. And if word-preference mode
is on, the whole thing has a¡È*¡Éprepended.
However if a command is being entered, the prompt would then
become¡È name command¡É, where
name was the program's name (system dependent, but most
likely¡Èlookup¡É).
The default prompt format string is¡È%C(%0 command)%!C(search
[%n])> ¡É.
- regex debug [boolean]
-
Sets the internal regex debugging flag (turn on if you want billions of
lines of stuff spewed to your screen).
- saved list size [value]
-
During a search, lines that match might be elided from the output due to
filters or word-preference mode. This command sets the number of such
lines to remember during any one search, such that they may be later
displayed (before the next search) by the show command.
The default is 100.
- select [ num | name | . ]
-
If num is given, sets the default slot to that slot number. If
name is given, sets the default slot to the first slot found
with a file (or combination) loaded with that name. The
incantation¡Èselect .¡Émerely sets the default
slot to itself, which can be useful in script files where you want to
indicate that any subsequent flags changes should work with whatever file
was the default at the time the script was sourced.
If no argument is given, simply reports the current default slot
(also see the files command).
In command files loaded via the source command, or as the startup
file, commands dealing with per-slot items (flags, local prompt, filters,
etc.) work with the file or slot last selected. The last such
selected slot remains selected once the load is complete.
Interactively, the default slot will become the selected slot for
subsequent searches and commands that aren't augmented with an
appended¡È,#¡É(as described in the INPUT
SYNTAX section).
- show
-
Shows any lines elided from the previous search (either due to a
filter or word-preference mode).
Will apply any modifications (see
the¡Èmodify¡Écommand) if modifications are
enabled for the file. You can use
the¡È!m!¡Éline prefix as well with this
command (in this case, put
the¡È!m!¡Ébefore the command-indicator
character).
The length of the list is controlled by the¡Èsaved list
size¡Écommand.
- source "filename"
-
Commands are read from filename and executed.
In the file, all lines beginning with¡È#¡Éare
ignored as comments (note that comments must appear on a line by
themselves, as¡È#¡Éis a reasonable character
to have within commands).
Lines whose first non-blank characters
is¡È=¡É,¡È!¡É,or¡È+¡Éare
considered searches, while all other non-blank lines are considered
lookup commands. Therefore, there is no need for lines to begin
with the command-introduction character. However, leading whitespace is
always OK.
For search lines, take care that any trailing whitespace is deleted if
undesired, as trailing whitespace (like all non-leading whitespace) is
kept as part of the regular expression.
Within a command file, commands that modify per-file flags and such always
work with the most-recently loaded (or selected) file. Therefore,
something along the lines of
load "my.word.list"
set word on
load "my.kanji.list"
set word off
set local prompt "enter kanji> "
would word as might make intuitive sense.
Since a script file must have a load, or select before any
per-slot flag is set, one can use¡Èselect
.¡Éto facilitate command scripts that are to work
with¡Èthe current slot¡É.
- spinner [value]
- Set the value of the spinner (A silly little feature). If set to a
non-zero value, will cause a spinner to spin while a file is being
checked, one increment per value lines in the file actually checked
against the search specifier. Default is off (i.e. zero).
- stats
- Shows information about how many lines of the text file were checked
against the last search specifier, and how many lines matched and were
printed.
- tag [boolean] ["string"]
- Enable, disable, or set the tag for the selected slot.
If the slot is not a combination slot, a tag string may be set (the
quotes are required).
If a tag string is set and enabled for a file, the string is prepended to
each matching output line printed.
Unlike the filter and modify commands which automatically
enable the function when a parameter is set, a tag is not
automatically enabled when set. It can be enabled while being set
via¡È'tag¡Éonor could be enabled subsequently
via just¡Ètag on¡É If the selected slot is a
combination slot, only the enable/disable status may be changed (on by
default). No tag string may be set.
The reason for the special treatment lies in the special nature of how tags
work in conjunction with combination files.
During a search when the selected slot is a combination slot, each file
which is a member of the combination has its per-file flags disabled if
their corresponding flag is disabled in the original combination slot.
This allows the combination slot's flags to act as
a¡Èmask¡Éto blot out each component file's
per-file flags.
The tag flag, however, is special in that the component file's tag flag is
turned on if the combination slot's tag flag is turned on (and, of
course, the component file has a tag string registered).
The intended use of this is that one might set a (disabled) tag to a file,
yet direct searches against that file will have no prepended tag.
However, if the file is searched as part of a combination slot (and the
combination slot's tag flag is on), the tag will be prepended,
allowing one to easily understand from which file an output line
comes.
- verbose [boolean]
-
Sets verbose mode on or off, or reports the current status (default on).
Many commands reply with a confirmation if verbose mode is turned on.
- version
- Reports the current version of the program.
- [default] wildcard [boolean]
-
The selected slot's patterns are considerd wildcard patterns if
turned on, regular expressions if turned off. The current status is
reported if no argument given. However,
if¡Èdefault¡Éis specified, the pattern-type to
be inherited as the default by subsequently-loaded files is set (or
reported).
Can be temporarily toggled by the¡È!W!¡Éline
prefix.
When wildcard patterns are selected, the changed metacharacters
are:¡È*¡Émeans¡Èany
stuff¡É,¡È?¡Émeans¡Èany
one
character¡É,while¡È+¡Éand¡È.¡Ébecome
unspecial. Other regex items such
as¡È|¡É,¡È(¡É,¡È[¡É,etc.
are unchanged.
What¡È*¡Éand¡È?¡Éwill
actually match depends upon the status of word-mode, as well as on the
pattern itself. If word-mode is on, or if the pattern begins with the
start-of-word¡È<¡Éor¡È[¡É,only
non-spaces will be matched. Otherwise, any character will be matched.
In summary,when wildcard mode is on, the input pattern is effected in the
following ways:
* is changed to the regular expression .* or
? is changed to the regular expression . or
+ is changed to the regular expression +
. is changed to the regular expression .
Because filename patterns are often called¡Èfilename
globs¡É,the command¡Èglob¡Écan
be used in place of¡Èwildcard¡É.
- [default] word|wordpreference [boolean]
- The selected file's word-preference mode is turned on or off (default is
off), or reports the current setting if no argument is specified. However,
if¡Èdefault¡Éis specified, the value to be
inherited as the default by subsequently-loaded files is set (or
reported).
In word-preference mode, entries are searched for as if the search
regex had a leading¡Æ<¡Çand a
trailing¡Æ>¡Ç, resulting in a list of
entries with a whole-word match of the regex. However, if there are none,
but there are non-word entries, the non-word entries are shown
(the¡Èsaved list¡Éis used for this -- see that
command). This make it an¡Èif there are whole words like
this, show me, otherwise show me whatever you've got¡Émode.
If there are both word and non-word entries, the non-word entries are
remembered in the saved list (rather than any possible filtered entries
being remembered there).
One caveat: if a search matches a line in more than one place, and the first
is not a whole-word, while one of the others is, the line
will be listed considered non-whole word. For example, the
search¡Öjapan¡×with word-preference mode on
will not list an entry such as¡È/Japanese/language in
Japan/¡É, as the
first¡ÈJapan¡Éis part
of¡ÈJapanese¡Éand not a whole word. If you
really need just whole-word entries, use
the¡Æ<¡Çand¡Æ>¡Çyourself.
The mode may be temporarily toggled via
the¡È!w!¡Éline prefix.
The rules defining what lines are filtered, remembered, discarded, and shown
for each permutation of search are rather complex, but the end result is
rather intuitive.
- quit | leave | bye | exit
-
Exits the program.
STARTUP FILE¶
If the file¡È~/.lookup¡Éis present, commands are
read from it during
lookup startup.
The file is read in the same way as the
source command reads files (see
that entry for more information on file format, etc.)
However, if there had been files loaded via command-line arguments, commands
within the startup file to load files (and their associated commands such as
to set per-file flags) are ignored.
Similarly, any use of the command-line flags -euc, -jis, or -sjis will disable
in the startup file the commands dealing with setting the input and/or output
encodings.
The special treatment mentioned in the above two paragraphs only applies to
commands within the startup file itself, and does not apply to commands in
command-files that might be
sourced from within the startup file.
The following is a reasonable example of a startup file:
## turn verbose mode off during startup file processing
verbose off
prompt "%C([%#]%0)%!C(%w'*'%!f'raw '%n)> "
spinner 200
pager on
## The filter for edict will hit for entries that
## have only one English part, and that English part
## having a pl or pn designation.
load ~/lib/edict
filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
highlight on
word on
## The filter for kanjidic will hit for entries without a
## frequency-of-use number. The modify spec will remove
## fields with the named initial code (U,N,Q,M,E, and Y)
load ~/lib/kanjidic
filter "uncommon" !/<F\d+>/
modify /( [UNQMEY]
## Use the same filter for my local word file,
## but turn off by default.
load ~/lib/local.words
filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
filter off
highlight on
word on
## Want a tag for my local words, but only when
## accessed via the combo below
tag off "¡Õ"
combine "words" 2 0
select words
## turn verbosity back on for interactive use.
verbose on
COMMAND-LINE ARGUMENTS¶
With the use of a startup file, command-line arguments are rarely needed. In
practical use, they are only needed to create an index file, as in:
lookup -write textfile
Any command line arguments that aren't flags are taken to be files which are
loaded in turn during startup. In this case,
any¡Èload¡É,¡Èfilter¡É,
etc. commands in the startup file are ignored.
The following flags are supported:
- -help
- Reports a short help message and exits.
- -write Creates index files for the named files and exits.
No
- startup file is read.
- -euc
- Sets the input and output encoding method to EUC (currently the default).
Exactly the same as the¡Èencoding
euc¡Écommand.
- -jis
- Sets the input and output encoding method to JIS. Exactly the same as
the¡Èencoding jis¡Écommand.
- -sjis
- Sets the input and output encoding method to Shift-JIS. Exactly the same
as the¡Èencoding sjis¡Écommand.
- -v -version
- Prints the version string and exits.
- -norc
-
Indicates that the startup file should not be read.
- -rc file
- The named file is used as the startup file, rather than the
default¡È~/.lookup¡É. It is an error for the
file not to exist.
- -percent num
-
When an index is built, letters that appear on more than num percent
(default 50) of the lines are elided from the index. The thought is that
if a search will have to check most of the lines in a file anyway, one may
as well save the large amount of space in the index file needed to
represent that information, and the time/space tradeoff shifts, as the
indexing of oft-occurring letters provides a diminishing return.
Smaller indexes can be made by using a smaller number.
- -noindex
-
Indicates that any files loaded via the command line should not be loaded
with any precomputed index, but recalculated on the fly.
- -verbose
-
Has metric tons of stats spewed whenever an index is created.
- -port ###
- For the (undocumented) server configuration only, tells which port to
listen on.
OPERATING SYSTEM CONSIDERATIONS¶
I/O primitives and behaviors vary with the operating system. On my operating
system, I can¡Èread¡Éa file by mapping it into
memory, which is a pretty much instant procedure regardless of the size of the
file. When I later access that memory, the appropriate sections of the file
are automatically read into memory by the operating system as needed.
This results in
lookup starting up and presenting a prompt very quickly,
but causes the first few searches that need to check a lot of lines in the
file to go more slowly (as lots of the file will need to be read in). However,
once the bulk of the file is in, searches will go very fast. The win here is
that the rather long file-load times are amortized over the first few (or few
dozen, depending upon the situation) searches rather than always faced right
at command startup time.
On the other hand, on an operating system without the mapping ability,
lookup would start up very slowly as all the files and indexes are read
into memory, but would then search quickly from the beginning, all the file
already having been read.
To get around the slow startup, particularly when many files are loaded,
lookup uses
lazy loading if it can: a file is not actually read
into memory at the time the
load command is given. Rather, it will be
read when first actually accessed. Furthermore, files are loaded while
lookup is idle, such as when waiting for user input. See the
files command for more information.
REGULAR EXPRESSIONS, A BRIEF TUTORIAL¶
Regular expressions (¡Èregex¡Éfor short) are
a¡Ècode¡Éused to indicate what kind of text you're
looking for. They're how one searches for things in the
editors¡Èvi¡É,¡Èstevie¡É,¡Èmifes¡Éetc.,
or with the grep commands. There are differences among the various regex
flavors in use -- I'll describe the flavor used by
lookup here. Also,
in order to be clear for the common case, I might tell a few lies, but nothing
too heinous.
The regex¡Öa¡×means¡Èany line with
an¡Æa¡Çin it.¡É Simple enough.
The regex¡Öab¡×means¡Èany line with
an¡Æa¡Çimmediately followed by
a¡Æb¡Ç¡É. So the line
I am feeling flabby
would¡Èmatch¡Éthe
regex¡Öab¡×because, indeed, there's
an¡Èab¡Éon that line. But it wouldn't match the
line
this line has no a followed _immediately_ by a b
because, well, what the lines says is true.
In most cases, letters and numbers in a regex just mean that you're looking for
those letters and numbers in the order given. However, there are some special
characters used within a regex.
A simple example would be a period. Rather than indicate that you're looking for
a period, it means¡Èany character¡É. So the silly
regex¡Ö.¡×would mean¡Èany line that
has any character on it.¡ÉWell, maybe not so silly... you can
use it to find non-blank lines.
But more commonly it's used as part of a larger regex. Consider the
regex¡Ögray¡×. It wouldn't match the line
The sky was grey and cloudy.
because of the different spelling (grey vs. gray). But the
regex¡Ögr.y¡×asks for¡Èany line with
a¡Æg¡Ç,¡Ær¡Ç, some
character, and then a¡Æy¡Ç¡É. So
this would
get¡Ègrey¡Éand¡Ègray¡É.
A special construct somewhat similar to¡Æ.¡Çwould
be the
character class. A character class starts with
a¡Æ[¡Çand ends with
a¡Æ]¡Ç, and will match any character given in
between. An example might be
gr[ea]y
which would match lines with
a¡Æg¡Ç,¡Ær¡Ç,
an¡Æe¡Ç
or
an¡Æa¡Ç, and then
a¡Æy¡Ç. Inside a character class you can list as
many characters as you want to.
For example the simple regex¡Öx[0123456789]y¡×would
match any line with a digit sandwiched between
an¡Æx¡Çand a¡Æy¡Ç.
The order of the characters within the character class doesn't really
matter...¡Ö[513467289]¡×would be the same
as¡Ö[0123456789]¡×.
But as a short cut, you could put¡Ö[0-9]¡×instead
of¡Ö[0123456789]¡×. So the character
class¡Ö[a-z]¡×would match any lower-case letter,
while the character class¡Ö[a-zA-Z0-9]¡×would
match any letter or digit.
The character¡Æ-¡Çis special within a character
class, but only if it's not the first thing. Another character that's special
in a character class is¡Æ^¡Ç, if it
is the
first thing. It¡Èinverts¡Éthe class so that it
will match any character
not listed. The
class¡Ö[^a-zA-Z0-9]¡×would match any line with
spaces or punctuation on them.
There are some special short-hand sequences for some common character classes.
The
sequence¡Ö\d¡×means¡Èdigit¡É,
and is the same as¡Ö[0-9]¡×.
¡Ö\w¡×means¡Èword
element¡Éand is the same
as¡Ö[0-9a-zA-Z_]¡×.
¡Ö\s¡×means¡Èspace-type
thing¡Éand is the same as¡Ö[
\t]¡×(¡Ö\t¡×means tab).
You can also
use¡Ö\D¡×,¡Ö\W¡×,
and¡Ö\S¡×to mean things
not a digit, word
element, or space-type thing.
Another special character would be¡Æ?¡Ç. This
means¡Èmaybe one of whatever was just before it, not is fine
too¡É. In the regex ¡Öbikes? for
rent¡×, the¡Èwhatever¡Éwould be
the¡Æs¡Ç, so this would match lines with
either¡Èbikes for rent¡Éor¡Èbike for
rent¡É.
Parentheses are also special, and can group things together. In the regex
big (fat harry)? deal
the¡Èwhatever¡Éfor
the¡Æ?¡Çwould be¡Èfat
harry¡É. But be careful to pay attention to details... this
regex would match
I don't see what the big fat harry deal is!
but
not
I don't see what the big deal is!
That's because if you take away the¡Èwhatever¡Éof
the¡Æ?¡Ç, you end up with
big deal
Notice that there are
two spaces between the words, and the regex didn't
allow for that. The regex to get either line above would be
big (fat harry )?deal
or
big( fat harry)? deal
Do you see how they're essentially the same?
Similar to¡Æ?¡Çis¡Æ*¡Ç,
which means¡Èany number, including none, of whatever's right in
front¡É. It more or less means that whatever is tagged
with¡Æ*¡Çis allowed, but not required, so
something like
I (really )*hate peas
would match¡ÈI hate peas¡É,¡ÈI really
hate peas!¡É,¡ÈI really really hate
peas¡É, etc.
Similar to
both¡Æ?¡Çand¡Æ*¡Çis¡Æ+¡Ç,
which means¡Èat least one of whatever just in front, but more is
fine too¡É. The
regex¡Ömis+pelling¡×would
match¡Èmi
spelling¡É,¡Èmi
sspelling¡É,¡Èmi
ssspelling¡É,
etc. Actually, it's just the same
as¡Ömiss*pelling¡×but more simple to type. The
regex¡Öss*¡×means¡Èan¡Æs¡Ç,
followed by zero or more¡Æs¡Ç¡É,
while¡Ös+¡×means¡Èone or
more¡Æs¡Ç¡É. Both really the same.
The special
character¡Æ|¡Çmeans¡Èor¡É.
Unlike¡Æ+¡Ç,¡Æ*¡Ç,
and¡Æ?¡Çwhich act on the thing
immediately
before, the¡Æ|¡Çis
more¡Èglobal¡É.
give me (this|that) one
Would match lines that had¡Ègive me this
one¡Éor¡Ègive me that one¡Éin them.
You can even combine more than two:
give me (this|that|the other) one
How about:
[Ii]t is a (nice |sunny |bright |clear )*day
Here, the¡Èwhatever¡Éimmediately before
the¡Æ*¡Çis
(nice |sunny |bright |clear )
So this regex would match all the following lines:
It is a day.
I think it is a nice day.
It is a clear sunny day today.
If it is a clear sunny nice sunny sunny sunny bright day then....
Notice how the¡Ö[Ii]t¡×matches
either¡ÈIt¡Éor¡Èit¡É?
Note that the above regex would also match
fru it is a day
because it indeed fulfills all requirements of the regex, even though
the¡Èit¡Éis really part of the
word¡Èfruit¡É. To answer concerns like this, which
are common,
are¡Æ<¡Çand¡Æ>¡Ç,
which mean¡Èword break¡É. The
regex¡Ö<it¡×would match any line
with¡Èit¡É
beginning a word,
while¡Öit>¡×would match any line
with¡Èit¡É
ending a word. And, of
course,¡Ö<it>¡×would match any line with
the word¡Èit¡Éin it.
Going back to the regex to find grey/gray, that would make more sense, then, as
<gr[ae]y>
which would match only the
words¡Ègrey¡Éand¡Ègray¡É.
Somewhat similar
are¡Æ^¡Çand¡Æ$¡Ç,
which mean¡Èbeginning of
line¡Éand¡Èend of line¡É,
respectively (but, not in a character class, of course). So the
regex¡Ö^fun¡×would find any line that begins with
the letters¡Èfun¡É,
while¡Ö^fun>¡×would find any line that begins
with the
word¡Èfun¡É.
¡Ö^fun$¡×would find any line that was
exactly¡Èfun¡É.
Finally,¡Ö^\s*fun\s*$¡×would match any line
that¡Èfun¡Éexactly, but perhaps also had leading
and/or trailing whitespace.
That's pretty much it. There are more complex things, some of which I'll mention
in the list below, but even with these few simple constructs one can specify
very detailed and complex patterns.
Let's summarize some of the special things in regular expressions:
Items that are basic units:
char any non-special character matches itself.
\ char special chars, when proceeded by \, become non-special.
. Matches any one character (except \n).
\n Newline
\t Tab.
\r Carriage Return.
\f Formfeed.
\d Digit. Just a short-hand for [0-9].
\w Word element. Just a short-hand for [0-9a-zA-Z_].
\s Whitespace. Just a short-hand for [\t \n\r\f].
\## \### Two or three digit octal number indicating a single byte.
[ chars] Matches a character if it's one of the characters listed.
[^ chars] Matches a character if it's not one of the ones listed.
The \ char items above can be used within a character class,
but not the items below.
\D Anything not \d.
\W Anything not \w.
\S Anything not \s.
\a Any ASCII character.
\A Any multibyte character.
\k Any (not half-width) katakana character (including ¡¼).
\K Any character not \k (except \n).
\h Any hiragana character.
\H Any character not \h (except \n).
( regex) Parens make the regex one unit.
(?: regex) [from perl5] Grouping-only parens -- can't use for \# (below)
\c Any JISX0208 kanji (kuten rows 16-84)
\C Any character not \c (except \n).
\# Match whatever was matched by the #th paren from the left.
With¡È¡ù¡Éto indicate one¡Èunit¡Éas above, the following may be used:
¡ù? A ¡ù allowed, but not required.
¡ù+ At least one ¡ù required, but more ok.
¡ù* Any number of ¡ù ok, but none required.
There are also ways to match¡Èsituations¡É:
\b A word boundary.
< Same as \b.
> Same as \b.
^ Matches the beginning of the line.
$ Matches the end of the line.
Finally, the¡Èor¡Éis
reg1|reg2 Match if either reg1 or reg2 match.
Note that¡È\k¡Éand the like aren't allowed in character classes, so
something such as¡Ö[\k\h]¡×to try to get all kana won't work.
Use ¡Ö(\k|\h)¡×instead.
BUGS¶
Needs full support for half-width katakana and JIS X 0212-1990.
Non-EUC (JIS & SJIS) items not tested well.
Probably won't work on non-UNIX systems.
Screen control codes (for clear and highlight commands) are hard-coded for
ANSI/VT100/kterm.
AUTHOR¶
Jeffrey Friedl (jfriedl@nff.ncl.omron.co.jp)
INFO¶
Jim Breen's text files
edict and
kanjidic and their documentation
can be found in¡Èpub/nihongo¡Éon
ftp.cc.monash.edu.au (130.194.1.106
Information on input and output encoding and codes can be found in Ken Lunde's
Understanding Japanese Information Processing
(ÆüËܸì¾ðÊó½èÍý)
published by O'Reilly and Associates. ISBN 1-56592-043-0. There is also a
Japanese edition published by SoftBank.
A program to convert files among the various encoding methods is Dr. Ken
Lunde's
jconv, which can also be found on ftp.cc.monash.edu.au.
Jconv is also useful for converting halfwidth katakana (which
lookup doesn't yet support well) to full-width.