table of contents
- NAME
- SYNOPSIS
- DESCRIPTION
- THIS MANUAL
- OVERVIEW OF MAJOR FEATURES
- REGULAR EXPRESSIONS
- JAPANESE CHARACTER ENCODING METHODS
- STARTUP
- INPUT
- BRIEF EXAMPLE
- JAPANESE INPUT
- READLINE INPUT
- ROMAJI FLAVOR
- INPUT SYNTAX
- FILENAME-LIKE WILDCARD MATCHING
- MULTIPLE-PATTERN SEARCHES
- COMBINATION SLOTS
- PAGER
- COMMANDS
- STARTUP FILE
- COMMAND-LINE ARGUMENTS
- OPERATING SYSTEM CONSIDERATIONS
- REGULAR EXPRESSIONS, A BRIEF TUTORIAL
- BUGS
- AUTHOR
- INFO
LOOKUP(1) | General Commands Manual | LOOKUP(1) |
April 22nd, 1994
Interactive input and output encoding, however, may be be selected via the -jis,
-sjis, and -euc invocation flags (default is -euc), or by various commands to
the program (described later).
Make sure to use the encoding appropriate for your system. If you're using kterm
under the X Window System, you can use lookup's -jis flag to match
kterm's default JIS encoding. Or, you might use kterm's¡È-km
euc¡Éstartup option (or menu selection) to put kterm into EUC mode.
Also, I have found kterm's scrollbar (¡È-sb -sl 500¡É) to
be quite useful.
With many¡ÈEnglish¡Éfonts in Japan, the character that
normally prints as a backslash (halfwidth version of ¡À) in The
States appears as a yen symbol (the half-width version of ¡ï). How
it will appear on your system is a function of what font you use and what
output encoding method you choose, which may be different from the font and
method that was used to print this manual (both of which may be different from
what's printed on your keyboard's appropriate key). Make sure to keep this in
mind while reading.
Lookup gains much of its search speed by constructing an index of the
file(s) to be searched. Since building the index can be time consuming itself,
you can have lookup write the built index to a file that can be quickly
loaded the next time you run the program. Index files will be given
a¡È.jin¡É(Jeffrey's Index) ending.
Let's build the indices for edict and kanjidic now:
You can now re-start lookup , automatically using the pre-computed index
files as:
The input syntax may perhaps at first seem odd, but has been designed to be
powerful and concise. A bit of time invested to learn it well will pay off
greatly when you need it.
Given the input:
Let's try another:
Let's give a command to turn this option off, so
that¡Æf¡Çand¡ÆF¡Çwon't be considered
the same. Here's an odd point about lookup's input syntax: the default
setting is that all command lines must begin with a space. The space is the
(default) command-introduction character and tells the input parser to expect
a command rather than a search regular expression. It is a common mistake
at first to forget the leading space when issuing a command. Be careful.
Try the command¡È fold¡Éto report the current status of
case-folding. Notice that as soon as you type the space, the prompt changes to
You can actually turn it off with¡È fold off¡É. Now try the
search for¡Èfukushima¡Éagain. Notice that this time the
entries with¡ÈFukushima¡Éaren't listed? Now try the search
string¡ÈFukushima¡Éand see that the entries
with¡Èfukushima¡Éaren't listed.
Case folding is usually very convenient (it also makes corresponding katakana
and hiragana match the same), so don't forget to turn it back on:
Try the command¡È fuzz¡É(again, don't forget the
command-introduction space). You'll see that fuzzification is turned on. Turn
it off with¡È fuzz off¡Éand
try¡È/tokyo¡É(which will convert as you type) again. This
time you only get the lines which
have¡È¤È¤¤ç¡Éexactly (well,
case folding is still on, so it might match katakana as well).
In a fuzzy search, length of vowels is ignored
--¡È¤È¡Éis considered the same
as¡È¤È¤¦¡É, for example. Also, the
presence or absence of any¡È¤Ã¡Écharacter is
ignored, and the pairs ¤¸ ¤Â, ¤º ¤Å,
¤¨ ¤ñ, and ¤ª ¤ò are considered
identical in a fuzzy search.
It might be convenient to consider a fuzzy search to be
a¡Èpronunciation search¡É.
Special note: fuzzification will not be performed if a regular
expression¡È*¡É,¡È+¡É,or¡È?¡Émodifies
a non-ASCII character. This is not an issue when input patterns are
filename-like wildcard patterns (discussed below).
In addition to kana fuzziness, there's one special case for kanji when fuzziness
is on. The kanji repeater mark¡È¡¹¡Éwill be
recognized such
that¡È»þ¡¹¡Éand¡È»þ»þ¡Éwill
match each-other.
Turn fuzzification back on (¡Èfuzz on¡É), and search for all
whole words which sound like¡Ètokyo¡É. That search
would be specified as:
Besides the kana conversion, you can use any cut-and-paste that your windowing
system might provide to get Japanese text onto the search line.
Cut¡È¤È¤¤ç¡Éfrom somewhere
and paste onto the search line. When hitting enter to run the search, you'll
notice that it is done without fuzzification (even if the fuzzification flag
was¡Èon¡É). That's because there's no
leading¡Æ/¡Ç. Not only does a
leading¡Æ/¡Çndicate that you want the romaji-to-kana
conversion, but that you want it done fuzzily.
So, if you'd like fuzzy cut-and-paste, just type a
leading¡Æ/¡Çefore pasting (or go back and prepend one
after pasting).
These examples have all been pretty simple, but you can use all the power that
regexes have to offer. As a slightly more complex example, the
search¡È<gr[ea]y>¡Éwould look for all lines with the
words¡Ègrey¡Éor¡Ègray¡Éin them. Since
the¡Æ[¡Çisn't the first character of the line, it doesn't
mean what was mentioned above (start-of-word romaji). In this case, it's just
the regular-expression¡Èclass¡Éindicator.
If you feel more comfortable using
filename-like¡È*.txt¡Éwildcard patterns, you can use
the¡Èwildcard on¡Écommand to have patterns be considered
this way.
This has been a quick introduction to the basics of lookup.
It can be very powerful and much more complex. Below is a detailed description
of its various parts and features.
In exactly what situations the automatic conversion will be done is intended to
be rather intuitive once the basic idea is learned. However, at any
time, one can use control-space to convert the ASCII to the left of the
cursor to kana. This can be particularly useful when needing to enter kana on
a command line (where auto conversion is never done; see below)
Long vowels can be entered by repeating the vowel, or
with¡Æ-¡Çor¡Æ^¡Ç.
In situations where an¡Èn¡Écould be vague, as
in¡Èna¡Ébeing ¤Ê or ¤ó¤¢,
use a single quote to force ¤ó.
Therefore,¡Ökenichi¡×¢ª¤±¤Ë¤Á
while¡Öken'ichi¡×¢ª¤±¤ó¤¤¤Á.
The romaji has been richly extended with many non-standard combinations such as
¤Õ¤¡ or ¤Á¤§, which are represented in
intuitive
ways:¡Öfa¡×¢ª¤Õ¤¡,¡Öche¡×¢ª¤Á¤§.
etc.
Various other mappings of interest:
Other lines are taken as search regular expressions, with the following special
cases:
When wildcard mode is on, only
¡È*¡É,¡È?¡É,¡È+¡É,and¡È.¡É,are
effected. See the entry for the ¡Èwildcard¡Écommand below
for details.
Other features, such as the multiple-pattern searches (described below) and
other regular-expression metacharacters are available.
The above example is very different from the single pattern
¡Èchina|japan¡Éwhich would select any line that had
either¡Èchina¡Éor¡Èjapan¡É.
With¡Èchina||japan¡É, you get lines that
have¡Èchina¡Éand then also
have¡Èjapan¡Éas well.
Note that it is also different from the regular
expression¡Èchina.*japan¡É(or the wildcard
pattern¡Èchina*japan¡É)which would select lines
having¡Èchina, then maybe some stuff, then japan¡É. But
consider the case when¡Èjapan¡Écomes on the line
before¡Èchina¡É.
Just for your comparison, the multiple-pattern
specifier¡Èchina||japan¡Éis pretty much the same as the
single regular expression¡Èchina.*japan|japan.*china¡É.
If you use¡È|!|¡Éinstead of¡È||¡É, it
will mean¡È...and then lines not matching...¡É.
Consider a way to find all lines of kanjidic that do have a Halpern
number, but don't have a Nelson number:
Again, it is important to stress that¡È||¡Édoes not
mean¡Èor¡É(as it does in a C program, or
as¡Æ|¡Çdoes within a regular expression). You might find
it convenient to read¡È||¡Éas¡È and
also¡É, while reading¡È|!|¡Éas¡Èbut
not¡É.
It is also important to stress that any whitespace around
the¡È||¡Éand¡È|!|¡Éconstruct is
not ignored, but kept as part of the regex on either side.
A special kind of slot, called a¡Ècombination slot¡É,rather
than representing a single file, can represent multiple previously-loaded
slots. Searches against a combination slot (or¡Ècombo
slot¡Éfor short) search all those previously-loaded slots associated
with it (called¡Ècomponent slots¡É).
Combo slots are set up with the combine command.
A Combo slot has no filter or modify spec, but can have a local prompt and flags
just like normal file slots. The flags, however, have special meanings with
combo slots. Most combo-slot flags act as a mask against the component-slot
flags; when acted upon as a member of the combo, a component-slot's flag will
be disabled if the corresponding combo-slot's flag is disabled.
Exceptions to this are the autokana, fuzz, and tag flags.
The autokana and fuzz flags governs a combo slot exactly the same
as a regular file slot. When a slot is searched as a component of a
combination slot, the component slot's fuzz (and autokana)
flags, or lack thereof, are ignored.
The tag flag is quite different altogether; see the tag command
for complete information.
Consider the following output from the files command:
As can be seen, slot #3 is a combination slot with the
name¡Èkotoba¡Éwith component slots two and zero.
When a search is initiated on this slot, first slot
#2¡Èlocal.words¡Éwill be searched, then slot
#0¡Èedict¡É.
Because the combo slot's filter flag is on, the component slots'
filter flag will remain on during the search. The combo slot's
word flag is off, however, so slot #0's word flag will be
forced off during the search.
See the combine command for information about creating combo slots.
If supported by the OS, lookup's idea of the screen size is automatically
set upon startup and window resize. Lookup must know the width of the
screen in doing both the horizontal input-line scrolling, and for knowing when
a long line wraps on the screen.
The pager parameters can be set manually with
the¡Èpager¡Écommand.
There are a number of commands that work with the selected file or
selected slot (both meaning the same thing). The selected file is the
one indicated by an appended comma+digit, as mentioned above. If no such
indication is given, the default selected file is used (usually the
first file loaded, but can be changed with
the¡Èselect¡Écommand).
Some commands accept a boolean argument, such as to turn a flag on or
off. In all such cases,
a¡È1¡Éor¡Èon¡Émeans to turn the flag
on, while a¡È0¡Éor¡Èoff¡Éis used to
turn it off. Some flags are per-file
(¡Èfuzz¡É,¡Èfold¡É, etc.), and a
command to set such a flag normally sets the flag for the selected file only.
However, the default value inherited by subsequently loaded files can be set
by prepending¡Èdefault¡Éto the command. This is
particularly useful in the startup file before any files are loaded (see the
section STARTUP FILE).
Items separated by¡Æ|¡Çare mutually exclusive possibilities
(i.e. a boolean argument is¡È1|on|0|off¡É).
Items shown in brackets (¡Æ[¡Çand¡Æ]¡Ç)
are optional. All commands that accept a boolean argument to set a flag or
mode do so optionally -- with no argument the command will report the current
status of the mode or flag.
Any command that allows an argument in quotes (such as load, etc.) allow the use
of single or double quotes.
The commands:
The file is read in the same way as the source command reads files (see
that entry for more information on file format, etc.)
However, if there had been files loaded via command-line arguments, commands
within the startup file to load files (and their associated commands such as
to set per-file flags) are ignored.
Similarly, any use of the command-line flags -euc, -jis, or -sjis will disable
in the startup file the commands dealing with setting the input and/or output
encodings.
The special treatment mentioned in the above two paragraphs only applies to
commands within the startup file itself, and does not apply to commands in
command-files that might be sourced from within the startup file.
The following is a reasonable example of a startup file:
The following flags are supported:
This results in lookup starting up and presenting a prompt very quickly,
but causes the first few searches that need to check a lot of lines in the
file to go more slowly (as lots of the file will need to be read in). However,
once the bulk of the file is in, searches will go very fast. The win here is
that the rather long file-load times are amortized over the first few (or few
dozen, depending upon the situation) searches rather than always faced right
at command startup time.
On the other hand, on an operating system without the mapping ability,
lookup would start up very slowly as all the files and indexes are read
into memory, but would then search quickly from the beginning, all the file
already having been read.
To get around the slow startup, particularly when many files are loaded,
lookup uses lazy loading if it can: a file is not actually read
into memory at the time the load command is given. Rather, it will be
read when first actually accessed. Furthermore, files are loaded while
lookup is idle, such as when waiting for user input. See the
files command for more information.
The regex¡Öa¡×means¡Èany line with
an¡Æa¡Çin it.¡É
Simple enough.
The regex¡Öab¡×means¡Èany line with
an¡Æa¡Çimmediately followed by
a¡Æb¡Ç¡É. So the line
In most cases, letters and numbers in a regex just mean that you're looking for
those letters and numbers in the order given. However, there are some special
characters used within a regex.
A simple example would be a period. Rather than indicate that you're looking for
a period, it means¡Èany character¡É. So the silly
regex¡Ö.¡×would mean¡Èany line that has any
character on it.¡ÉWell, maybe not so silly... you can use it to find
non-blank lines.
But more commonly it's used as part of a larger regex. Consider the
regex¡Ögray¡×. It wouldn't match the line
A special construct somewhat similar to¡Æ.¡Çwould be the
character class. A character class starts with
a¡Æ[¡Çand ends with a¡Æ]¡Ç, and will
match any character given in between. An example might be
For example the simple regex¡Öx[0123456789]y¡×would match
any line with a digit sandwiched between an¡Æx¡Çand
a¡Æy¡Ç.
The order of the characters within the character class doesn't really
matter...¡Ö[513467289]¡×would be the same
as¡Ö[0123456789]¡×.
But as a short cut, you could put¡Ö[0-9]¡×instead
of¡Ö[0123456789]¡×. So the character
class¡Ö[a-z]¡×would match any lower-case letter, while the
character class¡Ö[a-zA-Z0-9]¡×would match any letter or
digit.
The character¡Æ-¡Çis special within a character class, but
only if it's not the first thing. Another character that's special in a
character class is¡Æ^¡Ç, if it is the first thing.
It¡Èinverts¡Éthe class so that it will match any character
not listed. The class¡Ö[^a-zA-Z0-9]¡×would match
any line with spaces or punctuation on them.
There are some special short-hand sequences for some common character classes.
The sequence¡Ö\d¡×means¡Èdigit¡É, and
is the same as¡Ö[0-9]¡×.
¡Ö\w¡×means¡Èword element¡Éand is the
same as¡Ö[0-9a-zA-Z_]¡×.
¡Ö\s¡×means¡Èspace-type thing¡Éand is
the same as¡Ö[ \t]¡×(¡Ö\t¡×means tab).
You can also use¡Ö\D¡×,¡Ö\W¡×,
and¡Ö\S¡×to mean things not a digit, word element,
or space-type thing.
Another special character would be¡Æ?¡Ç. This
means¡Èmaybe one of whatever was just before it, not is fine
too¡É. In the regex ¡Öbikes? for rent¡×,
the¡Èwhatever¡Éwould be the¡Æs¡Ç, so
this would match lines with either¡Èbikes for
rent¡Éor¡Èbike for rent¡É.
Parentheses are also special, and can group things together. In the regex
That's because if you take away the¡Èwhatever¡Éof
the¡Æ?¡Ç, you end up with
Similar to¡Æ?¡Çis¡Æ*¡Ç, which
means¡Èany number, including none, of whatever's right in
front¡É. It more or less means that whatever is tagged
with¡Æ*¡Çis allowed, but not required, so something like
Similar to
both¡Æ?¡Çand¡Æ*¡Çis¡Æ+¡Ç,
which means¡Èat least one of whatever just in front, but more is
fine too¡É. The regex¡Ömis+pelling¡×would
match¡Èmi
spelling¡É,¡Èmisspelling¡É,¡Èmissspelling¡É,
etc. Actually, it's just the same as¡Ömiss*pelling¡×but
more simple to type. The
regex¡Öss*¡×means¡Èan¡Æs¡Ç,
followed by zero or more¡Æs¡Ç¡É,
while¡Ös+¡×means¡Èone or
more¡Æs¡Ç¡É. Both really the same.
The special character¡Æ|¡Çmeans¡Èor¡É.
Unlike¡Æ+¡Ç,¡Æ*¡Ç,
and¡Æ?¡Çwhich act on the thing immediately before,
the¡Æ|¡Çis more¡Èglobal¡É.
You can even combine more than two:
How about:
Here, the¡Èwhatever¡Éimmediately before
the¡Æ*¡Çis
Note that the above regex would also match
Going back to the regex to find grey/gray, that would make more sense, then, as
Somewhat similar are¡Æ^¡Çand¡Æ$¡Ç, which
mean¡Èbeginning of line¡Éand¡Èend of
line¡É, respectively (but, not in a character class, of course). So
the regex¡Ö^fun¡×would find any line that begins with the
letters¡Èfun¡É, while¡Ö^fun>¡×would
find any line that begins with the word¡Èfun¡É.
¡Ö^fun$¡×would find any line that was
exactly¡Èfun¡É.
Finally,¡Ö^\s*fun\s*$¡×would match any line
that¡Èfun¡Éexactly, but perhaps also had leading and/or
trailing whitespace.
That's pretty much it. There are more complex things, some of which I'll mention
in the list below, but even with these few simple constructs one can specify
very detailed and complex patterns.
Let's summarize some of the special things in regular expressions:
Non-EUC (JIS & SJIS) items not tested well.
Probably won't work on non-UNIX systems.
Screen control codes (for clear and highlight commands) are hard-coded for
ANSI/VT100/kterm.
Information on input and output encoding and codes can be found in Ken Lunde's
Understanding Japanese Information Processing
(ÆüËܸì¾ðÊó½èÍý)
published by O'Reilly and Associates. ISBN 1-56592-043-0. There is also a
Japanese edition published by SoftBank.
A program to convert files among the various encoding methods is Dr. Ken
Lunde'sjconv, which can also be found on ftp.cc.monash.edu.au.
Jconv is also useful for converting halfwidth katakana (which
lookup doesn't yet support well) to full-width.
NAME¶
lookup - interactive file search and displaySYNOPSIS¶
lookup [ args ] [ file ... ]DESCRIPTION¶
Lookup allows the quick interactive search of text files. It supports ASCII, JIS-ROMAN, and Japanese EUC Packed formated text, and has an integrated romaji¢ªkana converter.THIS MANUAL¶
Lookup is flexible for a variety of applications. This manual will, however, focus on the application of searching Jim Breen's edict (Japanese-English dictionary) and kanjidic (kanji database). Being familiar with the content and format of these files would be helpful. See the INFO section near the end of this manual for information on how to obtain these files and their documentation.OVERVIEW OF MAJOR FEATURES¶
The following just mentions some major features to whet your appetite to actually read the whole manual (-:- Romaji-to-Kana Converter
- Lookup can convert romaji to kana for you, even¡Èon the fly¡Éas you type.
- Fuzzy Searching
- Searches can be a bit¡Èvague¡Éor¡Èfuzzy¡É, so that you'll be able to find¡ÈÅìµþ¡Éeven if you try to search for¡È¤È¤¤ç¡É(the proper yomikata being¡È¤È¤¦¤¤ç¤¦¡É).
- Regular Expressions
- Uses the powerful and expressive regular expression for searching. One can easily specify complex searches that affect¡ÈI want lines that look like such-and-such, but not like this-and-that, but that also have this particular characteristic....¡É
- Wildcard ``Glob'' Patterns
- Optionally, can use well-known filename wildcard patterns instead of full-fledged regular expressions.
- Filters
- You can have lookup not list certain lines that would otherwise match your search, yet can optionally save them for quick review. For example, you could have all name-only entries from edict filtered from normal output.
- Automatic Modifications
- Similarly, you can do a standard search-and-replace on lines just before they print, perhaps to remove information you don't care to see on most searches. For example, if you're generally not interested in kanjidic's info on Chinese readings, you can have them removed from lines before printing.
- Smart Word-Preference Mode
- You can have lookup list only entries with whole words that match your search (as opposed to an embedded match, such as finding¡Èthe¡Éinside¡Èthem¡É), but if no whole-word matches exist, will go ahead and list any entry that matches the search.
- Handy Features
- Other handy features include a dynamically settable and parameterized prompt, automatic highlighting of that part of the line that matches your search, an output pager, readline-like input with horizontal scrolling for long input lines, a¡È.lookup¡Éstartup file, automated programability, and much more. Read on!
REGULAR EXPRESSIONS¶
Lookup makes liberal use of regular expressions (or regex for short) in controlling various aspects of the searches. If you are not familiar with the important concepts of regexes, read the tutorial appendix of this manual before continuing.JAPANESE CHARACTER ENCODING METHODS¶
Internally, lookup works with Japanese packed-format EUC, and all files loaded must be encoded similarly. If you have files encoded in JIS or Shift-JIS, you must first convert them to EUC before loading (see the INFO section for programs that can do this).STARTUP¶
Let's assume that your copy of edict is in ~/lib/edict. You can start the program simply withlookup ~/lib/edictYou'll note that lookup spends some time building an index before the default¡Èlookup> ¡Éprompt appears.
lookup -write ~/lib/edict ~/lib/kanjidicThis will create the index files
~/lib/edict.jin ~/lib/kanjidic.jinand exit.
lookup ~/lib/edict ~/lib/kanjidicYou should then be presented with the prompt without having to wait for the index to be constructed (but see the section on Operating System concerns for possible reasons of delay).
INPUT¶
There are basically two types of input: searches and commands. Commands do such things as tell lookup to load more files or set flags. Searches report lines of a file that match some search specifier (where lines to search for are specified by one or more regular expressions).BRIEF EXAMPLE¶
Assuming you've started lookup with edict and kanjidic as noted above, let's try a few searches. In these examples, the¡Èsearch [edict]> ¡Éis the prompt. Note that the space after the¡Æ>¡Çis part of the prompt.
search [edict]> tranquillookup will report all lines with the string¡Ètranquil¡Éin them. There are currently about a dozen such lines, two of which look like:
°Â¤é¤« [¤ä¤¹¤é¤«] /peaceful (an)/tranquil/calm/restful/ °Â¤é¤® [¤ä¤¹¤é¤®] /peace/tranquility/Notice that lines with¡Ètranquil¡Éand¡Ètranquility¡Ématched? This is because¡Ètranquil¡Éwas embedded in the word¡Ètranquility¡É. You could restrict the search to only the word¡Ètranquil¡Éby prepending the special¡Èstart of word¡Ésymbol¡Æ<¡Çand appending the special¡Èend of word¡Ésymbol¡Æ>¡Çto the regex, as in:
search [edict]> <tranquil>This is the regular expression that says¡Èthe beginning of a word, followed by a¡Æt¡Ç,¡Ær¡Ç, ...,¡Æl¡Ç, which is at the end of a word.¡ÉThe current version of edict has just three matching entries.
search [edict]> fukushimaThis is a search for the¡ÈEnglish¡Éfukushima -- ways to search for kana or kanji will be explored later. Note that among the several lines selected and printed are:
ÉûÅç [¤Õ¤¯¤·¤Þ] /Fukushima (pn,pl)/ ÌÚÁ¾Ê¡Åç [¤¤½¤Õ¤¯¤·¤Þ] /Kisofukushima (pl)/By default, searches are done in a case-insensitive manner --¡ÆF¡Çand¡Æf¡Çare treated the same by lookup, at least so far as the matching goes. This is called case folding.
¡Èlookup command> ¡Éas a reminder that now you're typing a command rather than a search specification.
lookup command> foldThe reply should be¡Èfile #0's case folding is on¡É
lookup command> fold on
JAPANESE INPUT¶
Lookup has an automatic romaji¢ªkana converter. A leading¡Æ/¡Çindicates that romaji is to follow. Try typing¡È/tokyo¡Éand you'll see it convert to¡È/¤È¤¤ç¡Éas you type. When you hit return, lookup will list all lines that have a¡È¤È¤¤ç¡Ésomewhere in them. Well, sort of. Look carefully at the lines which match. Among them (if you had case folding back on) you'll see:¥¥ê¥¹¥È¶µ [¥¥ê¥¹¥È¤¤ç¤¦] /Christianity/ Åìµþ [¤È¤¦¤¤ç¤¦] /Toukyou (pl)/Tokyo/current capital of Japan/ ÆÌ¶À [¤È¤Ã¤¤ç¤¦] /convex lens/The first one has¡È¤È¤¤ç¡Éin it (as¡È¥È¤¤ç¡É, where the katakana¡È¥È¡Ématches in a case-insensitive manner to the hiragana¡È¤È¡É), but you might consider the others unexpected, since they don't have¡È¤È¤¤ç¡Éin them. They're close (¡È¤È¤¦¤¤ç¡Éand¡È¤È¤Ã¤¤ç¡É), but not exact. This is the result of lookup's¡Èfuzzification¡É.
search [edict]> /<tokyo>(again, the¡Ètokyo¡Éwill be converted to¡È¤È¤¤ç¡Éas you type). My copy of edict has the three lines
Åìµþ [¤È¤¦¤¤ç¤¦] /Toukyou (pl)/Tokyo/current capital of Japan/ ÆÃµö [¤È¤Ã¤¤ç] /special permission/patent/ ÆÌ¶À [¤È¤Ã¤¤ç¤¦] /convex lens/This kind of whole-word romaji-to-kana search is so common, there's a special short cut. Instead of typing¡È/<tokyo>¡É, you can type¡È[tokyo]¡É. The leading¡Æ[¡Çmeans¡Èstart romaji¡Éand¡Èstart of word¡É. Were you to type¡È<tokyo>¡Éinstead (without a leading¡Æ/¡Çor¡Æ[¡Çto indicate romaji-to-kana conversion), you would get all lines with the English whole-word¡Ètokyo¡Éin them. That would be a reasonable request as well, but not what we want at the moment.
READLINE INPUT¶
The actual keystrokes are read by a readline-ish package that is pretty standard. In addition to just typing away, the following keystrokes are available:^B / ^F move left/right one character on the line ^A / ^E move to the start/end of the line ^H / ^G delete one character to the left/right of the cursor ^U / ^K delete all characters to the left/right of the cursor ^P / ^N previous/next lines on the history list ^L or ^R redraw the line ^D delete char under the cursor, or EOF if line is empty ^space force romaji conversion (^@ on some systems)If automatic romaji-to-kana conversion is turned on (as it is by default), there are certain situations where the conversion will be done, as we saw above. Lower-case romaji will be converted to hiragana, while upper-case romaji to katakana. This usually won't matter, though, as case folding will treat hiragana and katakana the same in the searches.
ROMAJI FLAVOR¶
Most flavors of romaji are recognized. Special or non-obvious items are mentioned below. Lowercase are converted to hiragana, uppercase to katakana.wo ¢ª¤ò we¢ª¤ñ wi¢ª¤ð VA ¢ª¥ô¥¡ VI¢ª¥ô¥£ VU¢ª¥ô VE¢ª¥ô¥§ VO¢ª¥ô¥© di ¢ª¤Â dzi¢ª¤Â dya¢ª¤Â¤ã dyu¢ª¤Â¤å dyo¢ª¤Â¤ç du ¢ª¤Å tzu¢ª¤Å dzu¢ª¤Å (the following kana are all smaller versions of the regular kana) xa ¢ª¤¡ xi¢ª¤£ xu¢ª¤¥ xe¢ª¤§ xo¢ª¤© xu ¢ª¤¥ xtu¢ª¤Ã xwa¢ª¤î xka¢ª¥õ xke¢ª¥ö xya¢ª¤ã xyu¢ª¤å xyo¢ª¤ç
INPUT SYNTAX¶
Any input line beginning with a space (or whichever character is set as the command-introduction character) is processed as a command to lookup rather than a search spec. Automatic kana conversion is never done on these lines (but forced conversion with control-space may be done at any time).- ?
- A line consisting of a single question mark will report the current command-introduction character (the default is a space, but can be changed with the¡Ècmdchar¡Écommand).
- =
- If a line begins with¡Æ=¡Ç, the line (without the¡Æ=¡Ç) is taken as a search regular expression, and no automatic (or internal -- see below) kana conversion is done anywhere on the line (although again, conversion can always be forced with control-space). This can be used to initiate a search where the beginning of the regex is the command-introduction character, or in certain situations where automatic kana conversion is temporarily not desired.
- /
- A line beginning with¡Æ/¡Çindicates
romaji input for the whole line. If automatic kana conversion is turned
on, the conversion will be done in real-time, as the romaji is typed.
Otherwise it will be done internally once the line is entered.
Regardless, the presence of the
leading¡Æ/¡Çindicates that any kana (either converted
or cut-and-pasted in) should be¡Èfuzzified¡Éif
fuzzification is turned on.
- [
- A line beginning with¡Æ[¡Çis taken to be romaji (just as a line beginning with¡Æ/¡Ç, and the converted romaji is subject to fuzzification (if turned on). However, if¡Æ[¡Çis used rather than¡Æ/¡Ç, an implied¡Æ<¡Ç¡Èbeginning of word¡Éis prepended to the resulting kana regex. Also, any ending¡Æ]¡Çon such a line is converted to the¡Èending of word¡Éspecifier¡Æ>¡Çin the resulting regex.
- !
- Various flags can be toggled for the duration of a
particular search by prepending a¡È!!¡Ésequence to the
input line.
!F! ¡Ä Filtration is toggled for this line (filter) !M! ¡Ä Modification is toggled for this line (modify) !w! ¡Ä Word-preference mode is toggled for this line (word) !c! ¡Ä Case folding is toggled for this line (fold) !f! ¡Ä Fuzzification is toggled for this line (fuzz) !W! ¡Ä Wildcard-pattern mode is toggled for this line (wildcard) !r! ¡Ä Raw. Force fuzzification off for this line !h! ¡Ä Highlighting is toggled for this line (highlight) !t! ¡Ä Tagging is toggled for this line (tag) !d! ¡Ä Displaying is on for this line (display)
The letters can be combined, as in¡È!cf!¡É.
- +
- A¡Æ+¡Çprepended to anything above will
cause the final search regex to be printed. This can be useful to see when
and what kind of fuzzification and/or internal kana conversion is
happening. Consider:
search [edict]> +/¤ï¤«¤ë a match is¡È¤ï[¤¡¤¢¡¼]*¤Ã?¤«[¤¡¤¢¡¼]*¤ë[¤¥¤¦¤ª¤©¡¼]*¡É
Due to the¡Èleading¡É/ the kana is fuzzified, which explains the somewhat complex resulting regex. For comparison, note:search [edict]> +¤ï¤«¤ë a match is¡È¤ï¤«¤ë¡É search [edict]> +!/¤ï¤«¤ë a match is¡È¤ï¤«¤ë¡É
As the¡Æ+¡Çshows, these are not fuzzified. The first one has no leading¡Æ/¡Çor¡Æ[¡Çto induce fuzzification, while the second has the¡Æ!¡Çline prefix (which is the default version of¡È!f!¡É), which toggles fuzzification mode to¡Èoff¡Éfor that line.
- ,
- The default of all searches and most commands is to work
with the first file loaded ( edict in these examples). One can
change this default (see the¡Èselect¡Écommand) or, by
appending a comma+digit sequence at the end of an input line, force that
line to work with another previously-loaded file. An
appended¡È,1¡Éworks with first extra file loaded (in
these examples, kanjidic). An
appended¡È,2¡Éworks with the 2nd extra file loaded,
etc.
search [edict]> [¤È¤¤ç¤È] ÅìµþÅÔ [¤È¤¦¤¤ç¤¦¤È] /Tokyo Metropolitan area/
cutting and pasting the ÅÔ from above, and adding a¡È,1¡Éto search kanjidic:search [edict]> ÅÔ,1 ÅÔ 4554 N4769 S11 ..... ¥È ¥Ä ¤ß¤ä¤³ {metropolis} {capital}
FILENAME-LIKE WILDCARD MATCHING¶
When wildcard-pattern mode is selected, patterns are considered as extended.Q "*.txt" "-like" patterns. This is often more convenient for users not familiar with regular expressions. To have this mode selected by default, putdefault wildcard oninto your¡È.lookup¡Éfile (see¡ÈSTARTUP FILE¡Ébelow).
MULTIPLE-PATTERN SEARCHES¶
You can put multiple patterns in a single search specifier. For example considersearch [edict]> china||japanThe first part (¡Èchina¡É) will select all lines that have¡Èchina¡Éin them. Then, from among those lines, the second part will select lines that have¡Èjapan¡Éin them. The¡È||¡Éis not part of any pattern -- it is lookup's¡Èpipe¡Émechanism.
search [edict]> <H\d+>|!|<N\d+>If you then wanted to restrict the listing to those that also had a¡Èjinmeiyou¡Émarking ( kanjidic's¡ÈG9¡Éfield) and had a reading of ¤¢¤, you could make it:
search [edict]> <H\d+>|!|<N\d+>||<G9>||<¤¢¤> A prepended¡Æ+¡Çwould explain: a match is¡È<H\d+>¡É and not¡È<N\d+>¡É and¡È<G9>¡É and¡È<¤¢¤>¡ÉThe¡È|!|¡Éand¡È||¡Écan be used to make up to ten separate regular expressions in any one search specification.
COMBINATION SLOTS¶
Each file, when loaded, is assigned to a¡Èslot¡Évia which subsequent references to the file are then made. The slot may then be searched, have filters and flags set, etc.¨®¨¬¨³¨¬¨¬¨¬¨¬¨¸¨¬¨¬¨³¨¬¨¬¨¬¨³¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬ ¨ 0¨F wcfh d¨¢a I ¨ 2762k¨/usr/jfriedl/lib/edict ¨ 1¨FM cf d¨¢a I ¨ 705k¨/usr/jfriedl/lib/kanjidic ¨ 2¨F cfh@d¨¢a ¨ 1k¨/usr/jfriedl/lib/local.words ¨*3¨FM cfhtd¨¢a ¨ combo¨kotoba (#2, #0) ¨±¨¬¨µ¨¬¨¬¨¬¨¬¨º¨¬¨¬¨µ¨¬¨¬¨¬¨µ¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬See the discussion of the files command below for basic explanation of the output.
PAGER¶
Lookup has a built in pager (a'la more). Upon filling a screen with text, the string--MORE [space,return,c,q]--is shown. A space will allow another screen of text; a return will allow one more line. A¡Æc¡Ç will allow output text to continue unpaged until the next command. A¡Æq¡Ç will flush output of the current command.
COMMANDS¶
Any line intended to be a command must begin with the command-introduction character (the default is a space, but can be set via the¡Ècmdchar¡Écommand). However, that character is not part of the command itself and won't be shown in the following list of commands.- [default] autokana [boolean]
-
- clear|cls
-
- cmdchar ['one-byte-char']
- The default command-introduction character is a space, but
it may be changed via this command. The single quotes surrounding the
character are required. If no argument is given, the current value is
printed.
- combine ["name"] [ num += ] slotnum ...
-
combo "three stooges" 2, 0, 1
The command will reportcreating combo slot #3 (three stooges): 2 0 1
combo 3 += 4
(the¡È+=¡Éwording comes from the C programming language where it means¡Èadd on to¡É). Adding to a combination always adds slots to the end of the list.combo "four stooges" 3 += 4
The reply would beadding to combo slot #3(four stooges): 4
- command debug [boolean]
-
- debug [boolean]
-
- describe specifier
- This command will tell you how a character (or each
character in a string) is encoded in the various encoding methods:
lookup command> describe "µ¤" ¡Èµ¤¡Éas EUC is 0xb5a4 (181 164; 265 \244) as JIS is 0x3524 ( 53 36; 65 \044 "5$") as KUTEN is 2104 ( 0x1504; 25 \004) as S-JIS is 0x8b1f (139 31; 213 \037)
- encoding [euc|sjis|jis]
- The same as the -euc, -jis, and -sjis command-line options, sets the encoding method for interactive input and output (or reports the current status). More detail over the output encoding can be achieved with the output encoding command. A separate encoding for input can be set with the input encoding command.
- files [ - | long ]
-
¨*0¨F wcfh d¨¢a I ¨ 3749k¨/usr/jeff/lib/edict ¨ 1¨FM cf d¨¢a I ¨ 754k¨/usr/jeff/lib/kanjidic ¨®¨¬¨³¨¬¨¬¨¬¨¬¨¬¨¸¨¬¨¬¨³¨¬¨¬¨¬¨³¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬ ¨ 0¨F wcf h d ¨¢a I ¨ 2762k¨/usr/jfriedl/lib/edict ¨ 1¨FM cf d ¨¢a I ¨ 705k¨/usr/jfriedl/lib/kanjidic ¨ 2¨F cfWh@d ¨¢a ¨ 1k¨/usr/jfriedl/lib/local.words ¨*3¨FM cf htd ¨¢a ¨ combo¨kotoba (#2, #0) ¨ 4¨ cf d ¨¢a ¨ 205k¨/usr/dict/words ¨±¨¬¨µ¨¬¨¬¨¬¨¬¨¬¨º¨¬¨¬¨µ¨¬¨¬¨¬¨µ¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬¨¬
The first section is the slot number, with a¡È*¡Ébeside the default slot (as set by the select command).F ¡Ä if there is a filter {but '#' if disabled}. (filter) M ¡Ä if there is a modify spec {but '%' if disabled}. (modify) w ¡Ä if word-preference mode is turned on. (word) c ¡Ä if case folding is turned on. (fold) f ¡Ä if fuzzification is turned on. (fuzz) W ¡Ä if wildcard-pattern mode is turned on (wildcard) h ¡Ä if highlighting is turned on. (highlight) t ¡Ä if there is a tag {but @ if disabled} (tag) d ¡Ä if found lines should be displayed (display) ¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡ a ¡Ä if autokana is turned on (autokana) P ¡Ä if there is a file-specific local prompt (prompt) I ¡Ä if the file is loaded with a precomputed index (load) d ¡Ä if the display flag is on (display)
Note that the letters in the upper section directly correspond to the¡È!!¡Ésequence characters described in the INPUT SYNTAX section.
- filter ["label"] [!] /regex/[i]
-
lookup command> filter "name" /(pn)/ search [edict]> [¤¤Î] µ¡Ç½ [¤¤Î¤¦] /function/faculty/ µ¢Ç¼ [¤¤Î¤¦] /inductive/ ºòÆü [¤¤Î¤¦] /yesterday/ ¢ã3 "name" lines filtered¢ä
In the example,¡Æ/¡Çcharacters are used to delimit the start and stop of the regex (as is common with many programs). However, any character can be used. A final¡Æi¡Ç, if present, indicates that the regex should be applied in a case-insensitive manner.filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$#
as it would filter all entries that had only one English section, that section being a name. It is also an example of using something other than¡Æ/¡Çto delimit a regex, as it makes things a bit easier to read.
- filter [boolean]
-
- [default] fold [boolean]
-
- [default] fuzz [boolean]
-
- help [regex]
-
- [default] highlight [boolean]
-
- highlight style [bold | inverse | standout | <___>]
-
- if {expression} command...
-
! x ¡Ä yields 0 if x is non-zero, 1 if x is zero. x && y ¡Ä ! x ¡Ä¡Ænot¡ÇYields 1 if x is zero, 0 if non-zero. x & y ¡Ä¡Æand¡ÇYields 1 if both x and y are non-zero, 0 otherwise. x | y ¡Ä¡Æor¡Ç Yields 1 if x or y (or both) is non-zero, 0 otherwise
!d!expect this line if {!printed} msg Oops! couldn't find "expect this line"
- input encoding [ euc | sjis ]
- Used to set (or report) what encoding to use when 8-bit bytes are found in the interactive input (all flavors of JIS are always recognized). Also see the encoding and output encoding commands.
- limit [value]
-
- log [ to [+] file ]
-
- log - | off
- If only¡È-¡Éor off is given, any currently-opened log file is closed.
- load [-now|-whenneeded] "filename"
-
% lookup -writeindex filename
to generate and write an index file, which will then be automatically used in the future.
- modify /regex/replace/[ig]
-
modify /<Japan>/Dainippon Teikoku/g
So that a line such asÆü¶ä [¤Ë¤Á¤®¤ó] /Bank of Japan/ would come out as Æü¶ä [¤Ë¤Á¤®¤ó] /Bank of Dainippon Teikoku/
As a real example of the modify command with kanjidic, consider that it is likely that one is not interested in all the various fields each entry has. The following can be used to remove the info on the U, N, Q, M, E, B, C, and Y fields from the output:modify /( [UNQMECBY]\S+)+//g,1
It's sort of complex, but works. Note that here the replacement part is empty, meaning to just remove those parts which matched. The result of such a search of Æü would normally printÆü 467c U65e5 N2097 B72 B73 S4 G1 H3027 F1 Q6010.0 MP5.0714 ¡À MN13733 E62 Yri4 P3-3-1 ¥Ë¥Á ¥¸¥Ä ¤Ò -¤Ó -¤« {day}
but with the above modify spec, appears more simply asÆü 467c S4 G1 H3027 F1 P3-3-1 ¥Ë¥Á ¥¸¥Ä ¤Ò -¤Ó -¤« {day}
- modify [boolean]
-
- msg string
- The given string is printed.
- output encoding [ euc | sjis | jis...]
- Used to set exactly what kind of encoding should be used
for program output (also see the input encoding command). Used when
the encoding command is not detailed enough for one's needs.
- euc
- Selects EUC for the output encoding.
- sjis
- Selects Shift-JIS for the output encoding.
- jis[78|83|90][-ascii|-roman]
- Selects JIS for the output encoding. If no year (78, 83, or 90) given, 78 is used. Can optionally specify that¡ÈEnglish¡Éshould be encoded as regular ASCII (the default when JIS selected) or as JIS-ROMAN.
- 212
- Indicates that JIS X0212-1990 should be supported (ignored for Shift-JIS output).
- no212
- Indicates that JIS X0212-1990 should be not be supported (default setting). This places JIS X0212-1990 characters under the domain of disp, nodisp, code, or mark (described below).
- hwk
- Indicates that half width kana should be left as-is (default setting).
- nohwk
- Indicates that half width kana should be stripped from the output. (not yet implemented).
- foldhwk
- Indicates that half width kana should be folded to their full-width counterparts. (not yet implemented).
- disp
- Indicates that non-displayable characters (such as JIS X0212-1990 while the output encoding method is Shift-JIS) should be passed along anyway (most likely resulting in screen garbage).
- nodisp
- Indicates that non-displayable characters should be quietly stripped from the output.
- code
- Indicates that non-displayable characters should be printed as their octal codes (default setting).
- mark
- Indicates that non-displayable characters should be printed as¡È¡ú¡É.
- pager [ boolean | size ]
- Turns on or off an output pager, sets it's idea of the
screen size, or reports the current status.
- [local] prompt "string"
- Sets the prompt string. If¡Èlocal¡Éis
indicated, sets the prompt string for the selected slot only.
Otherwise, sets the global default prompt string.
%N ¡Ä the default slot's file or combo name. %n ¡Ä like %N, but any leading path is not shown if a filename. %# ¡Ä the default slot's number. %S ¡Ä the¡Ècommand-introduction¡Écharacter (cmdchar) %0 ¡Ä the running program's name %F=' string' ¡Ä string shown if filtering enabled (filter) %M=' string' ¡Ä string shown if modification enabled (modify) %w=' string' ¡Ä string shown if word mode on (word) %c=' string' ¡Ä string shown if case folding on (fold) %f=' string' ¡Ä string shown if fuzzification on (fuzz). %W=' string' ¡Ä string shown if wildcard-pat. mode on (wildcard). %d=' string' ¡Ä string shown if displaying on (display). %C=' string' ¡Ä string shown if currently entering a command. %l=' string' ¡Ä string shown if logging is on (log). %L ¡Ä the name of the current output log, if any (log)
For the tests (%f, etc), you can put¡Æ!¡Çjust after the¡Æ%¡Çto reverse the sense of the test (i.e. %!f="no fuzz"). The reverse of %F is if a filter is installed but disabled (i.e. string will never be shown if there is no filter for the default file). The modify %M works comparably.%C='command'%!C(%f='fuzzy 'search:)
would result in a¡Ècommand¡Éprompt if entering a command, while it would result in either a¡Èfuzzy search:¡Éor a¡Èsearch:¡Éprompt if not entering a command. The parenthesized constructs may be nested.prompt "%C(%0 command)%!C(%w'*'%!f'raw '%n)> "
With this prompt specification, the prompt would normally appear as¡È filename> ¡Ébut when fuzzification is turned off as¡Èraw filename> ¡É. And if word-preference mode is on, the whole thing has a¡È*¡Éprepended. However if a command is being entered, the prompt would then become¡Èname command¡É, where name was the program's name (system dependent, but most likely¡Èlookup¡É).
- regex debug [boolean]
-
- saved list size [value]
-
- select [ num | name | . ]
-
- show
-
- source "filename"
-
load "my.word.list" set word on load "my.kanji.list" set word off set local prompt "enter kanji> "
would word as might make intuitive sense.
- spinner [value]
- Set the value of the spinner (A silly little feature). If set to a non-zero value, will cause a spinner to spin while a file is being checked, one increment per value lines in the file actually checked against the search specifier. Default is off (i.e. zero).
- stats
- Shows information about how many lines of the text file were checked against the last search specifier, and how many lines matched and were printed.
- tag [boolean] ["string"]
- Enable, disable, or set the tag for the selected
slot.
- verbose [boolean]
-
- version
- Reports the current version of the program.
- [default] wildcard [boolean]
-
* is changed to the regular expression .* or ? is changed to the regular expression . or + is changed to the regular expression + . is changed to the regular expression .
- [default] word|wordpreference [boolean]
- The selected file's word-preference mode is turned on or
off (default is off), or reports the current setting if no argument is
specified. However, if¡Èdefault¡Éis specified, the
value to be inherited as the default by subsequently-loaded files is set
(or reported).
- quit | leave | bye | exit
-
STARTUP FILE¶
If the file¡È~/.lookup¡Éis present, commands are read from it during lookup startup.## turn verbose mode off during startup file processing verbose off prompt "%C([%#]%0)%!C(%w'*'%!f'raw '%n)> " spinner 200 pager on ## The filter for edict will hit for entries that ## have only one English part, and that English part ## having a pl or pn designation. load ~/lib/edict filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$# highlight on word on ## The filter for kanjidic will hit for entries without a ## frequency-of-use number. The modify spec will remove ## fields with the named initial code (U,N,Q,M,E, and Y) load ~/lib/kanjidic filter "uncommon" !/<F\d+>/ modify /( [UNQMEY] ## Use the same filter for my local word file, ## but turn off by default. load ~/lib/local.words filter "name" #^[^/]+/[^/]*<p[ln]>[^/]*/$# filter off highlight on word on ## Want a tag for my local words, but only when ## accessed via the combo below tag off "¡Õ" combine "words" 2 0 select words ## turn verbosity back on for interactive use. verbose on
COMMAND-LINE ARGUMENTS¶
With the use of a startup file, command-line arguments are rarely needed. In practical use, they are only needed to create an index file, as in:lookup -write textfileAny command line arguments that aren't flags are taken to be files which are loaded in turn during startup. In this case, any¡Èload¡É,¡Èfilter¡É, etc. commands in the startup file are ignored.
- -help
- Reports a short help message and exits.
- -write Creates index files for the named files and exits. No
- startup file is read.
- -euc
- Sets the input and output encoding method to EUC (currently the default). Exactly the same as the¡Èencoding euc¡Écommand.
- -jis
- Sets the input and output encoding method to JIS. Exactly the same as the¡Èencoding jis¡Écommand.
- -sjis
- Sets the input and output encoding method to Shift-JIS. Exactly the same as the¡Èencoding sjis¡Écommand.
- -v -version
- Prints the version string and exits.
- -norc
-
- -rc file
- The named file is used as the startup file, rather than the default¡È~/.lookup¡É. It is an error for the file not to exist.
- -percent num
-
- -noindex
-
- -verbose
-
- -port ###
- For the (undocumented) server configuration only, tells
which port to listen on.
OPERATING SYSTEM CONSIDERATIONS¶
I/O primitives and behaviors vary with the operating system. On my operating system, I can¡Èread¡Éa file by mapping it into memory, which is a pretty much instant procedure regardless of the size of the file. When I later access that memory, the appropriate sections of the file are automatically read into memory by the operating system as needed.REGULAR EXPRESSIONS, A BRIEF TUTORIAL¶
Regular expressions (¡Èregex¡Éfor short) are a¡Ècode¡Éused to indicate what kind of text you're looking for. They're how one searches for things in the editors¡Èvi¡É,¡Èstevie¡É,¡Èmifes¡Éetc., or with the grep commands. There are differences among the various regex flavors in use -- I'll describe the flavor used by lookup here. Also, in order to be clear for the common case, I might tell a few lies, but nothing too heinous.I am feeling flabbywould¡Èmatch¡Éthe regex¡Öab¡×because, indeed, there's an¡Èab¡Éon that line. But it wouldn't match the line
this line has no a followed _immediately_ by a bbecause, well, what the lines says is true.
The sky was grey and cloudy.because of the different spelling (grey vs. gray). But the regex¡Ögr.y¡×asks for¡Èany line with a¡Æg¡Ç,¡Ær¡Ç, some character, and then a¡Æy¡Ç¡É. So this would get¡Ègrey¡Éand¡Ègray¡É.
gr[ea]ywhich would match lines with a¡Æg¡Ç,¡Ær¡Ç, an¡Æe¡Ç or an¡Æa¡Ç, and then a¡Æy¡Ç. Inside a character class you can list as many characters as you want to.
big (fat harry)? dealthe¡Èwhatever¡Éfor the¡Æ?¡Çwould be¡Èfat harry¡É. But be careful to pay attention to details... this regex would match
I don't see what the big fat harry deal is!but not
I don't see what the big deal is!
big dealNotice that there are two spaces between the words, and the regex didn't allow for that. The regex to get either line above would be
big (fat harry )?dealor
big( fat harry)? dealDo you see how they're essentially the same?
I (really )*hate peaswould match¡ÈI hate peas¡É,¡ÈI really hate peas!¡É,¡ÈI really really hate peas¡É, etc.
give me (this|that) oneWould match lines that had¡Ègive me this one¡Éor¡Ègive me that one¡Éin them.
give me (this|that|the other) one
[Ii]t is a (nice |sunny |bright |clear )*day
(nice |sunny |bright |clear )So this regex would match all the following lines:
It is a day. I think it is a nice day. It is a clear sunny day today. If it is a clear sunny nice sunny sunny sunny bright day then....Notice how the¡Ö[Ii]t¡×matches either¡ÈIt¡Éor¡Èit¡É?
fru it is a daybecause it indeed fulfills all requirements of the regex, even though the¡Èit¡Éis really part of the word¡Èfruit¡É. To answer concerns like this, which are common, are¡Æ<¡Çand¡Æ>¡Ç, which mean¡Èword break¡É. The regex¡Ö<it¡×would match any line with¡Èit¡Ébeginning a word, while¡Öit>¡×would match any line with¡Èit¡Éending a word. And, of course,¡Ö<it>¡×would match any line with the word ¡Èit¡Éin it.
<gr[ae]y>which would match only the words¡Ègrey¡Éand¡Ègray¡É.
Items that are basic units: char any non-special character matches itself. \ char special chars, when proceeded by \, become non-special. . Matches any one character (except \n). \n Newline \t Tab. \r Carriage Return. \f Formfeed. \d Digit. Just a short-hand for [0-9]. \w Word element. Just a short-hand for [0-9a-zA-Z_]. \s Whitespace. Just a short-hand for [\t \n\r\f]. \## \### Two or three digit octal number indicating a single byte. [ chars] Matches a character if it's one of the characters listed. [^ chars] Matches a character if it's not one of the ones listed. The \ char items above can be used within a character class, but not the items below. \D Anything not \d. \W Anything not \w. \S Anything not \s. \a Any ASCII character. \A Any multibyte character. \k Any (not half-width) katakana character (including ¡¼). \K Any character not \k (except \n). \h Any hiragana character. \H Any character not \h (except \n). ( regex) Parens make the regex one unit. (?: regex) [from perl5] Grouping-only parens -- can't use for \# (below) \c Any JISX0208 kanji (kuten rows 16-84) \C Any character not \c (except \n). \# Match whatever was matched by the #th paren from the left. With¡È¡ù¡Éto indicate one¡Èunit¡Éas above, the following may be used: ¡ù? A ¡ù allowed, but not required. ¡ù+ At least one ¡ù required, but more ok. ¡ù* Any number of ¡ù ok, but none required. There are also ways to match¡Èsituations¡É: \b A word boundary. < Same as \b. > Same as \b. ^ Matches the beginning of the line. $ Matches the end of the line. Finally, the¡Èor¡Éis reg1|reg2 Match if either reg1 or reg2 match. Note that¡È\k¡Éand the like aren't allowed in character classes, so something such as¡Ö[\k\h]¡×to try to get all kana won't work. Use ¡Ö(\k|\h)¡×instead.