NAME¶
ra-index - index files for use with remembrance agent software
SYNOPSIS¶
ra-index [--version] [-v] [-d] [-s] <base-dir> <source1>
  [<source2>] [...] [-e <excludee1> [<excludee2>] [...]]
DESCRIPTION¶
ra-index and 
ra-retrieve make up the Savant search engine, an
  information retrieval engine designed as a back-end for the Remembrance Agent
  (RA). Given a collection of the user's accumulated email, usenet news
  articles, papers, saved HTML files and other text notes, the RA attempts to
  find those documents which are most relevant to the user's current context.
  That is, it searches this collection of text for the documents which bear the
  highest word-for-word similarity to the text the user is currently editing, in
  the hope that they will also bear high conceptual similarity and thus be
  useful to the user's current work. With the Emacs front-end, these suggestions
  are continuously displayed in a small buffer at the bottom of the user's
  window. If a suggestion looks useful, the full text can be retrieved with a
  single command.
 
The Remembrance Agent works in two stages. First, the user's collection of text
  documents is indexed into a database saved in a vector format. After the
  database is created, the other stage of the Remembrance Agent is run from
  emacs, where it periodically takes a sample of text from the working buffer
  and finds those documents from the collection that are most similar. It
  summarizes the top documents in a small emacs window and allows you to
  retrieve the entire text of any one with a keystroke. See the README file for
  information on using the Emacs front-end.
 
 
At its core Savant is a text-retrieval search-engine that uses a standard TF/iDF
  algorithm, but it also uses a template system to recognize different kinds of
  documents and extract various field information. For example, 
ra-index
  can recognize subject lines and address information from email files and file
  this information separately. It can also pull apart file archives into
  separate documents, e.g. RMAIL files are indexed as separate email documents.
  Finally, there are filters defined for many document types to remove
  extraneous information like HTML tags that might otherwise cause problems in
  retrieval. These are all precompiled in a template structure. It is not
  currently well documented, though if anyone wants to play with it is all
  defined in the source file templates/conftemplates.c.
 
The RA is primarily designed as a proactive information provider that
  continually gives you information that might be relevant to your current
  environment, but Savant can also be used as a standard text and information
  retrieval search engine.
 
USAGE¶
To index, you must have a set of source text-files, and a directory Savant can
  put database files into. The <source> arguments may be files or
  directories. If a directory is in the list, Savant will use all its contents,
  recursing into all subdirectories. Non-text files and backup files (those
  appended with ~ or prepended with #) are ignored. It also ignores dot-files
  (those starting with .) and symbolic links. Any files or directories specified
  after the optional -e flag will be excluded. Savant will use any files it
  finds to create a database in the specified base directory, which must already
  exist. The optional -v argument (verbose) will direct Savant to keep you
  updated on its progress. So for example,
ra-index -v ~/RA-indexes/mail ~/RMAIL
  ~/Rmail-files -e ~/Rmail-files/Old-files
 
will build a database in the ~/RA-indexes/mail directory, made up of emails from
  my RMAIL file plus all files and subdirectories of ~/Rmail-files, excluding
  files and directories in ~/Rmail-files/Old-files.
 
ra-index can build databases in any directory you like, but the emacs
  interface for the Remembrance Agent expects a particular structure. For each
  database you want to make, you should create a directory, and all these
  directories should live in the same parent directory. For example, for my own
  use I have a directory ~/RA-indexes/, and within that are the directories
  ~/RA-indexes/mail/, ~/RA-indexes/papers/, etc. which actually contain the
  database files.
 
OPTIONS¶
  - -v
 
  - Verbose mode. Print useful information.
 
  - -d
 
  - Debug mode. Print not-so-useful information.
 
  - -e
 
  - Exclude all filenames and directories which follow
 
  - -s
 
  - Follow symbolic links when indexing
 
  - --version
 
  - Print version information.
    
 
   
SEE ALSO¶
ra-retrieve(1)
AUTHOR¶
Bradley Rhodes, MIT Media Lab. Please send comments and questions to
  ra-bugs@media.mit.edu. New versions and updates can be found at
  
http://www.media.mit.edu/~rhodes/RA/
 
COPYRIGHT¶
All code included in versions up to and including 2.09:
 Copyright (C) 2001 Massachusetts Institute of Technology.
 
All modifications subsequent to version 2.09 are copyright Bradley Rhodes or
  their respective authors.
 
Developed by Bradley Rhodes at the Media Laboratory, MIT, Cambridge,
  Massachusetts, with support from British Telecom and Merrill Lynch.
 
This program is free software; you can redistribute it and/or modify it under
  the terms of the GNU General Public License as published by the Free Software
  Foundation; either version 2 of the License, or (at your option) any later
  version. For commercial licensing under other terms, please consult the MIT
  Technology Licensing Office.
 
This program may be subject to the following US and/or foreign patents
  (pending): "Method and Apparatus for Automated, Context-Dependent
  Retrieval of Information," MIT Case No. 7870TS. If any of these patents
  are granted, royalty-free license to use this and derivative programs under
  the GNU General Public License are hereby granted.
 
This program is distributed in the hope that it will be useful, but WITHOUT ANY
  WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
  A PARTICULAR PURPOSE. See the GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License along with
  this program; if not, write to the Free Software Foundation, Inc., 59 Temple
  Place - Suite 330, Boston, MA 02111-1307, USA.
 
BUGS¶
Dates are not currently indexed, so anything trying to do a date query gets no
  suggestion back.
Requires GNU make to compile.
The template structure isn't documented.