|
The CNIDR ISEARCH Text Searching System
Erik Scott, Scott Technologies, Inc.
Archie Warnock, A/WWW Enterprises
Features of the 1.13 Release
Isearch is a software system for searching though large amounts of
text. The system allows a user to very quickly find out what documents
are available that contain certain words. Unlike older search systems,
Isearch does not use a list of keywords or an abstract; every word of
every document can be checked. This allows greatly improved chances of
discovering new information in old collections.
As an example, consider this real-world example: CNIDR uses Isearch
to index and search a collection of over 2000 AIDS-related patents
issued by the U.S. Patent and Trademark Office. This collection of XXX
megabytes of raw text can be searched in less than 1 second. A
researcher looking for patents containing either the word
"needle" or the word "syringe" can submit the query
and get results back about as fast as his desktop machine can display
them.
ISEARCH Features:
- Searches large collections using a Free-Text search: no reliance
on keywords, abstracts, or human-generated indexes.
- Handles very large collections: over 1 gigabyte (1 million
megabyte) collections can be handled on modest servers. Essentially
unlimited textbases can be searched with careful layout and planning.
- Very sophisticated result sorting: The documents most likely to be
useful are returned first. Ranking is based on statistical analysis of
word frequencies and is generalized for a wide variety of subjects and
user skill levels.
- Fast: documents are machine-indexed before searching, so
non-matching documents needn't be read in. Fast enough to make optical
media a reasonable solution, and extremely responsive with cheap SCSI
disks.
- Works well with OCR document storage and retrieval systems: no
need for people to classify documents, and the statistical ranking
method is forgiving of OCR errors. Potentially millions of pages can be
made searchable for little more than photocopy costs.
- Handles a wide range of document types: can handle text in formats
from raw ASCII dumps to richly formatted SGML. Convenient doctype
interface allows handling of entirely new and unusual formats in a
matter of hours. Good supply of free and commercial doctypes available
from third parties.
- Efficient use of disk resources: Indexes are relatively compact,
generally smaller than the original collection, and yet contain
references to every word in the textbase.
- Text maintenance commands: old documents can be deleted instantly
and new data can be added without having to re-index the entire
collection.
- Portable and Scalable: works well on Unix machines from Linux PCs
to Crays. Takes advantage of Very Large Memory (VLM) technology for
Digital AlphaServers. Support for Windows NT in 3Q96.
- Integrates smoothly with World Wide Web (WWW) and ANSI Z39.50
servers: Anyone can search an Isearch textbase using their favorite web
browser. When used with CNIDR's Isite package, Isearch can be used
through a Z39.50 session to interoperate with library automation
software. Isearch and Isite together form a three-tier client-server
architecture to allow essentially unlimited capacity growth.
- Easy to customize: The modular, object-oriented structure of
Isearch means that new features can be added independently of the
Isearch core. Third party extension is facilitated by using
well-defined Application Programming Interfaces (APIs) implemented in
C++.
|