The CNIDR ISEARCH Text Searching System

Erik Scott, Scott Technologies, Inc.

Archie Warnock, A/WWW Enterprises

Features of the 1.13 Release

Isearch is a software system for searching though large amounts of text. The system allows a user to very quickly find out what documents are available that contain certain words. Unlike older search systems, Isearch does not use a list of keywords or an abstract; every word of every document can be checked. This allows greatly improved chances of discovering new information in old collections.

As an example, consider this real-world example: CNIDR uses Isearch to index and search a collection of over 2000 AIDS-related patents issued by the U.S. Patent and Trademark Office. This collection of XXX megabytes of raw text can be searched in less than 1 second. A researcher looking for patents containing either the word "needle" or the word "syringe" can submit the query and get results back about as fast as his desktop machine can display them.

ISEARCH Features:

Searches large collections using a Free-Text search: no reliance on keywords, abstracts, or human-generated indexes.
Handles very large collections: over 1 gigabyte (1 million megabyte) collections can be handled on modest servers. Essentially unlimited textbases can be searched with careful layout and planning.
Very sophisticated result sorting: The documents most likely to be useful are returned first. Ranking is based on statistical analysis of word frequencies and is generalized for a wide variety of subjects and user skill levels.
Fast: documents are machine-indexed before searching, so non-matching documents needn't be read in. Fast enough to make optical media a reasonable solution, and extremely responsive with cheap SCSI disks.
Works well with OCR document storage and retrieval systems: no need for people to classify documents, and the statistical ranking method is forgiving of OCR errors. Potentially millions of pages can be made searchable for little more than photocopy costs.
Handles a wide range of document types: can handle text in formats from raw ASCII dumps to richly formatted SGML. Convenient doctype interface allows handling of entirely new and unusual formats in a matter of hours. Good supply of free and commercial doctypes available from third parties.
Efficient use of disk resources: Indexes are relatively compact, generally smaller than the original collection, and yet contain references to every word in the textbase.
Text maintenance commands: old documents can be deleted instantly and new data can be added without having to re-index the entire collection.
Portable and Scalable: works well on Unix machines from Linux PCs to Crays. Takes advantage of Very Large Memory (VLM) technology for Digital AlphaServers. Support for Windows NT in 3Q96.
Integrates smoothly with World Wide Web (WWW) and ANSI Z39.50 servers: Anyone can search an Isearch textbase using their favorite web browser. When used with CNIDR's Isite package, Isearch can be used through a Z39.50 session to interoperate with library automation software. Isearch and Isite together form a three-tier client-server architecture to allow essentially unlimited capacity growth.
Easy to customize: The modular, object-oriented structure of Isearch means that new features can be added independently of the Isearch core. Third party extension is facilitated by using well-defined Application Programming Interfaces (APIs) implemented in C++.