In the beginning, there was "grep".
Grep was good, but it lacked subtlety. It lacked speed. And while grep was cheap and hence widely used, it wasn't a text searching system.
It wasn't even a text searching program.
It was a regular expression matching program. It didn't sort or rank or try to get along well with others.
So the computer scientists of the world created Information Retrieval. It wasn't a product, it was a subject. They wrote many papers.
From the Information Retrieval people (who now called themselves "IR" so their colleagues would think they were developing remote controls for Tvs) came text search systems. Systems such as "SMART". Systems such as "WAIS".
WAIS was written by a guy who worked for a supercomputer company. He wanted to create an application for those big, hulking machines. He gave away a watered-down version of his program so people would see how nice a full-strength version of it would be on big, hulking machines. The supercomputer company that he worked for was going bankrupt, so he took his WAIS and he started his own company to sell WAIS.
The National Science Foundation saw hope in the free version of WAIS, so they worked with MCNC (it doesn't stand for anything, they're just MCNC) to create CNIDR (and it does stand for something: Truth, Justice, and the... no, actually, it used to stand for the Clearinghouse for Networked Information Discovery and Retrieval, but lately they've changed the "Clearinghouse" to "Center" since that's easier to type with one hand). CNIDR begat freeWAIS, which was the old, free WAIS (hence the name) with a series of enhancements.
But freeWAIS had problems. At its heart was the crippled stump of a search system. Small desktop machines had caught up with the old, hulking machines, but freeWAIS still only handled small collections. CNIDR said, "This sucks." I know, I heard them. So they created a new text search system from the ground up. And they called it Isite. Isearch is the part of Isite that actually does the searching.
And it was good.
1. | A Quick Example of Isearch Use |
2. | Indexing Collections |
2.1 | Arranging the Text |
2.2 | Running the Indexer |
2.3 | Deciding on Doctypes |
3. | Searching |
3.1 | Simple Searches |
3.2 | Searching in Subfields |
3.3 | Boolean Searches |
3.4 | Notes on Ranking |
3.5 | Wildcards |
3.6 | Prefixes and Suffixes |
4. | Maintaining Your Data |
4.1 | Removing old Files |
4.2 | Adding new Files |
4.3 | Moving Files Around |
4.4 | Viewing Information |
5. | Performance Issues |
5.1 | Location of Files for Speed. |
5.2 | How Much RAM is Enough? |
5.3 | CPU Load vs. User Skill |
This section assumes that Isearch has already been installed. We'll assume that Isearch has been installed in the directory /local/project/Isearch-1.09.09. Naturally, that name will change depending on the version number and the preferences of your site, so remember to substitute your own directory name. If you haven't installed Isearch yet, see the file "QuickStart" in the Isearch documentation directory.
In this example, we'll index two text files from the Isearch distribution, "COPYRIGHT" and "README". We only picked these files because we know everyone has them. Create a directory for the textbase:
sti-gw% mkdir /local/text/example1 sti-gw% cp COPYRIGHT README /local/text/example1
Now create a directory to hold the indexes:
sti-gw% mkdir /local/indexes/example1
Now we can index the text:
sti-gw% Iindex -d /local/indexes/example1/EX1 /local/text/example1/* Iindex 1.09.09 Building document list ... Building database /local/indexes/example1/EX1: Parsing files ... Parsing /local/text/example1/COPYRIGHT ... Parsing /local/text/example1/README ... Indexing 1004 words ... Merging index ... Database files saved to disk. sti-gw% ls -l /local/indexes/example1/ total 16 -rw-r--r-- 1 escott staff 85 Apr 11 12:24 EX1.dbi -rw-r--r-- 1 escott staff 4016 Apr 11 12:24 EX1.inx -rw-r--r-- 1 escott staff 24 Apr 11 12:24 EX1.mdg -rw-r--r-- 1 escott staff 40 Apr 11 12:24 EX1.mdk -rw-r--r-- 1 escott staff 520 Apr 11 12:24 EX1.mdt
That indexed the text in /local/text/example1 and created indexes in /local/indexes/example1. Now we can use those indexes to search for text:
Isearch -d /local/indexes/example1/EX1 fee fee Isearch 1.09.09 Searching database /local/indexes/example1/EX1: 1 document(s) matched your query, 1 document(s) displayed. Score File 1. 100 /local/text/example1/COPYRIGHT Select file #:
The word "fee" occurs only in the file COPYRIGHT. Note
that because of a bug in version 1.09.09 (actually, it's believed
to be a bug in gcc that Isearch exposes) the search term has to
be entered twice. This is corrected in later versions.
2. Indexing Collections.
This section will discuss the issues surrounding indexing of text
with Isearch, including how to arrange the text, decide on a doctype
to use, and actually run the indexer.
2.1 Arranging the Text.
Arranging the text to be indexed is fairly straightforward. Good practice dictates that the textbase be given its own directory hierarchy to make maintenance simple. It's also a good idea to not mix up the text files with the indexes. Maintenance is simpler when they are separate, and there is also some performance gain to be realized when the indexes and text files are on separate disks.
The Iindex command will need to be given a list of files to index and directories to traverse. If you give Iindex the "-r" option then it will recursively plunge headlong into subdirectories indexing everything it can find. For simple collections this is probably a good thing, but for more subtle applications you'll probably want to do the filespace traversal yourself. This is especially true if you are using some kind of version control system like SCCS or RCS.
Unix filesystems impose virtually no penalty for using subdirectories, so feel free to impose quite a bit of organization on your tree of text.
Note that if you have a huge number of small files, you'll want to either use a lot of subdirectories or else combine the small files into larger ones. Consider the case of 100,000 files of one line each (an extreme example, but we see it all the time). If you put 100,000 files into one Unix directory, file access will be incredibly slow. Consider one of two solutions:
1) Create 100 subdirectories and put 1000 files into each subdirectory.
-or-
2) Concatenate all the files into one 100,000 line file. Index this file using the "ONELINE" doctype (discussed in Section 2.3).
The second solution is strongly preferred.
Do not put text or index files into a network-mounted filesystem.
These files will be hit hard and hit often, and can bring a network
to its knees. Don't let vendor claims of caching performance
fool you either: Isearch and AFS (or DFS) do not get along well
(the files are accessed in ways generally guaranteed to *not*
be cached). So far, experiments with using CacheFS to back up
accesses to NFS 3.0 mounted partitions hasn't been very promising
either. Cheap SCSI disks are below 30 cents per megabyte now.
Consider dedicating two spindles, preferably on separate SCSI
controllers, to Isearch alone and you'll see a lot better performance.
2.2 Running the Indexer.
Iindex takes several command line options. Here is a list of the flags along with a brief discussion of each:
Iindex -d INDEXNAME -a newFile1 newFile2 newFile3
instead of adding the files one command at a time.
Hi, I'm the first document. ### Greetings to all from the second document.
If you index with:
Iindex -d mySeparatorExample -s "###" exampleFile
then you'll have two logical documents. A search for "greetings" will match the second paragraph, but not the first.
Use of the -s option can give you, in essence, a quick way to fake having new doctypes.
If you don't specify a doctype, then the "SIMPLE" doctype is used by default. SIMPLE doesn't really do much; it assumes that there is one document per file, no fields, and that presentation is handled by just dumping the contents of the file.
For more information, see the files "dtconf.inf", "BSn.doc", and "STI.doc" in the doctype directory.
ls /local/text/example2 > myListOfFiles Iindex -d exampleIndex -f myListOfFiles
causes Iindex to read the file "myListOfFiles" and then index every file it sees in there. This is very useful for huge lists of files. By default, older Unix systems can't pass command lines more than 10240 characters long. If you have several thousand files to index, the command line could quickly become too long.
(Note to the confused: No, you would never type a line that long. But consider how big a command line can get through filename expansion with wildcards. Try this:
echo /usr/man/*/* | wc
the wordcount utility reports that "echo /usr/man/*/* expanded into a line 122634 characters long.
(Subnote to the cluefull: So, if that command expanded to 122K, how could it be passed to "echo" so I could wordcount it? Simple. I used Solaris 2.5, which allows 1 megabyte command lines. Nonetheless, there are a lot of Ultrix boxes still in use so it's worth noting.)).
Iindex -d watchThisItsCool `find /deeptree -t f \! -name "s.*""`
will cause Iindex to index all files below "/local/text/deeptree" except for those that begin with "s." (SCCS files).
If you're using the "sccs" convenience shell, then you'll want to ignore "SCCS" directories.
Find is such a powerful command that you should spend a while getting to know it better if you haven't already. It especially useful for maintaining collections of text by doing things like weeding out old versions of files automatically.
Once you've decided what options to use, let Iindex bang away
creating indexes. Don't be surprised if this takes a long time,
especially if you have more text than will fit into RAM.
2.3 Deciding on Doctypes.
Doctypes are modules that explains how to find logical documents and fields within those documents. It also has code to display documents that it finds.
The default doctype is "SIMPLE". It tells Iindex that there is one logical document per file, that there are no fields to index separately, and that when displaying these documents they will just be dumped out without any formatting. SIMPLE really is the simplest doctype you can have.
The other CNIDR-supplied doctype is "SGMLTAG". This doctype will make one logical document per file, will index text occurring inside of SGML-like markup pairs as fields, and will present the text by dumping it out without a lot of special formatting. This doctype can be used to index HTML web pages for making searchable WWW sites.
There are many other doctypes available. See the files "BSn.doc"
and "STI.doc" in the doctype directory for more information.
New doctypes are being added all the time, so look for other
"*.doc" files in that directory.
3. Searching.
This section describes the "Isearch" command itself.
Isearch is used to search through the text collection after it
has been indexed with Iindex.
3.1 Simple Searches.
The simplest case searches in Isearch are to look for either one word or any of a list of words in any location within every file in the collection. Consider searching for "poetry" or "rhyme" or "verse" in a collection of journal articles:
Isearch -d MLAabstracts poetry rhyme verse
This will display a list of matching items. The items will have their "headlines" displayed. To select one for viewing, enter the number of the item and press "Return". To exit, press "Return" by itself.
Note that adding more search terms will generally cause more items
to be returned, rather than fewer. This is counterintuitive,
but isn't the big problem that it seems to be. Isearch has a
very sophisticated ranking technique that will try to decide what
documents are most likely to be useful and will return them at
the top of the list. This ranking method actually works better
for longer queries than for short ones.
3.2 Searching in Subfields.
Consider the following HTML text:
<title> Cool Page </title> This is my excellent Web Page.
If you index this text with the SGMLTAG doctype, then you can search for words that occur anywhere in the document, or you can search for words that only occur in the title (inside the <title> and </title> markers).
For example:
sti-gw% Iindex -d why -t sgmltag /local/text/testfile Iindex 1.13 Building document list ... Building database why: Parsing files ... Parsing /local/text/testfile ... Indexing 7 words ... Merging index ... Database files saved to disk. sti-gw% Isearch -d why web Isearch 1.13 Searching database why: 1 document(s) matched your query, 1 document(s) displayed. Score File 1. 100 /local/text/testfile Cool Page Select file #: sti-gw% Isearch -d why title/web Isearch 1.13 Searching database why: 0 document(s) matched your query, 0 document(s) displayed.
In the first example, we searched for "web", and lo
and behold we found it. In the second example we looked for "title/web",
which means "find me occurrences of 'web' inside the 'title'
field". There aren't any of those, so Isearch says there
were no matching documents.
3.3 Boolean Searches.
This discussion of boolean searches assumes you have at least
version 1.13. If you don't, then the brief section of "rpn"
queries will apply to you but not the rest of this section. And
if you aren't using at least 1.13 then I strongly urge you
to upgrade because you're living with too many bugs.
3.3.1 "Infix" Notation Boolean Searches
Infix notation is the old familiar notation you remember from Cobol, Fortran, C, and TI calculators.
Reusing our example from section 3.2, we still have the Cool Page document. To search for a document that contained both "cool" and "page", we could use a query like:
sti-gw% Isearch -d why -infix cool and page
To search for a document with either cool or page (or even both) we can use:
sti-gw% Isearch -d why -infix cool or page
To search for documents that contain "cool" but not "page", we can say:
sti-gw% Isearch -d why -infix cool andnot page
Note that "andnot" is one word.
Note further than you can say "&&" instead of "and", "||" instead of "or", and "&!" instead of "andnot". This is handy for recovering C programmers.
You can group things with parenthesis to make more complicated queries, but remember that most Unix shells place special meaning in parenthesis. Protect them with quotation marks like this:
sti-gw% Isearch -d why -infix "(page or doc)" and "(cool or cold)"
RPN notation is the old familiar notation you remember from Forth, HP calculators, and your compiler writing class.
You probably shouldn't be using RPN unless you know what you're doing, so here's that last query from the previous section in rpn:
sti-gw% Isearch -d why -rpn page doc or cool cold or and
In a nutshell, you "push" search terms onto a stack
and then feed enough operators to that stack to pop them all off.
What could be more obvious?
3.4 Notes on Ranking.
In sections above, we have alluded to a sophisticated ranking algorithm. We use an algorithm developed by Gerald Salton for the SMART retrieval system. Without going into great detail, both the search terms and the resulting documents are scored. Documents that contain more instances of higher scoring words gets ranked higher.
What makes a high-scoring word? Basically, the fewer documents the word occurs in, the more important it must be. This is often bogus, but more often it's useful.
A high-scoring document is one that contains a lot of instances of search terms.
To combine the scores, the document and the search term scores are multiplied for each matching search term. These products are totaled over the set of all the search terms. Those who are good at math will recognize this as a "dot product". Finally, the scores are normalized so the highest score is 1.0.
Generally speaking, using a search word that appears in nearly every document won't affect the ranking of results. It will, however, slow down the query tremendously so it still isn't a good idea.
While we're on the subject, it's possible manually alter the search term weightings, albeit crudely. Consider this case:
sti-gw% Isearch -d why cool:5 page
This tells Isearch to look for "cool" or "page", but give "cool" five times the normal weight. The following example does the opposite:
sti-gw% Isearch -d why cool:-5 page
This tells Isearch to find documents that contain "page", but if they contain "cool" then subtract the value of that match five times. In other words, "I really want to see documents with "page" in them, but if they also contain "cool" then rank them really low". This is more useful than boolean queries with "andnot". Consider a set of documents on, say, databases. You might want a query like:
sti-gw% Isearch -d databases OODBMS andnot RDBMS
Problem is, if there is a good article on OODBMSes that just happens
to say "Unlike braindamaged RDBMSes, our system can solve
world hunger" then the above query will skip that document.
Using a negative term weight will push that document down on
the list, but at least it will still be there.
3.5 Wildcards
Isearch supports a limited wildcard facility. You can append an asterix to the end of a search term to match any word that begins with a given prefix. For example:
sti-gw% Isearch -d diseases chol*
will find any document with a word that begins with "chol",
including "cholesterol", and yes, even, "cholesteral".
This is useful for finding misspelled words in addition to its
obvious uses.
3.6 Prefixes and Suffixes
Isearch will allow you to slightly change the output when it prints a document. The following query shows this:
sti-gw% Isearch -d why -prefix "<large>" -suffix "</large>" cool Isearch 1.13 Searching database why: 1 document(s) matched your query, 1 document(s) displayed. Score File 1. 100 /local/text/testfile Cool Page Select file #: 1 <title> <large>Cool</large> Page </title> This is my excellent Web Page.
The -prefix option lets you specify a string to be printed immediately before a word that matches. The -suffix option is for a string to be printed after the word. Note the SGML markup on either side of "Cool" in the above example.
NOTE: If you're going to specify SGML or HTML on a command line,
protect it with quotation marks. Otherwise, the arrows will be
interpreted as file redirection operators with completely unexpected
results.
4. Maintaining Your Data.
After you index your collection and start searching in it, you'll
eventually have to update the data. This section describes how
to add, remove, and relocate your files. The command used for
this is "Iutil".
4.1 Removing old Files.
To remove an old file from being considered for a search requires two steps. The first step is to find the document "key" for the one to be removed:
sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc USMARC: [1839] (839 - 1577) /local/text/marc/demomarc
In this case, we used Iutil's -v option to display the table of documents. The key for each document appears inside the brackets. We'll delete the second document, key number 1839:
sti-gw% Iutil -d huh -del 1839 Iutil 1.13 Marking documents as deleted ... 1 document(s) marked as deleted.
To see the deletion:
sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc USMARC: [1839] (839 - 1577) /local/text/marc/demomarc *
The asterisk at the end of the last line means that entry has been marked as deleted.
After deleting one or more documents, it's a good idea to force the deletion with Iutil -c:
sti-gw% Iutil -d huh -c Iutil 1.13 Cleaning up database (removing deleted documents) ... 1 document(s) were removed. sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc
You should NEVER remove an actual data file from the collection
until you run Iutil -c to commit the changes. Otherwise, Isearch
will display an error message when you try to search.
4.2 Adding new Files.
There are two ways to add a new file to an existing collection: either reindex the whole collection from scratch or use incremental indexing. If you have enough RAM to reindex the whole collection, you should probably do that. It's actually faster to start from scratch than it is to add a few files piecemeal. If you don't have that much RAM, then you have no choice.
Here's an example, adding the file /local/text/example1/COPYRIGHT:
sti-gw% Iindex -d huh -a /local/text/example1/COPYRIGHT Iindex 1.13 Building document list ... Adding to database huh: Parsing files ... Parsing /local/text/example1/COPYRIGHT ... Indexing 5100 words ... Merging index ... Database files saved to disk.
As a general rule, you want to specify as many new files to add at once as possible. Don't do this one at a time for even as little as two files because you'll be here for many minutes as it is.
And a note: You (generally) can't modify an existing data file
and expect to get correct results. You'll have to delete the
old one and index the new one.
4.3 Moving Files Around.
Sometimes you'll want to move files around in your directory structure, and the last thing you want to have to do is reindex all of them. Here's how to relocate the files from the above example:
sti-gw% Iutil -d huh -newpaths Iutil 1.13 Scanning database for file paths ... Enter new path or <Return> to leave unchanged: Path=[/local/text/example1/] > /export/home/escott Done.
We've already seen "Iutil -v" to look at index files to see what they contain. Here are some other options to Iutil to see different things:
sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc USMARC: [1839] (839 - 1577) /local/text/marc/demomarc
Pretty basic, right? The "-vi" option shows some summary information:
sti-gw% Iutil -d huh -vi Iutil 1.13 Database name: /local/project/Isearch-1.13-nrn/bin/huh Global document type: USMARC Total number of documents: 2 Documents marked as deleted: 0
And the "-vf" option shows information about fields:
sti-gw% Iutil -d huh -vf Iutil 1.13 The following fields are defined in this database: 001 008 010 LCCN 040 020 ISBN 043
[and so forth. The numbers above are also field names for the
USMARC record format.]
5. Performance Issues.
There are several performance issues with Isearch. They can be
roughly divided into "indexing concerns" and "search
concerns". Indexing concerns only pop when files are indexed,
but searching issues will show up every day.
5.1 Location of Files for Speed.
Location of files in your system is important for good speed. Good speed is, in turn, essential for supporting a lot of simultaneous users.
In general, it's a good idea to locate the index and the data files on separate disk drives. Do not make the mistake of putting them in separate partitions on the same drive: this is worse than doing nothing. Generally, the disk usage pattern will alternate between hitting the index files and the data files in short, alternating bursts. Near the conclusion of a search the index files will be accessed sequentially, but until then the pattern appears essentially random to the operating system. Putting the two sets of files on separate spindles will help minimize the amount of head motion. This is most noticable for searching, but affects indexing to some degree. The biggest gains will be for the largest collections. If you only have 100megs of data or so, then you'll be hard pressed to measure the gain.
In a related note: consider RAID. In particular, consider using
RAID level 0 with a small granularity so you can get the most
I/O operations per second. Most RAID management software will
let you crank the configuration toward either "faster streaming
speed" or "more operations per second" and you
definitely want the latter. The only time Isearch will stream
a device is during the result set generation phase of a search
when the index file is essentially streamed (and, incidentally,
the textbase storage device is being hit in a purely random fashion).
These bursts don't last long, so most of the time the small granule
size is a winner. Since most Unixes will transfer one page of
data from a device at a minimum, use the "pagesize"
command and set your chunk size to that.
5.2 How Much RAM is Enough?
Answer: how much can you fit into the box?
Better answer: Even that may not be enough.
Indexing is pretty slow with Isearch. To speed it up, Iindex will try to use lots of RAM. The "-m" option is used to specify how much RAM to allocate for buffering data. Note that Iindex will use considerably more than what you specify. On my little Sparc5 with 48 megs of RAM and an additional 16 megs of swap space, I can usually only use "-m 6" even though I have considerably more virtual memory available:
elvis% swap -s total: 41304k bytes allocated + 8720k reserved = 50024k used, 15620k available elvis% Iindex -d huh -m 7 -a /local/text/example1/COPYRIGHT Iindex 1.13 Building document list ... Adding to database huh: Parsing files ... Virtual memory exceeded in `new'
Granted, I'm running CDE/Motif, netscape, and a few goodies, but nothing really serious. To index on this machine I usually shut down to single-user mode and run Iindex from the console. It's ugly, but it's also a lot faster.
If you need to index a lot of data, you need a lot of RAM to make it go quickly.
Searching is another issue entirely. More RAM helps, essentially without limit, but the reward isn't as great. The basic search algorithm runs in about 50 lines of code. The problem starts when that basic algorithm starts finding matches. When it does, those matches are stored in a result set. These result sets can get pretty big in a hurry. Imagine searching for "knee" in a textbase on knee injuries: the "knee" result set will get huge. Booleans won't help you, either. In fact, they make it worse. Imagine searching for "knee" and "dislocation". You'll still have to drag around the monster "knee" result set, and then you'll have to make a second one for "dislocation", and then finally you'll have to create a third one that is the union of the two. Isearch will delete the first two when the third one is created, but only after the third one has been finished. It can add up in a hurry.
While the relational database people have spent years optimizing for this situation, Isearch isn't that clever yet. Moral to the story: make sure your users know to use "good quality" search terms. They'll get better results and your office won't look like SIMM City.