|
ISEARCH TUTORIALThe Complete Guide to Using and Configuring the CNIDR Isearch System with Examples, Advice, and Editorial CommentErik Scott, Scott Technologies, Inc.In the beginning, there was "grep". Grep was good, but it lacked subtlety. It lacked speed. And while grep was cheap and hence widely used, it wasn't a text searching system. It wasn't even a text searching program. It was a regular expression matching program. It didn't sort or rank or try to get along well with others. So the computer scientists of the world created Information Retrieval. It wasn't a product, it was a subject. They wrote many papers. From the Information Retrieval people (who now called themselves "IR" so their colleagues would think they were developing remote controls for Tvs) came text search systems. Systems such as "SMART". Systems such as "WAIS". WAIS was written by a guy who worked for a supercomputer company. He wanted to create an application for those big, hulking machines. He gave away a watered-down version of his program so people would see how nice a full-strength version of it would be on big, hulking machines. The supercomputer company that he worked for was going bankrupt, so he took his WAIS and he started his own company to sell WAIS. The National Science Foundation saw hope in the free version of WAIS, so they worked with MCNC (it doesn't stand for anything, they're just MCNC) to create CNIDR (and it does stand for something: Truth, Justice, and the... no, actually, it used to stand for the Clearinghouse for Networked Information Discovery and Retrieval, but lately they've changed the "Clearinghouse" to "Center" since that's easier to type with one hand). CNIDR begat freeWAIS, which was the old, free WAIS (hence the name) with a series of enhancements. But freeWAIS had problems. At its heart was the crippled stump of a search system. Small desktop machines had caught up with the old, hulking machines, but freeWAIS still only handled small collections. CNIDR said, "This sucks." I know, I heard them. So they created a new text search system from the ground up. And they called it Isite. Isearch is the part of Isite that actually does the searching. And it was good. ISEARCH TUTORIAL TABLE OF CONTENTS
1. A Quick Example of Isearch Use.This section assumes that Isearch has already been installed. We'll assume that Isearch has been installed in the directory /local/project/Isearch-1.09.09. Naturally, that name will change depending on the version number and the preferences of your site, so remember to substitute your own directory name. If you haven't installed Isearch yet, see the file "QuickStart" in the Isearch documentation directory. In this example, we'll index two text files from the Isearch distribution, "COPYRIGHT" and "README". We only picked these files because we know everyone has them. Create a directory for the textbase: sti-gw% mkdir /local/text/example1 sti-gw% cp COPYRIGHT README /local/text/example1 Now create a directory to hold the indexes: sti-gw% mkdir /local/indexes/example1 Now we can index the text: sti-gw% Iindex -d /local/indexes/example1/EX1 /local/text/example1/* Iindex 1.09.09 Building document list ... Building database /local/indexes/example1/EX1: Parsing files ... Parsing /local/text/example1/COPYRIGHT ... Parsing /local/text/example1/README ... Indexing 1004 words ... Merging index ... Database files saved to disk. sti-gw% ls -l /local/indexes/example1/ total 16 -rw-r--r-- 1 escott staff 85 Apr 11 12:24 EX1.dbi -rw-r--r-- 1 escott staff 4016 Apr 11 12:24 EX1.inx -rw-r--r-- 1 escott staff 24 Apr 11 12:24 EX1.mdg -rw-r--r-- 1 escott staff 40 Apr 11 12:24 EX1.mdk -rw-r--r-- 1 escott staff 520 Apr 11 12:24 EX1.mdt That indexed the text in /local/text/example1 and created indexes in /local/indexes/example1. Now we can use those indexes to search for text: Isearch -d /local/indexes/example1/EX1 fee fee Isearch 1.09.09 Searching database /local/indexes/example1/EX1: 1 document(s) matched your query, 1 document(s) displayed. Score File 1. 100 /local/text/example1/COPYRIGHT Select file #:
The word "fee" occurs only in the file COPYRIGHT. Note
that because of a bug in version 1.09.09 (actually, it's believed
to be a bug in gcc that Isearch exposes) the search term has to
be entered twice. This is corrected in later versions.
This section will discuss the issues surrounding indexing of text
with Isearch, including how to arrange the text, decide on a doctype
to use, and actually run the indexer.
Arranging the text to be indexed is fairly straightforward. Good practice dictates that the textbase be given its own directory hierarchy to make maintenance simple. It's also a good idea to not mix up the text files with the indexes. Maintenance is simpler when they are separate, and there is also some performance gain to be realized when the indexes and text files are on separate disks. The Iindex command will need to be given a list of files to index and directories to traverse. If you give Iindex the "-r" option then it will recursively plunge headlong into subdirectories indexing everything it can find. For simple collections this is probably a good thing, but for more subtle applications you'll probably want to do the filespace traversal yourself. This is especially true if you are using some kind of version control system like SCCS or RCS. Unix filesystems impose virtually no penalty for using subdirectories, so feel free to impose quite a bit of organization on your tree of text. Note that if you have a huge number of small files, you'll want to either use a lot of subdirectories or else combine the small files into larger ones. Consider the case of 100,000 files of one line each (an extreme example, but we see it all the time). If you put 100,000 files into one Unix directory, file access will be incredibly slow. Consider one of two solutions: 1) Create 100 subdirectories and put 1000 files into each subdirectory. -or- 2) Concatenate all the files into one 100,000 line file. Index this file using the "ONELINE" doctype (discussed in Section 2.3). The second solution is strongly preferred.
Do not put text or index files into a network-mounted filesystem.
These files will be hit hard and hit often, and can bring a network
to its knees. Don't let vendor claims of caching performance
fool you either: Isearch and AFS (or DFS) do not get along well
(the files are accessed in ways generally guaranteed to *not*
be cached). So far, experiments with using CacheFS to back up
accesses to NFS 3.0 mounted partitions hasn't been very promising
either. Cheap SCSI disks are below 30 cents per megabyte now.
Consider dedicating two spindles, preferably on separate SCSI
controllers, to Isearch alone and you'll see a lot better performance.
Iindex takes several command line options. Here is a list of the flags along with a brief discussion of each:
Once you've decided what options to use, let Iindex bang away
creating indexes. Don't be surprised if this takes a long time,
especially if you have more text than will fit into RAM.
Doctypes are modules that explains how to find logical documents and fields within those documents. It also has code to display documents that it finds. The default doctype is "SIMPLE". It tells Iindex that there is one logical document per file, that there are no fields to index separately, and that when displaying these documents they will just be dumped out without any formatting. SIMPLE really is the simplest doctype you can have. The other CNIDR-supplied doctype is "SGMLTAG". This doctype will make one logical document per file, will index text occurring inside of SGML-like markup pairs as fields, and will present the text by dumping it out without a lot of special formatting. This doctype can be used to index HTML web pages for making searchable WWW sites.
There are many other doctypes available. See the files "BSn.doc"
and "STI.doc" in the doctype directory for more information.
New doctypes are being added all the time, so look for other
"*.doc" files in that directory.
This section describes the "Isearch" command itself.
Isearch is used to search through the text collection after it
has been indexed with Iindex.
The simplest case searches in Isearch are to look for either one word or any of a list of words in any location within every file in the collection. Consider searching for "poetry" or "rhyme" or "verse" in a collection of journal articles: Isearch -d MLAabstracts poetry rhyme verse This will display a list of matching items. The items will have their "headlines" displayed. To select one for viewing, enter the number of the item and press "Return". To exit, press "Return" by itself.
Note that adding more search terms will generally cause more items
to be returned, rather than fewer. This is counterintuitive,
but isn't the big problem that it seems to be. Isearch has a
very sophisticated ranking technique that will try to decide what
documents are most likely to be useful and will return them at
the top of the list. This ranking method actually works better
for longer queries than for short ones.
Consider the following HTML text: <title> Cool Page </title> This is my excellent Web Page. If you index this text with the SGMLTAG doctype, then you can search for words that occur anywhere in the document, or you can search for words that only occur in the title (inside the <title> and </title> markers). For example: sti-gw% Iindex -d why -t sgmltag /local/text/testfile Iindex 1.13 Building document list ... Building database why: Parsing files ... Parsing /local/text/testfile ... Indexing 7 words ... Merging index ... Database files saved to disk. sti-gw% Isearch -d why web Isearch 1.13 Searching database why: 1 document(s) matched your query, 1 document(s) displayed. Score File 1. 100 /local/text/testfile Cool Page Select file #: sti-gw% Isearch -d why title/web Isearch 1.13 Searching database why: 0 document(s) matched your query, 0 document(s) displayed.
In the first example, we searched for "web", and lo
and behold we found it. In the second example we looked for "title/web",
which means "find me occurrences of 'web' inside the 'title'
field". There aren't any of those, so Isearch says there
were no matching documents.
This discussion of boolean searches assumes you have at least
version 1.13. If you don't, then the brief section of "rpn"
queries will apply to you but not the rest of this section. And
if you aren't using at least 1.13 then I strongly urge you
to upgrade because you're living with too many bugs.
Infix notation is the old familiar notation you remember from Cobol, Fortran, C, and TI calculators. Reusing our example from section 3.2, we still have the Cool Page document. To search for a document that contained both "cool" and "page", we could use a query like: sti-gw% Isearch -d why -infix cool and page To search for a document with either cool or page (or even both) we can use: sti-gw% Isearch -d why -infix cool or page To search for documents that contain "cool" but not "page", we can say: sti-gw% Isearch -d why -infix cool andnot page Note that "andnot" is one word. Note further than you can say "&&" instead of "and", "||" instead of "or", and "&!" instead of "andnot". This is handy for recovering C programmers. You can group things with parenthesis to make more complicated queries, but remember that most Unix shells place special meaning in parenthesis. Protect them with quotation marks like this: sti-gw% Isearch -d why -infix "(page or doc)" and "(cool or cold)" 3.3.2 "RPN" Notation Boolean SearchesRPN notation is the old familiar notation you remember from Forth, HP calculators, and your compiler writing class. You probably shouldn't be using RPN unless you know what you're doing, so here's that last query from the previous section in rpn: sti-gw% Isearch -d why -rpn page doc or cool cold or and
In a nutshell, you "push" search terms onto a stack
and then feed enough operators to that stack to pop them all off.
What could be more obvious?
In sections above, we have alluded to a sophisticated ranking algorithm. We use an algorithm developed by Gerald Salton for the SMART retrieval system. Without going into great detail, both the search terms and the resulting documents are scored. Documents that contain more instances of higher scoring words gets ranked higher. What makes a high-scoring word? Basically, the fewer documents the word occurs in, the more important it must be. This is often bogus, but more often it's useful. A high-scoring document is one that contains a lot of instances of search terms. To combine the scores, the document and the search term scores are multiplied for each matching search term. These products are totaled over the set of all the search terms. Those who are good at math will recognize this as a "dot product". Finally, the scores are normalized so the highest score is 1.0. Generally speaking, using a search word that appears in nearly every document won't affect the ranking of results. It will, however, slow down the query tremendously so it still isn't a good idea. While we're on the subject, it's possible manually alter the search term weightings, albeit crudely. Consider this case: sti-gw% Isearch -d why cool:5 page This tells Isearch to look for "cool" or "page", but give "cool" five times the normal weight. The following example does the opposite: sti-gw% Isearch -d why cool:-5 page This tells Isearch to find documents that contain "page", but if they contain "cool" then subtract the value of that match five times. In other words, "I really want to see documents with "page" in them, but if they also contain "cool" then rank them really low". This is more useful than boolean queries with "andnot". Consider a set of documents on, say, databases. You might want a query like: sti-gw% Isearch -d databases OODBMS andnot RDBMS
Problem is, if there is a good article on OODBMSes that just happens
to say "Unlike braindamaged RDBMSes, our system can solve
world hunger" then the above query will skip that document.
Using a negative term weight will push that document down on
the list, but at least it will still be there.
Isearch supports a limited wildcard facility. You can append an asterix to the end of a search term to match any word that begins with a given prefix. For example: sti-gw% Isearch -d diseases chol*
will find any document with a word that begins with "chol",
including "cholesterol", and yes, even, "cholesteral".
This is useful for finding misspelled words in addition to its
obvious uses.
Isearch will allow you to slightly change the output when it prints a document. The following query shows this: sti-gw% Isearch -d why -prefix "<large>" -suffix "</large>" cool Isearch 1.13 Searching database why: 1 document(s) matched your query, 1 document(s) displayed. Score File 1. 100 /local/text/testfile Cool Page Select file #: 1 <title> <large>Cool</large> Page </title> This is my excellent Web Page. The -prefix option lets you specify a string to be printed immediately before a word that matches. The -suffix option is for a string to be printed after the word. Note the SGML markup on either side of "Cool" in the above example.
NOTE: If you're going to specify SGML or HTML on a command line,
protect it with quotation marks. Otherwise, the arrows will be
interpreted as file redirection operators with completely unexpected
results.
After you index your collection and start searching in it, you'll
eventually have to update the data. This section describes how
to add, remove, and relocate your files. The command used for
this is "Iutil".
To remove an old file from being considered for a search requires two steps. The first step is to find the document "key" for the one to be removed: sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc USMARC: [1839] (839 - 1577) /local/text/marc/demomarc In this case, we used Iutil's -v option to display the table of documents. The key for each document appears inside the brackets. We'll delete the second document, key number 1839: sti-gw% Iutil -d huh -del 1839 Iutil 1.13 Marking documents as deleted ... 1 document(s) marked as deleted. To see the deletion: sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc USMARC: [1839] (839 - 1577) /local/text/marc/demomarc * The asterisk at the end of the last line means that entry has been marked as deleted. After deleting one or more documents, it's a good idea to force the deletion with Iutil -c: sti-gw% Iutil -d huh -c Iutil 1.13 Cleaning up database (removing deleted documents) ... 1 document(s) were removed. sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc
You should NEVER remove an actual data file from the collection
until you run Iutil -c to commit the changes. Otherwise, Isearch
will display an error message when you try to search.
There are two ways to add a new file to an existing collection: either reindex the whole collection from scratch or use incremental indexing. If you have enough RAM to reindex the whole collection, you should probably do that. It's actually faster to start from scratch than it is to add a few files piecemeal. If you don't have that much RAM, then you have no choice. Here's an example, adding the file /local/text/example1/COPYRIGHT: sti-gw% Iindex -d huh -a /local/text/example1/COPYRIGHT Iindex 1.13 Building document list ... Adding to database huh: Parsing files ... Parsing /local/text/example1/COPYRIGHT ... Indexing 5100 words ... Merging index ... Database files saved to disk. As a general rule, you want to specify as many new files to add at once as possible. Don't do this one at a time for even as little as two files because you'll be here for many minutes as it is.
And a note: You (generally) can't modify an existing data file
and expect to get correct results. You'll have to delete the
old one and index the new one.
Sometimes you'll want to move files around in your directory structure, and the last thing you want to have to do is reindex all of them. Here's how to relocate the files from the above example: sti-gw% Iutil -d huh -newpaths Iutil 1.13 Scanning database for file paths ... Enter new path or <Return> to leave unchanged: Path=[/local/text/example1/] > /export/home/escott Done. 4.4 Viewing Information.We've already seen "Iutil -v" to look at index files to see what they contain. Here are some other options to Iutil to see different things: sti-gw% Iutil -d huh -v Iutil 1.13 DocType: [Key] (Start - End) File (* indicates deleted record) USMARC: [10] (0 - 838) /local/text/marc/demomarc USMARC: [1839] (839 - 1577) /local/text/marc/demomarc Pretty basic, right? The "-vi" option shows some summary information: sti-gw% Iutil -d huh -vi Iutil 1.13 Database name: /local/project/Isearch-1.13-nrn/bin/huh Global document type: USMARC Total number of documents: 2 Documents marked as deleted: 0 And the "-vf" option shows information about fields: sti-gw% Iutil -d huh -vf Iutil 1.13 The following fields are defined in this database: 001 008 010 LCCN 040 020 ISBN 043
[and so forth. The numbers above are also field names for the
USMARC record format.]
There are several performance issues with Isearch. They can be
roughly divided into "indexing concerns" and "search
concerns". Indexing concerns only pop when files are indexed,
but searching issues will show up every day.
Location of files in your system is important for good speed. Good speed is, in turn, essential for supporting a lot of simultaneous users. In general, it's a good idea to locate the index and the data files on separate disk drives. Do not make the mistake of putting them in separate partitions on the same drive: this is worse than doing nothing. Generally, the disk usage pattern will alternate between hitting the index files and the data files in short, alternating bursts. Near the conclusion of a search the index files will be accessed sequentially, but until then the pattern appears essentially random to the operating system. Putting the two sets of files on separate spindles will help minimize the amount of head motion. This is most noticable for searching, but affects indexing to some degree. The biggest gains will be for the largest collections. If you only have 100megs of data or so, then you'll be hard pressed to measure the gain.
In a related note: consider RAID. In particular, consider using
RAID level 0 with a small granularity so you can get the most
I/O operations per second. Most RAID management software will
let you crank the configuration toward either "faster streaming
speed" or "more operations per second" and you
definitely want the latter. The only time Isearch will stream
a device is during the result set generation phase of a search
when the index file is essentially streamed (and, incidentally,
the textbase storage device is being hit in a purely random fashion).
These bursts don't last long, so most of the time the small granule
size is a winner. Since most Unixes will transfer one page of
data from a device at a minimum, use the "pagesize"
command and set your chunk size to that.
Answer: how much can you fit into the box? Better answer: Even that may not be enough. Indexing is pretty slow with Isearch. To speed it up, Iindex will try to use lots of RAM. The "-m" option is used to specify how much RAM to allocate for buffering data. Note that Iindex will use considerably more than what you specify. On my little Sparc5 with 48 megs of RAM and an additional 16 megs of swap space, I can usually only use "-m 6" even though I have considerably more virtual memory available: elvis% swap -s total: 41304k bytes allocated + 8720k reserved = 50024k used, 15620k available elvis% Iindex -d huh -m 7 -a /local/text/example1/COPYRIGHT Iindex 1.13 Building document list ... Adding to database huh: Parsing files ... Virtual memory exceeded in `new' Granted, I'm running CDE/Motif, netscape, and a few goodies, but nothing really serious. To index on this machine I usually shut down to single-user mode and run Iindex from the console. It's ugly, but it's also a lot faster. If you need to index a lot of data, you need a lot of RAM to make it go quickly. Searching is another issue entirely. More RAM helps, essentially without limit, but the reward isn't as great. The basic search algorithm runs in about 50 lines of code. The problem starts when that basic algorithm starts finding matches. When it does, those matches are stored in a result set. These result sets can get pretty big in a hurry. Imagine searching for "knee" in a textbase on knee injuries: the "knee" result set will get huge. Booleans won't help you, either. In fact, they make it worse. Imagine searching for "knee" and "dislocation". You'll still have to drag around the monster "knee" result set, and then you'll have to make a second one for "dislocation", and then finally you'll have to create a third one that is the union of the two. Isearch will delete the first two when the third one is created, but only after the third one has been finished. It can add up in a hurry. While the relational database people have spent years optimizing for this situation, Isearch isn't that clever yet. Moral to the story: make sure your users know to use "good quality" search terms. They'll get better results and your office won't look like SIMM City. |