(c) By Nancy Mulvany (June 4, 1996)
Steve G. Steinberg, "Seek and Ye Shall Find (Maybe)," Wired May 1996 (4.05).
Steinberg's article provides an excellent overview of the state of information classification and retrieval on the Web. The article begins with a discussion of classification and provides detailed discussion of Yahoo!, Inktomi, Architext's Excite, and Oracle's ConText. In my opinion, the most interesting part of the article is at the end, the section subtitled, What's the purpose?
As an index writer I had to get over the fact that this article is not about indexing as I know and practice it. I'll admit that it was a little difficult for me. The cover of Wired promotes the article as "the quest for the ultimate index" and running heads throughout the magazine identify the various sections of the article as "Indexing the Web."
The article begins by discussing John Wilkins' (1668 A.D.) classification scheme. Attempts to classify knowledge go back much further than the 17th century. In Western thought we could start with Aristotle (384-322 B.C.) and move forward from there. The reason for bringing this up is that the desire to classify knowledge has been a very strong tradition in the West and continues to be; that in itself is curious and interesting.
In the second paragraph we find that "the dream of organizing all knowledge has been thoroughly discredited." The very next paragraph begins with:
But recently there have been hints of entirely new ways to classify knowledge, new systems for sorting and storing information that avoid the pitfalls of the past and can work on unimaginable large corpuses. The long-moribund fields of knowledge organization and information retrieval are, once again, showing signs of life. The reason, of course, is the Web.
My, my. I think there are some ASIS members who might not agree with the description of their field of expertise as "long-moribund." The failure of the computer-science geeks to familiarize themselves with the work of the information science nerds for the past twenty years has wasted a lot of time and resources. I have to disagree that "there have been hints of entirely new ways to classify knowledge."
On page 113 the problem with Yahoo! is summed up as "a category scheme where users have a hard time guessing where they'll find what they're looking for." Yes, that is the problem with "top-down" classification systems. Unless users intimately understand the classification scheme, they will have trouble locating specific information. More user-friendly classification systems are actually thesauri that provide structure and pointers for users so that they can learn the scheme. The problem with point-of-view in classification is an old one.
What's needed, I decided, is an index of the Web. A concordance that keeps track of every word on every Web site. (Steinberg, p. 113)
I had not planned to quote myself, but my comments are the most succinct:
An index is not a concordance, a list of all the words that appear in a document. A concordance lacks analysis and synthesis. It is simply a list of words. A concordance, even in alphabetic order, is not a "systematic guide to the items contained in or concepts derived from a collection." (Mulvany, Indexing Books: 1994, p. 4)
Concordances and inverted word lists (aka "indexes") have been around for a long time. The contribution made by Inktomi is processing method and scalability. The well-documented problems with KWIC, KWOC, KWAC, inverted word lists and concordances remain.
The following excerpt is the one made me ballistic!
But after using Inktomi more, I started to wonder if an index really satisfied my desire for organizing knowledge. I could usually find what I was looking for, but I felt as if I was poking around in the dark. I remembered something Jerry Yang had told me at Yahoo!: "The difference between a catalog and an index is that a catalog provides context." That made sense now.
.... Indexes not only don't provide context for the document, they don't provide context for the keywords.
.... While indexes solve the problems of subjectivity and scale that plague classification schemes, they don't impose enough order. (Steinberg, p. 174)
If "concordance" or "inverted word list" was substituted for "index" in the paragraphs above, I would have no problem with the statements. Steinberg is truly not at fault here. Those of us who write indexes lost our grip on the word "index" a long time ago. The NISO committee that is revising the ANSI standard on indexes has put the nail in the coffin by failing to distinguish between authored indexes and machine generated lists; that is the primary reason why I resigned from the committee.
Excite is a little more difficult to comment on because Steinberg was not provided with an in depth discussion of the techniques used. Apparently it works with inverted word lists derived from documents and performs statistical analysis. This is not a new approach. In 1988 Gerard Salton outlined many statistical techniques in his book, Automatic Text Processing. The Journal of the American Society for Information Science (ASIS) has published many articles about statistical analysis of text for years.
Of all the systems discussed, ConText is the most ambitious and in my opinion, the most interesting. Is it a new approach? It's hard to say. From what Steinberg writes and from other things I have heard and read it seems to combine various automatic text processing methods. Steinberg is impressed with ConText's ability to classify documents. The examples he provides do indeed look good. But, we come full circle back to Yahoo! with this. We end up with classification. Granted, the ConText classification scheme is much larger than that of Yahoo!. Theoretically, the ConText ontology can be more precise. However, in practice, because of its size, it will be far more difficult for users to figure it out.
The following quote from page 182 is most disturbing:
As automated indexing becomes available, we will begin to depend on it. It will encourage people to write plainly, without metaphors or double entendres that might confuse a search engine. After all, everyone wants people to be able to find what they have written.
I dread the day when as a writer my primary audience is a search engine! Steinberg is probably right about this, people will start writing with search engine retrieval in mind. It will not be unlike the proliferation of words in scientific article titles because the greater the number of words, the greater the article's retrievability via a KWIC list.
I have written indexes for books. I have tested automatic indexing programs that claim to do concept analysis. I have compared the results of machine analysis with the indexing performed by a human. In regard to index quality and usability, the machine-generated "indexes" are extremely poor.
One thing is clear, automatic text processing methods are not going to go away. We need to address their gross deficiencies. We need to be realistic about what automatic techniques can and cannot do. Lastly, we need to come to terms with what we will give up if we truly come to depend on them.