Based on a paper presented at the 34th Clinic on Library Applications of Data Processing (Cochrane & Johnson, 1998)
© 1998. Jessica L. Milstead. All Rights Reserved
A thesaurus is more than a simple equivalence list, the kind of "thesaurus" most often supported by text retrieval packages. While equivalence lists are vital to effective information retrieval, they are not enough. They can only suggest other ways of expressing an idea which is already in the user's mind; they do not remind the user of related ideas that might be valuable in searching.
A true thesaurus has equivalence relationships, but it also supports other kinds of relationships, such as genus-species, and provides navigation assistance by means of scope notes and other aids. In other words, a thesaurus is a tool designed to aid users in finding their way around a vocabulary database. In addition to its traditional use as an authority for the terms used in indexing the database, it offers reminders of terms the user might not even have considered.
The ANSI/NISO standard for thesauri (National Information Standards Organization, 1994) provides the best available information on what thesauri should do and how they should be built, but it predates the recent explosion of full text and powerful search engines, and it is not an adequate guide to future needs and potential.
The first thesauri were produced before electronic searching became widely available, but their full development coincided with the growth of online bibliographic databases. The earliest electronic files consisted only of titles, bibliographic descriptions, and indexing; if you were lucky there were abstracts, but this was not to be taken for granted when storage space was a very precious commodity and acquiring anything in electronic form generally meant rekeying it. In this environment, indexing had to be of high quality if information was to be retrieved at all. Hence the obvious need for thesauri.
Today it is beginning to seem as if all information is available in full text. However, this is not true, nor will it be true in the immediate future. Vast numbers of legacy documents remain, and converting these to searchable text is an expensive, long-term proposition. Furthermore, many documents are still being produced only in printed form. Therefore, thesauri and indexing will continue to have a place at least for awhile in facilitating access to documents for which electronic text is not available. Their long-run value, however, depends on integration with full-text search.
We should distinguish kinds of tools for facilitating access to full text on the basis of the attention they give to semantics. Older, exact-match (Boolean) systems give no attention to semantics. The search terms must appear in the document for it to be retrieved; if a term appears at all the document will be retrieved regardless of whether the term is important to the meaning of the document or not. Another approach relies on statistical information -- co-occurrence of words in the document, frequency, etc. Boolean and statistically-based systems have been found to have comparable retrieval performance, but to produce very different retrieval sets. That is, searches of the same database using a Boolean engine and a statistically-based one often produce about the same number of relevant hits, but there may be little overlap between the two sets of hits.
Intelligent retrieval systems integrate statistical and semantic information, as well as a full battery of linguistic techniques, to retrieve more useful results. Such a system may contain an extensive lexicon, not just of word meanings and equivalents, but of word types and relationships. Text is parsed, to a greater or lesser extent depending on the system, and there are often tools for disambiguation of terms. Phrases rather than just single words can also be handled. The most powerful systems actually can determine syntactic or structural meaning, permitting them to retrieve a concept expressed in words that are not actually in the lexicon.
Any of these types of system could produce better results by taking advantage of the presence of controlled-vocabulary indexing. In a Boolean system the chances of retrieving relevant documents that do not happen to contain the words of the search query are improved, though precision is not helped unless the search is specifically limited to controlled vocabulary terms. In either statistical or intelligent systems, the index terms could be weighted more heavily than the running text. Unfortunately, this is generally not the case.
Searchers consistently state that they need indexed, searchable full text (Pritchard-Schoch, 1993). For some kinds of queries, statistical techniques applied to the full text have been satisfactory, while others just cannot be answered satisfactorily without indexing.
The basic design of thesauri to date, then, has been as indexing aids, with the expectation that searchers would be able to use these aids as a guide to searching. The notation used in term relationships is abstruse; the fact that "BT" and "NT" mean that two terms are related hierarchically is obvious only to specialists. Furthermore, database producers frequently do not mount their thesauri on their search systems. And even if the thesaurus is mounted, the search system may not support the full range of navigational information. In other words, the thesaurus is an indexing aid which we hope can also be used for searching, but we frequently haven't put much effort into making this use possible, let alone easy.
Thesauri are known to be underused by searchers; this is probably due at least partly to the fact that the thesaurus for a database is unlikely to be readily available to them. Even if it is available, it may be only in paper form, or as an online list with little or no user aid in the interface. Even without significant changes in the nature of the thesaurus itself, provision of effective thesaurus navigation tools in interfaces to search systems should increase searcher use substantially. Permitting the searcher to switch seamlessly between navigating the thesaurus and searching the database can only improve access.
An obvious way in which a thesaurus can be applied directly in retrieval is to use the relationships as a means of expanding the search. Research, however, has shown that these relationships must be used with caution. In general, expanding a search to include the narrower terms tends to improve recall without great sacrifice in precision. Expanding to include broader or related terms, while it does improve recall, typically has a significant negative impact on precision.
Over the years there have been proposals for end-user thesauri designed specifically to facilitate searching. An end-user thesaurus differs from a conventional thesaurus in two primary ways: its term inclusion and organization, and its displays. It is designed to reflect and organize the total specialized vocabulary of users in a field, rather than to provide a limited list of authorized terms. It gives more information about the scope of terms, and its displays are designed around the way in which users approach information (Anderson & Rowley, 1992; Bates, 1990).
End-user thesauri have not been widely implemented, for a number of possible reasons. Conventional thesauri are costly to develop and maintain; the additional access in an end-user thesaurus would be even more costly. At the same time, until recently there seems not to have been a real understanding that the more full text there is, the more help users need in navigating it, even with a powerful search engine.
A semantic network would serve some of the same purposes as an end-user thesaurus, but would be just as costly to develop. Use of existing semantic networks such as WordNet has not shown great improvement in retrieval results, perhaps because they are not directly focused on the vocabulary of a particular user group.
The availability of powerful text analysis software could change this scenario dramatically. The same analysis used to provide relevance-ranked search results could be used to suggest candidate terms for indexing without manual development of a rule base. Already-indexed documents would be used to train the text analysis software, which would then assign candidate index terms to the documents for indexer review. Without human review, of course, the same scenario would produce automatic indexing.
To date, however, the vendors of text analysis software have concentrated their development and marketing efforts on websites and corporate intranets, with an emphasis on categorization and routing applications. Unfortunately, a system which effectively distinguishes among 100 or 1,000 categories may not scale up to 10,000 or more highly specific terms without redesign. This may be why thesaurus-based text analysis products are only now beginning to show up in the market.
MAI assumes a developed thesaurus, and ongoing maintenance and refinement of the term assignment criteria. It shifts much of the analysis effort away from review of individual documents to maintenance of the vocabulary and retraining of the system. In fact, use of MAI does not reduce the need for a thesaurus; if anything, it increases the demands made on these tools, and as a result is bringing more of their limitations to light.
For these reasons it seems likely that metadata developments will have relatively little impact one way or the other on thesauri. Producers who are concerned with providing standardized subject access to their resources will use thesauri to determine the content of the element(s) allocated to subject metadata. Those who are less concerned about subject access will put unstandardized keywords in such metadata elements if they use them at all.
The number of kinds of relationships in the present design is limited -- and yet even this specification of types is probably only of marginal direct value to many users. As indicated above, users do not necessarily recognize that "BT" and "NT" mean a relationship is hierarchical and "Use" and "UF" mean the terms are equivalent, while "RT" means the relationship is something else that something being unspecified.
For a thesaurus developer, even deciding when a relationship is hierarchical or part/whole can be difficult. The decision is fairly easy when concrete objects (e.g., truck/motor vehicle) are the issue. However, in a world where the same entity may be a "particle" (i.e., concrete) or a "wave" (not concrete), depending on how the observer happens to be looking at the entity at the moment, deciding whether something is a "thing" or a "process" may not only be difficult, it is likely to be futile.
If the distinction between hierarchical and other relationships is that porous in fact, of how much value is it to users? We find ourselves wanting to say: "These terms are very closely related, while these others, though less related, might still be useful for you," but this way of looking at relationships is not necessarily hierarchical. Instead, it involves weighting, but there is no way to build weighting into a standard thesaurus -- and the appropriate weights will be a subjective decision in any case.
At the same time, text analysis software theoretically can make use of much richer semantic analysis, not only of the relationships between terms, but of the kind of term, e.g., a process, thing, or property. Historically, this kind of analysis has been even more labor-intensive than that required to develop the relationships in a standard thesaurus. For instance, efforts such as the Cyc project have involved manual development of a knowledge base that would permit automatic analysis. On the near horizon, though, are systems which will automate development of abstractions such as relationships among concepts.
Displaying the relationships of a thesaurus in print has always involved compromises. For instance, the typical alphabetical display can only show a single level of upward and downward hierarchical relationships. Thesauri which include the full hierarchy of terms in the alphabetical display become much more voluminous. If the full hierarchical display is relegated to a separate listing, it can be difficult in the alphabetical display to show where to enter the hierarchical listing to see the full hierarchy of the term.
While electronic display of a thesaurus can ameliorate some of the limitations of the print display, making it possible, for instance, to switch back and forth between alphabetical and hierarchical display, the limitations of the screen are substituted for the limitations of the printed page. The screen display does offer possibilities of flexibility and customization that simply are not possible in print. Unfortunately, the thesauri which are currently publicly available on the Web frequently are less rich in access than the print forms of the same products.
Graphical displays have not been much used to date, but Plumb Design has implemented an interesting display of the WordNet semantic network that permits moving around and changing focus within the display.
Given all the problems and limitations, how is it possible to remain positive about the need for continued use of thesauri? There are two fundamental reasons, one philosophical and one pragmatic:
Anderson, James D., and Frederick A. Rowley. "Building End-User Thesauri From Full-Text." In Barbara H. Kwasnik and Raya Fidel, eds. Advances in Classification Research, Volume 2; Proceedings of the 2nd ASIS SIG/CR Classification Research Workshop, October 27, 1991. Medford, NJ: Learned Information, 1992. p. 1-13.
Bates, Marcia J. "Design for a Subject Search Interface and Online Thesaurus for a Very Large Records Management Database." In: American Society for Information Science. Annual Meeting. Proceedings, v.27. Medford, NJ: Learned Information, 1990. p. 20-28.
Cochrane, Pauline A., and Eric H. Johnson, eds. Visualizing Subject Access for 21st Century Information Resources; Proceedings of the 34th Annual Clinic on Library Applications of Data Processing, March 2-4,1997. Champaign, IL: Graduate School of Library and Information Science, University of Illinois, 1998. p. 28-38.
Hlava, Marjorie M.K., and Richard Hainebach. "Machine Aided Indexing: European Parliament Study and Results." In 17th National Online Meeting. Proceedings. Medford, NJ: Information Today, 1996. p. 137-158.
"Metadata: Cataloging by Any Other Name..." (with Susan Feldman). Online 23(1):24-31,January-February 1999. (Also available at: http://www.onlineinc.com/onlinemag/metadata/)
National Information Standards Organization. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. (ANSI/NISO Z39.19-1993)
Pritchard-Schoch, Teresa. "Natural Language Comes of Age." Online 17(3):33-43, May 1993.