Use of Thesauri in the Full-Text Environment


Based on a paper presented at the 34th Clinic on Library Applications of Data Processing (Cochrane & Johnson, 1998)

© 1998. Jessica L. Milstead. All Rights Reserved

Introduction

The information retrieval world has changed dramatically in recent years, with the immense increase in availability of searchable full text -- and the increasing availability of powerful engines for searching the text. It is reasonable to ask whether there is any place left for thesauri in this new information retrieval environment. I believe there is a place for thesauri -- or something like them ­ but they must change in order to continue to be of value, and it is hard to predict just what the changes will be.

A thesaurus is more than a simple equivalence list, the kind of "thesaurus" most often supported by text retrieval packages. While equivalence lists are vital to effective information retrieval, they are not enough. They can only suggest other ways of expressing an idea which is already in the user's mind; they do not remind the user of related ideas that might be valuable in searching.

A true thesaurus has equivalence relationships, but it also supports other kinds of relationships, such as genus-species, and provides navigation assistance by means of scope notes and other aids. In other words, a thesaurus is a tool designed to aid users in finding their way around a vocabulary database. In addition to its traditional use as an authority for the terms used in indexing the database, it offers reminders of terms the user might not even have considered.

The ANSI/NISO standard for thesauri (National Information Standards Organization, 1994) provides the best available information on what thesauri should do and how they should be built, but it predates the recent explosion of full text and powerful search engines, and it is not an adequate guide to future needs and potential.

The first thesauri were produced before electronic searching became widely available, but their full development coincided with the growth of online bibliographic databases. The earliest electronic files consisted only of titles, bibliographic descriptions, and indexing; if you were lucky there were abstracts, but this was not to be taken for granted when storage space was a very precious commodity and acquiring anything in electronic form generally meant rekeying it. In this environment, indexing had to be of high quality if information was to be retrieved at all. Hence the obvious need for thesauri.

Today it is beginning to seem as if all information is available in full text. However, this is not true, nor will it be true in the immediate future. Vast numbers of legacy documents remain, and converting these to searchable text is an expensive, long-term proposition. Furthermore, many documents are still being produced only in printed form. Therefore, thesauri and indexing will continue to have a place ­ at least for awhile ­ in facilitating access to documents for which electronic text is not available. Their long-run value, however, depends on integration with full-text search.

Thesauri And Search Engines

Thesauri actually have a place at both ends of the information access process, at both storage and retrieval. The amount of electronically accessible full text is so immense, and is growing so fast, that users need all the help they can get in accessing it. The explosive growth of Web search engines, with their primitive algorithms, has had some rather unfortunate effects, to my mind. Some of these engines appear to have been developed by people who saw a need, but who had not the vaguest idea that there was already a history of development of tools to fulfill similar needs. There is little evidence that some of these developers had ever used either Dialog or a library catalog.

We should distinguish kinds of tools for facilitating access to full text on the basis of the attention they give to semantics. Older, exact-match (Boolean) systems give no attention to semantics. The search terms must appear in the document for it to be retrieved; if a term appears at all the document will be retrieved regardless of whether the term is important to the meaning of the document or not. Another approach relies on statistical information -- co-occurrence of words in the document, frequency, etc. Boolean and statistically-based systems have been found to have comparable retrieval performance, but to produce very different retrieval sets. That is, searches of the same database using a Boolean engine and a statistically-based one often produce about the same number of relevant hits, but there may be little overlap between the two sets of hits.

Intelligent retrieval systems integrate statistical and semantic information, as well as a full battery of linguistic techniques, to retrieve more useful results. Such a system may contain an extensive lexicon, not just of word meanings and equivalents, but of word types and relationships. Text is parsed, to a greater or lesser extent depending on the system, and there are often tools for disambiguation of terms. Phrases rather than just single words can also be handled. The most powerful systems actually can determine syntactic or structural meaning, permitting them to retrieve a concept expressed in words that are not actually in the lexicon.

Any of these types of system could produce better results by taking advantage of the presence of controlled-vocabulary indexing. In a Boolean system the chances of retrieving relevant documents that do not happen to contain the words of the search query are improved, though precision is not helped unless the search is specifically limited to controlled vocabulary terms. In either statistical or intelligent systems, the index terms could be weighted more heavily than the running text. Unfortunately, this is generally not the case.

Searchers consistently state that they need indexed, searchable full text (Pritchard-Schoch, 1993). For some kinds of queries, statistical techniques applied to the full text have been satisfactory, while others just cannot be answered satisfactorily without indexing.

Use of Thesauri in Searching

In the traditional scenario, an indexer uses the thesaurus to select index terms for inclusion in the document record. Then the searcher, hopefully referring to the same thesaurus, selects terms which seem likely to produce relevant results, and searches the indexing, retrieving on the basis of exact match. Even if the searcher has not referred to the thesaurus, s/he is aided by the indexing because if the query words appear in the indexing, then all documents indexed with those words will be retrieved, whether the words happen to appear in the text or not.

The basic design of thesauri to date, then, has been as indexing aids, with the expectation that searchers would be able to use these aids as a guide to searching. The notation used in term relationships is abstruse; the fact that "BT" and "NT" mean that two terms are related hierarchically is obvious only to specialists. Furthermore, database producers frequently do not mount their thesauri on their search systems. And even if the thesaurus is mounted, the search system may not support the full range of navigational information. In other words, the thesaurus is an indexing aid which we hope can also be used for searching, but we frequently haven't put much effort into making this use possible, let alone easy.

Thesauri are known to be underused by searchers; this is probably due at least partly to the fact that the thesaurus for a database is unlikely to be readily available to them. Even if it is available, it may be only in paper form, or as an online list with little or no user aid in the interface. Even without significant changes in the nature of the thesaurus itself, provision of effective thesaurus navigation tools in interfaces to search systems should increase searcher use substantially. Permitting the searcher to switch seamlessly between navigating the thesaurus and searching the database can only improve access.

An obvious way in which a thesaurus can be applied directly in retrieval is to use the relationships as a means of expanding the search. Research, however, has shown that these relationships must be used with caution. In general, expanding a search to include the narrower terms tends to improve recall without great sacrifice in precision. Expanding to include broader or related terms, while it does improve recall, typically has a significant negative impact on precision.

Over the years there have been proposals for end-user thesauri designed specifically to facilitate searching. An end-user thesaurus differs from a conventional thesaurus in two primary ways: its term inclusion and organization, and its displays. It is designed to reflect and organize the total specialized vocabulary of users in a field, rather than to provide a limited list of authorized terms. It gives more information about the scope of terms, and its displays are designed around the way in which users approach information (Anderson & Rowley, 1992; Bates, 1990).

End-user thesauri have not been widely implemented, for a number of possible reasons. Conventional thesauri are costly to develop and maintain; the additional access in an end-user thesaurus would be even more costly. At the same time, until recently there seems not to have been a real understanding that the more full text there is, the more help users need in navigating it, even with a powerful search engine.

A semantic network would serve some of the same purposes as an end-user thesaurus, but would be just as costly to develop. Use of existing semantic networks such as WordNet has not shown great improvement in retrieval results, perhaps because they are not directly focused on the vocabulary of a particular user group.

Changes in Indexing

Meanwhile, as more organizations make use of machine-aided or even automatic indexing, the demands on their thesauri have increased. For many years a few organizations such as NASA, the American Petroleum Institute, and the Defense Technical Information Center have been using machine-aided indexing (MAI). In these older MAI systems, the text of titles and abstracts is run against a rule base; when a rule is matched the applicable thesaurus term is assigned to the document. The indexer reviews these candidate index terms, adding and deleting as appropriate. While their users have found that the systems increase indexer productivity significantly, there has been no great move to MAI by other database producers in the 20 or more years that these systems have been in use. Within the past few years one MAI shell system has become commercially available (Hlava & Hainebach, 1996).

The availability of powerful text analysis software could change this scenario dramatically. The same analysis used to provide relevance-ranked search results could be used to suggest candidate terms for indexing without manual development of a rule base. Already-indexed documents would be used to train the text analysis software, which would then assign candidate index terms to the documents for indexer review. Without human review, of course, the same scenario would produce automatic indexing.

To date, however, the vendors of text analysis software have concentrated their development and marketing efforts on websites and corporate intranets, with an emphasis on categorization and routing applications. Unfortunately, a system which effectively distinguishes among 100 or 1,000 categories may not scale up to 10,000 or more highly specific terms without redesign. This may be why thesaurus-based text analysis products are only now beginning to show up in the market.

MAI assumes a developed thesaurus, and ongoing maintenance and refinement of the term assignment criteria. It shifts much of the analysis effort away from review of individual documents to maintenance of the vocabulary and retraining of the system. In fact, use of MAI does not reduce the need for a thesaurus; if anything, it increases the demands made on these tools, and as a result is bringing more of their limitations to light.

Metadata and Thesauri

At first glance, it would seem self-evident that the great variety of metadata development efforts underway today would have a significant impact on thesauri. However, this is not the case so far. Metadata standards, from the Dublin Core to the CDWA (Categories for the Description of Works of Art) and beyond, are limited almost entirely to formats and frameworks, meaning standardization of the tags and of packages for them. For resource-discovery metadata, the only concern with content of the tags has been for a few limited fields such as type of resource. Metadata formats are likely to provide a means to specify the authority used for the content of the tag, but they are designed as packages and it is up to the user to design the content of the package (Milstead & Feldman, in press).

For these reasons it seems likely that metadata developments will have relatively little impact one way or the other on thesauri. Producers who are concerned with providing standardized subject access to their resources will use thesauri to determine the content of the element(s) allocated to subject metadata. Those who are less concerned about subject access will put unstandardized keywords in such metadata elements if they use them at all.

Problems of Thesaurus Design

There are fundamental problems in the basic design of thesauri that make them less than optimally useful for more powerful retrieval scenarios. There is no reason to expect that a tool designed for Boolean search on index terms will be optimized when full text is searched by a powerful engine. Unfortunately, the ways in which thesauri could be redesigned to be more useful are not immediately obvious.

The number of kinds of relationships in the present design is limited -- and yet even this specification of types is probably only of marginal direct value to many users. As indicated above, users do not necessarily recognize that "BT" and "NT" mean a relationship is hierarchical and "Use" and "UF" mean the terms are equivalent, while "RT" means the relationship is something else ­ that something being unspecified.

For a thesaurus developer, even deciding when a relationship is hierarchical or part/whole can be difficult. The decision is fairly easy when concrete objects (e.g., truck/motor vehicle) are the issue. However, in a world where the same entity may be a "particle" (i.e., concrete) or a "wave" (not concrete), depending on how the observer happens to be looking at the entity at the moment, deciding whether something is a "thing" or a "process" may not only be difficult, it is likely to be futile.

If the distinction between hierarchical and other relationships is that porous in fact, of how much value is it to users? We find ourselves wanting to say: "These terms are very closely related, while these others, though less related, might still be useful for you," but this way of looking at relationships is not necessarily hierarchical. Instead, it involves weighting, but there is no way to build weighting into a standard thesaurus -- and the appropriate weights will be a subjective decision in any case.

At the same time, text analysis software theoretically can make use of much richer semantic analysis, not only of the relationships between terms, but of the kind of term, e.g., a process, thing, or property. Historically, this kind of analysis has been even more labor-intensive than that required to develop the relationships in a standard thesaurus. For instance, efforts such as the Cyc project have involved manual development of a knowledge base that would permit automatic analysis. On the near horizon, though, are systems which will automate development of abstractions such as relationships among concepts.

Displaying the relationships of a thesaurus in print has always involved compromises. For instance, the typical alphabetical display can only show a single level of upward and downward hierarchical relationships. Thesauri which include the full hierarchy of terms in the alphabetical display become much more voluminous. If the full hierarchical display is relegated to a separate listing, it can be difficult in the alphabetical display to show where to enter the hierarchical listing to see the full hierarchy of the term.

While electronic display of a thesaurus can ameliorate some of the limitations of the print display, making it possible, for instance, to switch back and forth between alphabetical and hierarchical display, the limitations of the screen are substituted for the limitations of the printed page. The screen display does offer possibilities of flexibility and customization that simply are not possible in print. Unfortunately, the thesauri which are currently publicly available on the Web frequently are less rich in access than the print forms of the same products.

Graphical displays have not been much used to date, but Plumb Design has implemented an interesting display of the WordNet semantic network that permits moving around and changing focus within the display.

Future of Thesauri

These tools, originally designed to facilitate consistent analysis of documents at input to an information retrieval system, are already well on their way to becoming vital retrieval tools as well. In fact, thesauri may soon be used more at retrieval than at input. They may work behind the scenes much of the time; while users should certainly have access to any available vocabulary aids if they want them, we need to design our interfaces so that users need not interact directly with the thesaurus to any greater extent than they wish or need to.

Given all the problems and limitations, how is it possible to remain positive about the need for continued use of thesauri? There are two fundamental reasons, one philosophical and one pragmatic:

A thesaurus can become the basis of a more extensive semantic network, providing information not just on what terms are used in indexing, but on how they are used within the system. Most often a semantic network includes richer relationships than a thesaurus, but there is no reason not to build the less sophisticated system, using it as a resource when it becomes feasible to develop the more powerful system.

References

Anderson, James D., and Frederick A. Rowley. "Building End-User Thesauri From Full-Text." In Barbara H. Kwasnik and Raya Fidel, eds. Advances in Classification Research, Volume 2; Proceedings of the 2nd ASIS SIG/CR Classification Research Workshop, October 27, 1991. Medford, NJ: Learned Information, 1992. p. 1-13.

Bates, Marcia J. "Design for a Subject Search Interface and Online Thesaurus for a Very Large Records Management Database." In: American Society for Information Science. Annual Meeting. Proceedings, v.27. Medford, NJ: Learned Information, 1990. p. 20-28.

Cochrane, Pauline A., and Eric H. Johnson, eds. Visualizing Subject Access for 21st Century Information Resources; Proceedings of the 34th Annual Clinic on Library Applications of Data Processing, March 2-4,1997. Champaign, IL: Graduate School of Library and Information Science, University of Illinois, 1998. p. 28-38.

Hlava, Marjorie M.K., and Richard Hainebach. "Machine Aided Indexing: European Parliament Study and Results." In 17th National Online Meeting. Proceedings. Medford, NJ: Information Today, 1996. p. 137-158.

"Metadata: Cataloging by Any Other Name..." (with Susan Feldman). Online 23(1):24-31,January-February 1999. (Also available at: http://www.onlineinc.com/onlinemag/metadata/)

National Information Standards Organization. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. (ANSI/NISO Z39.19-1993)

Pritchard-Schoch, Teresa. "Natural Language Comes of Age." Online 17(3):33-43, May 1993.