© 1998. Jessica L. Milstead. All Rights Reserved
This workshop brings together practitioners of library/information science and biological systematics at the intersection of their disciplines: the task of organization of information for later retrieval. Organization of information is a fundamental area of research in information science; information scientists try to understand the nature of information, how humans process it, and how best to organize it to facilitate use. For taxonomists, on the other hand, organization of information is more a basic tool than an area of research. They are more concerned with actually getting information organized than with developing theories about the process. Through this workshop, information scientists can learn about the information issues facing a discipline which is particularly dependent on organizational aspects of its information, while taxonomists can gain familiarity with tools and techniques which they can use to facilitate organization of their information.
One of these tools is the information retrieval thesaurus. This tool was developed over 30 years ago, and is the subject of both national and international standards. The goal of this paper is to provide an overview of what information retrieval thesauri are, to describe the provisions of the thesaurus standard, and to show the potential applicability of thesauri both to taxonomy and to organization of information in the life sciences in general.
The National Information Standards Organization (NISO) is accredited by the American National Standards Institute for the setting of standards in the area of information and library science. This includes such topics as indexing, abstracting, and standard numbering schemes for documents. The standard for information retrieval thesauri (National Information Standards Organization, 1993) is part of this program. The standards for library cataloging are not part of NISOs program; these standards, such as the Anglo-American Cataloguing Rules, predate the establishment of NISO by many years. There are also two international standards for thesauri: one for monolingual thesauri (International Organization for Standardization, 1986) which is similar to the NISO standard except in its provisions for non-English language thesauri, and another for multilingual thesauri (International Organization for Standardization, 1985) -- thesauri which themselves are in multiple languages.
All NISO standards are voluntary; there are no "standards police" to enforce compliance. In fact, many of the standards are more in the nature of guidelines, making recommendations for best practice rather than promulgating requirements.
The standard for thesauri defines a thesaurus "as a controlled vocabulary arranged in a known order," with specified types of relationships, identified by standardized relationship indicators. A controlled vocabulary is a subset of natural language, consisting of preferred and nonpreferred terms. The primary purposes of a thesaurus are identified as promotion of consistency in the indexing of documents and facilitation of searching. The definition in the international standard is similar. While the context of thesaurus standards is tags for labeling of the content of documents, the principles of thesaurus development can be applied to a wide variety of tasks of information organization.
A thesaurus may be used as a terminological authority, but usually in a very limited sense. If it supports a database or a family of databases, it serves as the authority for the terms used in indexing those databases. This does not imply any prescriptive authoritativeness in the outside world, but if the thesaurus is to serve its users it must reflect their usage of terms. Thus, it may well serve as a guide to the terms actually in use in a field -- not the same thing as an authority which prescribes usage.
The word "thesaurus" is most familiar to the general public in the context of the tool originally developed by Roget. This kind of thesaurus is similar to an information retrieval thesaurus in that it contains some hierarchical organization, and gives equivalents. However, its basic purpose is quite different. A Roget-type thesaurus is designed to aid writers in finding the word or phrase that most precisely expresses a nuance of meaning, or that brings variety to their writing. In contrast, an information retrieval thesaurus conflates equivalents and near-equivalents for the purpose of authorizing a subset of natural language which will bring together closely related information to facilitate retrieval.
In the context of information retrieval, especially of software used for retrieval from full text, "thesaurus" is sometimes used in a more limited sense, and I want to be sure that the distinction is clearly made here. Some text retrieval packages which offer a "thesaurus" are using the term to mean a dictionary of term equivalents, whether built-in or developed by the user. These tools are useful, but they are far from the richness of a true information retrieval thesaurus which contains, in addition to term equivalents, hierarchical and associative relationships and additional authority information about terms. For purposes of this paper, a "thesaurus" is an ANSI/NISO standard thesaurus.
Another important point to keep in mind is that a thesaurus is a database which is composed of records. In most thesauri, these records are for individual terms; they contain links for the relationships of a given term to other terms, as well as a variety of other kinds of information such as the date when a term was declared valid or invalid, its scope in the database, or the authority for its use. This definition implies that in a generic sense a thesaurus and a taxonomy are closely related.
Most thesauri are designed to facilitate access to the information contained within one database or a group of specific databases, rather than to cover a discipline as a whole. This is not a requirement, however, and one of the most important thesauri, the Art and Architecture Thesaurus, is explicitly designed to cover a set of designated subject fields. It has been put to use in indexing a number of different databases, but the scope or requirements of these databases have not constrained its development though they have sometimes resulted in development of extensions or new capabilities for the thesaurus.
The standard contains considerable detail on structure of terms which need not be repeated here. Basically, single concepts are preferred:
Parasites Infection not Parasitic infection
This is a guideline rather than an inflexible rule, and it is common in thesauri serving large databases to include complex terms for concepts in the core subject area of the thesaurus; this facilitates access for users and reduces false retrievals.
Multiword terms are written in direct (adjective-noun) rather than inverted order:
Mumps virus not Virus, mumps
Homographs are differentiated by means of a qualifier, usually in parentheses; this provision could be used to distinguish ambiguous taxa, e.g.:
Lemur (Hapalemur) USE: Hapalemur
The members of the genus Hapalemur were formerly classified in the genus Lemur; this reference provides access via the obsolete name.
Three basic types of relationship are permitted by a standard thesaurus: equivalence, hierarchical, and associative; a variety of specialized relationships may be included within one of these three types. Equivalence relationships are defined between terms which are synonyms or near-synonyms, as well as terms which are treated as equivalent for purposes of the thesaurus and/or of the database which it supports. The relationship between an accepted taxon name and its synonym is an equivalence relationship. The conventional notation for this type of relationship is UF/USE (Used For/Use), e.g.,:
Coleoptera UF: Beetle Beetle USE: Coleoptera
In standard thesauri, there are three kinds of hierarchical relationship: generic-specific, instance, and part-whole. The conventional notation is BT/NT (Broader Term/Narrower Term). The relationship between one taxon and another of lower rank is a classic example of a generic-specific relationship, e.g.:
Coleoptera NT: Abacidus Abacidus BT: Coleoptera
The instance and part-whole relationships appear to have little use in a vocabulary which is strictly limited to taxonomy, but if the taxonomic terms are embedded among other kinds of terms, these other kinds of hierarchical relationships may be of value. An example of the instance relationship is that between Mountain ranges and Alps; the relationship between the leg and the knee is a whole/part relationship.
The final type of standard thesaurus relationship is the associative relationship; this relationship is defined negatively, and can thus include a number of different conceptual types. An associative relationship is any nonhierarchical semantic relationship between a pair of preferred or authorized terms. In thesauri, this relationship will be present in such cases as a material and its properties, or an organism and its role, e.g., parasite. The notation conventionally used is RT/RT (Related Term/Related Term), e.g.:
Cestoda RT: Parasites Parasites RT: Cestoda
Potential uses for this relationship in strictly taxonomic thesauri include linking living and extinct taxa which are considered to be related, e.g., Carnivora and Creodonta, as well as linking other sister taxa where the relationship is considered especially significant. The thesaurus standard excludes the definition of relationships between all siblings (terms with the same Broader Term) because this is redundant.
Another part of the navigational apparatus of a standard thesaurus is notes. These may be of any number and kind, depending on the requirements of the application. All of the supporting information -- authors, references, dates, etc. -- that is found in the record for a taxon would be included in a series of specialized notes in a thesaurus application. However, the relationships to other taxa which are often included in notes in taxonomic files usually can be better expressed by means of the Related Term relationship in a thesaurus.
A variety of other organizational devices may be adopted in a thesaurus in order to provide additional means of navigating within the structure. Two of the most common are classifications of terms and "node labels." While term classifications are related to the hierarchical BT/NT structure, they are likely to be more pragmatic, deviating from strict hierarchy in the interest of developing additional useful groupings of terms. Classifications range from flat categories (broad classes with a single-level listing of terms included in the class), as in the Thesaurus of ERIC Descriptors, to full structures, complete with notations, as in the tree structure of Medical Subject Headings (MeSH).
Node labels are most commonly used to organize hierarchical displays. A classic example is the following:
Automobiles <by body type> coupe station wagon <by model> Buick Chrysler <by size> compact midsize
This device is most useful when there are a large number of Narrower Terms under a specific Broader Term; the organization makes it easier to grasp the organization of the list and to find useful terms without scanning a long, diverse array. The node labels are not themselves used in indexing; their purpose is purely organizational.
Like a taxonomy, a thesaurus is never "finished." New findings, and reinterpretation and restating of what is already known, require that terms be added, changed, and occasionally deleted. Continued usefulness of the thesaurus requires an ongoing commitment to updating.
The most common application of thesauri is for vocabularies dominated by topical subjects. These vocabularies typically present the most difficult problems from a navigational point of view. The concepts are more diverse and abstract, and it can be difficult for both indexers and searchers to determine the most appropriate term(s) for labeling a document or using in a query.
However, a thesaurus structure can be used for almost any kind of information. Any kind of relationship between information items can be included within the standard thesaurus relationships -- equivalence, hierarchical, and associative -- though some applications can benefit from more detailed differentiation of relationship types.
Even if the conventional relationship notations (BT/NT, etc.) are not employed, the principles of thesaurus construction are of value in organizing vocabularies. The basic relationship types and the criteria for establishing them are still applicable. A thesaurus structure is hospitable to storage of any kind of useful information about an item, whether that item is a journal article, an organism, or a work of art.
A pragmatic reason for considering development of a thesaurus as an aid to organization of taxonomic information is the availability of reasonably priced off-the-shelf software. While this software has its limitations, several packages are hospitable to a variety of user-defined relationships, and they can generate reports in a number of structures and formats, including HTML, permitting files to be made available for searching via the Web. Some packages even have CGI interfaces, making real-time updating possible. The URLs for some packages in common use today are given in the references at the end of this paper.
A review of the literature and some searching of the Web turned up no cases of use of a standard thesaurus structure to organize specifically taxonomic information. However, a standard thesaurus of taxonomic terms has been developed for use in-house at Chemical Abstracts Service (CAS) (Priestley, 1998). CAS needed a taxonomic vocabulary for the specialized purpose of providing access to organisms of chemical interest. While high value was attached to use of currently accepted taxon names in indexing, and to hierarchical relationships, the full level of detail in taxonomic information was not considered important to CAS purposes. For this reason, genera are linked directly to the family, without attempting to distinguish intermediate levels such as tribes. The family is a useful point for grouping of taxa of similar chemistry, and plant families typically are more stable than tribes.
This example typifies the use of a thesaurus structure for a special-purpose taxonomic vocabulary. CAS does not register new taxa, nor does it attempt to act as an authority for taxonomic decisions. Instead, their goal is to provide access to literature about organisms whose chemistry is of interest. While a thesaurus of general subjects was being developed at the same time as the taxonomic thesaurus, the two tools were kept separate because the number of taxonomic terms (~100,000) would have overwhelmed the number of general subject terms (~15,000), even though the latter were more heavily used and had more complex relationships. A few common taxonomic names were included in the general subjects thesaurus, and further rationalization of the relationship between the two thesauri is planned.
Another use of the thesaurus standard as a guideline was found during this workshop. Randy Bellew of the Museum Informatics Project is using the standard for guidance in development of vocabularies under the aegis of this project.
A number of taxonomic files may be found on the Web and elsewhere which contain thesaurus-type information, and which would be amenable to a thesaurus structure. That is, they are organized hierarchically, they show synonymy, and/or they contain a variety of notes about each taxon. The only element usually lacking is the associative relationship, though at least some files contain information in notes that would readily translated into an associative relationship, e.g., specification of sister taxa.
An example of a possible standard thesaurus display may be found in the Appendix. Brief extracts from the Zoological Record Search Guide Systematic Thesaurus and Master Index are shown, followed in each case by the same data structured with standard thesaurus relationships. It is important to note that this is an example, not a proposal. In a real thesaurus design some additional special devices would almost certainly be used, for example to show the time period when a specific relationship was valid.
Since thesauri can be used as aids to organizing and searching almost any kind of information, it is worth looking briefly at their value for the life sciences in general. A thesaurus can facilitate provision of access to a database by providing a controlled vocabulary that aids both in consistent organization of the information and in searching it.
The primary such example is that of Zoological Record, which in addition to its hierarchically arranged Systematic Thesaurus of taxa contains a Subject Thesaurus of non-taxonomic terms which is organized hierarchically in broad categories. The terms from both hierarchical displays are listed alphabetically in a Master Index. Other thesauri such as AGROVOC (1995), Medical Subject Headings, and EMTREE include taxa of interest in the subject area among other terms.
Taxonomists need structured information tools such as thesauri for purposes other than organization of the taxonomy itself. A thesaurus structure can be used to organize the various characteristics used in descriptions of organism. For instance, the relationships among terms such as "glaucous" or "pubescent" used to describe the surface of plant parts could be shown in a way that would facilitate their effective use in keys and elsewhere.
NISO Z39.19 thesauri provide a standardized means of organizing many kinds of information, including both conceptual and taxonomic information. Even if a specialized notation is preferred for relationships, the principles of design can still be applied. A thesaurus can be used as a tool to integrate taxonomic and non-taxonomic information. And, pragmatically, the availability of off-the-shelf software may make it possible to record the structural information of a taxonomy with minimal programming or application development.
Zoological Record kindly gave permission for use of their data and for its reformatting as an example. David Priestley of Chemical Abstracts Service provided information on the policies followed in development of the CAS in-house taxonomic thesaurus; acknowledgment is also due to CAS for permission to use this information. Any errors of transcription, manipulation, or interpretation are, however, the responsibility of the author.
AGROVOC: Multilingual AgriculturalThesaurus. 3rd ed. Rome: Food and Agricultural Organization, 1995.
Art and Architecture Thesaurus. Santa Monica, CA: Getty Information Institute. (http://www.gii.getty.edu/vocabulary/aat.html)
EMTREE Thesaurus. Amsterdam: Elsevier Science. Annual.
International Organization for Standardization. Documentation -- Guidelines for the Establishment and Development of Monolingual Thesauri. 2nd ed. n.p.: ISO, 1986. (ISO 2788-1986(E)). (Available in the U.S. from American National Standards Institute)
International Organization for Standardization. Documentation -- Guidelines for the Establishment and Development of Multilingual Thesauri. n.p.: ISO, 1985. (ISO 5964-1985(E)). (Available in the U.S. from American National Standards Institute)
National Information Standards Institute. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. 69p. (ANSI/NISO Z39.19-1993) (http://www.niso.org/obtainst.html)
Medical Subject Headings. Bethesda, MD: National Library of Medicine. Annual.
Priestley, David. Personal communication, April 3, 1998.
Hierarch (Systematics Information Systems Pty): http://www.ozemail.com.au/~sisnsw/hierarch.htm
Lexico (Project Management Enterprises, Inc.): http://www.pmei.com/lexico/lexico.html
MultiTes (MultiSystems, Inc.): http://www.multites.com
TCS (Liu-Palmer Inc.): http://www.liu-palmer.com
Extract from Zoological Record Systematic Thesaurus
Note: For space considerations, not all positions of every taxon are shown. The extract is more complete for lower-level taxa.
Pisces (Group) Actinopterygii (Class) Gasterosteiformes (Order) Valid position for v.128-; see also under Acanthopterygii Gasterosteidae (Family) Valid position for v.128-; see also under Gasterosteoidei. Related term Aulorhynchidae Gasterosteoidei (Suborder) Valid for v.115-127; for v.128- indexed under Gasterosteiformes Aulorhynchidae (Family) Valid for v.115-127; for v.128- indexed under Gasterosteidae Gasterosteidae (Family) Valid position for v.1115-127; see also under Gasterosteiformes. Related term Aulorhynchidae
Zoological Record Systematic Thesaurus Data in Standard Thesaurus Format
Note: Node labels are used in this example to assure that the position of a particular taxon is unambiguously indicated. This might not be necessary if every taxon were unambiguously placed in the hierarchy.
Pisces <Class> Actinopterygii <Order> Gasterosteiformes <Family> Gasterosteidae <Suborder> Gasterosteoidei <Family> Aulorhynchidae Gasterosteidae
Extract from Zoological Record Master Index
Actinopterygii, E-110, E-146 Aulorhynchidae (v.115-127), E-167 indexed under Gasterosteidae (v.128-), E-116, E-167 Gasterosteidae, E116, E167 see also Aulorynchidae Gasterosteiformes, E-115, E-167 Gasterosteoidei (v.115-127), E-167 indexed under Gasterosteiformes (v.128-), E115, E-167
Zoological Record Master Index Formatted as Standard Alphabetical Display
Actinopterygii BT Pisces NT <Order> Gasterosteiformes Aulorhynchidae BT Gasterosteoidei RT Gasterosteidae HN v.128- indexed under Gasterosteidae; valid for v.115-127 Gasterosteidae BT Gasterosteiformes Gasterosteoidei RT Aulorhynchidae HN v.128- under Gasterosteiformes; v.115-127 under Gasterosteoidei Gasterosteiformes BT Actinopterygii NT <Family> Gasterosteidae <Suborder> Gasterosteoidei Gasterosteoidei BT Gasterosteiformes NT <Family> Aulorhynchidae Gasterosteidae HN v.128- indexed under Gasterosteiformes; valid for v.115-127