Indexicon, The Only Fully Automatic Indexer: A Review

© 1994 Nancy Mulvany and Jessica Milstead

Just what is automatic indexing? Basically, it is the assignment of index terms to information items by a computer algorithm rather than by a human being. Automatic indexing became a major research area in the late 1950s, as one element of the developing interest in automatic text analysis. At that time, computers were beginning to be applied to work other than number-crunching. This was also the period of growing awareness of the "information explosion."

For the first 25 years or so of research in automatic indexing, efforts were severely limited both by the limited computing power available relative to the demands of text analysis, and the lack of material in usable electronic form. Only in the past decade has this situation begun to change, as computers have become vastly more powerful and more text has become available.

At the same time, however, the primary emphasis had shifted from "automatic indexing" to "natural language retrieval," i.e., from algorithms for assigning index terms to text, to algorithms that directly aid search of the text itself. This searching of the text itself naturally implies online search, not development of a printed index.

More recently, there has been a resurgence of interest in automatic indexing, as it has become clear that it is not realistic to process (and reprocess) large amounts (gigabytes--thousands of megabytes) of text over and over for every query put to an information system. The emphasis remains, however, on search of electronically-stored information, not of printed texts.

The automatic or machine-aided (automatic with human review of the results) indexing systems in use today are those of large database producers such as Reuters, NASA, or the American Petroleum Institute. These systems all are designed for use of Boolean or other search methods that rely on combining separate concepts at the time of search, rather than building an index string that fully represents the different aspects of a piece of information.

Automatic indexing systems can have a number of components, though not all may be present in a single system.

Automatic indexing systems vary greatly in sophistication. The most simpleminded are those which simply compare text against a stored vocabulary. These systems cannot give very good results, because it is impossible to encode all the different ways a concept may be expressed explicitly in a vocabulary.

For instance, the concept of nonnative species of plants may be referred to in a variety of ways such as:

nonnative species

exotic species

exotics

introduced plants

invaders

barbarians

seeds arriving in ship ballast

seed contaminants

garden escapes

A simple automatic indexing system cannot begin to detect these means of expressing the concept. A sophisticated one will have more success because it will take context more into account, but it is still unlikely to catch all of the variants.

A different approach is to detect phrases by means of syntactic analysis, and select indexing terms (words and phrases) on the basis of frequency, rather than relying on a preexisting vocabulary. Our testing shows that this is the approach taken by Indexicon. Phrases are detected, and rearranged to make headings and modifiers. Unfortunately, as our review has shown, the algorithms used cannot substitute for human intelligence in synthesizing entries.

Indexicon: Overview and Operation

Indexicon for WordPerfect (version 1.00b) was tested on a 486DX 33MHz system running Windows 3.10 with 16Mb of RAM and WordPerfect 6.0a for Windows. Indexicon is an add-on product for WordPerfect 6 for Windows. Indexicon embeds tags for index entries in WordPerfect files. The index is generated by WordPerfect. The publisher, Iconovex Corp., plans to release versions of Indexicon for other document processing programs in the future.

The installation of Indexicon was quick and smooth. Indexicon sets itself up as an additional option on the WordPerfect Tools Menu and as an icon on the Toolbar. The program is distributed in a box with two disks and comes with a ten-page User’s Guide.

Operation of the program is quite simple just as is claimed on the Indexicon box. "CLICK. You’ve just indexed an entire document." The Indexicon submenu offers four options: Create Index, Remove Index, Edit Exclusions, and Help.

The Create Index option is the command that embeds tags for index entries in the text file. When this command is chosen, there are two options to set. First, you can have the index automatically generated at the end of the document in one pass. When Indexicon is finished marking the document, WordPerfect’s index generator will be called and the index will appear at the end of the document. Second, the Level of Detail for the index is set here. The manual states, "Indexicon offers six index levels for you to choose from. Level 1 includes only the most significant indexable terms; level 6 includes all the indexable terms." After these two options are set, clicking the OK button sets Indexicon in motion.

However, before the text is indexed, it is likely that it will be necessary to Edit Exclusions. This command allows you to mark text in the document that will be excluded from indexing. There are a few text structures that Indexicon will automatically exclude, such as the Table of Contents. The Edit Exclusion Zones box allows the user to mark text to be excluded, review exclusion zones already present and remove the zone if desired, and remove all exclusion zones.

After Exclusion Zones have been marked and the Level of Detail set up, Indexicon marks index entries in the file. The User’s Guide contains a small section called, "Getting the Best Index." Iconovex recommends that users generate indexes with different levels of detail so as to find the best level for their documents. Users are told that Indexicon "may skip over single words or short, isolated phrases such as those that are often used for headings." The manual suggests that users mark such phrases manually using the WordPerfect indexing facility; "the easiest way to produce the final index is to use WordPerfect’s indexing features first, then run Indexicon".

Testing and Results

The claims made on Indexicon’s box influenced the nature of our testing. Here are some quotes from the box:

Indexicon pays for itself after 50 pages of referencing* (*Based on professional indexing pricing of $3 per page)

The Standard for Indexing is Here!

With just a click of the mouse, you create back-of-the-book indexes

Produce professional quality indexes at a rate of up to 50 pages per minute

We set out to answer the following questions: Is Indexicon a cost-effective replacement for a human indexer? Does Indexicon adhere to publishing industry standards for indexing? Can Indexicon create a back-of-the-book index that conforms to the "Function of an Index" definition in the American, British and international standards (ANSI Z39.4-199x Draft 4.1, BS:3700, and ISO 999)? Can Indexicon produce a professional quality index?

In order to answer these questions we compared Indexicon’s index entries with those of professional indexers. Two texts were used for the test, Chapter 5, "Arrangement of Entries" from Indexing Books (University of Chicago Press, 1994) and Chapter 5, "Sorting the Index" from Macrex User Guide (Bayside Indexing Service, 1993).

These texts were imported into WordPerfect. Page breaks were inserted so that the pagination matched that of the published books. In both texts, many exclusion zones were marked. Indexicon’s Level of Detail was set at six, the maximum. We did not set the program for automatic index generation since we wanted to run the WordPerfect index generator separately. Also, we did not manually mark any index entries. Indexicon claims to be a fully automatic indexer, so that is what was tested.

Carolyn McGovern wrote the index for Indexing Books and Ty Koontz wrote the Macrex User Guide index. Entries for the individual chapters were extracted from the full indexes. No cross-references were included because Indexicon cannot provide cross-references. The McGovern and Koontz index entries were imported into the Macrex Indexing Program and indexes were generated conforming to the University of Chicago Press style.

The McGovern and Koontz indexes are of professional quality. These indexes conform to the de facto U.S. publishing industry standard format as outlined in The Chicago Manual of Style. In addition, these indexes meet the Function of an Index criteria of the national and international standards. The McGovern entries are from a general reference book about indexing. The entire index for the book averaged approximately seven entries per page. The Koontz entries are from a technical reference manual for an indexing program. The entire index for the manual averaged approximately 15 entries per page. Both texts were heavily indexed by the human indexers, this is why we set Indexicon’s Level of Detail to the maximum.

Figure 1 and Figure 2 display the McGovern and Indexicon entries for Chapter 5 of Indexing Books. Figure 3 and Figure 4 display the Koontz and Indexicon entries for Chapter 5 of the Macrex User Guide. Even a cursory examination of the Indexicon entries indicates a lack of the analysis and synthesis expected in a professional quality index. Even a simple task such as merging singular and plural forms of entries is not performed ("ampersand/ampersands"). The program found no continuous discussion of any topic in the Indexing Books chapter.

A particularly glaring problem with the Indexicon listing is the omission of entries. Recall that Indexicon was set at the maximum Level of Detail (includes "all the indexable terms"). A simple comparison of Figures 1 and 2 indicates a serious problem with the thoroughness of the Indexicon list. Surely terms such as "adjectives," "commas," and "biographies" would be expected to be recognized by a semantic analysis engine. Also note that there are no entries for "sorting," "filing order," or "arrangement"--major themes of the chapter. Lack of completeness in any index is a serious flaw.

The Indexicon list (Fig. 2) also lists silly entries, such as "dogs and cats" and "funeral of character." In fairness to the program, these text strings should have been marked as exclusion zones. These phrases appeared within sentences in the text. The Exclusion Zones that were marked were the lengthy examples that appeared as display material. It would have been very tedious to mark each and every word or phrase that we did not want indexed. We did not have to indicate any Exclusion Zones for the human indexer; the human indexer did not need to be told that terms used in the displayed examples of various alphabetizing methods should not be indexed. After running Indexicon without Exclusion Zones, it was apparent that we had to mark the many lists of examples in the text.

Not only does Indexicon suffer from a lack of completeness as noted above, it is also incapable of recognizing concepts, i.e. terms that do not appear verbatim in the text. For example, on page 123 of the text there is a discussion of the arrangement of words that are spelled alike; these words are called homographs. Despite the claim in an Iconovex press release that Indexicon "understands the subtleties of the English language," in our tests the program has demonstrated absolutely no ability to provide conceptual entries.

Lastly, formatting in the text is not carried over to the index entries. Throughout the text the phrase, "Chicago Manual of Style," appears in italic. This formatting is not included in the embedded index entries for this phrase.

The Indexicon entries for the Macrex User Guide were also closely examined. Indexicon doesn’t know about:

As an example, here is how Indexicon treated a related group of entries for the University of Chicago, its Press, and its style manual.


TEXT                                            INDEXICON ENTRY  
University of Chicago Press recommendations     Chicago
                                                   University of

letter-by-letter style of the University        Chicago
   of Chicago Press                                letter-by-letter style of 
                                                   University of

Chicago Manual of Style, 14th ed., 17.109       no entry

University of Chicago Press                     Chicago
                                                   University of

Chicago Manual of Style (17.97)                 Chicago Manual of Style

"Chicago-style" letter-by-letter sorting        Chicago-style letter-by-letter

17.97 of the 14th edition 
   of the Manual of Style                       Manual of Style
                                                   17.97 of 14th edition of

University of Chicago Press                     Chicago
                                                   University of 

(Chicago Manual of Style, 17.97)                no entry

These entries indicate several things (aside from the painful result of reliance on transcribing the wording of the text):

Its criteria for selecting terms for indexing are indeterminate. Most natural language processing software relies on word frequency as an important criterion for selection. Frequency within the document is always treated in relation to frequency in general text -- that is, the criterion is not the absolute frequency of a word in the text being analyzed, but its frequency in the text relative to its frequency in some body of text used for comparison. Sometimes software of this sort uses medium frequency as a criterion, on the theory that words that occur very often are common, non-information bearing words, while those that occur very, very rarely or only once, are typically not of as much value either. For one thing, almost all misspellings fall into the latter category.

"Subentry" occurs a few times, and is never indexed. On the other hand, "MACREX" occurs on almost all pages, frequently several times, and it is faithfully indexed. "Sort," the subject of the test chapter, occurs about as often as "MACREX" -- far more often than it would in natural language -- but receives no entry. Furthermore, two "words" which occur only once are indexed. These are "howeverthat" and "seriesare." These look like typos, but they are Indexicon-generated, by dropping an em-dash between two words.

It likes "and" phrases, and regularly indexes them, e.g., Conjunctions and articles.

Two occurrences of "prepositions, conjunctions, and articles" generate: Conjunctions and articles.

One occurrence of "prepositions, articles and conjunctions" generates: Conjunctions, articles and.

One occurrence of "prepositions and articles" generates: Prepositions and articles.

These entries give rise to several questions: Why only the one entry in the index for "prepositions"; why no entry under "A" for "articles"? And why the phrase in one case and the modifier in the other? A partial answer to the second question is possible: Indexicon only changes the word order of a phrase by making a heading-modifier combination. But why it made this particular sequence instead of, for instance, "Articles and conjunctions" is hard to determine. Perhaps because the "Conjunctions and articles" entries occurred first, and the software is smart enough to put the occurrences of "conjunctions" next to each other but not smart enough to go any further?

We had expected Indexicon to perform better with the text from a software user manual. This sample chapter is from a technical document that is a reference manual. However, the problems that plagued Indexicon in the Indexing Books text become even more obvious in the technical documentation text.

By its very nature, technical documentation calls for detailed indexing. The lack of completeness in the Indexicon list (compare Figures 3 and 4) is glaring. One recurring theme in this chapter is the default settings of the program. All programs are shipped with some sort of default, built-in settings that can often be changed by users. As Figure 4 indicates, there is no entry for "default". Also, the program demonstrates no ability to identify concepts or to gather related information together.

Indexicon introduces strange entries that it creates itself. For example the entry, "Machine and space occupied memory," is derived from a paragraph that discusses the amount of memory available for sorting. In particular, it appears that the entry results from the strange parsing of this sentence: "The actual figure may vary according to the amount of RAM in your machine and the space occupied by memory-resident programs."

Another document we used for testing was The National Information Infrastructure: Agenda for Action. This 91Kb document is the Administration’s policy paper about the national information infrastructure (aka Information Superhighway). It contains references to many people, states, and programs that are already in place. Here is one paragraph from the NII paper along with the Indexicon entries for that paragraph.

In May 1993, Governor Jim Hunt announced the creation of the North Carolina Information Highway, a network of fiber optics and advanced switches capable of transmitting the entire 33-volume Encyclopedia [sic] Britannica in 4.7 seconds. This network, which will be deployed in cooperation with BellSouth, GTE, and Carolina Telephone, is a key element of North Carolina’s economic development strategy.

Indexicon entries (in order of appearance)

Carolina Information Highway
   creation of North
Carolina’s economic development strategy
   key eleme
Optics and advanced switches
   network of fiber
Encyclopedia Britannica
   33-volume
BellSouth
Telephone
   Carolina

From this example, it appears that Indexicon:

Indexicon and WordPerfect

As noted earlier, Indexicon does not replace the WordPerfect index generator. Indexicon’s function is to embed index tags in WordPerfect text files. After the tags are embedded, WordPerfect generates the index. This allows users to add their own embedded index entries to files in addition to those added by Indexicon. This symbiotic relationship between Indexicon and WordPerfect also means that the generated index will reflect WordPerfect’s index generation limitations.

Indexes generated by WordPerfect can only have main headings and subentries; sub-subentries are not allowed. Each level is limited to 64 characters. Print formatting, such as italic, is not automatically carried from the text to the index entry. The sorting used by WordPerfect does not conform to any recognized method of alphabetizing. For example, WordPerfect generates index entries in the following order:

"Names, Names, Names"

abcd

Intel 80386

Intel 8088

type font

type foundry

Type, Mary Alice

TYPE-ADF command

Type/Specs Inc.

typeface

typeset

This arrangement order is unacceptable. As we know, standards for indexes primarily address the presentation of indexes. Indexicon will have to dispense with its claim, "The Standard for Indexing is Here!", so long as it is tied so closely with another program that fails to meet the presentation criteria of any standard for indexes.

Another shortcoming of the Indexicon-WordPerfect relationship is that editing of index entry tags must still be performed using WordPerfect’s tedious, error-prone process. Since Indexicon does not replace WordPerfect’s index generator, we are far removed from the day when the index is dynamically linked to the embedded tags so that the index itself can be edited rather than editing the tags.

Indexicon performs its tasks quickly and easily as claimed. The NII document (91Kb) took 4 minutes, 25 seconds for Indexicon to mark. Indexicon is even faster removing its tags; it took only 32 seconds to remove the tags from the NII document.

Customer Support

Customer support is available by phone (not toll free), fax, and by CompuServe email. We contacted Iconovex by phone with a query. The call was handled promptly and the problem resolved quickly.

Summary

Unfortunately the documentation, while adequate for running Indexicon, does not offer any explanation of how Indexicon decides to choose index entries. Given the strange results, we have had to guess at what is going on behind the scenes. The program stumbles over simple matters, such as singular and plural forms of words, personal names, and names of states; even "New York City" is not recognized.

Indexicon has not demonstrated the ability to produce a professional quality book index. Even applying the most minimal of criteria, Indexicon cannot create a sensible and thorough back-of-the-book index. Indexicon, even when set at the maximum Level of Detail, does not provide a complete and thorough index. Far too many obvious and important topics are missing. The program demonstrates no ability to identify concepts in the text. Despite the claim that Indexicon "understands the subtleties of the English language," it cannot create cross-references, the lack of which severely compromises the quality of any index. The program introduces errors and nonsensical entries; e.g., "Howeverthat" and "Machine and space occupied memory."

Although Indexicon performs its tasks very quickly, ultimately the cost-effectiveness of this program will be eradicated by the time it would take to prepare the text and clean up the index. While technical publications departments may be attracted by the idea of automatically producing indexes for their documentation, Indexicon is not the shortcut it claims to be.

First the text must be prepared before Indexicon is run. Exclusion zones must be marked. The best way to approach this tedious task would be to run Indexicon without Exclusion Zones. Then you will see the types of material which must be marked for exclusion. In our sample texts, some of the necessary Exclusion Zones were obvious and easy to mark. However, the process bogs down when it is necessary to go into specific paragraphs and mark individual words and phrases within sentences.

Because Indexicon will overlook so many important index entries, it will be necessary to insert tags manually for these entries in your text files. For recurring oversights that appear verbatim in the text, one can build a control list in a WordPerfect concordance file. This will automate part of the process. However, since so many entries in a properly written index do not appear verbatim in the text, it will be necessary to mark many entries by hand.

After Indexicon has been run it will be necessary to clean up the index. If one chooses a low Level of Detail there will be less to clean up, but there will even more entries that will need to be manually inserted. Technical writers will be tempted to edit only the generated index, not the embedded index tags within the text file. This common practice defeats the entire purpose of embedding index entries in text files because when the index is generated at a later date the original, unedited index tags remain in the file. However, even in our small test files the prospect of editing Indexicon’s index entry tags in WordPerfect was overwhelming. Since no special text formatting like italics is carried from the text to the index that will have to be added manually. Lastly, since Indexicon is incapable of producing cross-references, they will have to be added manually.

As users have indicated so clearly (Grech 1992), the index is the most important component of technical documentation. Many technical publications managers are well aware of the importance of the index and would welcome software tools that would help improve the quality of their indexes. Indexicon is not the answer (compare Figures 3 and 4).

Automatic indexing was never intended to produce back-of-the-book indexes. As Indexicon demonstrates so well, back-of-the-book indexes cannot be automatically generated. A proper book index is much more than text strings formed into index structures as main headings and subentries. The logical domain for automatic indexing methods is the electronic environment where these methods are used in conjunction with sophisticated searching techniques. In regard to the production of a book index, this is the wrong tool for the job--a bigger dictionary, a more extensive thesaurus, more clever parsing and weighting algorithms will not help. Our test results (Figs. 1-4) could be used as a Turing Test for Indexes. Can you tell which ones were done by a computer?

The claims made on the Indexicon box are disturbing given the long tradition of book indexing. Indexicon is not a replacement for a human indexer. We could have reused the title of the front page article in the Sept./Oct. 1992 Key Words (Milstead 1992), "No, You Can’t Be Replaced by a Computer." It is unfortunate that Iconovex has focused on the application of an inappropriate technology rather than addressing the market’s need for improved computer-aided indexing tools (Mulvany 1994).

Notes

Grech, Christine. 1992. "Computer Documentation Doesn’t Pass Muster." PC Computing (April):212-14 and Dataquest Worldwide Services Group. 1993. Desktop Software Support: User Wants and Needs, 1993 Annual Edition. Framingham, MA: Dataquest Incorporated, p. 124.

Milstead, Jessica. 1992. "No, You Can’t Be Replaced by a Computer." Key Words (Sept/Oct):1.

Mulvany, Nancy C. 1994. Indexing Books. Chicago: University of Chicago Press, 277-279 and Mulvany, Nancy C. 1994. "Embedded Indexing Software: Users Speak Out." In The Changing Landscapes of Indexing: The Proceedings of the 26th Annual Meeting of the American Society of Indexers. Port Arnansas, TX: American Society of Indexers, 41-51.

Indexicon for WordPerfect: System Requirements

Iconovex Corp., 7448 West 78th St., Bloomington, MN 55439; (612) 943-0292

Suggested Retail Price: $149.99

Author Bios...............

Nancy Mulvany is a past president of the American Society of Indexers and the author of Indexing Books. She has been involved in design issues related to indexing software since 1985. She can be reached at nmulvany@bayside-indexing.com

Jessica Milstead is Principal of The JELEM Company. She specializes in machine-aided indexing and thesaurus development. She can be reached at milstead@jelem.com

The authors would like to thank Carolyn McGovern and Ty Koontz for permission to use their work in our review.



This article first appeared in Key Words, Sep/Oct 1994 (Vol. 2, Number 5).






Return to Home Page