The Nature of Linguistic Data and the Requirements of a Computing Environment for Linguistic Research

Gary F. Simons
Summer Institute of Linguistics
gary_simons@sil.org


This paper was originally drafted in 1993 as a chapter for a book proposed by Lawler and Dry, Computers and the Ordinary Working Linguist. The version presented here is a revision that was published in 1996 as an article in the journal Dutch Studies on Near Eastern Languages and Literature, volume 2, number 1, pages 111-128. (Note, however, that the bibliography has been annotated to add Web links and updated to report the eventual details of works originally cited as "forthcoming".)

The book finally came out in 1998 with a new title and the paper was further revised and expanded by about 20%. The citation for the full published version is:

Simons, Gary F. 1998. The nature of linguistic data and the requirements of a computing environment for linguistic research. In Using Computers in Linguistics: a practical guide, John M. Lawler and Helen Aristar Dry (eds.). London and New York: Routledge. Pages 10-25.

Routledge maintains a Web site for the book which includes an on-line appendix that gives links to many information resources that are relevant to topics covered in this paper.


Table of contents

1. The multilingual nature of linguistic data
2. The sequential nature of linguistic data
3. The hierarchical nature of linguistic data
4. The multidimensional nature of linguistic data
5. The highly integrated nature of linguistic data
6. The separation of information from format
7. Toward a computing environment for linguistic research
References

The progress made in the last decade toward harnessing the power of electronic computers as a tool for the ordinary working linguist (OWL) has been phenomenal. As the decade of the 80s dawned, virtually no OWLs were using computers, but the personal computer revolution was just beginning and it was possible to foresee its impact on our discipline (Simons 1980). Now, fifteen years later, the personal computer is commonplace; battery-powered laptops have even made computing a routine part of life for the field linguist. But despite widespread success at getting hardware into the hands of linguists, we have fallen short of realizing the full potential of computing for the OWL. Why is this? Because commercial software does not meet all the requirements of the linguist, and the linguistic community has not yet been able to develop the software that will fill the gap.

Other articles in this book (particularly the survey by Antworth) document the software that is currently available to the OWL. There are many good tools that many linguists have put to good use, but I think it is fair to say that this body of tools, for the most part, remains inaccessible to the average OWL. There are two chief reasons for this. First, there is a friendliness gap--many programs are hard to use because they have one-of-a-kind user interfaces that have a steep learning curve and are easy to forget if not used regularly. The recent emergence of graphical user interface standards (such as for Windows and Macintosh) is doing much to solve this problem. Second, there is a semantic gap--many current programs model data in terms of computationally convenient objects (like files with lines and characters, or with records and fields), and then require the user to understand how these map onto the objects of the problem domain (like grammatical categories, lexical entries, and phonemes). In cases where programs do present a semantically transparent model of the problem domain, the programmer has typically had to build it from scratch using underlying objects like files, lines, and characters. While the results can be excellent, the process of developing such software is typically slow.

As we look to the future, better (and faster) progress in developing software for linguists is going to depend on using methods that better model the nature of the data we are trying to manipulate. The first five sections of this article discuss five essential characteristics of linguistic data which any successful software for the OWL must account for, namely, that the data are multilingual, sequential, hierarchically structured, multidimensional, and highly integrated. The sixth section discusses a further requirement, namely, that the software must maintain a distinction between the information in the data and the appearance it receives when it is formatted for display. The concluding section briefly describes a computing environment being developed by the Summer Institute of Linguistics to meet these (and other) requirements for a foundation on which to build better software for the OWL.

1. The multilingual nature of linguistic data

Every instance of textual information entered into a computer is expressing information in some language (whether natural or artificial). The data that linguists work with typically include information in many languages. In a document like a bilingual dictionary, the chunks of data switch back and forth between different languages. In other documents, the use of multiple languages may be nested, such as when an English text quotes a paragraph in German which discusses some Greek words. Such multilingualism is a fundamental property of the textual data with which OWLs work.

Many computerists have conceived of the multilingual data problem as a special characters problem. This approach considers the multilingualism problem to be solved when all the characters needed for writing the languages being worked with can be displayed both on the screen and in printed output. This has been difficult to achieve in the MS-DOS environment with off-the-shelf software since the operating system views the world in terms of a single set of 256 characters. Linguists have had to resort to using character shape editors (see, for instance, Simons 1989b) to define a customized character set that contains all the characters they need to use in a particular job. The limit of having only 256 characters is exacerbated by the fact that each combination of a diacritic with a base character must be treated as a single composite character. For instance, to correctly display a lowercase Greek alpha with no breathing, a smooth breathing, or a rough breathing, and with no accent, an acute accent, a grave accent, or a circumflex accent, one would need to define twelve different characters in order to display all the possible combinations of diacritics on a lowercase alpha.

The Windows and Macintosh environments have made a significant advance beyond this. Rather than a single character inventory, these operating systems provide a font system. Data in languages with different writing systems can be represented in different fonts. This means that the total character inventory is not limited by the number of possible character codes. There is still the limit of 256 possible character codes in a font, but one could put Roman characters in one font, Greek characters in another font, and Arabic characters in still another. By switching between different fonts in the application software, the user can access and display as many characters as are needed. Note, however, that the Windows font system still has the limitation that every combination of base character plus diacritic must be treated as a separate composite character.

The Macintosh font manager (Apple 1985) offers a significant advance in supporting overstriking diacritics. An overstriking diacritic is a character that is superimposed on-the-fly over a separate base character (somewhat like a dead key on a conventional typewriter). It is possible to build thousands of composites dynamically from a single font of 255 characters. Thus, for instance, all the European languages with Roman-based writing systems can be rendered with the basic Macintosh extended character set. But in spite of the refinements, these are still special-character approaches--they say that if two characters look the same, they should be represented by the same character code, and conversely, if they look different, they should have different codes.

The special-character approach encodes information in terms of its visual form. In so doing it causes us both to underdifferentiate and to overdifferentiate important semantic (or functional) distinctions that are present in the encoded information. We underdifferentiate when we use the same character codes to represent words in different languages. For instance, the character sequence die represents rather different information when it encodes a German word as opposed to an English word.

We overdifferentiate when we use different character codes to represent contextual variants of the same letter in a single language. For instance, the lowercase sigma in Greek has one form if it is word initial or medial, and a second form if it is word final. An even more dramatic example is Arabic in which nearly every letter of the alphabet appears in one of four variant forms depending on whether the context is word initial, word medial, word final, or freestanding. Another type of overdifferentiation occurs when single composite characters are used to represent the combination of base characters with diacritics that represent functionally independent information. For instance, in the example given above of using twelve different composite characters to encode the possible combinations of Greek lowercase alpha with breathing marks and accents, the single functional unit (namely, lowercase alpha) is represented by twelve different character codes. Similarly, the single functional unit of rough breathing would be represented in four of these character codes, and in two dozen others for the other six vowels.

To represent our data in a semantically transparent way, it is necessary to do two things. First, we must explicitly encode the language that each particular datum is in; this makes it possible to use the same character codes for different languages without any ambiguity or loss of information. Second, we need to encode characters at a functional level and let the computer handle the details of generating the correct context-sensitive display of form.

It was Joseph Becker, in his seminal article "Multilingual Word Processing" (1984), who pointed out the necessity to distinguish form and function in the computer implementation of writing systems. He observed that character encoding should consistently represent the same information unit by the same character code. He then defined rendering as the process of converting the encoded information into the correct graphic form for display. He observed correctly that for any writing system, this conversion from functional elements to formal elements is defined by regular rules, and therefore the computer should perform this conversion automatically. Elsewhere I have described a formalism for dealing with this process (Simons 1989a).

The writing system is the most visible aspect of language data; thus we tend to think first of rendering when we think of multilingual computing. But the language a particular datum is in governs much more than just its rendering on the screen or in printed output; it governs many other aspects of data processing. One of these is keyboarding: a multilingual computing environment would know that part of the definition of a language is its conventions for keyboarding, and would automatically switch keyboard layouts based on the language of the datum under the system cursor.

Another language-dependent aspect of data processing is the collating sequence that defines the alphabetical order for sorted lists in the language. For instance, the character sequence ll comes between li and lo in English, but in Spanish it is a separate "letter" of the alphabet and occurs between lu and ma. Still other language- dependent aspects are rules for finding word boundaries, sentence boundaries, and possible hyphenation points. Then there are language-specific conventions for formatting times, dates, and numbers.

There are two recent developments in the computing industry which bode well for our prospects of having a truly multilingual computing environment. The first of these is the Unicode standard for character encoding (Unicode Consortium 1991). The Unicode Consortium, comprised of representatives from some of the leading commercial software and hardware vendors, has developed a single set of character codes for all the characters of all the major writing systems of the world (including the International Phonetic Alphabet). This system uses two bytes (16 bits) to encode each character. Version 1.0 of Unicode defines codes for 28,706 characters; this leaves room to define over 35,000 more characters. A major aim of Unicode is to make it possible for computer users to exchange highly multilingual documents with full confidence that the recipient will be able to correctly display the text. The definition of the standard is quick to emphasize, however, that it is only a standard for the interchange of character codes. Unicode itself does not address the question of context-sensitive rendering nor of any of the language-dependent aspects of data processing. In fact, it is ironic that Unicode fails to account for the most fundamental thing one must know in order to process a stream of character data, namely, what language it is encoding. Unicode is not by itself a solution to the problem of multilingual computing, but the support promised by key vendors like Microsoft and Apple is likely to make it a part of the solution.

The second recent development is the incorporation of the World Script component into version 7.1 of the Macintosh operating system (Ford and Guglielmo 1992). About five years ago, Apple developed an extension to their font manager called the script manager (Apple 1988). It handles particularly difficult font problems like the huge character inventory of Japanese and the context-sensitive rendering of consonant shapes in Arabic. A script system, in conjunction with a package of "international utilities," is able to handle just about all the language-dependent aspects of data processing mentioned above (Davis 1987). The script manager's greatest failing was that only one non-Roman script system could be installed in the operating system. World Script has changed this. It is now possible to install as many script systems as one needs. Nothing comparable is yet available for Windows users, but the trade press has reported that Apple intends to port this technology to the Windows platform. As software developers make their programs aware of this technology, adequately multilingual computing may become a widespread reality.

2. The sequential nature of linguistic data

The stream of speech is a succession of sound that unfolds in temporal sequence. Written text is similarly sequential in nature, as word follows word and sentence follows sentence. The order of the words and sentences is, of course, a significant part of the information in text, since changing the order of constituents can change the meaning of the text.

Word processors excel at modeling the sequential nature of text, but fall short in modeling the other aspects of the information structure discussed below in sections 3 through 5. In particular, word processors do not allow us to represent the multidimensional and highly integrated nature of text. Ironically, database management systems, which excel at modeling multidimensionality and integratedness, are generally weak in dealing with sequentiality. The relational database model, for instance, does not inherently support the notion of sequence at all. Relations are, by definition, unordered. To represent sequence in a database model, one must add fields to store explicit sequence numbers and then manipulate these fields to put pieces of text in order or to test whether items are adjacent and so forth. Parunak (1982) has used such an approach to model Biblical text in a relational database. Stonebraker and others (1983) have developed extensions to the relational database model that make it better able to cope with texts (including, for instance, the notion of ordered relations), but these extensions have not become commonplace in commercially available database systems.

3. The hierarchical nature of linguistic data

The data we deal with as linguists are highly structured. This is true of the primary data we collect, as well as of the secondary and tertiary data we create to record our analyses and interpretations. One aspect of that structuring, namely hierarchy, is discussed in this section. Two other aspects, the multidimensionality and the interrelatedness of data elements, are discussed in the next two sections.

Hierarchy is a fundamental characteristic of data structures in linguistics. The notion of hierarchy is familiar in syntactic analysis where, for instance, a sentence may contain clauses which contain phrases which contain words. Similar hierarchical structuring can be observed at higher levels of text analysis, such as when a narrative is made up of episodes which are made up of paragraphs and so on. We see hierarchy in the structure of a lexicon when the lexicon is made up of entries which contain sense subentries which in turn contain things like definitions and examples. Even meanings, when they are represented as feature structures which allow embedded feature structures as feature values, exhibit hierarchical structure. The list of examples is almost limitless.

As fundamental as hierarchy is, it is ironic that the tools that are most accessible to personal computer users--word processors, spreadsheets, and database managers--do not really support it. There is little question about this assessment of spreadsheets; they simply provide a two-dimensional grid of cells in which to place simple data values. In the case of database management systems (like dBase or 4th Dimension) and even card filing systems (like AskSam or Hypercard), a programmer can construct hierarchical data structures, but such a task would be beyond the average user. This is because the inherent model of these systems is that data are organized as a flat sequence of records or cards.

Even word processors do not do a good job at modeling hierarchy. They essentially treat textual data as a sequence of paragraphs. They typically support no structure below this. For instance, if a dictionary entry were represented as a paragraph, the typical word processor would have no way of modeling the hierarchical structure of elements (like headword, etymology, sense subentries, and examples) within the entry. Rather, word processors can only model the contents of a dictionary entry as a sequence of characters; it would be up to the user to impose the internal structure mentally. Going up the hierarchy from paragraph, word processors do a little better, but it is done by means of special paragraph types rather than by modeling true hierarchy. For instance, if a document has a structure of chapters, sections, and subsections, this is imposed by putting the title of each element in a heading paragraph of level 1, 2, or 3, respectively. Under some circumstances, such as in an outline view, the word processor can interpret these level numbers to manipulate the text in terms of its hierarchical structure.

A new generation of document processing systems with a data model that is adequate to handle the hierarchical structure in textual data is beginning to emerge. They are based on an information markup language called SGML, for Standard Generalized Markup Language (Goldfarb 1990, Herwijnen 1990, Cover 1992). SGML is not a program; it is a standard which describes how textual data should be represented in ASCII files so that data files can be interchanged among programs and among users without losing any information (especially information about the structure of the text). In 1986 SGML was adopted by the leading body for international standards (ISO 1986); since that time it has gained momentum in the computing industry to the extent that SGML compatibility is now beginning to appear in popular software products.

The basic model of SGML is a hierarchical one. It views textual data as being comprised of content elements which are of different types and which embed inside each other. For instance, the following is a sample of what a dictionary entry for the word abacus might look like in an SGML-conforming interchange format:

     <entry>
        <headword>abacus</headword>
        <etymology>L. abacus, from Gr. abax</etymology>
        <paradigm>pl. -cuses, or -ci</paradigm>
        <sense n=1><pos>n</pos>
           <def>a frame with beads sliding back and forth
                on wires for doing arithmetic</def></sense>
        <sense n=2><pos>n</pos>
           <def>in architecture, a slab forming the top of
                the capital of a column</def></sense>
     </entry>

Each element of the text is delimited by an opening tag and a matching closing tag. An opening tag consists of the name of the element type enclosed in angle brackets. The matching closing tag adds a slash after the left angle bracket. In this example, the entry element contains five elements: a headword, an etymology, paradigm information, and two sense subentries. Each sense element embeds two elements: pos (part of speech) and definition. The sense elements also use the attribute n to encode the number of the sense.

Rather than forcing the data to fit a built-in model of hierarchical structure (like a word processor does), SGML allows the model of data structure to be as rich and as deep as necessary. An SGML-conforming data file is tied to a user-definable Document Type Definition. The DTD lists all the element types allowed in the document, and then specifies the allowed structure of each in terms of what other element types it can contain and in what order. The DTD is a machine-readable document with a formal syntax prescribed by the SGML standard. This makes it possible for SGML-based application software to read the DTD and to understand the structure of the text being processed. Because the DTD is a plain ASCII file, it is also human readable and thus serves as formal documentation, showing other potential users of a data set how it is encoded.

Perhaps the greatest impact of a formal definition of possible document structure is that it helps to close the semantic gap between the user and the computer application. This is particularly true when the formal model of the structure matches the model in the minds of practitioners in the domain, and when the formal model uses the same names for the data element types that domain specialists would use to name the corresponding real-world objects. For instance, an SGML-based document editor starts up by reading in the DTD for the type of document the user wants to create (whether it be, for instance, the transcription of a conversation or a bilingual dictionary). The editor then helps the user by showing what element types are possible at any given point in the document. If the user attempts to create an invalid structure, the editor steps in and explains what would be valid at that point. The formal definition of structure can help close the semantic gap when data are processed, too. For instance, an information retrieval tool that knows the structure of the documents in its database can assist the user in formulating queries on that database.

The academic community has recognized the potential of SGML for modeling linguistic (and related) data. The Text Encoding Initiative (TEI) is a large-scale international project to develop SGML-based standards for encoding textual data, including its analysis and interpretation (Burnard 1991). It has been sponsored jointly by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Linguistic and Literary Computing and has involved scores of scholars working in a variety of subcommittees (Hockey 1989-92). Guidelines for the encoding of machine-readable texts have now been published (Sperberg-McQueen and Burnard 1994) and are being followed by a number of projects. The TEI proposal for markup of linguistic analysis depends heavily on feature structures; see Langendoen and Simons (1995) for a description of the approach and a discussion of its rationale.

While the power of SGML to model the hierarchical structure in linguistic data takes us beyond what is possible in word processors, spreadsheets, and database managers, it still does not provide a complete solution. It falls short in the two aspects of linguistic data considered in the next two sections. The attributes of SGML elements cannot themselves store other elements; thus the multidimensional nature of complex data elements must be modeled as hierarchical containment. To get at the network of relationships among elements that reflect the integrated nature of linguistic data, SGML offers a pointing mechanism (through IDs and IDREFs in attribute values), but there is no semantic validation of pointers. Any pointer can point to any element; there is no mechanism for specifying constraints on pointers in the DTD. The only relationship between element types that can be modeled in the DTD is hierarchical inclusion.

4. The multidimensional nature of linguistic data

A conventional text editing program views text as a one-dimensional sequence of characters. A tool like an SGML-based editor adds a second dimension--namely, the hierarchical structure of the text. But from the perspective of a linguist, the stream of speech which we represent as a one-dimensional sequence of characters has form and meaning in many simultaneous dimensions (Simons 1987). The speech signal itself simultaneously comprises articulatory segments, pitch, timing, and intensity. A given stretch of speech can be simultaneously viewed in terms of its phonetic interpretation, its phonemic interpretation, its morphophonemic interpretation, its morphemic interpretation, or its lexemic interpretation. We may view its structure from a phonological perspective in terms of syllables, stress groups, and pause groups, or from a grammatical perspective in terms of morphemes, words, phrases, clauses, sentences, and so on.

The meaning of the text also has many dimensions and levels. There is the phonological meaning of devices like alliteration and rhyme. There is the lexical meaning of the morphemes, and of compounds and idioms which they form. There is the functional meaning carried by the constituents of a grammatical construction. In looking at the meaning of a whole utterance, there is the literal meaning versus the figurative, the denotative versus the connotative, the explicit versus the implicit. All of these dimensions, and more, lurk behind that one-dimensional sequence of characters which we have traditionally thought of as text.

There are already some programs designed for the OWL which handle this multidimensional view of text rather well, namely, interlinear text processing systems like IT (Simons and Versaw 1987, Simons and Thomson 1988) and Shoebox (Davis and Wimbish 1993). In these programs, the user defines the dimensions of analysis that are desired. The program then steps through the text helping the user to fill in appropriate annotations on morphemes, words, and sentences for all the dimensions. Another class of programs which is good at modeling the multidimensional nature of linguistic data is database managers: when a database record is used to represent a single object of data, the many fields of the record can be used to represent the many dimensions of information that pertain to it.

While interlinear text processors and database managers handle the multidimensional nature of linguistic data well, they fall short by not supporting the full hierarchical nature of the data. To adequately model linguistic data, the OWL needs a system which has the fully general, user-definable hierarchy of elements (such as SGML offers) in which the elements not only contain the smaller elements which are their parts, but also allow for a record-like structure of fields which can simultaneously store multiple dimensions of information concerning the elements.

5. The highly integrated nature of linguistic data

Hierarchies of data elements with annotations in multiple dimensions are still not enough. Hierarchy, by itself, implies that the only relationships between data elements are those inherent in their relative positions in the hierarchy of parts within wholes. But for the database on which linguistic research is based, this only scratches the surface. Crosscutting the basic hierarchical organization of the elements is a complex network of associations between them.

For instance, the words that occur in a text are composed of morphemes. Those morphemes are defined and described in the lexicon (rather than in the text). The relationship between the surface word form and its underlying form as a string of lexical morphemes is described in the morphophonology. When a morpheme in an analyzed text is glossed to convey its sense of meaning, that gloss is really an attribute of one of the senses of meaning listed in the lexicon entry for that morpheme. The part-of-speech code for that use of the morpheme in the text is another attribute of that same lexical subentry. The part-of-speech code itself does not ultimately belong to the lexicon. It is the grammar which enumerates and defines the possible parts of speech, and the use of a part-of-speech code in the lexicon is really a pointer to its description in the grammar. The examples which are given in the lexicon or the grammar relate back to the text from which they were taken. Cultural terms which are defined in the lexicon and cultural activities which are exemplified in texts relate to their full analysis and description in an ethnography. All the above are examples of how the different parts of a field linguist's database are conceptually integrated by direct links of association. Weber (1986) has discussed this network-like nature of the linguistic database in his description of a futuristic style of computer-based reference grammar.

This network of associations is part of the information structure that is inherent in the phenomena we study. To maximize the usefulness of computing in our research, our computational model of the data must match this inherent structure. Having direct links between related bits of information in the database has an obvious benefit of making it easy and fast to retrieve related information.

An even more fundamental benefit has to do with the integrity of the data and the quality of the resulting work. Because the information structures we deal with in research are networks of relationships, we can never make a hypothesis in one part of the database without affecting other hypotheses elsewhere in the database. Having the related information linked together makes it possible to immediately check the impact of a change in the database.

The addition of associative links to the data structure also makes it possible to achieve the virtue of normalization, a concept which is well-known in relational database theory (Smith 1985). In a fully normalized database, any given piece of information exists only once in the database. Any use of that information is by referring to the single instance rather than by making a copy of it. For instance, the spelling of a part-of-speech code would ideally exist only once in a linguistic database (as an attribute of a part-of-speech entry in the grammar). Rather than repeating that code in all the lexical entries for morphemes with that part of speech, the lexical entries would point to the single part-of-speech entry in the grammatical part of the database; from this they would retrieve the single instance of the spelling of the part-of-speech code as needed. When the analyst decides to change the spelling of the abbreviation, all references are simultaneously "updated" since they now point to a changed spelling. This avoids the ubiquitous database problem known as "update anomaly" which happens when some copies of a single conceptual entity get updated while others do not.

6. The separation of information from format

It is imperative that any system for manipulating linguistic data maintain the distinction between information and format. In printed media, we use variations in format to signal different kinds of information. For instance, in a dictionary entry, bold type might be used to indicate the headword, square brackets might delimit the etymology, while italics with a trailing period might mark the part-of-speech code. The bold type is not really the information--it is the fact that the emboldened form is the headword. Similarly, the square brackets (even though they are characters in the display) are not really part of the data; they simply indicate that the delimited information is the etymology.

Generalized markup (the GM in SGML) is the notion of marking up a document by identifying its information structure rather than its display format (Coombs, Renear, and DeRose 1987). For instance, in a dictionary entry one should insert a markup tag to say, "The following is the headword" (as does the <headword> tag in the SGML example given above in section 3), rather than putting typesetting codes to say, "The following should be in 12 point bold Helvetica type." In the generalized markup approach, each different type of information is marked by a different markup tag, and then details of typesetting are specified in a separate document which is often called a style sheet (Johnson and Beach 1988). The style sheet declares for each markup tag what formatting parameters are to be associated with the content of the marked up element when it is output for display.

The separation of content and structure from display formatting has many advantages. (1) It allows authors to defer formatting decisions. (2) It ensures that formatting of a given element type will be consistent throughout. (3) It makes it possible to change formats globally by changing only a single description in the style sheet. (4) It allows the same document to be formatted in a number of different styles for different publishers or purposes. (5) It makes documents portable between systems. And perhaps most important of all for our purposes, (6) it makes possible computerized analysis and retrieval based on structural information in the text.

The lure of WYSIWYG ("what you see is what you get") word processors for building a linguistic database (like a dictionary) must be avoided at all costs when "what you see is all you get." On the other hand, a database manager which allows one to model the information structure correctly, but cannot produce nicely formatted displays is not much use either. The OWL needs a hybrid system that combines the notion of generalized markup for faithfully storing the information structure of the data with the notion of style sheets that can transform the information into conventionally formatted displays.

7. Toward a computing environment for linguistic research

The above sections have discussed six requirements for a computing environment that manages linguistic data:

  1. The data are multilingual, so the computing environment must be able to keep track of what language each datum is in, and then display and process it accordingly.
  2. The data in text unfold sequentially, so the computing environment must be able to represent the text in proper sequence.
  3. The data are hierarchically structured, so the computing environment must be able to build hierarchical structures of arbitrary depth.
  4. The data are multidimensional, so the computing environment must be able to attach many kinds of analysis and interpretation to a single datum.
  5. The data are highly integrated, so the computing environment must be able to store and follow associative links between related pieces of data.
  6. While doing all of the above to model the information structure of the data correctly, the computing environment must be able to present conventionally formatted displays of the data.

It is possible to find software products that meet some of these requirements, but we are not aware of any that can meet them all. Consequently, the Summer Institute of Linguistics (through its Academic Computing Department) has embarked on a project to build such a computing environment for the OWL. We call it CELLAR--for Computing Environment for Linguistic, Literary, and Anthropological Research. This name reflects our belief that these requirements are not unique to linguists--virtually any scholar working with textual data will have the same requirements.

Fundamentally, CELLAR is an object-oriented database system (Rettig, Simons, and Thomson, 1993). Borgida (1985) gives a nice summary of the advantages of modeling information as objects. Zdonik and Maier (1990) offer more extensive readings. Booch (1994) and Coad and Yourdon (1991) teach the methodology that is used in analyzing a domain to build an object-oriented information model for it.

In CELLAR each data element is modeled as an object. Each object has a set of named attributes which record the many dimensions of information about it (addressing requirement 4 above). An attribute value can be a basic object like a string, a number, a picture, or a sound; every string stores an indication of the language which it encodes (requirement 1; see Simons and Thomson (1998) for a detailed discussion of CELLAR's multilingual component). An attribute can store a single value or a sequence of values (requirement 2). An attribute value can also be one or more complex objects which are the parts of the original object, thus modeling the hierarchical structure of the information (requirement 3). Or, an attribute value can be one or more pointers to objects stored elsewhere in the database to which the original object is related (requirement 5).

Each object is an instance of a general class. Each class is sanctioned by a user-definable "class definition" which describes what all instances of the class have in common. This includes definitions of all the attributes with constraints on what their values can be, definitions of virtual attributes which compute their values on-the-fly by performing queries on the database, definitions of parsers which know how to convert plain ASCII files into instances of the class, definitions of views which programmatically build formatted displays of instances of the class, and definitions of tools which provide graphical user interfaces for manipulating instances of the class. The latter two features address requirement 6; see Simons (1997) for a fuller discussion of this aspect of CELLAR.

CELLAR is really a tool for building tools. Programmers will be able to use CELLAR to build class definitions that model the content, format, and behavior of linguistic data objects. These models are the tools that OWLs will use. Because CELLAR's model of data inherently supports the very nature of linguistic data, the programmer can quickly build semantically transparent models of linguistic data.

A beta version of CELLAR was first released to the public in December 1995 as part of the product named LinguaLinks. LinguaLinks uses CELLAR to implement applications for phonological analysis, interlinear text analysis, lexical database management, and other tasks typically performed by field linguists. See the project's home page on the Internet (http://www.sil.org/cellar) for the latest information concerning availability.

References

Apple Computer. 1985. "The font manager," in Inside Macintosh 1:215-240 (with updates in 4:27-48, 1986). Reading, MA: Addison-Wesley.

_____. 1988. "The script manager," in Inside Macintosh 5:293-322. Reading, MA: Addison-Wesley.

Becker, Joseph D. 1984. "Multilingual word processing," Scientific American 251(1):96-107.

Booch, Grady. 1994. Object-oriented analysis and design with applications, 2nd ed. Redwood City, CA: Benjamin/Cummings Publishing Co.

Borgida, Alexander. 1985. "Features of languages for the development of information systems at the conceptual level." IEEE Software 2(1): 63-72.

Burnard, Lou D. 1991. "An introduction to the Text Encoding Initiative," in Daniel I. Greenstein (ed.), Modeling Historical Data: towards a standard for encoding and exchanging machine-readable texts. (Halbgraue Reihe zur Historischen Fachinformatik, Serie A, Historische Quellenkunden, Band 1.) Max-Planck-Institut fuer Geschichte.

Coad, Peter and Edward Yourdon. 1991. Object-oriented analysis, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall.

Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. "Markup systems and the future of scholarly text processing," Communications of the ACM 30(11):933-947.

Cover, Robin. 1992. "Standard Generalized Markup Language: annotated bibliography and list of references," <TAG>: The SGML newsletter</> 5(3):4-12, 5(4):13-24, 5(5):25-36. (See http://www.oasis-open.org/cover/ for Cover's Web site which features an up-to-date version of this bibliography and a wealth of pointers to SGML resources.)

Davis, Daniel W. and John S. Wimbish. 1993. The linguist's Shoebox: an integrated data management and analysis tool (version 2.0). Waxhaw, NC: Summer Institute of Linguistics.

Davis, Mark E. 1987. "The Macintosh script system," Newsletter for Asian and Middle Eastern Languages on Computer 2(1&2):9-24.

Ford, Ric and Connie Guglielmo. 1992. "Apple's new technology and publishing strategies," MacWeek (September 28, 1992), 38-40.

Goldfarb, Charles F. 1990. The SGML handbook. Oxford: Oxford University Press.

Herwijnen, Eric van. 1990. Practical SGML. Dordrecht: Kluwer Academic Publishers.

Hockey, Susan. 1989-92. "Chairman's report," Literary and Linguistics Computing 4(4):300-302, 5(4):334-346, 6(4):299, 7(4):244-245.

ISO. 1986. Information processing--text and office systems--Standard Generalized Markup Language (SGML). ISO 8879-1986 (E). Geneva: International Organization for Standards, and New York: American National Standards Institute.

Johnson, Jeff and Richard J. Beach. 1988. "Styles in document editing systems," IEEE Computer 21(1):32-43.

Langendoen, D. Terence and Gary F. Simons. 1995. "A rationale for the TEI recommendations for feature structure markup," Computers and the Humanities 29:191-209.

Parunak, H. Van Dyke. 1982. "Database design for biblical texts," in Richard W. Bailey (ed.), Computing in the Humanities. North Holland Publishing Company, 149-161.

Rettig, Marc, Gary F. Simons, and John V. Thomson. 1993. "Extended objects," Communications of the ACM 36(8):19-24.

Simons, Gary F. 1980. "The impact of on-site computing on field linguistics," Notes on Linguistics 16:7-26.

_____. 1987. "Multidimensional text glossing and annotation," Notes on Linguistics 39:53-60.

_____. 1989a. "The computational complexity of writing systems," in Ruth M. Brend and David G. Lockwood (eds.), The fifteenth LACUS forum. Lake Bluff, IL: Linguistic Association of Canada and the United States, 538-553.

_____. 1989b. "Working with special characters," in Priscilla M. Kew and Gary F. Simons (eds.), Laptop publishing for the field linguist: an approach based on Microsoft Word. (Occasional Publications in Academic Computing 14.) Dallas, TX: Summer Institute of Linguistics, 109-118.

_____. 1997. "Conceptual modeling versus visual modeling: a technological key to building consensus," Computers and the Humanities 30(4):303-319. (See the longer working paper version at http://www.sil.org/cellar/ach94/ach94.html.)

Simons, Gary F. and John V. Thomson. 1988. How to use IT: interlinear text processing on the Macintosh. Edmonds, WA: Linguist's Software.

_____. 1998. "Multilingual data processing in the CELLAR environment," in John Nerbonne (ed.), Linguistic Databases. Stanford, CA: Center for the Study of Language and Information, 203-234. (The original working paper is available at http://www.sil.org/cellar/mlingdp/mlingdp.html.)

Simons, Gary F. and Larry Versaw. 1987. How to use IT: a guide to interlinear text processing. Dallas, TX: Summer Institute of Linguistics. (3rd edition, 1992)

Smith, Henry C. 1985. "Database design: composing fully normalized tables from a rigorous dependency diagram," Communications of the ACM 28(8):826-838.

Sperberg-McQueen, C. M. and Lou Burnard. 1994. Guidelines for the encoding and interchange of machine-readable texts. Chicago and Oxford: Text Encoding Initiative. (See also http://www.uic.edu/orgs/tei.)

Stonebraker, Michael, Heidi Stettner, Nadene Lynn, Joseph Kalash, and Antonin Guttman. 1983. "Document processing in a relational database system," ACM Transactions on Office Information Systems 1(2):143-188.

Unicode Consortium. 1991. The Unicode standard: worldwide character encoding, version 1.0, volume 1. Reading, MA: Addison-Wesley. (Version 2.0 published 1996; see also http://www.unicode.org/.)

Weber, David. 1986. "Reference grammars for the computational age," Notes on Linguistics 33:28-38.

Zdonik, Stanley B. and David Maier (eds.). 1990. Readings in object-oriented database systems. San Mateo, CA: Morgan Kaufmann Publishers.


Last modified: 5-Jan-1999

Reply to WWW@sil.org

[SIL Home Page]