SIL Electronic Working Papers 1998-006, December 1998
Copyright © 1998 Gary F. Simons and Summer Institute of Linguistics, Inc.
All rights reserved.
A paper presented at Markup Technologies '98, Chicago, 19-20 Nov 1998
Gary F. Simons
The large SGML DTDs in widespread use (e.g. HTML, DocBook, ISO 12083, CALS, EAD, TEI) offer the advantage of standardization, but for a particular project they often carry the disadvantage of being too large or too general. A given project might be better served by a DTD that is no bigger than is needed to solve the specific problem at hand, and that is even customized to meet special requirements of the problem domain. Furthermore, the project might prefer for the data it produces to meet the different syntactic constraints of XML conformity. This paper demonstrates how architectural processing can be used to develop a problem-specific XML DTD for a particular project without losing the advantage of conforming to a widely-used SGML DTD. As an example, the paper discusses the markup for a dictionary of the Sikaiana language (Solomon Islands) and develops a small XML application for the purpose derived from the TEI (Text Encoding Initiative) DTD. The TEI Guidelines offer a mechanism for building TEI-conformant applications; the paper concludes by proposing an alternative approach to TEI conformance based on architectures.
The work described in this paper began as an effort to perform a particular markup task. Back in 1983 while doing linguistic field work in the Solomon Islands, I helped anthropologist William Donner (then a graduate student at the University of Pennsylvania) to produce a bilingual dictionary of the Sikaiana language [Don87]. For the purpose, we devised a one-of-a-kind markup system. Now, fifteen years later, we would like to put this data in a form that can be shared on the Web; conversion into a standardized form of markup is needed. The leading standard for the markup of dictionaries is the SGML-based TEI (Text Encoding Initiative) DTD [SMB94]. But using this DTD presents three main problems for this project, because what we really want is to:
This is, in fact, a general problem. The large SGML DTDs in widespread use (e.g. HTML, DocBook, ISO 12083, CALS, EAD, TEI) offer the advantage of standardization, but for a particular project they often carry the disadvantage of being too large or too general. A given project might be better served by a DTD that is no bigger than is needed to solve the specific problem at hand, and that is even customized to meet special requirements of the problem domain. Furthermore, the project might prefer to constrain the data it produces to stay within the bounds of XML conformity.
This paper demonstrates how architectural processing can be used to develop a problem-specific XML DTD for a particular project without losing the advantage of conforming to a widely used SGML DTD. Section 2 begins by elaborating on the three problems mentioned above. The paper uses SGML architectures to develop a solution to these problems. Section 3 gives an overview of architectures. Then section 4 describes exactly how they are used to solve the three problems. Section 5 shows how an SGML parser that includes an architectural processor can be used to simultaneously validate data against both a problem-specific XML DTD and a widely used SGML DTD. Finally, section 6 discusses the issue of TEI conformance. The TEI Guidelines offer an elaborate mechanism for building new applications that are both customized and conforming. The paper concludes by proposing an alternative approach to TEI conformance based on architectures.
Using one of the large SGML DTDs in widespread use may pose some problems: one might really want to deliver an XML application, one might really want a customized markup scheme that better fits the problem domain, or one might want a small DTD that is no bigger than is really needed for the exact application. These three problems are discussed in more detail in the following three subsections.
With the growing popularity of XML and its potential for the publication of structured information on the Web, content developers are wanting to use XML applications. However, our most widely used standards for structuring information are SGML applications like HTML, DocBook, ISO 12083, CALS, EAD, and TEI. A transition to XML poses two kinds of problems.
First, there are problems of XML well-formedness. A data file that is valid with respect to a given SGML application (including both the SGML declaration and the DTD) is likely not to be well-formed XML. For instance, SGML applications typically allow many of the end tags to be omitted, while XML prohibits this. Most SGML applications use '>' to close empty tags and processing instructions, while XML uses '/>' and '?>' respectively. [Cla97] gives a detailed listing of differences between SGML and XML.
Second, there are problems of XML validity. Many features of SGML DTDs are not allowed in XML DTDs. For instance, XML DTDs do not allow the tag minimization characters in element declarations, nor do they support inclusion exceptions or exclusion exceptions in content models. Furthermore, they allow PCDATA in content models only under very restricted conditions. For the sample problem of marking up the Sikaiana dictionary, these differences between SGML DTDs and XML DTDs mean that the TEI DTD cannot be used to deliver a fully XML solution. While the TEI DTD could be used with an SGML parser (with the right SGML declaration) to validate a well-formed XML file, it could not be used with an XML parser.
In a particular markup project, it may be desirable, or even necessary, to build a customized DTD. While the large widely-used DTD may handle the essential features needed for the job, it could be that different names may make more sense for certain elements or attributes, or that new elements or attributes need to be added, or that it is more convenient to encode certain combinations of elements with fixed attribute values as new element types.
For instance, in the Sikaiana dictionary it is common for definitions to include embedded words in the Sikaiana language. For instance, consider this entry:
hakamaatele, v. for the chief (aliki) to pray by calling out the names of spirits (aitu).
The TEI DTD prescribes that the definition in this entry be marked up as follows:
<def>for the chief (<foreign lang="SIK">aliki</foreign>), to pray by calling out the names of spirits (<foreign lang="SIK">aitu</foreign>)</def>
However, the tag
<foreign lang="SIK"> is
so common in this application, that one would like to abbreviate it as
<sik>. That is,
<def>for the chief (<sik>aliki</sik>) to pray by calling out the names of spirits (<sik>aitu</sik>)</def>
As another example of where a customization is needed, consider the following entry:
pili, v. to run aground, of a boat or canoe; te vaka ni pili i te popolani, 'the boat ran ashore on the reef'; Idiom: toku vaka ni pili, 'I have made a mess of things (lit., my ship has wrecked)'.
This entry provides two example sentences, the second of which is
explicitly marked as an idiom. The following is how the first example is
marked up according to the TEI DTD (where
<eg> is "example,"
<q> is "quoted," and
<eg><q>te vaka ni pili i te popolani,</q> <tr>the boat ran ashore on the reef</tr></eg>
But there is no really satisfactory way to mark up the idiom. The TEI
has no tag for an idiom, nor does the
have a type attribute. By adding a type attribute to
<tr>, we could distinguish a normal example from
an idiom and a normal translation from a literal one. For instance,
<eg type="idiom"><q>toku vaka ni pili,</q> <tr>I have made a mess of things</tr> <tr type="lit">my ship has wrecked</tr></eg>
Even better would be to add two new elements
<lit> so that literal translations could be
constrained to occur only with idioms. For instance,
<idiom><q>toku vaka ni pili,</q> <tr>I have made a mess of things</tr> <lit>my ship has wrecked</lit></idiom>
Five years ago, a cover story in Byte [PTUM93] decried the problem of "fatware"--software that just keeps getting bigger and bigger with each release without returning commensurate benefit to the user. Niklaus Wirth, in his plea for lean software [Wir95], sums up the situation thus: "Software's girth has surpassed its functionality."
I wonder if we aren't seeing a similar phenomenon with some of our favorite DTDs. Whether they have grown by accretion or were huge by original design, many widely-used DTDs are so large that a typical markup project needs only a fraction of the functionality in the DTD. In the world of software, the average user is much more likely to be successful in using a single-purpose tool that is focussed on the task at hand than in trying to figure out how to apply a multipurpose tool that has more features than are needed for the task [Sim98]. The same must be true in the world of markup: a DTD that is focused on the task at hand must be easier for people to use than a large one that is full of features that will not be applied.
For the Sikaiana dictionary project, the TEI DTD proved to be huge in comparison to the subset of elements and attributes that were actually used. Having a DTD that is limited to just the elements and attributes that are used in a project simplifies many tasks like building project-specific software, specifying stylesheets, shipping the DTD with the data, and documenting markup practice.
Even more significant for the Sikaiana dictionary project than reducing the fat of unused elements and attributes was the matter of reducing the fat of overly permissive content models for the elements that actually were used. In the first case we want to reduce the size of the DTD; in the second case we want to reduce the size of the document space it generates. The TEI's model for dictionary markup is a descriptive one; the DTD aims to provide the user a means of tagging anything that could be encountered in previously published dictionaries. But in tagging the Sikaiana dictionary, our purpose was to be prescriptive; we wanted to specify constraints on how we would structure individual entries and then ensure that all the entries consistently followed that structure.
This point is easy to illustrate. Consider the following entry from the
<gramGrp> is "grammatical
<pos> is "part of
<re> is "related entry"):
atamai, 1. vs. to be intelligent, skillful, clever, knowledgeable. Hano pe a koe e atamai, aliki ei koe, 'if you are intelligent, you will become chief'. CAUS: hakaatamai 'to instruct, to make intelligent.' 2. n, location. the right side, as opposed to the left: te lima atamai, 'the right hand'; te vahi atamai, 'the right side'. ANT: vvale.
The TEI markup for the first sense is as follows:
<sense n="1"> <gramGrp><pos>vs</pos></gramGrp> <def>to be intelligent, skillful, clever, knowledgeable.</def> <eg><q>Hano pe a koe e atamai, aliki ei koe,</q> <tr>if you are intelligent, you will become chief</tr></eg> <re type="causative"> <form>hakaatamai</form> <def>to instruct, to make intelligent</def></re> </sense>
The content model for
<sense> as it is used in the
Sikaiana dictionary is as follows:
( gramGrp, def, (eg | idiom | note)*, usg?, (xr | re)* )
That is, a sense contains an obligatory grammatical information group
and definition, followed by optional examples and idioms which may have
notes interspersed, an optional usage comment, and optional semantic
cross-references and related entries for derivative forms. Contrast this
with the content model for
<sense> in the TEI DTD:
( sense | %m.dictionaryTopLevel | %m.phrase | #PCDATA )* where, <!ENTITY % m.dictionaryTopLevel "def | eg | etym | form | gramGrp | note | re | trans | usg | xr" >
The entity reference
%m.phrase expands to more than fifty
phrase-level elements that can occur in paragraph-level elements
throughout the TEI DTD. The result of this content model is that the TEI
<sense> allows recursion of senses,
inclusion of more than fifty phrase elements, and free-standing PCDATA,
all of which we do not want to allow in the Sikaiana dictionary. For the
dictionary-specific elements that we do want to use, the TEI DTD has no
required elements and puts no constraints on the order of the elements.
This situation is far from satisfactory for any particular project that
wants to enforce a consistent pattern in the structure of its entries.
These problems can be addressed by using architectural processing. The HyTime standard [ISO92][DD94] first introduced the concept of architectural forms as a way to associate standardized semantics with elements in user-defined DTDs. Since then the concept has been generalized and formally adopted into SGML as part of the SGML Extended Facilities in the 1997 revision of the HyTime standard [ISO97]. An excellent tutorial introduction to SGML architectures can be found in [Kim97]. An in-depth explanation of a particular application of architectures can be found in [Sim97]. See [Cov98] for an up-to-date listing of other resources relating to SGML architectures and their application.
An SGML architecture is an SGML document type that is used as a basis
for deriving new document types. (For instance, [Meg98]
includes three chapters on how to design new DTDs by deriving them from
architectures.) In the same way that a class may be based on a superclass
in object-oriented programming, a document type may be based on an
architecture. Each of the elements in an architectural DTD is called an
architectural form. An architectural form attribute is
used on an element of the user document to specify the architectural form
on which it is based. For instance, if one were using HTML as an
architecture and html as the architectural form attribute, the
<para html="P"> in a user document would
say that this
<para> element is derived from (or,
inherits the semantics of) HTML's
An architecture is defined by a DTD. It is often called a meta-DTD to emphasize its higher level function, but its syntax is just like a normal DTD. We can exploit this fact in solving the problem at hand by using the existing widely-used SGML DTD as an architecture. We then write a problem-specific XML DTD to embody the constraints of the project and use an architectural form attribute to map the elements of the XML DTD onto the elements of the SGML architecture. Corkern [Cor97] proposes the same solution for the corporate setting in which many groups within the company must use the same standardized DTD; each group can comply by having its own "authoring DTD" that is architecturally derived from the "corporate DTD". Fortunately, the architectural processing mechanism has built-in provision for automatically mapping user elements onto architectural elements that have the same name. Thus, when the problem-specific DTD is essentially a subset of a widely-used DTD that is being used as an architecture, very little setup is needed to achieve the right mappings. See section 4.4 for an example.
The basic strategy is to build a problem-specific XML DTD that declares a widely-used SGML DTD as its base architecture. The following subsections give details of how the three problems of section 2 are addressed by this strategy.
The problem of transition to XML is addressed in two ways. The requirement that the data can be validated by an XML parser is met by having the problem-specific DTD be a valid XML DTD. The requirement to assure XML well-formedness and validity while also assuring validity with respect to the SGML architecture is met by altering the SGML declaration used with the architectural DTD so that it can accept XML syntax in the document instance. In this way, an SGML parser with an architecture engine can parse an XML document and simultaneously validate it against the customized XML DTD and the base SGML DTD.
For the Sikaiana dictionary project, the following two changes needed to be made to the SGML declaration used with the TEI:
DELIMsection of the SGML declaration used with the base DTD to define general delimiters for "null end tag" and "processing instruction close" that are consistent with the XML syntax for closing empty tags and processing instructions. The resulting
DELIMsection looks like this:
DELIM GENERAL SGMLREF NET "/>" PIC "?>" SHORTREF SGMLREF
FEATURESsection of the SGML declaration, set
NO. This prohibits tags from being omitted (as is required by an XML document) and allows the two tag minimization characters in an element declaration to be left out (as is required by an XML DTD).
The problem of fatware is addressed by creating a DTD for the project
that omits declarations for all the elements and attributes of the base
DTD that are not used. This also requires that content models be revised
to no longer reference omitted elements. At the same time, content models
should be tightened to embody any additional constraints the project wants
to enforce. For instance, elements that are optional in the base DTD could
be required in the project DTD, or elements that may occur in free order
in the base DTD could be constrained to a particular order in the project
DTD. The discussion of
<sense> in section
2.3 provides an example.
If a good sample document already exists, an easy way to proceed with creating the customized DTD is to use a DTD generator like FRED [Sha95]. FRED is a free service on the Web that analyzes a submitted SGML document and returns the DTD that is deduced for it. In the case of the Sikaiana dictionary project, the DTD returned by FRED was very close to what we wanted and was easy to modify. By contrast, the TEI DTD was so much bigger and more permissive than what we wanted that starting from scratch would have been easier than trying to edit it.
The problem of customization is addressed by modifying the
problem-specific XML DTD as needed. Elements that have the same name as
the corresponding element in the architectural DTD will automatically map
to the right architectural form. Any other elements must be explicitly
mapped by using an architectural form attribute. For instance, consider
the example from section 2.2 of mapping from the
<sik> to the architectural equivalent
<foreign lang="SIK">. Given that
is the name declared for the architectural form attribute, the
customization is achieved by the following definitions in the XML DTD:
<!ELEMENT sik (#PCDATA) > <!ATTLIST sik tei NMTOKEN #FIXED "foreign" lang CDATA #FIXED "SIK" >
The other example from section 2.2 was the discrimination of normal example sentences from idioms; the latter differ in that they require different display formatting and may include a literal translation. This part of the XML DTD would be coded as follows:
<!ELEMENT idiom (q, tr, lit?) > <!ATTLIST idiom tei NMTOKEN #FIXED "eg" > <!ELEMENT q (#PCDATA) > <!ELEMENT tr (#PCDATA) > <!ELEMENT lit (#PCDATA) > <!ATTLIST lit tei NMTOKEN #FIXED "tr" >
These declarations say that the custom elements
<lit> are really just specializations of the TEI
ATTLIST declaration to associate them with an
architectural form since the architectural processing mechanism
automatically associates an element with an architectural form of the same
The complete problem-specific DTD for the Sikaiana dictionary project is as follows:
<!-- sikaiana.dtd --> <!-- XML DTD for the Sikaiana dictionary project --> <!ELEMENT SikDict (teiHeader, text) > <!ATTLIST SikDict tei NMTOKEN #FIXED "TEI.2" > <!ELEMENT text (front, body) > <!-- Declarations for teiHeader and front are omitted to save space --> <!ELEMENT body (entry+) > <!ELEMENT entry ( form+, ( xr | note+ | (etym?, sense+, (xr | re)*) )) > <!ATTLIST entry id ID #REQUIRED n CDATA #IMPLIED > <!ELEMENT form (#PCDATA) > <!ATTLIST form type (headword|alternate) "headword" > <!-- Etymology --> <!ELEMENT etym (#PCDATA | lang | mentioned | gloss )* > <!ELEMENT lang (#PCDATA) ><!-- language name, abbrev --> <!ELEMENT mentioned (#PCDATA) ><!-- a source form --> <!ELEMENT gloss (#PCDATA) ><!-- gloss of source form --> <!-- Senses of meaning --> <!ELEMENT sense ( gramGrp, def, (eg | idiom | note)*, usg?, (xr | re)* ) > <!ATTLIST sense n CDATA #IMPLIED > <!-- Grammatical information --> <!ELEMENT gramGrp ( pos, gram? ) > <!ELEMENT pos (#PCDATA) ><!-- part of speech --> <!ELEMENT gram (#PCDATA) ><!-- further grammar note --> <!ATTLIST gram type CDATA #FIXED "note" > <!-- Definitions --> <!ELEMENT def ( #PCDATA | sik )* > <!ELEMENT sik (#PCDATA) ><!-- Sikaiana word(s) --> <!ATTLIST sik tei NMTOKEN #FIXED "foreign" lang CDATA #FIXED "SIK" > <!ELEMENT usg (#PCDATA) ><!-- a usage note --> <!-- Examples --> <!ELEMENT eg ( q, tr, usg? ) > <!ELEMENT idiom ( q, tr, lit? ) > <!ATTLIST idiom tei NMTOKEN #FIXED "eg" > <!ELEMENT q (#PCDATA) ><!-- the example text --> <!ELEMENT tr (#PCDATA) ><!-- free translation --> <!ELEMENT lit (#PCDATA) ><!-- literal translation --> <!ATTLIST lit tei NMTOKEN #FIXED "tr" > <!ELEMENT note (#PCDATA) > <!-- Cross-references --> <!ELEMENT xr (ptr+) ><!-- a semantic cross-ref. --> <!ATTLIST xr type ( seeAlso | antonym | generic | contrast | causative | transitive | whole | synonym | other | stative ) #REQUIRED > <!ELEMENT ptr EMPTY ><!-- a pointer to an entry --> <!ATTLIST ptr target CDATA #REQUIRED > <!-- The target is CDATA so that files for individual letters of the alphabet can be validated without being swamped by missing IDs. In TEI, target is IDREF, so that full validation of cross-references occurs when parsed with the -Atei option. --> <!-- Related entry; i.e., a grammatical derivative --> <!ELEMENT re ( form, gramGrp?, def? ) > <!ATTLIST re type ( singular | other | causative | passive | plural | repeatedAction | stative | oneTimeAction | transitive) #REQUIRED >
In comparing this DTD to the full TEI DTD, we see a situation that is like the difference between a single-purpose software tool and a general-purpose tool. In software, a key technology for building task-centered applications is to use a scripting language to build many single-purpose tools around a single many-purpose component [Sim98]. Analogously in markup, a key technology for building task-centered applications is to use architectural processing to map many single-purpose DTDs onto a single many-purpose DTD.
In order to employ this technique of building a problem-specific XML DTD that is derived from a widely-used SGML DTD, one must use an SGML parser that incorporates a full architectural processing engine. The SP parser by James Clark [Cla98] is an example of such a parser.
First, we want to validate our project documents against just the problem-specific DTD. Use the -w xml command line option to run the SP parser in XML mode. In this mode, the parser issues warnings about anything in the DTD that is not valid in XML. For instance,
nsgmls -w xml -c xml.soc myData.xml
where xml.soc is an SGML Open catalog containing:
SGMLDECL xml.dcl DOCTYPE SikDict sikaiana.dtd
That is, the standard SGML declaration for XML (supplied with the SP parser) is used and sikaiana.dtd (see section 4.4) is the DTD that is used when the document element is <SikDict>.
Second, we want to use architectural processing to validate our project documents against the TEI DTD as well. The secret to setting up the parser to use architectural processing is to insert a declaration of the base architecture into the DTD. For this purpose, we create a new version of the project DTD named sik_tei.dtd:
<!-- sik_tei.dtd Sikaiana Dictionary Project DTD for mapping to TEI with SP parser--> <!-- First, declare that this application is based on the TEI architecture --> <?IS10744 ArcBase tei ?> <!ENTITY % teiDTD SYSTEM "mypizza.dtd" > <!NOTATION tei SYSTEM> <!ATTLIST #NOTATION tei arcDocF NAME #FIXED TEI.2 arcFormA NAME #FIXED tei ArcDTD CDATA #FIXED "%teiDTD" > <!-- Now declare the elements of the Sikaiana dictionary application --> <!ENTITY % sikaianaDTD SYSTEM "sikaiana.dtd"> %sikaianaDTD;
A base architecture named tei is declared by means of the processing instruction. Following this is the architectural support declaration. It consists of a notation declaration and an attribute definition list that sets options which control the architecture engine. In this case, ArcDocF specifies the generic identifier for the document element of the architectural document, arcFormA identifies the architectural form attribute, and ArcDTD specifies the file which contains the architectural DTD. Mypizza.dtd is a customized DTD downloaded from the TEI "Pizza Chef" [OUCS98]; it is a 98K file containing just the core tag set and the base tag set for dictionaries. Finally, the problem-specific DTD from section 4.4 is included without change; thus we still have a DTD for the document type SikDict.
Further, we create an alternative SGML Open catalog named tei.soc to use when we want to validate against both the problem-specific DTD and the TEI DTD. It uses the modified version of the TEI SGML declaration created in section 4.1 and the version of the problem-specific DTD that sets up architectural processing. That is,
SGMLDECL tei_xml.dcl DOCTYPE SikDict sik_tei.dtd
Running the SP tools with the
-A command line parameter
invokes architectural processing for the named architecture. Thus,
nsgmls -A tei -c tei.soc myData.xml
causes the document to be validated against both the problem-specific DTD and the architectural DTD. Note that the sgmlnorm member of the SP family can go even a step further to translate a project document into the equivalent architectural document.
The TEI Guidelines devote one chapter to the issue of conformance and another to mechanisms for modifying the DTD in a conforming manner [SMB94]. As the guidelines explain, the target uses of the DTD demanded that extension be possible:
The document type declaration provided by the TEI is intended to cover as wide a variety of document types and processing needs as proved feasible. It is impossible, however, for any finite list of text elements to cover every need of textual research and processing. As a result, extension of the TEI DTD has no effect on strict TEI conformance, as long as certain restrictions are observed; these have the effect of ensuring that later users of a file can easily see what changes have been made to the DTDs and what the new tags are intended to mean. [Section 28.5.3]
An extended TEI DTD is TEI conformant if it meets two basic requirements: (1) all extensions are documented in a prescribed way, and (2) all modifications are made in the DTD subset of the document (that is, the actual TEI DTD files may not be modified). To support DTD modification via the DTD subset, the TEI DTD was implemented using an ingenious system of entities:
In short, virtually any change (including wholesale redefinition) is conformant, as long as it is done using the prescribed mechanisms. Such a liberal view of conformance is probably troubling to most. The guidelines partially address this in section 29.1 by defining two classes of modifications: clean modifications versus unclean modifications. The implication is that the former are preferred over the latter:
Note that a modification that renames an element without creating a conflict with an existing element name is considered clean (section 29.1.2) since the set of documents matching the modified DTD is isomorphic to the set of documents matching the original DTD.
The TEI DTD was developed before the notion of SGML architectures was generalized. Had architectures existed, the TEI could have avoided devising its elaborate system of extension by adopting an architectural approach to conformance. Such an approach might work something like the following.
The TEI notion of original DTD corresponds to the architectural DTD and the TEI notion of modified DTD corresponds to the derived problem-specific DTD. A problem-specific DTD would be TEI conformant if it declared the TEI DTD to be its base architecture. Such a definition is comparable in its liberality to the TEI's definition. What is more significant is the distinction between clean and unclean conformance and the contribution the architectural approach can make to that question.
In the TEI approach to conformance, the notion of unclean versus clean has a formal definition in terms of the overlap or non-overlap of the two sets of documents matched by the two DTDs. In the TEI approach, the SGML parser cannot validate a modification as being clean or not; this is simply a matter for the DTD designer to reason about. The architectural approach, however, can change this. For both documents and DTDs, we could define two kinds of conformance in terms of parser behavior as follows:
This definition of clean conformance has essentially the same coverage as the TEI definition. The TEI definition has three basic cases which correspond as follows in the architectural approach:
This architectural approach to defining clean conformance has a major advantage over the TEI approach, namely, the SGML parser can formally test clean conformance for any user document. By simultaneously validating a document against its own DTD and its architectural DTD, clean conformance is achieved when no errors are reported for either DTD. When a document is valid against its own DTD, but generates errors with respect to the architectural DTD, then it is unclean conformance. When this happens there are two cases:
This approach does have one major weakness: the SGML parser can only verify that a particular document instance conforms to the architecture; it cannot verify that the problem-specific DTD conforms in the general case to the architectural DTD. That is, there is no way to ensure in advance that a particular problem-specific DTD only accepts documents that are also architecturally valid. For a case like the Sikaiana dictionary project, in which there is a closed set of data files, and we can easily validate them all against both DTDs, this limitation does not pose a problem. On the other hand, in a case like an industrial setting, where a run-time validation error could bring production to a screeching halt, this limitation could be a serious one.
The prototypical use of architectural processing has been to annotate one DTD with respect to the forms (or semantic elements) of another. This paper has demonstrated the application of architectural processing for a different purpose, namely, to indirectly validate a small DTD developed for a particular project against a large widely-used DTD that it is meant to be based on. By using this technique, a DTD developer can enjoy the benefits of a customized XML DTD without losing the benefits of the intellectual effort that went into developing the widely-used SGML DTD. By the same token, a project can have the advantages of delivering a customized XML application without losing the advantages of conforming to one of the widely-used SGML applications.
[Cla97] Clark, J. (1997) "Comparison of SGML and XML," World Wide Web Consortium NOTE-sgml-xml-971215. <http://www.w3.org/TR/NOTE-sgml-xml>.
[Cla98] Clark, J. (1998) SP:An SGML System Conforming to International Standard ISO 8879 --Standard Generalized Markup Language, version 1.3. <http://jclark.com/sp/>. See especially "Architectural form processing," <http://jclark.com/sp/archform.htm>.
[Cor97] Corkern, C. (1997) "From architectures to authoring DTDs," SGML/XML '97 Conference Proceedings, pages 263-268. Alexandria, VA: Graphic Communications Association.
[Cov98] Cover, R. (1998) "Architectural Forms and SGML/XML Architectures," in The SGML/XML Web Page. <http://www.oasis-open.org/cover/topics.html#archForms>.
[DD94] DeRose, S. and Durand, D. (1994) Making Hypermedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. See especially pages 79-90.
[Don87] Donner, W. (1987) Sikaiana Vocabulary: Na male ma na talatala o Sikaiana. Honiara, Solomon Islands: published by the author through a grant from the South Pacific Cultures Fund of the Australian government. 267 pp.
[ISO92] International Organization for Standardization. (1992) ISO/IEC 10744. Hypermedia/Time-based Structuring Language: HyTime.
[ISO97] International Organization for Standardization. (1997) "Architectural Form Definition Requirements (AFDR)," Annex A.3 of ISO/IEC N1920, Information Processing--Hypermedia/Time-based Structuring Language (HyTime), Second edition 1997-08-01. <http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html>.
[Kim97] Kimber, W. E. (1997) "A tutorial introduction to SGML architectures," an ISOGEN International Corporation workpaper. <http://www.isogen.com/papers/archintro.html>.
[Meg98] Megginson, D. (1998) Structuring XML Documents. Charles F. Goldfarb Series on Open Information Management. Upper Saddle River, NJ: Prentice Hall.
[OUCS98] Oxford University Computing Services, Humanities Computing Unit. (1998) "The Pizza Chef: a TEI tag set selector," an interactive service on the Web. <http://www.oucs.ox.ac.uk/humanities/TEI/pizza.htm>.
[PTUM93] Perratore, E., T. Thompson, J. Udell, and R. Malloy. (1993) "Fighting fatware," Byte (April 1993), pp. 98-108.
[Sha95] Shafer, K. (1995) "Creating DTDs via the GB-Engine and Fred," a paper presented at SGML '95. <http://www.oclc.org/fred/docs/sgml95.html>. The software is available at <http://www.oclc.org/fred/>.
[Sim97] Simons, G. (1997) "Using architectural forms to map SGML data into an object-oriented database," SGML/XML '97 Conference Proceedings, pages 449-459. Alexandria, VA: Graphic Communications Association. A fuller workpaper is available at <http://www.sil.org/cellar/import/>.
[Sim98] Simons, G. (1998) "In search of task-centered software: building single-purpose tools from multipurpose components," SIL Electronic Working Paper 1998-004. Dallas: Summer Institute of Linguistics. <http://www.sil.org/silewp/1998/004/>.
[SMB94] Sperberg-McQueen, C. M. and L. Burnard (eds.). (1994) Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative. <http://www-tei.uic.edu/orgs/tei/p3/elect.html>. See especially chapter 12, "Print dictionaries," chapter 28, "Conformance," and chapter 29, "Modifying the TEI DTD."
[Wir95] Wirth, N. (1995) "A plea for lean software," IEEE Computer (February 1995), pp. 64-68.
Date created: 29-Dec-1998
[SILEWP 1998 Contents | SILEWP Home | SIL Home]