The SGML model versus the object model, and the problem of converting from one to the other
Importing SGML data into CELLAR by means of architectural forms
Gary F. Simons
Summer Institute of Linguistics
Last revised: 12 November 1997
- The SGML model in a nutshell
- The object model in a nutshell
- An unsatisfactory default mapping from elements to objects
- An example of the kind of mapping we need
- The fundamental problem
- A basic architecture for mapping SGML data into objects
The problems inherent in importing SGML data into an object database stem from the differences between the SGML model of data and the object model of data. In speaking of the "object model of data," I am referring specifically to the way object databases [Cat97] and conceptual modeling languages [Bor85] represent information. Such systems replace the simple instance variables of an object-oriented programming language with attributes that encapsulate integrity constraints and the semantics of relationships to other objects.
In this document, the SGML model and the object model are reduced to their most basic features in order to make comparison easy. Then the problem of converting data from the SGML model to the object model is discussed. First a fully automatic approach to translation is demonstrated, but shown to be unusable because of a fundamental problem: sometimes an SGML element corresponds to an object, sometimes to an attribute, and sometimes to both. This sets the stage for the solution adopted in this working paper of performing an automatic translation guided by formal mapping rules derived by a human analysis of how elements in the source SGML data relate to objects and attributes in the target object model. SGML architectural forms are used to encode the formal mapping.
In SGML, the fundamental unit of data representation is the element. Each element must have a generic identifier; it may optionally have a number of attributes or content or both. Each attribute has a name and a value; the value is represented by a string of characters. The content of an element may consist of character data or embedded elements or a combination of both. These generalizations may be expressed in terms of the following declarations:
<!ELEMENT element - - (attr* & content?) > <!ATTLIST element gi NAME #REQUIRED > <!ELEMENT attr - O EMPTY > <!ATTLIST attr name NAME #REQUIRED value CDATA #IMPLIED > <!ELEMENT content - - (#PCDATA | element)* >
In the object model, the fundamental unit of data representation is the object. Each object must have a class, and is either a primitive object that stores primitive data like a string or a number, or is a complex object that has attributes. Each attribute has a name and a value; the value consists of embedded objects. These generalizations may be expressed in terms of the following declarations:
<!ELEMENT object - - (attr)* > <!ATTLIST object class NAME #REQUIRED > <!ELEMENT attr - - (primitiveObject | object)* > <!ATTLIST attr name NAME #REQUIRED > <!ELEMENT primitiveObject - - (#PCDATA) > <!ATTLIST primitiveObject class NAME #REQUIRED >
Element and object are superficially similar: generic identifier corresponds to class, both have attributes, and both occur recursively. They differ fundamentally, however, in the nature of the attributes and the recursion. With elements, the attributes cannot contain embedded structure; the recursion of elements is allowed only within the content of an element. With objects, there is no specialized notion of content; rather, the recursive embedding of further objects takes place within the attributes.
Convert every instance of
Convert every instance of
<attr name=X value=Y>to
<attr name=X><primitiveObject class="String">Y</primitiveObject></attr>.
Convert every instance of
Embed every instance of
#PCDATAwithin the tags
For example, the following sample SGML element contains an instance of each of the four conditions listed above:
<phrase rend="ital">an italic phrase</phrase>
Following the nutshell model of SGML in section 1, this corresponds to the following semantic representation:
<element gi="phrase"> <attr name="rend" value="ital"> <content>an italic phrase</content> </element>
This would be converted into the following object representation by the proposed default mapping:
<object class="phrase"> <attr name="rend"> <primitiveObject class="String">ital</primitiveObject> </attr> <attr name="content"> <primitiveObject class="String">an italic phrase</primitiveObject> </attr> </object>
The default transformation described in the preceding section can easily be done on any SGML document, but it will seldom yield a result that actually fits the conceptual model of a target object database. Consider, for instance, the following simplistic SGML document:
<!DOCTYPE document SYSTEM "document.dtd"> <document> <creationDate>12-Jun-97</creationDate> <title> <maintitle>The main title</maintitle> <subtitle>a subtitle</subtitle> </title> <authors> <author> <name>First Author</name> <affil>Some Company</affil> </author> <author> <name>Second Author</name> <affil>Another Company</affil> </author> </authors> <p>An introductory paragraph</p> <div1><!-- The first section --></div1> <div1><!-- The second section --></div1> </document>
The above represents a typical approach to encoding a document in SGML. But compare it to the following which is also typical of how a Document class might be defined in an object database:
class Document has creationDate : Date title : TitleStatement authors : sequence of Person content : sequence of Paragraph or Division
The default mapping proposed in section 3 would first
go wrong by putting all the subelements within the document in a single attribute
named content; instead we want to map them into four different attributes.
The first three subelements (
to Document attributes of the same name. The remaining subelements
<p> and two instances of
correspond to objects that go into the Document attribute named
content (which happens not to be explicitly tagged). Though the first
three subelements correspond to attributes, they differ significantly in
the way they do so.
<creationDate> additionally carries
the information that the embedded PCDATA content should be mapped onto a
basic object of class Date.
<title> not only corresponds
to the attribute title but also to an object of class TitleStatement
(which in turn has attributes maintitle and subtitle). By contrast,
<authors> corresponds to the attribute and nothing more;
<author> element corresponds to an object
of class Person.
This example illustrates the following fundamental result when comparing the SGML model to the object model: some SGML elements encode an object, some encode an attribute, and still others simultaneously encode both. (We see in the full CELLAR architecture that still other relationships are possible.) The basic challenge of importing SGML data into an object database is to determine which of these cases holds for each of the element types occurring in the data, and then to express formally how each maps onto the corresponding classes and attributes of the target database schema.
The HyTime standard [ISO92] first introduced the concept of architectural forms as a way to associate standardized semantics with elements in user-defined DTDs [DD94]. Now that this notion has been generalized in the SGML Extended Facilities (defined in Annex A of the revised HyTime standard [ISO97]), we can use it to good advantage in solving the problem at hand. Architectural forms provide a mechanism we can use to express the semantics of how SGML elements map onto the object model. See [Cov97] for pointers to other applications of architectural forms.
There are two basic element forms in the architecture,
<attr>. Rather than having
a third form for the case when an element corresponds to both an object and
an attribute, this case is treated as being a mapping to an object, and the
object form adds an architectural attribute to name the attribute it also
maps to. The basic definitions of these two forms are as follows (see the
main paper for their full definition):
<!ELEMENT object - - (object | attr | #PCDATA)* > <!ATTLIST object class -- Create this class of CELLAR object -- CDATA #REQUIRED parentAttr -- Put the object in this attr of its parent -- CDATA #IMPLIED contentAttr -- Put embedded objects in this attribute -- CDATA #IMPLIED pcdataClass -- Create this class for embedded PCDATA -- CDATA "String" > <!ELEMENT attr - - (object | #PCDATA)* > <!ATTLIST attr contentAttr -- Put embedded objects in this attribute -- CDATA #IMPLIED pcdataClass -- Create this class for embedded PCDATA -- CDATA "String" >
The easiest way to explain these forms is by example. In the illustrative
document in section 4, the
<document> element corresponds to an object of class Document;
the element content (unless an embedded element names a specific target
attribute) goes into the content attribute of the object. The
<document> element would be annotated as follows to indicate
its mapping into the object model:
<document cellar=object class="Document" contentAttr="content">
This says that in the architecture named cellar, this
<document> element corresponds to an
<object> element whose class is "Document" and
whose contentAttr is "content".
<creationDate> element corresponds to an attribute.
Its content goes into the creationDate attribute, and the embedded
PCDATA needs to be converted into Date objects. Thus,
<creationDate cellar=attr contentAttr="creationDate" pcdataClass="Date">
<title> element corresponds to a TitleStatement object,
but it also corresponds to an attribute in that it maps into the title
attribute of its parent object (that is, the Document). Thus,
<title cellar=object class="TitleStatement" parentAttr="title">
<authors> element corresponds to the
authors attribute; thus,
<authors cellar=attr contentAttr="authors">
An SGML parser that performs architectural processing can take elements annotated like this and translate them into elements of the target architecture. Return to the main paper for an explanation of how this works.
Document date: 12-Nov-1997