This document is a roadmap for the Darwin Information Typing Architecture: what it is and how it applies to technical documentation. It is also a product of the architecture, having been written entirely in XML and produced using the principles described here...
The Darwin Information Typing Architecture (DITA) is an XML-based, end-to-end architecture for authoring, producing, and delivering technical information. This architecture consists of a set of design principles for creating "information-typed" modules at a topic level and for using that content in delivery modes such as online help and product support portals on the Web.
At the heart of DITA ( Darwin Information Typing Architecture), representing the generic building block of a topic-oriented information architecture, is an XML document type definition (DTD) called "the topic DTD." The extensible architecture, however, is the defining part of this design for technical information; the topic DTD, or any schema based on it, is just an instantiation of the design principles of the architecture.
This architecture and DTD were designed by a cross-company workgroup representing user assistance teams from across IBM. After an initial investigation in late 1999, the workgroup developed the architecture collaboratively during 2000 through postings to a database and weekly teleconferences. The architecture has been placed on IBM's developerWorks Web site as an alternative XML-based documentation system, designed to exploit XML as its encoding format. With the delivery of these signficant updates contains enhancements for consistency and flexibility, we consider the DITA design to be past its prototype stage.
IBM, with millions of pages of documentation for its products, has its own very complex SGML DTD, IBMIDDoc, which has supported this documentation since the early 1990s. The workgroup had to consider from the outset, "Why not just convert IBMIDDoc or use an existing XML DTD such as DocBook, or TEI, or XHTML?" The answer requires some reflection about the nature of technical information.
First, both SGML and XML are recognized as meta languages that allow communities of data owners to describe their information assets in ways that reflect how they develop, store, and process that information. Because knowledge representation is so strongly related to corporate cultures and community jargon, most attempts to define a universal DTD have ended up either unused or unfinished. The ideal for information interchange is to share the semantics and the transformational rules for this information with other data-owning communities.
Second, most companies rely on many delivery systems, or process their information in ways that differ widely from company to company. Therefore any attempt at a universal tool set also proves futile. The ideal for tools management is to base a processing architecture on standards, to leverage the contributed experience of many others, and to solve common problems in a broad community.
Third, most attempts to formalize a document description vocabulary (DTD or schema) have been done as information modelling exercises to capture the current business practices of data owners. This approach tends to encode legacy practices into the resulting DTDs or vocabularies. The ideal for future extensibility in DTDs for technical information (or any information that is continually exploited at the leading edge of technology) is to build the fewest presumptions about the "top-down" processing system into the design of the DTD.
In the beginning, the workgroup tried to understand the role of XML in this leading edge of information technology. As the work progressed, the team became aware that any DTD design effort would have to account for a plurality of vocabularies, a tools-agnostic processing paradigm, and a legacy-free view of information structures. Many current DTDs incorporate ways to deal with some of these issues, but the breadth of the issues lead to more than just a DTD. To support many products, brands, companies, styles, and delivery methods, the entire authoring-to-delivery process had to be considered. What resulted was a range of recommendations that required us to represent our design, not just as a DTD, but as an information architecture.
As the "Architecture" part of DITA's name suggests, DITA has unifying features that serve to organize and integrate information:
The various information architectures for online deliverables all tend to focus on the idea of topics as the main design point for such information. A topic is a unit of information that describes a single task, concept, or reference item. The information category (concept, task, or reference) is its information type (or infotype). A new information type can be introduced by specialization from the structures in the base topic DTD. Typed topics are easily managed within content management systems as reusable, stand-alone units of information. For example, selected topics can be gathered, arranged, and processed within a delivery context to provide a variety of deliverables. These deliverables might be groups of recently updated topics for review, helpsets for building into a user assistance application, or even chapters or sections in a booklet that are printed from user-selected search results or "shopping lists."
Through topic granularity and topic type specialization, DITA brings the following benefits of the object-oriented model to information sets:
DITA can be considered object-oriented in that:
With discipline and ingenuity, some of the benefits of topic information sets can be provided through a book DTD. In particular, techniques for chunking can generate topics out of a book DTD. In DITA, the converse approach is possible: a book can be assembled from a set of DITA topics. In both cases, however, the adaptation is secondary to the primary purpose of the DTD. That is, if you are primarily authoring books, it makes the most sense to use a DTD that is designed for books. If you are primarily authoring topics, it makes sense to use a DTD that is designed for topics and can scale to large, processable collections of topics.
The Darwin Information Typing Architecture defines a set of relationships between the document parts, processors, and communities of users of the information.
The Darwin Information Typing Architecture has the following layers that relate to specific design points expressed in its core DTD, topic.
Delivery contexts | ||
---|---|---|
helpset | aggregate printing | Web site; information portal |
Typed topic structures | |||
---|---|---|---|
topic | concept | task | reference |
Specialized vocabularies (domains) across information types | |||
---|---|---|---|
Typed topic: | concept | task | reference |
Included domains: |
|
Common structures | ||
---|---|---|
metadata | OASIS (CALS) table |
A typed topic, whether concept, task, or reference, is a stand-alone unit of ready-to-be-published information. Above it are any processing applications that may be driven by a superset DTD; below it are the two types of content models that form the basis of all specialized DTDs within the architecture. We will look at each of these layers in more detail.
This domain represents the processing layer for topical information. Topics can be processed singly or within a delivery context that relates multiple topics to a defined deliverable. Delivery contexts also include document management systems, authoring units, packages for translation, and more.
delivery contexts | ||
---|---|---|
helpset | aggregate printing | Web site; information portal |
The typed topics represent the fundamental structuring layer for DITA topic-oriented content. The basis of the architecture is the topic structure, from which the concept, task, and reference structures are specialized. Extensibility to other typed topics is possible by further specialization.
typed topic structures | |||
---|---|---|---|
topic | concept | task | reference |
The four information types (topic, concept, task, and reference) represent the primary content categories used in the technical documentation community. Moreover, specialized, information types, based on the original four, can be defined as required.
As a notable feature of this architecture, communities can define or extend additional information types that represent their own data. Examples of such content include product support information, programming message descriptions, and GUI definitions. Besides the ability to type topics and define specific content models therein, DITA also provides the ability to extend tag vocabularies that pertain to a domain. Domain specialization takes the place of what had been called "shared structures" in DITA's original design.
Commonly, when a set of infotyped topics are used within a domain of knowledge, such as computer software or hardware, a common vocabulary is shared across the infotyped topics. However, the same infotyped topic can be used across domains that have different vocabularies and semantics. For example, a hardware reference topic might refer to diagnostic codes while a software reference topic might refer to error message numbers, with neither domain necessarily needing to expose the other domain's unique vocabulary to its own writers.
specialized vocabularies (domains) across information types | |||
---|---|---|---|
Typed topic: | concept | task | reference |
Included domains: |
|
The basic domains defined as examples for DITA include:
Domain | Elements |
---|---|
highlighting | b, u, i, tt, sup, sub |
software | msgph, msgblock, msgnum, cmdname, varname, filepath, userinput, systemoutput |
programming | codeph, codeblock, option, var, parmname, synph, oper, delim, sep, apiname, parml, plentry, pt, pd, syntaxdiagram, synblk, groupseq, groupchoice, groupcomp, fragment, fragref, synnote, synnoteref, repsep, kwd |
user interface | uicontrol, wintitle, menucascade, shortcut |
By following the rules for specializing a new domain of content, you can extend, replace, or remove these domains. Moreover, content specialization enables you to name and extend any content element in the scope of DITA infotyped topics for a more semantically significant role in a new domain.
To enable specialized vocabulary, you declare a parameter entity equivalent for every element used in a DTD (such as topic or one of its specializations), and then use the parameter entities instead of literal element tokens within the content models of that DTD. Later, after entity substitution, because an element's parameter entity is redefined to include both the original element and the domain elements derived from that element, anywhere the original element is allowed, the other derived domain elements are also allowed. In effect, a domain-agnostic topic can be easily extended for different domains by simply changing the scope of entity set inclusions in a front-end DTD "shell" that formalizes the vocabulary extensions within that typed topic or family of typed topics
One of the design points of DITA has been to exploit the reuse of common substructures within the world of XML. Accordingly, the topic DTD incorporates the OASIS table model (known originally as the CALS table model). It also has a defined set of metadata that might be shared directly with the metadata models of quite different DTDs or schemas.
common structures | ||
---|---|---|
metadata | OASIS (CALS) table |
The metadata structure defines document control information for individual topics, higher-level processing DTDs, or HTML documents that are associated to the metadata as side files or as database records.
The table structure provides presentational semantics for body-level content. The OASIS/CALS table display model is supported in many popular XML editors.
A company that has specific information needs can define specialized topic types. For example, a product group might identify three main types of reference topic: messages, utilities, and APIs. By creating a specialized topic type for each type of content, the product architect can be assured that each type of topic has the appropriate content. In addition, the specialized topics make XML-aware search more useful because users can make fine-grained distinctions. For example, a user could search for "xyz" only in messages or only in APIs, as well as search for "xyz" across reference topics in general.
There are rules for how to specialize safely: each new information type must map to an existing one and must be more restrictive in the content that it allows. With such specialization, new information types can use generic processing streams for translation, print, and Web publishing. Although a product group can override or extend these processes, they get the full range of existing processes by default without any extra work or maintenance.
A corporation can have a series of DTDs that represent a consistent set of information descriptions, each of which emphasizes the value of specialization for those new information types.
The technical documentation community that designed this architecture defined the basic architecture and shared resources. The content owned by specified communities (within or outside of the defining community) can reuse processors, styles, and other features already defined. But, those communities are responsible for their unique business processes based on the data that they manage. They can manage data by creating a further specialization from one of the base types.
The following figure represents how communities, as "content owners at the topic level," can specialize their content based on the core architecture.
In this figure, the overlap represents the common architecture and tools shared between content-owning communities that use this information architecture. New communities that define typed documents according to the architecture can then use the same tools at the outset, and refine their content-specific tools as needed.
© Copyright International Business Machines Corp., 2002, 2003. All rights reserved.
The information provided in this document has not been submitted to any formal IBM test and is distributed "AS IS," without warranty of any kind, either express or implied. The use of this information or the implementation of any of these techniques described in this document is the reader's responsibility and depends on the reader's ability to evaluate and integrate them into their operating environment. Readers attempting to adapt these techniques to their own environments do so at their own risk.