|
|
This article was originally printed in DB2 Magazine.
Your company is probably already using it. It’s time to figure out how to manage it.
XML Origins
Back in the early 1980s, when I worked as a systems engineer in an IBM branch office, I didn’t use a word processor to create letters, reports, and other items of business documentation (the personal computer was quite new at the time); instead, I used an IBM product called the Document Composition Facility (DCF), which ran on a mainframe server under the Virtual Machine (VM) operating system. More specifically, I used a text formatter, called SCRIPT, that was a component of DCF. Sitting at a “dumb terminal” (the term “thin client” was not yet in vogue), I would enter the text of my document and insert “tags” into the file to control the appearance of the printed document. For example, I’d indicate the start of a paragraph with this tag (the tags were preceded by a colon and followed by a period):
:p.The IBM 3350 disk drive is a high-performance…
If I wanted to boldface a word, I would use the :hp2. tag (the :ehp2. tag denotes the point at which the use of the bold typeface is to stop):
:p.The 3033 computer offers a whopping :hp2.16 megabytes:ehp2. of central storage…
These tags, and others used to format SCRIPT files, comprise what is known as the Generalized Markup Language (GML).
GML was originally developed by IBM in the 1960s. In the mid-1980s, a GML descendant called SGML (Standard Generalized Markup Language) made the scene. SGML is a metalanguage. Just as metadata defines data, the SGML metalanguage can be used to define document markup languages. This means that SGML is extensible; however, SGML is also complex, and, as a result, it tends to be used in specialized circumstances. [Wikipedia entries for GML and SGML helped jog my memory about the relationship between SCRIPT and DCF, and the relationship between GML and SGML.]
In the mid-1990s, HTML (HyperText Markup Language), which is based on SGML tagging, was developed to control the appearance of content displayed via one of those new-fangled things called a Web browser.
HTML is great for making data look good when displayed on a Web page, but it doesn’t confer meaning to data. You might dispute that contention because you can look at an HTML-formatted price list (for example) on a Web page and tell that it’s a price list. But that’s because your brain understands written language and visual cues, and you have seen jillions of price lists in your lifetime. Your brain is a lot more powerful than most computers (in some ways, more powerful than any computer), so what works for you and me is not such a good solution when a machine has to interpret data in a document (or file). In an era of rapidly growing data interchange between organizations running all manner of computer systems, what was needed was a data-defining markup language flexible enough to adapt to new requirements yet simple enough to be used for general-purpose applications.
That language showed up in 1998. It’s called Extensible Markup Language, but it’s better known as XML.
Why XML is Important
Of course, XML didn’t make it possible for organizations to exchange data electronically — that had already been going on for several decades. But XML was a catalyst that greatly expanded the world of electronic inter-system data exchange. How so? Consider the benefits delivered by XML:
Productivity. It’s typically easier for an organization’s application developers to code programs that process data in XML format vs. legacy formats that rely on the positioning of data to confer meaning (as in, “the 10-character field beginning at offset 12 from the beginning of the record is the account code”).
Flexibility. Because XML is very open, the number of organizations that can use it is very large. If a manufacturing firm that exchanges data with suppliers in XML format wants to add a new company to its supplier network, chances are that supplier is already XML-capable (or could become so quickly).
Adaptability. Suppose an organization wants to exchange data with peers in XML format, but the data in question is very specialized and some fields can’t be properly described by any existing XML tags. No problem: the organizations can develop their own tags to suit their needs (and these tags can eventually be made universally available, just as a dictionary is periodically expanded with new entries).
Basically, it’s all about speed. To succeed in business (or even in nonprofit activities), organizations have to be able to respond quickly and effectively to shifting market dynamics and changing customer needs. Corporations need to be able to forge working partnerships with other companies in short order to address new opportunities. People need to be able to rapidly solve business problems without being slowed by complex rules around the exchange of data with third parties. Organizations that leverage XML gain agility in return.
Compare and Contrast
I recently had the opportunity to talk XML with Chris Eaton, a
senior product manager at IBM’s Toronto Lab (home of DB2 for Linux, Unix, and
Windows). Chris shared an example that really impressed upon me the value of XML
as a data-describing language.
FIX (Financial Information eXchange) is a protocol used by financial services
organizations for the exchange of data pertaining to securities transactions
(such as stock trades). Listing 1 shows a trade involving the purchase of 1,000
shares of IBM stock in the FIX protocol.
Listing 2 shows the same information in the FIXML protocol, a recently developed
XML representation of FIX. If you were an application programmer, which format
would you rather work with?
Note that although XML is certainly more intuitive than many legacy
data-exchange protocols, it can also result in larger message sizes. Some years
ago, that might have been a problem; indeed, a number of data-exchange protocols
were likely designed to minimize the size of a data transmission (in order to
optimize throughput and make efficient use of costly network, server, and disk
storage systems). Today, network bandwidth, processor speeds, and disk subsystem
capacities are much greater (and cheaper) than they were when I got started in
IT, and it seems that many companies are more than willing to trade somewhat
greater IT infrastructure resource consumption for significantly improved
organizational productivity.
Some XML Lingo
Before getting into DB2 9’s XML support, I’ll define a few terms (assisted by the handout for an XML presentation delivered a few years ago by IBM’s Susan Malaika):
The Document Type Definition (DTD) defines the “grammar” (tags and such) for a given type of XML document. As Susan put it, a DTD is to a type of XML document what DDL is to a DB2 table. An XML document structure can also be defined by an XML schema.
An element of an XML document is a piece bracketed by a particular tag and its associated “end” tag (for example, <title>Big Book of DB2 Facts</title>.
An attribute in an XML document is a value assignment within an element (for example, <chapter id=”1”>).
The root element is the outermost of the nested elements within an XML document (for example, if an XML document describing a book begins with a <book> tag and ends with a </book> tag, and all other elements such as “chapter” and “price” are nested therein, “book” is the root element of the document).
An XML document is well-formed if it can be parsed without error by an XML parser. Error-free parsing occurs when the document conforms to a few rules:
It has exactly one root element.
Each “begin” tag is matched by an “end” tag (for example, <section> and </section>).
All elements are properly nested (if the “begin” tag for element A precedes the “begin” tag for element B, the “end” tag for element B precedes the “end” tag for element A).
Attribute values appear in quotes (<price date=”12/22/2006”>).
Certain characters are not used in tags or values (an example of a disallowed character is “<”, so 3<5 is specified as 3<5).
An XML document is valid if it is well-formed and it complies with a specific DTD or XML schema.
DB2 9 XML Support: The Real Deal
With XML being so pervasive, it makes sense that organizations would want to store XML documents in a DBMS, right? Well, the relational DBMS for people who are XML-oriented is DB2 9.
“Why is that?” you might ask. “Other relational DBMSs have provided XML support for a while now. Even DB2 has had XML support for years in the form of the XML Extender.” True enough, but DB2 9 takes XML support to a whole new level.
Before DB2 9, you basically had two options when faced with the task of storing XML data (which is inherently hierarchical) in a relational DBMS:
You could “shred” the document, storing the element and attribute values in the columns of various tables in the database.
You could store documents in their entirety as character large objects (CLOBs)
in a table column.
There are several problems associated with the “shredding” approach: the number
of tables required can be large, queries needed to “put Humpty Dumpty together
again” can be quite complex, and a change to the XML schema that defines a
document type can create a real headache in terms of changing the associated
relational schema.
The CLOB approach has its own problems; the primary one is the fact that searching for particular element and attribute specifications within XML documents stored as CLOBs requires XML parsing — something that drives up CPU consumption and lengthens response time.
Enter DB2 9. Now, for the first time, you can store an XML document as a true XML data type, and it will be stored as an entity in the DB2 database (no shredding needed). But — unlike the CLOB approach — DB2 has visibility down to the element and attribute level of the document. In other words, the relational DBMS is aware of the hierarchical structure of the XML document. You can build indexes on the XML documents, with keys generated based on specified attribute patterns. You can query the data in the XML documents using SQL or the XQuery language. You can take advantage of built-in XML parsing and validation functions.
What’s the payoff? Enhanced ease of use, and improved (sometimes dramatically)
query performance.
Here’s how I think of DB2 9’s XML support compared to what was there before: the
old capabilities were like a singer who speaks English and records a song in
Japanese without knowing the language — he just sings phonetically, getting the
pronunciation right but having no understanding of the meaning of the sounds
he’s making. The new technology is like the singer after he’s learned to speak
Japanese. He actually understands what he’s singing, and what’s more, because he
speaks both English and Japanese, he can respond in Japanese to a question posed
in English, or vice versa. The DB2 Extender made it possible for DB2 to
pronounce XML, if you will. But DB2 9 understands XML.
Your Job: Bridge the Gap
There may well be an XML-related knowledge gap in your organization. You know about DB2 9’s groundbreaking XML-supporting capabilities, but you may not know how your organization is using XML today or planning to use XML in the future. Some of your colleagues who are application developers and architects may know how your organization is using XML, but they don’t know anything about the advanced XML support provided by DB2 9. Do you see where I’m going with this? Talk to some of your senior application development people. Find out what they’re doing with XML, and let them know what DB2 9 can do with XML data. You could end up making their lives easier and their applications faster. They might buy you lunch as a thank-you.
Hey, it could happen.
Robert Catterall
is a Director of Engineering at CheckFree Corporation. As part of the
Company's Technology Strategy and Planning group, Robert works to establish
corporate-wide standards for the use of information technology in CheckFree
applications and systems. Robert is a past president of the International DB2
Users Group (IDUG), and a member of IDUG's Speakers Hall of Fame. He has a
Bachelor's Degree from Rice University and an MBA from Southern Methodist
University.