Validating XML Documents in .NET

Jul 29, 11:00 pm

Article Author: Dan Wahlin
.NET 3.5 Books

This article is part of Dan Wahlin’s ‘XML Support in ASP.NET’ Guru week

XML represents an excellent mechanism for exchanging data between distributed systems due to its ability to describe data and maintain its structure during the exchange. Because of XML’s extensible nature it’s crucial that the different systems and applications involved in the exchange agree upon the structure of the XML document so that data can be extracted and used appropriately.

The process of checking an XML document to ensure that it follows specific guidelines is referred to as "validation". This second article in the XML Support in .NET series will explore the different alternatives for validating XML documents and detail how to programmatically validate XML documents in the .NET framework. Before jumping into a discussion on XML validation support in .NET, let’s examine how to create documents that can be used to validate XML.

How can XML Documents be Validated?

There are two major players in the world of XML document validation. These players include Document Type Definitions (DTDs) and XML Schemas. While DTDs and Schemas can both be used to validate XML, each brings with it a definite set of pros and cons as you’ll see in the next few sections.

Document Type Definitions

DTDs have been around for many years and evolved from XML’s parent language called Standard Generalized Markup Language (SGML). Although relatively old compared to many Web technologies, DTDs do an excellent job of describing the structure that an XML document should follow. DTDs allow you to define important pieces of an XML document such as elements, attributes, and entities. Although I won’t present a complete discussion of creating DTDs in this article, Listing 1 shows a simple DTD that can be used to validate the XML document shown in Listing 2.


<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT Customers (Customer*)>
<!ELEMENT Customer (CompanyName, ContactName, ContactTitle, Address, City, Zip, Phone, Fax)>
<!ATTLIST Customer CustomerID CDATA #REQUIRED
>
<!ELEMENT CompanyName (#PCDATA)>
<!ELEMENT Address (#PCDATA)>
<!ELEMENT City (#PCDATA)>
<!ELEMENT ContactName (#PCDATA)>
<!ELEMENT ContactTitle (#PCDATA)>
<!ELEMENT Fax (#PCDATA)>
<!ELEMENT Phone (#PCDATA)>
<!ELEMENT Zip (#PCDATA)>

Listing 1. A DTD can be used to validate the structure of an XML document.




<?xml version="1.0"?>
<!— The following statement references the DTD in Listing 1 —>
<!DOCTYPE Customers SYSTEM "Customers.dtd">
<Customers> <Customer CustomerID="32"> <CompanyName>ACME Corp</CompanyName> <ContactName>John Doe</ContactName> <ContactTitle>Sales Representative</ContactTitle> <Address>1234 Anywhere St.</Address> <City>Phoenix</City> <Zip>85244</Zip> <Phone>123-123-1234</Phone> <Fax>123-123-1235</Fax> </Customer>
</Customers>

Listing 2. An XML document contains metadata that describes the data it contains. This example describes customer data through using different elements and an attribute. The XML document can be validated against the DTD shown in Listing 1.



Referring back to Listing 1 you’ll see that it contains several element definitions and describes how each of the elements can be nested. For example, the DTD specifies that the Customers root element can have 0 or more children named Customer (as determined by the * character):


<!ELEMENT Customers (Customer*)>

The data between the parentheses is referred to as the Content Model for the element and determines what can be between the element’s start and end tags.

The Customer element acts as the parent for 9 different child elements as shown in the following DTD element definition:


<!ELEMENT Customer (CompanyName, ContactName, 
          ContactTitle, Address, City, PostalCode, 
          Country, Phone, Fax)>

The order that these child elements appear within the Customer element is very significant. The CompanyName element must appear first and Fax must appear last with the other elements (ContactName, ContactTitle, etc.) appearing in order between these two elements. The attribute on the Customer element named CustomerID is also defined in the DTD by using the ATTLIST keyword:


<!ATTLIST Customer
      CustomerID CDATA #REQUIRED
>

This definition says that the attribute may contain alphanumeric characters (CDATA = Character Data) and that it is required to appear in the XML document

Looking through the DTD element and attribute definitions in Listing 1 you’ll quickly notice that although they do an excellent job of outlining the structure that the XML document must follow, they do a very poor job of describing the different data types that the elements and single attribute can contain. DTDs don’t support data types such as integer, float, date, etc. In fact, elements in DTDs can only contain child elements, parsed character data (PCDATA), or a combination of both. PCDATA is similar to a primitive string data type in many programming languages. This lack of data type support can present a rather large problem when distributed systems exchange XML documents as there is no way to ensure that valid data is being received using DTDs. As the XML documents are parsed and imported into data stores, such as relational databases that require specific data types, processes tend to fail when incorrect data is inserted or updated.

Aside from their lack of data type support, DTDs also have a few other flaws, including lack of support for XML namespaces and a disregard for the syntax rules outlined in the XML specification. Even with these flaws, DTDs are still used extensively throughout the world to validate an XML document’s structure, and many validating XML parsers are only capable of validating against DTDs.

XML Schemas

With the W3C’s release of the XML Schema specification, a new and more powerful way of validating XML documents is now available. The .NET platform contains excellent support for Schemas and uses them not only for XML document validation, but also for working with relational data. I’ll provide more information about how Schemas are used in ADO.NET in Thursday’s article.

XML Schemas offer many advantages over DTDs including support for a robust set of data types, XML namespaces, and the XML rules. Plus, XML Schemas are quite customizable, which allows for the creation of custom data types. There are a few cons associated with XML Schemas, however – They are arguably more complex to create than DTDs and are generally more verbose since they follow the XML rules.

Due to their complexity, a complete discussion won’t be presented here, however, Listing 3 contains a sample XML Schema that can be used to validate the XML document in Listing 4.


<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="Customers"> <xs:complexType> <xs:sequence> <xs:element ref="Customer"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Customer"> <xs:complexType> <xs:sequence> <xs:element name="CompanyName" type="xs:string"/> <xs:element name="ContactName" type="xs:string"/> <xs:element name="ContactTitle" type="xs:string"/> <xs:element name="Address" type="xs:string"/> <xs:element name="City" type="xs:string"/> <xs:element name="Zip" type="xs:int"/> <xs:element name="Phone" type="xs:string"/> <xs:element name="Fax" type="xs:string"/> </xs:sequence> <xs:attribute name="CustomerID" type="xs:int" use="required"/> </xs:complexType> </xs:element>
</xs:schema>

Listing 3. XML Schemas offer many advantages of DTDs including support for data types, namespaces, and the XML syntax rules.




<?xml version="1.0"?>
<Customers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Customers.xsd"> <Customer CustomerID="32"> <CompanyName>ACME Corp</CompanyName> <ContactName>John Doe</ContactName> <ContactTitle>Sales Representative</ContactTitle> <Address>1234 Anywhere St.</Address> <City>Phoenix</City> <Zip>85244</Zip> <Phone>123-123-1234</Phone> <Fax>123-123-1235</Fax> </Customer>
</Customers>

Listing 4. XML documents can reference an existing Schema by using the noNamespaceSchemaLocation or SchemaLocation attributes. This example uses the noNamespaceSchemaLocation attribute since no namespaces are used in the document.



Looking at the Schema in Listing 3 you’ll see that it uses a specific namespace prefix (xs) to identify elements and data types defined in the Schema specification. Elements are defined using the element tag, while attributes are defined using the attribute tag as shown in the two definitions below:


<xs:element name="CompanyName" type="xs:string"/>

<xs:attribute name="CustomerID" type="xs:int" use="required"/>

Within an element, a complexType element can be found:


<xs:element name="Customers">
    <xs:complexType>
        <xs:sequence>
            <xs:element ref="Customer"/>
        </xs:sequence>
    </xs:complexType>
</xs:element>

The complexType element is roughly equivalent to the parenthesis used to define the content model of an element in DTDs (refer back to Listing 1). The start complexType tag equates with the opening parenthesis and the end complexType tag equates with the end parenthesis. Within the complexType tag you may find several different child elements. In the above example the sequence tag is used to specify the order of child elements of the Customer parent element. The sequence tag replaces the comma used in DTDs.

To see this more clearly, the following DTD definition defines an element called person that can have child elements named firstName, lastName, and birthDate. The order that the child elements must appear in is determined by reading from left to right in the parenthesis with the comma character acting as the separator:


<!ELEMENT person (firstName,lastName,birthDate)>
<!ELEMENT firstName (#PCDATA)>
<!ELEMENT lastName (#PCDATA)>
<!ELEMENT birthDate (#PCDATA)>

This definition can be rewritten to look like the following using the XML Schema language:


<xs:element name="person">
    <!—
        The complexType element is similar to the 
        "(" and ")" characters in DTDs 
    —>
    <xs:complexType>
        <!—
           The sequence element is similar to the 
           comma in DTDs as it determines the order
           in which the child elements must appear
        —>
        <xs:sequence>
            <xs:element name="firstName" type="xs:string" />
            <xs:element name="lastName" type="xs:string" />
            <xs:element name="birthDate" type="xs:date" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

Although the Schema version is much more verbose than the DTD version, Schemas have the major benefit of being able to use a variety of data types as shown in Figure 1.

Figure 1. The XML Schema specification defines many different data types that can be used to validate data contained within an XML document.



The data types shown in Figure 1 are a huge improvement over the extremely limited set of data types in DTDs. Instead of defining the Zip element as simply containing PCDATA, you can now ensure that the data within the element is an integer:


<xs:element name="Zip" type="xs:int" />

Dates can also be validated to ensure that a correct date is being used. For example, the element named BirthDate shown earlier would have to contain a correct date in order to be considered valid since it’s type attribute contained a value of xs:date. The following would not be valid based upon the element’s definition in the Schema (notice that the format for valid dates is year-month-day):


<BirthDate>2002-02-30</BirthDate>

There may be situations when you need to restrict or extend the Schema types shown in Figure 1. For example, the Zip element shown earlier in Listing 4 may need to contain data that has 5 numeric digits followed by a dash (-) character and 4 more numeric digits (zip+4). This custom data type can be created using the simpleType element:


<xs:simpleType name="stZipCode">
      <xs:restriction base="xs:string">
            <xs:pattern value="d{5}-d{4}" />
      </xs:restriction>
</xs:simpleType>

Notice that a regular expression is being used within the value attribute of the xs:pattern element. This allows for a great deal of flexibility and control over the data contained within an element or attribute. To assign the simpleType to the Zip element the following syntax can be used:


<xs:element name="Zip" type="stZipCode" />

By using XML Schemas, an XML document’s structure and data can be validated quite thoroughly in order to determine if the document is appropriate to use in an application. There is much more to Schemas that can be covered in this article, but you have seen some of the different features that Schemas offer for validating XML. Now let’s take a look at how to programmatically validate XML documents in .NET.

Validating XML in the .NET Platform

Now that you’ve seen an overview of the two main ways to validate XML, let’s examine how XML can be programmatically validated against a DTD or Schema using the .NET platform. Looking in the System.Xml namespace you’ll find two classes that can be used for this purpose named XmlTextReader and XmlValidatingReader. Both classes represent a forward-only, cursor-style model that create and read a stream of XML tokens.

The XmlTextReader can be used to read XML quickly and efficiently but cannot validate XML against a DTD or Schema unless it is used along with the XmlValidatingReader class. Although you’ll have the opportunity to see the basics of working with the XmlTextReader class here, more details about using it will be provided in a later article in this series.

Before seeing an example of validating XML, it’s important that you understand how the XmlTextReader and XmlValidatingReader work together. First, the XmlTextReader ‘s constructor must be called and a path to the XML document to validate must be passed. This class has several different versions of the constructor but the following one will work for our purposes:


[C#]
public XmlTextReader( string url
);

Once the XmlTextReader object is instantiated and the XML document to parse and validate is loaded into it, the XmlTextReader object can be loaded into the XmlValidatingReader ‘s constructor:


XmlValidatingReader vReader = new XmlValidatingReader(reader);

Before the XmlValidatingReader’s Read() method is called and the XML document is parsed, its ValidationEventHandler event must be wired up to a ValidationEventHandler delegate which is found in the System.Xml.Schema namespace.

If you’re not familiar with delegates, they allow events to be hooked up to event handler methods. By doing this, any problems found during the validation process can be passed to a central event handler, which can generate an error message, perform logging functionality, or do some other task. The following code demonstrates how to hook the XmlValidatingReader up to an event handler in C#:


vReader.ValidationEventHandler += 
    new ValidationEventHandler(ValidationCallBack);

The XmlValidatingReader ‘s ValidationType property should also be set to a valid enumeration value. Acceptable values include ValidationType.Auto, ValidationType.DTD, ValidationType.Schema, ValidationType.None, and ValidationType.XDR. When you know you’ll be validating against an XML Schema you can use the following code to let the XmlValidatingReader know that a Schema will be used:


vReader.ValidationType = ValidationType.Schema;

An example of tying all of these steps together to validate an XML document against an XML Schema is shown in Listing 5. This example is contained within a file named ValidationSampleSchema.aspx.cs in the downloadable code.


//Assume document is valid to start
protected bool valid = true;

private void Page_Load(object sender, System.EventArgs e) { string xmlPath = Server.MapPath("XML/Customers(Schema).xml"); XmlTextReader reader = null; XmlValidatingReader vReader = null; try { //Load XML into XmlTextReader reader = new XmlTextReader(xmlPath);

//Load XmlTextReader into XmlValidatingReader vReader = new XmlValidatingReader(reader); //Set the validation type vReader.ValidationType = ValidationType.Schema; //Hook up the XmlValidatingReader’s //ValidationEventHandler to a call //back method named DTDCallBack vReader.ValidationEventHandler += new ValidationEventHandler( this.ValidationCallBack); //Read through the XML document by calling //the Read() method while (vReader.Read()) {} //Check boolean field named valid (located at top of code) if (this.valid) this.lblOutput.Text = "Validation was Successful!"; } catch {} finally { //Close readers vReader.Close(); reader.Close(); } }

public void ValidationCallBack(object sender, ValidationEventArgs args ) { this.lblOutput.Text = "Validation failed! " + "Error is: " + args.Message; this.valid = false;
}

Listing 5. By combining the XmlTextReader and XmlValidatingReader classes, XML documents can be validated against DTDs and XML Schemas. This example demonstrates validating against an XML Schema.



If an error occurs during validation of the XML document, the ValidationCallBack method will be called and a ValidationEventArgs object will be passed in as a parameter. This object exposes several important properties that can be used to access error information such as Exception, Message, and Severity. The Exception property return an XmlSchemaException object that can be used to access the line and column position (through the LineNumber and LinePosition properties) where the error occurred in the XML document. This can be helpful in locating the error and can be used in logging processes.

Dynamically Assigning Schemas to XML Documents

Listing 5 demonstrates validating an XML document that specifically references a Schema by using the noNamespaceSchemaLocation attribute (refer back to Listing 4). What if you’d like to validate an XML document against a Schema that is not actually referenced within the XML document? This can be done by using the Schemas property of the XmlValidatingReader. This property can contain a collection of XmlSchema objects that can be added using the collection’s Add() method. The Add() method is overloaded and can accept a variety of input parameters from strings to XmlSchema objects (see the SDK for more details). The following code shows how to add the Schema shown in Listing 3 into the collection so that the XML document can dynamically be validated against the Schema.


vReader.Schemas.Add(null,Server.MapPath("Schemas/Customers.xsd"));

This code specifies that no targetNamespace is being used (targetNamespace is an attribute that can be used to reference a namespace in the XML document) in the Schema as well as the physical path to the Schema. The complete code for this example can be found in the file named ValidationExampleDynamicSchema.aspx.cs in the downloadable code. Although an XML Schema is used in the example, XDR (XML-Data Reduced) schemas can be used as well.

Expanding Entities with the XmlValidatingReader

Before concluding our discussion on validating XML documents, let’s take a look at another way the XmlValidatingReader can be used to perform useful tasks that doesn’t actually involve validation. Although DTDs are typically used to validate an XML document, they can also be used to hold entity definitions. An entity is simply a placeholder for frequently used data and can be compared to include files, macros, variables, etc. Entities can be defined within a DTD using the ENTITY keyword as shown below:


<!ENTITY address "1234 Anywhere St.">

Once defined, an entity reference can be added in multiple places within an XML document by prefixing the entity name with an ampersand character and following the name with a semicolon character:


<Address>&address;</Address>

Once parsed, the entity in the XML document will be "expanded" so that the proper data (1234 Anywhere St. in this example) shows up in the XML as shown below:


<Address>1234 Anywhere St.</Address>

This entity expansion involves the XmlValidatingReader even though the DTD may only be used to define entities rather than defining the elements and attributes that an XML document can contain. Listing 7 shows the necessary code to expand entities defined within a DTD that are referenced within an XML document. The code contains several comments to explain what is happening.


private void Page_Load(object sender,System.EventArgs e) {
    Response.ContentType = "text/xml";
    XmlTextReader reader = null;
    XmlValidatingReader vReader = null;

try { reader = new XmlTextReader( Server.MapPath("XML/Customers(Entities).xml")); vReader = new XmlValidatingReader(reader); //Set ValidationType to none since we don’t //want to validate but do want to expand entities vReader.ValidationType = ValidationType.None; //Set EntityHandling property to ExpandCharEntities //so that &address; gets expanded vReader.EntityHandling = EntityHandling.ExpandCharEntities; //Load XmlValidatingReader into XmlDocument and //then write out the contents to the web page //to show that the entities were indeed expanded XmlDocument doc = new XmlDocument(); doc.Load(vReader); doc.Save(Response.Output); } catch {} finally { //Close readers reader.Close(); vReader.Close(); } }

Listing 7. Although the XmlValidatingReader is normally used to validate XML documents, it can also be used for the sole purpose of expanding entities as well. The complete code for this listing is located in a file named ExpandEntities.aspx.cs.

Building a Reusable Validation Object

Because the process of validating XML documents is largely a repetitive process, a reusable object can be built to encapsulate much of the functionality needed to validate XML against DTDs or Schemas. Listing 8 puts together many of the different concepts presented earlier in the article to build a class named Validator. A live version of the Validator class in action can be found at http://www.xmlforasp.net/content.aspx?content=SchemaValidator.

Although all of the specifics will not be covered here, the Validator class demonstrates more advanced features of XML validation such as using the XmlParserContext class to dynamically assign DTDs. The class also provides logging capabilities so that validation errors can be tracked easily.


public class Validator { 
    bool _valid;      //Track if XML is valid
    bool _logError;   //Track if we log any errors
    string _logFile;  //Track logfile location
    string _validationErrors = String.Empty;

XmlTextReader xmlReader = null; XmlValidatingReader vReader = null; //The Validation() method accepts the XML to validate, a schema //collection (if needed), an array containing info needed to //dynamically assign a DTD, a Boolean indicating if any errors //should be logged, and the path to the log file. It returns //a custom ValidationStatus object public ValidationStatus Validate(object xml, XmlSchemaCollection schemaCol, string[] dtdInfo, bool logError,string logFile) { _logError = logError; _logFile = logFile; _valid = true; try { //Determine how XML document to validate was passed if (xml is StringReader) xmlReader = new XmlTextReader((StringReader)xml); if (xml is String) xmlReader = new XmlTextReader((String)xml); //Handle dynamically adding DTD reference if (dtdInfo != null && dtdInfo.Length > 0) { //Use XmlParserContext to assign DTD root name plus //DTD definitions XmlParserContext context = new XmlParserContext(null,null,dtdInfo0, "",dtdInfo1,"",dtdInfo1,"", XmlSpace.Default); xmlReader.MoveToContent(); vReader = new XmlValidatingReader(xmlReader.ReadOuterXml(), XmlNodeType.Element,context); vReader.ValidationType = ValidationType.DTD; } else { //Handle other cases vReader = new XmlValidatingReader(xmlReader); vReader.ValidationType = ValidationType.Auto; if (schemaCol != null) { vReader.Schemas.Add(schemaCol); } } vReader.ValidationEventHandler += new ValidationEventHandler(this.ValidationCallBack); // Parse through XML while (vReader.Read()){} } catch { _valid = false; } finally { //Close our readers if (xmlReader != null) { xmlReader.Close(); } if (vReader != null) { vReader.Close(); } } ValidationStatus status = new ValidationStatus(); status.Status = _valid; status.ErrorMessages = _validationErrors; return status; } private void ValidationCallBack(object sender, ValidationEventArgs args){ _valid = false; //hit callback so document has a problem DateTime today = DateTime.Now; StreamWriter writer = null; try { if (_logError) { writer = new StreamWriter(_logFile,true,Encoding.ASCII); writer.WriteLine("Validation error in XML: "); writer.WriteLine(); writer.WriteLine(args.Message + " " + today.ToString()); writer.WriteLine(); if (xmlReader.LineNumber > 0) { writer.WriteLine("Line: "+ xmlReader.LineNumber + " Position: " + xmlReader.LinePosition); } writer.WriteLine(); writer.Flush(); } else { _validationErrors = args.Message + " Line: " + xmlReader.LineNumber + " Column:" + xmlReader.LinePosition + "nn"; } } catch {} finally { if (writer != null) { writer.Close(); } } } }

public struct ValidationStatus { public bool Status; public string ErrorMessages;
}

Listing 8. The Validation class encapsulates several different features of XML validation in the .NET platform (Validator.cs).



Listing 9 shows how to leverage the Validator class to create an online validation utility that allows an end user to input an XML document and Schema into two text boxes. Any errors found when the XML document is validated are reported.


private void btnSubmit_Click(object sender, System.EventArgs e) {
    try {
        XmlDocument doc = new XmlDocument();
        //Load XML document
        doc.LoadXml(this.txtXml.Text);

//Load XML Schema XmlSchema schema = XmlSchema.Read( new StringReader(this.txtSchema.Text),null); XmlSchemaCollection schemaCol = new XmlSchemaCollection(); schemaCol.Add(schema); Validator validator = new Validator(); ValidationStatus status = validator.Validate( new StringReader(doc.OuterXml),schemaCol,null,false,null); if (status.Status) { this.lblStatus.ForeColor = Color.Navy; this.lblStatus.Text = "Validation of XML was SUCCESSFUL!"; } else { this.lblStatus.ForeColor = Color.Red; this.lblStatus.Text = "Validation of the XML Document " + failed! Error message(s):<p /> " + status.ErrorMessages; } } catch (Exception exp) { this.lblStatus.ForeColor = Color.Red; this.lblStatus.Text = exp.Message; } }

Listing 9. By using the Validator class shown in Listing 8, XML documents can easily be validated against DTDs or Schemas (SchemaValidator.aspx).



Figure 2 shows the output of the utility:

Figure 2. The Validator utility is an online tool that allows people to validate XML documents against Schemas.

Conclusion

Validation of XML documents is an important piece of the data exchange process. Although both DTDs and XML Schemas can be used to validate an XML document, Schemas provide many advantages, such as providing support for namespaces and data types.

The .NET platform contains several different classes that can be used to validate XML documents against DTDs or XML Schemas. By leveraging these classes you can ensure that XML data is valid and catch potential errors before the data touches other parts of an application.

Please rate this article using the form below. By telling us what you like and dislike about it we can tailor our content to meet your needs.






















Article Information
Author Dan Wahlin
Chief Technical Editor John R. Chapman
Project Manager Helen Cuthill
Reviewers Andy Krowczyk, Saurabh Nandu


If you have any questions or comments about this article, please contact the technical editor.

Founders at Work

Commenting is closed for this article.