Quick ‘n Dirty: Comparing two XML documents in .NET

Yesterday I was tasked with comparing two XML documents for equality (in .NET). The catch is that they didn’t have to be *exactly* equal. There were a few fields that were allowed to be different for the documents to still be considered a match (timestamps, GUIDs, etc). My first thought was to parse and recursively walk the first document, grab the XPath for each thing I ran into and then look for the corresponding thing in the second document. Then I’d walk the second document and search the first to make sure there wasn’t too much in the second document.

After Googling around for not very long, I found a tool that did a ton of the work for me – the Microsoft Xml Diff and Patch Tool. This tool is exposed as a .NET library available on NuGet. It’s pretty flexible and includes options to ignore namespaces, element order, whitespace and more, but it doesn’t allow you to specify elements to skip. The result is a boolean indicating whether the documents are an exact match. The tool also writes a third Xml document to an XmlWriter that includes the changes that would be needed to make the documents equal. This isn’t an Xslt, it’s an Xml document that can be fed back into the tool to make the second document the same as the first.

This is a Quick ‘n Dirty so I won’t go into tons of depth on this – rather I’ll include some base code to get things up and running and show you how I handled the fields I wanted to ignore. First, the comparison code, giving two strings of Xml:

XmlDiff xmlDiff = new XmlDiff(XmlDiffOptions.IgnoreChildOrder | XmlDiffOptions.IgnoreComments | XmlDiffOptions.IgnoreNamespaces | XmlDiffOptions.IgnoreWhitespace | XmlDiffOptions.IgnoreXmlDecl);
 
StringBuilder diffgramStringBuilder = new StringBuilder();
bool xmlComparisonResult = false;
using (StringReader legacySr = new StringReader(legacyResultXml), nextgenSr = new StringReader(serializedNextgenResponse))
{
    using (XmlReader legacyReader = XmlReader.Create(legacySr), nextgenReader = XmlReader.Create(nextgenSr))
    {
        using (StringWriter sw = new StringWriter(diffgramStringBuilder))
        {
            using (XmlWriter diffgramWriter = XmlWriter.Create(sw))
            {
                xmlComparisonResult = xmlDiff.Compare(nextgenReader, legacyReader, diffgramWriter);
            }
        }
    }
}

At this point we have the raw comparison result in xmlComparisonResult and the “diff” Xml file in diffgramStringBuilder. The diffgram is a little hard to read as it doesn’t have element names or XPath notations – instead it uses a 1-based indexing system on the document. Here’s a result:

<?xml version="1.0" encoding="utf-16">
 
     <!--This is the root element-->
        2013-08-08T15:27:48.750;
        9BB80998-428D-4FD5-B6AB-34F6B469E3FD
         <!--The first child of the root, we'll call it CHILD1-->
             <!--The third child of CHILD1, we'll call it SUBCHILD3-->
                 <!--The first child of the third SUBCHILD3, we'll call it SUBSUBCHILD1-->
                    1

As you can see, the key to finding the elements that are different lies in the xd:node elements of the diff Xml. The match parameter on these elements is the 1-based index of the children of the “current” element, starting at the root. Each difference appears in either xd:change, xd:add or xd:remove elements throughout the diff. In this case, the root element has mismatches on timestamp and guid, and /CHILD1/SUBCHILD3/SUBSUBCHILD1 is missing a field named isactive.

Now we know what’s different, but the last piece of the puzzle is handling the fields that are ok to be different. The boolean result of the comparison is false because of the differences, but if isactive was set on that child element and only timestamp and guid were different we’d want to consider this a match. Here’s the code I used to “whitelist” attributes by name. Note that my comparison doesn’t consider location in the document, but this is a Quick ‘n Dirty – this is here to give you a quick idea of how to handle it should you need to do something similar.

bool xmlMatches = true;
List failedXmlCompares = new List();
if (!xmlComparisonResult)
{
    List skippedDiffs = new List();
    skippedDiffs.Add("timestamp");
    skippedDiffs.Add("guid");
 
    //This means the documents were not precicely equals. That's ok in our case - there are a few fields that are OK to be different.
    XDocument xdoc = XDocument.Parse(diffgramStringBuilder.ToString());
    var changes = xdoc.Descendants().Where(d =&gt; d.Name.LocalName == "change").ToList(); 
    changes.AddRange(xdoc.Descendants().Where(d =&gt; d.Name.LocalName == "add").ToList()); 
    changes.AddRange(xdoc.Descendants().Where(d =&gt; d.Name.LocalName == "remove").ToList()); 
    foreach (var single in changes)
    {
        var attributes = single.Attributes();
        foreach (var attribute in attributes)
        {
            string attributeValue = attribute.Value.Replace("@", "");
            if (attribute.Name == "match")
            {
                if (skippedDiffs.Contains(attributeValue))
                {
                    //This is ok
                }
                else
                {
                    if (!failedXmlCompares.Contains(attributeValue))
                    {
                        failedXmlCompares.Add(attributeValue);
                    }
                    xmlMatches = false;
                }
            }
            else if (attribute.Name == "name")
            {
                if (skippedDiffs.Contains(attributeValue))
                {
                    //This is ok
                }
                else
                {
                    if (!failedXmlCompares.Contains(attributeValue))
                    {
                        failedXmlCompares.Add(attributeValue);
                    }
                    xmlMatches = false;
                }
            }
        }
    }
}

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>