I would like to speed up the process of validating a batch of XML files against the same single XML schema (XSD). Only restrictions are that I am in a PHP environment.
My current problem is that the schema I would like to validate against includes the fairly complex xhtml schema of 2755 lines (http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd). Even for very simple data this takes a long time (around 30 seconds pr. validation). As I have thousands of XML files in my batch, this doesn't really scale well.
For validating the XML file I use both of these methods, from the standard php-xml libraries.
I am thinking that the PHP implementation fetches the XHTML schema via HTTP and builds some internal representation (possibly a DOMDocument) and that this is thrown away when the validation is completed. I was thinking that some option for the XML-libs might change this behaviour to cache something in this process for reuse.
I've build a simple test setup which illustrates my problem:
test-schema.xsd
<xs:schema attributeFormDefault="unqualified"
    elementFormDefault="qualified"
    targetNamespace="http://myschema.example.com/"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:myschema="http://myschema.example.com/"
    xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xs:import
        schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
        namespace="http://www.w3.org/1999/xhtml">
    </xs:import>
    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="MyHTMLElement">
                    <xs:complexType>
                        <xs:complexContent>
                            <xs:extension base="xhtml:Flow"></xs:extension>
                        </xs:complexContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>
test-data.xml
<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd ">
  <MyHTMLElement>
    <xhtml:p>This is an XHTML paragraph!</xhtml:p>
  </MyHTMLElement>
</Root>
schematest.php
<?php
$data_dom = new DOMDocument();
$data_dom->load('test-data.xml');
// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidate: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidate('test-schema.xsd')) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}
// Loading schema into a string.
$schema_source = file_get_contents('test-schema.xsd');
// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidateSource: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidateSource($schema_source)) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}
Running this schematest.php file produces the following output:
schemaValidate: Attempt #1 returns Valid! in 30 seconds.
schemaValidate: Attempt #2 returns Valid! in 30 seconds.
schemaValidate: Attempt #3 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 32 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 30 seconds.
Any help and suggestions on how to solve this issue, are very welcomed!
You can safely substract 30 seconds from the timing values as overhead.
Remote requests to W3C servers are being delayed because most libraries do not reflect caching the documents (even the HTTP headers suggest that). But read your own:
The W3C servers are slow to return DTDs. Is the delay intentional?
Yes. Due to various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we have started to serve DTDs and schema (DTD, XSD, ENT, MOD, etc.) from our site with an artificial delay. Our goals in doing so are to bring more attention to our ongoing issues with excessive DTD traffic, and to protect the stability and response time of the rest of our site. We recommend HTTP caching or catalog files to improve performance.
W3.org tries to keep requests low. That is understandable. PHP's DomDocument is based on libxml. And libxml allows to set an external entity loader. The whole Catalog support section is interesting in this case.
To solve the issue in question, setup a catalog.xml file:
<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>
Save a copy of the two .xsd files with the names given in that catalog file next to the catalog (relative as well as absolute paths file:///... do work if you prefer a different directory).
Then ensure your systems environment variable XML_CATALOG_FILES is set to the filename of the catalog.xml file. When everything is setup, the validation just runs through:
schemaValidate: Attempt #1 returns Valid! in 0 seconds.
schemaValidate: Attempt #2 returns Valid! in 0 seconds.
schemaValidate: Attempt #3 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 0 seconds.
If it still takes long, it's just a sign that the environment variable is not set to the right location. I have handled the variable as well as some edge cases as well in a blog post:
It should take care of diverse edge cases, like filenames containing spaces.
Alternatively it is possible to create a simple external entity loader callback function that uses a URL => file mapping for the local file-system in form of an array:
$mapping = [
     'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd'
         => 'schema/xhtml1-transitional.xsd',
     'http://www.w3.org/2001/xml.xsd'                          
         => 'schema/xml.xsd',
];
As this shows, I've placed a verbatim copy of these two XSD files into a subdirectory called schema. The next step is to make use of libxml_set_external_entity_loader to activate the callback function with the mapping. Files that exist on disk already are preferred and loaded directly. If the routine encounters a non-file that has no mapping, a RuntimeException will be thrown with a detailed message:
libxml_set_external_entity_loader(
    function ($public, $system, $context) use ($mapping) {
        if (is_file($system)) {
            return $system;
        }
        if (isset($mapping[$system])) {
            return __DIR__ . '/' . $mapping[$system];
        }
        $message = sprintf(
            "Failed to load external entity: Public: %s; System: %s; Context: %s",
            var_export($public, 1), var_export($system, 1),
            strtr(var_export($context, 1), [" (\n  " => '(', "\n " => '', "\n" => ''])
        );
        throw new RuntimeException($message);
    }
);
After setting this external entity loader, there isn't any longer the delay for the remote-requests.
And that's it. See Gist. Take care: This external entity loader has been written for loading the XML file to validate from disk and to "resolve" the XSD URIs to local filenames. Other kind of operations (e.g. DTD based validation) might need some code changes / extension. More preferable is the XML catalog. It also works for different tools.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With