XML is currently one of the most common data exchange formats and you are, no doubt, familiar with it already. There is an infinite supply of XML data available via HTTP on the Web. XML has been used for over 10 years for a multitude of specialized and interesting formats like RSS, FOAF (friend of a friend) and microformats. Interestingly, one of the general directions of all these representations is being able to automatically link information through references in the data. And of course HTML, although it doesn't have XML's requirement of being well-formed, is clearly the same kind of data representation.
PHP comes with more than one way to work with XML. There are facilities to work with XML's DOM (document object model) or do it SAX-style, which is beneficial in the respect that it makes XML access look a lot like it does in some other languages. Not every useful function is represented, however, but we can add them. There are extensions to turn XML into PHP objects and vice versa, which provides obvious familiarity by letting us work with the structures we're used to.
DOMDocument
Many programming languages provide an API to work with XML's DOM. This method reads the entire XML document into memory and facilitates random access of any node and easily changing the document itself. In PHP, an XML DOM Document is created by instantiating and loading the DOMDocument class.
$dom = new DOMDocument ();
$dom->loadXML ('<root><node attr="value">text value</node></root>');
$dom->saveXML();
<?xml version="1.0"?>
<root><node attr="value">text value</node></root>
saveXML turns the DOM back into its string representation. save() can be used to save the string to a file. We don't have to dump the whole DOM Document with saveXML, a child node from the document can be specified.
$dom->saveXML ($dom->documentElement);
$dom->saveXML ($dom->documentElement->firstChild);
<root><node attr="value">text value</node></root>
<node attr="value">text value</node>
Being able to create an DOM object from a string of XML is convenient, but very often the XML will come from a file or a source on the Web. The load() method is used to construct from an outside source. This is a configuration XML file, I found on my system.
$dom = new DOMDocument ();
$dom->load ('moz-bindings.xml');
$dom->saveXML ($dom->documentElement);
<bindings xmlns="http://www.mozilla.org/xbl">
<binding id="numericfield">
<implementation>
<constructor>
this.keypress = CheckIsDigit ;
</constructor>
<method name="CheckIsDigit">...
To load an XML document from the Web:
$dom->load ('[http://techishard.wordpress.com/feed/][8]');
$dom->saveXML ($dom->documentElement);
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" …
xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
<channel>
<title>Tech Is Hard</title>
<atom:link href="http://techishard.wordpress.com/feed/" rel="self" type="application/rss+xml"/>
<link>http://techishard.wordpress.com</link>
<description>But thinking hard beats working hard</description>...
Having two different methods to load a DOM Document is inconvenient. The following function is handy:
/**
* Overloaded load method.
*
* If an XML string is passed (e.g. "<root></root>") then loadXML will be
* used; if a file is passed, then load is used.
*
* @param mixed $mXml Either a string of XML or path to XML file
* @return XmlDocument $this (for fluent interface)
*/
function LoadXml(DOMDocument $dom, $mXml, $options = 0) {
if ($dom && $mXml) {
return ($mXml{0} == '<') ? $dom->loadXML($mXml, $options) : $dom->load($mXml, $options);
}
else {
throw new DOMException("Missing argument(s). A DOMDocument and XML are required.");
}
}
We can safely assume that XML will start with “<”. (If we wanted to be extra flexible, we could trim any leading spaces from the $mXml argument first. LoadXml can be used on any source:
$dom = new DOMDocument ();
LoadXml ($dom, '<root><node attr="value">text value</node></root>');
LoadXml ($dom, 'moz-bindings.xml');
LoadXml ($dom, 'http://techishard.wordpress.com/feed/');
LoadXml ($dom, '');
The last call will throw a DOMException, because of the argument check in the function.
A third method DOMDocuments can be loaded is very convenient for “scraping” data from Web pages: loadHTML and loadHTMLFile. As noted, HTML doesn't have to follow the same rules as XML. Browsers have been figuring out how to display it for years by making assumptions. Without this functionality, one has to get the HTML contents of a page and use regular expressions to parse for the desired data. By using loadHTML a regular HTML page can be turned into an XML document and all the links on the page can be listed. (loadHTML prints a lot of warnings about the HTML not being perfect, so I'm turning off warnings.)
error_reporting(error_reporting() & ~E_WARNING);
$dom = new DOMDocument ();
$dom->loadHTMLFile ("http://finance.yahoo.com/");
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $anchor) {
print "LINK {$anchor->getAttribute ('href')} \"{$anchor->nodeValue}\"";
}
LINK #yuhead-search "Skip to search."
LINK https://edit.yahoo.com/config/eval_register?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/ " New User? Register"
LINK https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/ " Sign In"
LINK http://help.yahoo.com/l/us/yahoo/finance/ "Help"
LINK http://www.yahoo.com/bin/set/?ilc=37 "Make Y! My Homepage"...
getElementsByTagName returns a DOMNodeList which can be looped through like an array, but it's actually an object. A specific node can be selected with getElementById, if it has a unique id attribute. To get the yield for 30 uear Treasury bonds from this Yahoo page:
echo $dom->getElementById("yfs_l84_^tyx")->nodeValue;
3.03
SimpleXML
SimpleXML is another very popular way of working with XML in PHP. As the name implies, it hides a lot of the details imposed by the strict grammar of XML. Because an XML element can have its own attributes and text value, along with children of the exact same type, this is represented as a set of nested objects whose names mimic the XML names.
$a = new SimpleXMLElement (
'<a id="1">' .
'<b>abc' .
'<c at1="hello" at2="world">foo</c>' .
'<c>bar</c>' .
'</b>' .
'</a>');
$a is now an object with two properties:
- @attributes = array ('id' => '1')
- b = SimpleXMLElement object with two properties:
- 0 = “abc”
- c = array of SimpleXMLElements
The objects have special methods to access values that make printing the raw objects misleading, but make normal access very easy:
echo "#", $a,'<br/>';
echo "#", $a['id'],'<br/>';
echo "#", $a->b,'<br/>';
echo "#", $a->b->c,'<br/>';
echo "#", $a->b->c[0],'<br/>';
echo "#", $a->b->c[1],'<br/>';
echo "#", $a->b->c['at1'],'<br/>';
echo "#", $a->b->c[0]['at1'],'<br/>';
echo "#", $a->b->c['at2'],'<br/>';
echo "#", $a->b->c[0]['at2'],'<br/>';
#
#1
#abc
#foo
#foo
#bar
#hello
#hello
#world
#world
When a node is referenced directly, like $a->b, the text value of the node is returned. Attributes are available as $node[attribute-name]. Child elements are named properties (the b in $a->b). Notice how referencing an array of elements ($a->b->c) without the index automatically gives you the first element. A SimpleXMLElement can be built from a string or a path to XML, just like the LoadXml function earlier, but it has to be told the path is a URL in the third parameter.
$rss = new SimpleXMLElement ('http://techishard.wordpress.com/feed/', 0, true);
Note that SimpleXML isn't flexible enough to handle the same Web page that DOMDocument's loadHTML can. But RSS feeds are well-formed XML.
function rssSimpleList ($xml) {
$li = array();
foreach ($xml->channel->item as $item) {
$li [] = '<li><a href="' . $item->link . '">' .
$item->title . '</a><p>' . $item->description . '</p></li>';
}
return '<ul>' . join ($li) . '</ul>';
}
print rssSimpleList ($rss);

Congratulations, you just built an RSS feed reader.
XML Parser
This is another popular way to handle XML documents. It's event based, which means that as the parser reads the XML document, it notifies the program of different events like the start and end of elements. So the code is written as handler functions for the events that matter to us (anything that doesn't have a handler is just ignored).
The style of use is a little unusual, because although it is necessary to create a parser and then call functions, instead of using the object oriented syntax of DOMDocument, the parser variable is passed to the function.
// event handlers
function startit ($parser, $name, $attribs) {
echo "<br/>begin ", $name;
foreach ($attribs as $attr => $val) {
printf ("<br/>@%s = '%s'", $attr, $val);
}
}
function cdata ($parser, $data) {
echo "<br/>text()=", $data;
}
function endit ($parser, $name) {
echo "<br/>end ", $name;
}
// make a new parser
$fooParse = xml_parser_create ();
xml_parser_set_option ($fooParse, XML_OPTION_CASE_FOLDING, 0);
xml_set_element_handler ($fooParse, "startit", "endit");
xml_set_character_data_handler ($fooParse, "cdata");
xml_parse ($fooParse, '<root><node attr="value">text value</node></root>', true);
begin root
begin node
@attr = 'value'
text()=text value
end node
end root
xml_parser_set_option ($fooParse, XML_OPTION_CASE_FOLDING, 0) keeps the node names in their original case. The main advantage of XML Parser is that it doesn't load the entire DOM into memory at once. It begins calling event handlers as soon as it can. Notice the 3rd argument to xml_parse; it tells the parser whether or not this is the end of the data. So it's also possible to stream the data through the parser in a file reading loop. It would be logical to assume there's a significant speed advantage, depending on how the XML is being used, due to those two factors.
Creating and Modifying XML
In the very first example, we created an XML document.
$dom->loadXML ('<root><node attr="value">text value</node></root>');
Saving $dom to a file would make it permanent. But often, the entire document doesn't get loaded all at once; new elements have to be added to or removed from an existing DOM. Text and attribute contents get changed. Creating new elements in DOMDocument is a two-step process. First, the element is created in the document, and then it has to be appended to an existing element. Starting with the simple XML from the first example, adding an element in the $dom above is done like:
$addedNode = $dom->createElement("added-node", "new text");
// can append to document itself
// which actually results in invalid XML document (no root)
$dom->appendChild($addedNode);
<root><node attr="value">text value</node></root>
<added-node>new text</added-node>
// placed inside the root element (documentElement gives the root element)
$node = $dom->documentElement->appendChild($addedNode);
<root><node attr="value">text value</node><added-node>new text</added-node></root>
// inside the root's child
$node = $dom->documentElement->firstChild->appendChild($addedNode);
<root><node attr="value">text value<added-node>new text</added-node></node></root>
createElement and appendChild both return a DOMNode. Notice that there weren't multiple copies of the new node even though $addedNode was only created once. That points out that two steps (create and append) are needed to add a new node. Attributes are added to an element with
$node->setAttribute("new-attr", "and its value");
<root><node attr="value">text value<added-node new-attr="and its value">new text</added-node></node></root>
Here's a couple shortcut functions for some of the work we just did.
/**
* Creates a new element in the document and appends it to the element
*
* @param DOMNode to append the new node to
* @param string $name of the new node
* @param string $value [optional] of the new node
* @param array $attributes keyed array of attribute name/val
* @return DOMElement the new created element
*/
function insertChild(DOMNode $node, $name, $value = null, $attributes = array()) {
$newNode = $node->appendChild ($node->ownerDocument->createElement ($name, $value));
return setAttributes ($newNode, $attributes);
}
/**
* Adds an array of attributes to the element
*
* @param $elem DOM element to add attributes to
* @param array $attributes
* @return DOMEleement the element with these attributes
*/
function setAttributes(DOMElement $elem, array $attributes = array()) {
foreach ($attributes as $name => $value) {
$elem->setAttribute($name, $value);
}
return $elem;
}
$dom = new DOMDocument ();
$dom->loadXML ('<root><node attr="value">text value</node></root>');
insertChild ($dom->documentElement,
"inserted", "all at once",
array ("foo" => "bar", "date" => date("r"))
);
<root><node attr="value">text value</node><inserted foo="bar" date="Thu, 04 Oct 2012 22:47:29 -0600">all at once</inserted></root>
In insertChild, there's some nested code that references the $node's ownerDocument property, to get the DOMDocument to use for createElement, which conveniently returns a new DOMNode to append to the $node. I hope that shows the value of a few well thought out methods or functions and how much work it can save us.