Sunday, March 20, 2011

Parsing HTML with XPath/XMLHttpRequest

I'm trying to download an HTML page, and parse it using XMLHttpRequest(on the most recent Safari browser). Unfortunately, I can't get it to work!

var url = "";

xmlhttp = new XMLHttpRequest();"GET", url);

xmlhttp.onreadystatechange  = function(){
  response = xmlhttp.responseText;
  var doc = new DOMParser().parseFromString(response, "text/xml");
  var nodes = document.evaluate("//a/text()",doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,null);
  for(var i =0; i<nodes.snapshotLength; i++){
   thisElement = nodes.snapshotItem(i);

The text gets downloaded successfully(response contains the valid HTML), and is parsed into a tree correctly(doc represents a valid DOM for the page). However, nodes.snapshotLength is 0, despite the fact that the query is valid and should have results. Any ideas on what's going wrong?

From stackoverflow
  • HTML is not XML. The two are not interchangeable. Unless the "HTML" is actually XHTML, you will not be able to use XPATH to process it.

    Mike : I understand that - but Safari should be (and is, into the doc object) processing the "ugly" HTML into a nice, tidy, XHTML-compliant DOM, which should be able to be used with XPath, right?
    John Saunders : I was unaware of this magic cleanup feature of Safari.
  • If you are using either:

    • a JS library or
    • you have a modern browser with the querySelectorAll method available (Safari is one)

    You can try to use CSS selectors to parse the DOM instead of XPATH.


Post a Comment