Sunday, March 20, 2011

Parsing HTML with XPath/XMLHttpRequest

I'm trying to download an HTML page, and parse it using XMLHttpRequest(on the most recent Safari browser). Unfortunately, I can't get it to work!

var url = "http://google.com";

xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", url);

xmlhttp.onreadystatechange  = function(){
 if(xmlhttp.readyState==4){
  response = xmlhttp.responseText;
  var doc = new DOMParser().parseFromString(response, "text/xml");
  console.log(doc);
  var nodes = document.evaluate("//a/text()",doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,null);
  console.log(nodes);
  console.log(nodes.snapshotLength);
  for(var i =0; i<nodes.snapshotLength; i++){
   thisElement = nodes.snapshotItem(i);
   console.log(thisElement.nodeName);
  }
 }
};
xmlhttp.send(null);

The text gets downloaded successfully(response contains the valid HTML), and is parsed into a tree correctly(doc represents a valid DOM for the page). However, nodes.snapshotLength is 0, despite the fact that the query is valid and should have results. Any ideas on what's going wrong?

From stackoverflow
  • HTML is not XML. The two are not interchangeable. Unless the "HTML" is actually XHTML, you will not be able to use XPATH to process it.

    Mike : I understand that - but Safari should be (and is, into the doc object) processing the "ugly" HTML into a nice, tidy, XHTML-compliant DOM, which should be able to be used with XPath, right?
    John Saunders : I was unaware of this magic cleanup feature of Safari.
  • If you are using either:

    • a JS library or
    • you have a modern browser with the querySelectorAll method available (Safari is one)

    You can try to use CSS selectors to parse the DOM instead of XPATH.

0 comments:

Post a Comment