Code Answer: Parsing HTML with XPath/XMLHttpRequest

I'm trying to download an HTML page, and parse it using XMLHttpRequest(on the most recent Safari browser). Unfortunately, I can't get it to work!

var url = "http://google.com";

xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", url);

xmlhttp.onreadystatechange  = function(){
 if(xmlhttp.readyState==4){
  response = xmlhttp.responseText;
  var doc = new DOMParser().parseFromString(response, "text/xml");
  console.log(doc);
  var nodes = document.evaluate("//a/text()",doc, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,null);
  console.log(nodes);
  console.log(nodes.snapshotLength);
  for(var i =0; i<nodes.snapshotLength; i++){
   thisElement = nodes.snapshotItem(i);
   console.log(thisElement.nodeName);
  }
 }
};
xmlhttp.send(null);

The text gets downloaded successfully(response contains the valid HTML), and is parsed into a tree correctly(doc represents a valid DOM for the page). However, nodes.snapshotLength is 0, despite the fact that the query is valid and should have results. Any ideas on what's going wrong?

From stackoverflow

HTML is not XML. The two are not interchangeable. Unless the "HTML" is actually XHTML, you will not be able to use XPATH to process it.

Mike : I understand that - but Safari should be (and is, into the doc object) processing the "ugly" HTML into a nice, tidy, XHTML-compliant DOM, which should be able to be used with XPath, right?

John Saunders : I was unaware of this magic cleanup feature of Safari.
If you are using either:
- a JS library or
- you have a modern browser with the querySelectorAll method available (Safari is one)
You can try to use CSS selectors to parse the DOM instead of XPATH.

Code Answer

Sunday, March 20, 2011

Parsing HTML with XPath/XMLHttpRequest

0 comments:

Post a Comment

Blog Archive