jQuery, hammers and nails, and screen scraping

Nov 24, 2008

I am currently playing around with jQuery and I am in a mode where jQuery is my hammer and every imaginable web problem looks like a nail to be hammered with jQuery. So I am trying it out on various kinds of web problems, from UI beautification (from widgets like accordions and tabs, to details like rounded corners) to more structural/backendish things.

One thing I just did is very similar to screen scraping, but in a more intelligent way. I call it “intelligent screen scraping”. I’m pretty happy with my result.

In short, imagine you have two websites. Site A has a page which is something like this:

<h1>Current news</h1>
<a href="...url....">News item 1</a>
<a href="...url....">News item 2</a>
<a href="...url....">Old news item 1</a>
<a href="...url....">Old news item 2</a>
<a href="...url....">Old news item 3</a>

Your mission: to re-post the first paragraph of every item under “Current news” (but NOT archive) to site B.

Now, traditionally, you would build some sort of server-side aggregation thing. Maybe with RSS or something. But I didn’t have that luxury, and let’s say for simplicity that both of these were simple HTML pages without any backend “system”.

So I took this nail and hammered it with jQuery. On page B, I put a piece of jQuery which, as soon as page B is loaded, goes out and fetches page A by Ajax. Then it does the following magic on the DOM of page A:

$(data).find('#layerName h1:first').nextAll()
.parent().find('h1:last').prevAll().find('a').each(function(i) {

Which translates to the following in English:

  • In layer “layerName”…
  • … find the first “h1” element.
  • Grab all the sibling elements after that h1.
  • Then, switch back to parent context. (Sidenote: the parent-child contexts are a bit tricky to figure out in jQuery, especially in chains like this. Sometimes queries span across all the subtree while sometimes only over immediate children. The best recipe I’ve found is “read jQuery docs, google, and experiment”.)
  • Now, find the LAST “h1”… (we’re still working with children of layerName)
  • … and grab all the siblings BEFORE it. (Ending up with all the elements after the first “h1” and before the last “h1”.
  • From all the results that you have now, grab all “a” elements.
  • For each element, execute a function. (Which in my case does the second part of the above mission, and namely executes separate Ajax request for each matched link, fetches the content, grabs the first paragraph, and sticks everything to page B to the right place.)

A bit intimidating to look at, but makes perfect sense. To me, anyway. And this result in particular is nothing extraordinary, it’s just an example of the class of problems you can solve with jQuery.

Now, this exact same thing could have been done with “classic” screen scraping. But I called the jQuery stuff “intelligent screenscraping” because from my point of view it has these advantages.

  • No ugly parsing. I’ve also done oldschool serverside screenscraping where all you got was some ugly invalid HTML and you had to hammer it with a bunch of regexes to find what you need. And whenever a single byte was changed on the server side, you had to go and fix what got borken. The beauty of doing this clientside is that all the parsing is taken care of for you and you can work with a clean DOM.
  • Zero server maintenance, overhead, and caching. Well, this can be an advantage or not depending on your particular business needs. When everything happens client side, and each client maintains their own cache of a page that gets a million hits per hour, then maybe serverside caching is more justified. But at the same time fetching Ajax content this way is also subject to clientside caching and proxies. For non-script content, caching is on in jQuery, although you can turn it off if you want. I just find it beautiful that you don’t have to run anything on the server and there is no magic homebrew aggregation script which breaks 3AM Saturday morning and someone must scramble and figure it out.
  • All the subrequests happen in the security context of the end user. I find this important and beautiful from the architecture and maintenance perspective. If you are doing anything serverside, you don’t have the user’s security context (their browser, cookies etc), so typically you have some extra “system user” that has privileges to access and aggregate all the content. But then, when a user comes, you must do extra work to validate if they have permissions to really see what they are requesting. If you do clientside screen scraping, you are always operating in the “correct” security context of the actual end user, so you do not have to do any extra serverside homebrew security work.

I’m not saying that this is an ideal solution to the sort of aggregation need that I described above. But I did find that for my particular need, I got the work done with far less code than anything serverside would have been.