XForms Everywhere

3/31/2005

Screen-Scraping with OPS

Filed under: General — Erik Bruchez @ 2:48 am

I read a couple of days ago an article by Brian Goetz on IBM DeveloperWorks about with XQuery.

The idea behind HTML screen-scraping consists in accessing an HTML page on the web, and extracting information out of it. The DeveloperWorks article proposes two ideas:

  1. Using JTidy to convert the HTML to well-formed XML
  2. Using XQuery to extract and reformat the data

I have to admit that I was not familiar with the term “screen-scraping”, but in fact we’ve had some examples of this technique in OPS for a very long time. In particular, the URL Generator example was retrieving the latest CNN headline by accessing the HTML of the CNN web site, and the Google Spell-Checker example was doing even more complex HTML-based interaction with Google.

The URL generator processor in OPS is able to produce XML from data fetched from a URL that you pass it. That source data may be of four different types:

  • XML. In this case, it is just parsed
  • HTML. In this case, it is automatically cleaned-up (with JTidy!) and converted into XML
  • Text. In this case, the text is encapsulated within XML
  • Binary. In this case, the binary is Base64-encoded and encapsulated within XML

So it is trivial to extract an existing HTML page from OPS with code like this in XPL:

  <p:processor name=”oxf:url-generatorxmlns:p=”http://www.orbeon.com/oxf/pipeline>
  <p:input name=”config>
  <config>
  <url>http://weather.yahoo.com/</url>
  <content-type>text/html</content-type>
  </config>
  </p:input>
  <p:output name=”dataid=”page/>
  </p:processor>

Now the thing we did not do with the URL generator example was using XQuery: we used XSLT instead. In fact, XQuery and XSLT are very similar in what they can accomplish in such a use case, the XQuery syntax being a little lighter. OPS had an XQuery processor as well, illustrated by the XQuery sandbox example. We were using a fairly old version of Qexo, which implemented a version of the XQuery spec that was not quite up to date. I used the opportunity to move to Saxon. Saxon is primarily known as an XSLT transformer, but because XQuery 1.0 is so close from XPath 2.0 and XSLT 2.0, Saxon also implements XQuery. Since Saxon is already the default XSLT transformer in OPS, implementing an XQuery processor based on it was a breeze. Here is the Yahoo! Weather example from the article, written in XPL:

  <p:config xmlns:p=”http://www.orbeon.com/oxf/pipeline>
  <p:processor name=”oxf:url-generator>
  <p:input name=”config>
  <config>
  <url>http://weather.yahoo.com/</url>
  <content-type>text/html</content-type>
  </config>
  </p:input>
  <p:output name=”dataid=”pagedebug=”page/>
  </p:processor>
  <p:processor name=”oxf:xquery>
  <p:input name=”config>
  <xquery>
  <html>
  <body>
  <table>
{ for $d in //td[contains(a/small/text(), “New York, NY”)] return for $row in $d/parent::tr/parent::table/tr where contains($d/a/small/text()[1],
“New York”) return

  <tr>
  <td>{data($row/td[1])}</td>
  <td>{data($row/td[2])}</td>
  <td>{$row/td[3]//img}</td>
  </tr>
}
  </table>
  </body>
  </html>
  </xquery>
  </p:input>
  <p:input name=”datahref=”#page/>
  <p:output name=”dataid=”html-page/>
  </p:processor>
  <p:processor name=”oxf:html-serializer>
  <p:input name=”config>
  <config/>
  </p:input>
  <p:input name=”datahref=”#html-page/>
  </p:processor>
  </p:config>

As usual with XPL, the XML pipeline language of OPS, you notice how easy it is to connect together small components and make them work together without writing any line of Java and without going through compilation and deployment.

Now here is how you could write the XQuery fragment with XSLT:

  <p:processor name=”oxf:xsltxmlns:p=”http://www.orbeon.com/oxf/pipeline>
  <p:input name=”config>
  <html xsl:version=”2.0>
  <body>
  <table>
  <xsl:for-each select=”//td[contains(a/small/text(), ‘New York, NY’)]xmlns:xsl=”http://www.w3.org/1999/XSL/Transform>
  <xsl:variable name=”dselect=”./>
  <xsl:if test=”contains($d/a/small/text()[1], ‘New York’)>
  <xsl:for-each select=”$d/parent::tr/parent::table/tr>
  <xsl:variable name=”rowselect=”./>
  <tr>
  <td><xsl:value-of select=”$row/td[1]/></td>
  <td><xsl:value-of select=”$row/td[2]/></td>
  <td><xsl:copy-of select=”$row/td[3]//img/></td>
  </tr>
  </xsl:for-each>
  </xsl:if>
  </xsl:for-each>
  </table>
  </body>
  </html>
  </p:input>
  <p:input name=”datahref=”#page/>
  <p:output name=”dataref=”data/>
  </p:processor>

This Yahoo! Weather example using XQuery is now in CVS and should have already shown up in the unstable builds.

OPS Screen-Scraping Example
The Yahoo! Weather Screen-Scraping Example in OPS

3/30/2005

New RSS Feeds

Filed under: General — Alessandro Vernet @ 5:17 pm

We are now serving the RSS feeds for this blog through FeedBurner. FeedBurner makes sure the feed is well formatted so it works optimally on your reader. The URLs for the new feeds are:

You can continue to use the old URLs if you wish as they will redirect your feed reader to the above URLs. Send us a comment if you notice any problem with the new RSS feeds.

3/28/2005

The OPS Blog Sample Application, Part I

Filed under: General — Erik Bruchez @ 7:31 am

Just before this weekend, I launched a mini-project to create a new “Blog” sample application for Orbeon PresentationServer (OPS). The idea had been suggested independently by two users of OPS, and there was also my own inclination to write another cool example application for OPS that leverages XML.

Where do we Start?

I have been using w.bloggar, a blog client, since the beginning of the year. While limited, it actually handles some of the basic functionality of a blog. So, thinking in terms of services, the first thing I wanted to do was to understand better the functionality provided by the XML-RPC-based Blogger and MetaWeblog APIs (the latter being an extension of the former). My quick analysis is that those APIs essentially manage the following entities:

  • Blogs. A blog is an entity which defines an individual user’s blog hosted by the application. A blog is identified by a blog id. It provides a URL, a name, and has associated categories, identified by name, that also provide associated URLs for the HTML and RSS versions of pages related to each category.

  • Posts. A post is an entity which defines a post within a blog. A post is identified by a post id. Its main features are a title, a link, and a “description” (the actual content of the post). Other RSS 2.0 attributes can be used as well, in particular a publication date and the associated categories.

First Goal

My first goal was to get something running quickly. “Running” is here defined by the following steps:

  1. Visibility from my blog client. I should be able to configure my blog client to access the OPS blog sample, even if the data exchanged is static and doesn’t actually accomplish anything.

  2. Support basic operations. I should be able to create a post with my blog client and persist it. Then, retrieve it, edit it, and update it again.

  3. View blog posts. I should be able to visit a URL and access the last posts with title, date and content in HTML. Let’s call this page the recent-posts page.

Storage Format

One of the initial ideas was that we would follow the spirit of OPS as much as we can. This implies sticking to using XML-friendly approaches. One of those is that we would use XML storage. OPS comes with an embedded eXist database, so why not use it?

How do you “design” an XML database schema? This is something not many of us are used to doing. I figured that the simpler the better: I would create two collections: a blogs collection, and a posts collection. Let’s say user lambda has a blog: he would simply have an XML document describing that blog under the blogs collection. Then each of his posts will be a separate document under the posts collection.

This is not the only possible solution. Since posts are related to a unique blog anyway, you could embed all of a user’s posts within the blog document. The benefit is that a single document would do, but then you work more with document updates rather than creating new documents for posts. This also means that a blog document could become pretty big. For now, we’ll go with the first approach. I am waiting for comments on this!

Here is an example of a blog document according to this design:

  <blog>
  <blog-id>blog123</blog-id>
  <username>ebruchez</username>
  <name>My Cool Blog</name>
  <categories>
  <category>
  <name>General Stuff</name>
  <name>Cool Stuff</name>
  </category>
  </categories>
  </blog>

And here is an example of a post document:

  <post>
  <post-id>post456</post-id>
  <username>ebruchez</username>
  <blog-id>blog123</blog-id>
  <title>Post du Jour</title>
  <description>What a day…</description>
  <published>true</published>
  <date-created>2004-03-28T10:00:00</date-created>
  <categories>
  <category-name>General Stuff</category-name>
  </categories>
  <comments>
  <comment></comment>
  </comments>
  </post>

Note that post comments are not part of the MetaWeblog API, but here I decided to store them along with each post. Again, a different strategy could consist in creating yet another collection for comments.

Once this basic format established, I created stub Relax NG schemas to validate those two types of documents.

Hooking Up XML-RPC

This is actually quite straightforward with OPS: an XML-RPC call consists in an XML document sent as the body of the HTML request. A response consists in an XML document sent back. Such a model is implemented in just a few lines with the OPS Request generator, XML converter and HTTP serializer.

I then created a dispatcher in XPL that calls individual pipelines based on the XML-RPC method requested. So far, in the order in which I introduced them: blogger.getUsersBlogs, metaWeblog.getCategories, metaWeblog.newPost, metaWeblog.getRecentPosts, metaWeblog.getPost, metaWeblog.editPost, blogger.deletePost.

I should note that the XML-RPC format is very verbose. While adapted to mapping back and forth to good old function or method-based languages, it is far from a document-based approach to services, which would have been way more natural here. Therefore it is here almost needed to introduce a conversion layer from the XML-RPC API to a simpler format for internal use. Consider the following short example:

  <params>
  <param>
  <value>
  <string>705B0BBB-8DF7-DB98-1FA1-B416860AA61B</string>
  </value>
  </param>
  <param>
  <value>
  <string>ebruchez</string>
  </value>
  </param>
  <param>
  <value>
  <string>private</string>
  </value>
  </param>
  </params>

This could be represented, in a document-oriented service, as follows:

  <edit-post>
  <post-id>705B0BBB-8DF7-DB98-1FA1-B416860AA61B</post-id>
  <username>ebruchez</username>
  <password>private</password>
  </edit-post>

Which do you prefer? Unfortunately, the conversion task between one format and another cannot be easily automated, because the XML-RPC format’s parameters do not have names.

Note that I also wrote a short Relax NG schema to validate XML-RPC requests and responses. Those schemas are hooked up in the XML-RPC dispatcher written in XPL and make sure we do not process or generate garbage. Long live Relax NG and XPL!

The bottom line is that the logic is now in place that implements the APIs mentioned above by hooking them up to eXist.

Overall Architecture of the Blog Sample

Overall Architecture of the Blog Sample

First Page

The recent-posts page was almost trivial to implement. It consists of a page model that calls a data access pipeline that retrieves the posts for a given user’s blog. The page view just formats this data in HTML.

What Next?

I think that the following tasks are required to make the application usable:

  • XML-RPC Authentication. Right now, no authentication is not done at all for the XML-RPC calls. I can only reiterate my wish that simple HTTP authentication could be used!

  • Comments Page. A page showing an individual post with simple text comments.

  • Admin Page. This is needed to create a blog and related categories.

The source code is available from CVS, under src/examples/web/examples/blog.

3/2/2005

Benefits of Using XML Technologies for Web Applications

Filed under: General — Alessandro Vernet @ 6:23 pm

Reason 1: XML is the standard for exchanging data

Increasingly data that a Web application needs is available in XML format. Increasingly business logic is not mixed with the presentation layer to form an application, but is made available to the presentation layer as a reusable service and web services are typically used for this purpose. Consequently web apps have to deal more and more with data in XML format.

Reason 2: XML is very appropriate format to describe systems

The algorithmic parts of an application are in the business logic, and as noted earlier, increasingly business logic is not mixed anymore with the presentation layer. This means that the presentation layer instead of being programmed, can be described, using a set of specific higher level and standard languages. The different aspects of the presentation layer can be described in XML files. For instance, page templates can be written in XSLT that generates XHTML, web forms described with XForms and the data entered by the user validated with W3C XML Schema or Relax NG. Combining XML technologies, the presentation layer can be written entirely in XML files, that can be easily modified to adapt to end-users requirements.

Conclusion

Because of those two reasons, XML documents quickly become a centerpiece of web applications, both as a data exchange format and to describe applications. It makes sense to use XML technologies built to deal with those XML documents, rather than going through the process of transforming XML into objects. Java provides a wonderful platform but it creates a tunnel vision amongst certain architects who have come to think of everything in term of objects. To some people equipped with a hammer, everything looks like a nail, but everyone else will want to use XML tools to work on XML documents. Don’t you?

Powered by WordPress