A Lesson in the Beauty of Data - XML Parsing on the Front End: to CLJS or not to CLJS?

Table of Contents

img

The task came that I need to parse some XML in a front-only app. In a sense, browsers are just big XML (≅ HTML) processors, so embracing the Clojure principle of being a hosted language, it seemed desirable to utilize the built-in power of my browser. This effort turned out to be a rabbit hole, though. Consider the following:

A Native Approach

(let [s "<title>Tech.ToryAnderson.com</title>" ;; 1
      p (js/DOMParser.) ;; 2
      doc (.parseFromString p s "text/xml") ;; 3
      xp (.evaluate doc "/title" doc nil js/XPathResult.ANY_TYPE nil) ;; 4
      ]
  (-> xp .-numberValue) ;; 5
    )

  1. A little piece of XML in its true form (string)
  2. Make thing #1 (not data; a DOMParser)
  3. Use thing #1 to ingest our xml string with its parseFromString method and produce a different thing #2 (an XMLDocument when we said “text/xml”)
  4. Turns out XMLDocuments like thing#2 have an evaluate method that can read an xpath string. This returns a new thing #3 (XPathResult)
  5. Thing #3 has a numberValue parameter which is a magic number that indicates what kind of thing it is, or maybe it’s just the number it has if it is a number-type thing. One of those.

However, we have not yet figured out how to do something so simple as count the number of result that came back from my search. This may involve looking into node snapshots (a new type of thing) and various other types of thing. (tip: it actually means having an sql-like “count” line in your xpath query)

Worth it?

Is all this worth staying “in box” and employing our host? Our costs include needing to learn over 3 APIs (one for each object type) as well as deciphering which types and methods work at each step of the journey. We also must know how to write in xpath, which has a miniature syntax all its own. All of this works in counter-point to highlight the value of Clojure’s data-oriented philosophy: in Clojure all things boil down to six or seven language primitives, there is only a universal API structure, and there are neither classes (things) nor methods (distinct APIs).

Solution: Introducing Hickory

Hickory, a CLJ/CLJS library that stopped development years ago because it’s truly DONE, has a slogan that is precisely what we want: “HTML as data.” It is a solution that works in ClojureScript or Clojure, and allows us to bring to bear all the Clojure machinery.

The Code

(ns toryanderson.xml
  "XML functions for parsing the RSS feed info"
  (:require [clojure.string :as str]
            [hickory.core :as h]
            [hickory.select :as s]))

(def string->hickory (comp h/as-hickory h/parse))

(def get-items
  "Get item elements from root hickory parse"
  (partial s/select (s/tag :item)))

(def get-titles
  "Get title elements from root hickory parse"
  (partial s/select
           (s/child 
            (s/tag :item)
            (s/tag :title))))

(defn get-title-from-item
  "Given a hickory product of `get-items`, get the title string of that item"
  [i]
  (-> i :content second :content  first))

(defn get-link-from-item
  "Given a hickory product of `get-items`, get the relative link portion of that item "
  [i]
  (-> (get-in i [:content 4])
      (str/split #"\\")
      first))

(defn get-description-from-item
  "Get the description that goes with an item"
  [i]
  (get-in i [:content 9 :content 0]))

Then I call it like so:

(defn rss-box
  "A box containing the RSS content from one of the blogs"
  [rss-key]
  (let [amount 5
        s (rss-key @RSS) ;; where I store the XML strings I grabbed from my feeds
        hick (xml/string->hickory s)
        items (->> hick xml/get-items)]
    (into [:div.rss-box]
          (for [i (take amount items)
                :let [t (xml/get-title-from-item i)
                      l (xml/get-link-from-item i)
                      d (xml/get-description-from-item i)
                      summary (if-not (str/blank? d)
                                d
                                t)]]
            [:a.rss-link {:href l :data-title summary} t]))))

Cost Analysis

  1. Xpath vs Hickory.select

    Instead of learning the xpath way of encoding a query, I need to learn the Hickory way. There are a few advantages here, though:

    1. REPL interaction. In programming I know of no better way to learn new frameworks than to have a REPL with which to play hands-on. The feedback-loop with an in-Clojure solution is beautiful, evaluating each step in the REPL and retracing if your selector missed, all without leaving your code. With XPath, on the other hand, I would still have REPL interactions, but I’d be hitting many more insensible javascript errors for failed code (no nil-punning to smooth things along) and I’d be getting various types of Javascript objects which would resist easy inspection without looking up documentation on them. I feel Hickory wins here (counter-argument would be that the browser is your REPL in the native solution, but that’s a subject for another post).

    2. Uniform API. Modelled off of org.clojure/data.xml , nodes follow a familiar “zipper” walking structure and, most importantly, have uniform (and immutable) structure. I can use the entire Clojure language of data-transformation/traversal tools at any step.

    3. Learning that matters. When I’ve learned a new object and class API in any language (Javascript in this case), it’s learning that has a very small scope. Knowing how an XPathResult works has a very limited benefit to my programming skills. Learning abstractions like composing selector functions, applying functional traversals, and using list comprehensions are concepts that not only benefit every project I think about in the future, but often shed insights into my daily life as well. I can’t overstate the impact that Clojure has had on my life philosophies.

  2. The Native Solution

    The very real benefit I had originally shot for with the fully-javascript version was that I would need no libraries to do a job that the browser should be very well suited for. That part is true – it would have avoided package-sizes and computational expense of transforming things to Clojure and Hickory structures. However, the cost to the developer is precisely the opposite. Javascript is an inferior language to Clojure (at least for the reasons stated previously), and it shows in the very low trade it offers of developer learning/effort to rewarding skill/value.

Success!

img

At the end of the day I have successfully produced feed-lists, with links, of the most recent posts from my four primary blogs. This is done entirely on the front-end so it is compatible with my cheap shared hosting provider (which wouldn’t allow access to anything other than PHP on their server).

Resources

Tory Anderson avatar
Tory Anderson
Web App Engineer, Digital Humanist, Researcher, Computer Psychologist