Thursday, January 08, 2009

Haskell XML Processing Using HXT


There seem to be two ways to process XML in Haskell -- HaXmL and HXT. HXT seems to be the most capable and understands XML namespaces. Support for namespaces being key for my needs, I started with HXT. As it happens often, I ran into some issues and spent more than a few hours trying to understand them.

Here is a simple HXT program to parse an XML document  and write it to the terminal. The XML document is embedded in the code. It works as expected.

--{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
--{-# OPTIONS -fglasgow-exts -cpp #-}

import System.Environment
import System.IO

   
import Text.XML.HXT.Arrow

main :: IO ()
main = do
  runX ( readString [(a_validate, v_0)
                    , (a_encoding, isoLatin1)]
         "<a><b/><b/></a>"
         >>>
  writeDocument [] "/dev/tty"
       )             
  return ()

Okay, that works beautifully. I will now add filter to select only XML  elements from the document.

--{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
--{-# OPTIONS -fglasgow-exts -cpp #-}

import System.Environment
import System.IO

   
import Text.XML.HXT.Arrow

main :: IO ()
main = do
  runX ( readString [(a_validate, v_0)
                    , (a_encoding, isoLatin1)]
         "<a><b/><b/></a>"
         >>>
         deep isElem
         >>>
  writeDocument [] "/dev/tty"
       )             
  return ()

That produces the entire document. Say, I am interested in extracting all elements named b, and I will enhance my filter.


--{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
--{-# OPTIONS -fglasgow-exts -cpp #-}

import System.Environment
import System.IO

   
import Text.XML.HXT.Arrow

main :: IO ()
main = do
  runX ( readString [(a_validate, v_0)
                    , (a_encoding, isoLatin1)]
         "<a><b/><b/></a>"
         >>>
         deep (isElem >>> hasName "a")
         >>>
  writeDocument [] "/dev/tty"
       )             
  return () 
 
This is where my productivity came to a screeching halt. I started suspecting that somehow,  my filter wasn't specified right, HXT has a silly bug and so on. 
After significant frustration, I started reading HXT tutorial. Turns out that readDocument returns a Rose Tree and write document expects to receive a similar tree. Once I added
the hasName filter, the result of the arrow no longer returned something that was acceptable to writeDocument. These filter functions, silently ignore errors. Phew! The fix is,
adding root element to the result of filter.

--{-# LANGUAGE Arrows, NoMonomorphismRestriction #-}
--{-# OPTIONS -fglasgow-exts -cpp #-}

import System.Environment
import System.IO

   
import Text.XML.HXT.Arrow

main :: IO ()
main = do
  runX ( readString [(a_validate, v_0)
                    , (a_encoding, isoLatin1)]
         "<a><b/><b/></a>"
         >>>
         root [] [deep (isElem >>> hasName "b")]
         >>>
  writeDocument [] "/dev/tty"
       )             
  return () 
HXT tutorial shows a canonical way to write HXT code and I should probably switch to that style for serious work, 
Hope this serves as a reminder to rtfm next time around!

No comments:

Post a Comment