I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.
import Control.Monad
import Text.HTML.TagSoup
userid = "83805"
main = do
  posts <- liftM parseTags (readFile "posts.xml")
  print $ head $ map (fromAttrib "Id") $
                 filter (~== ("<row OwnerUserId=" ++ userid ++ ">"))
                 posts
import Text.XML.HXT.Arrow
import Text.XML.HXT.XPath
userid = "83805"
main = do
  runX $ readDoc "posts.xml" >>> posts >>> arr head
  where
    readDoc = readDocument [ (a_tagsoup, v_1)
                           , (a_parse_xml, v_1)
                           , (a_remove_whitespace, v_1)
                           , (a_issue_warnings, v_0)
                           , (a_trace, v_1)
                           ]
posts :: ArrowXml a => a XmlTree String
posts = getXPathTrees byUserId >>>
        getAttrValue "Id"
  where byUserId = "/posts/row/@OwnerUserId='" ++ userid ++ "'"
import Control.Monad
import Control.Monad.Error
import Control.Monad.Trans.Maybe
import Data.Either
import Data.Maybe
import Text.XML.Light
userid = "83805"
main = do
  [posts,votes] <- forM ["posts", "votes"] $
    liftM parseXML . readFile . (++ ".xml")
  let ps = elemNamed "posts" posts
  putStrLn $ maybe "<not present>" show
           $ filterElement (byUser userid) ps
elemNamed :: String -> [Content] -> Element
elemNamed name = head . filter ((==name).qName.elName) . onlyElems
byUser :: String -> Element -> Bool
byUser id e = maybe False (==id) (findAttr creator e)
  where creator = QName "OwnerUserId" Nothing Nothing
Where did I go wrong? What is the proper way to process hefty XML documents with Haskell?
I notice you're doing String IO in all these cases. You absolutely must use either Data.Text or Data.Bytestring(.Lazy) if you hope to process large volumes of text efficiently, as String == [Char], which is an inappropriate representation for very large flat files.
That then implies you'll need to use a Haskell XML library that supports bytestrings. The couple-of-dozen xml libraries are here: http://hackage.haskell.org/packages/archive/pkg-list.html#cat:xml
I'm not sure which support bytestrings, but that's the condition you're looking for.
Below is an example that uses hexpat:
{-# LANGUAGE PatternGuards #-}
module Main where
import Text.XML.Expat.SAX
import qualified Data.ByteString.Lazy as B
userid = "83805"
main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing
ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (StartElement "row" as)
  | Just ouid <- lookup "OwnerUserId" as = ouid == uid
  | otherwise = False
ownedBy _ _ = False
The definition of ownedBy is a little clunky. Maybe a view pattern instead:
{-# LANGUAGE ViewPatterns #-}
module Main where
import Text.XML.Expat.SAX
import qualified Data.ByteString.Lazy as B
userid = "83805"
main :: IO ()
main = B.readFile "posts.xml" >>= print . earliest
  where earliest :: B.ByteString -> SAXEvent String String
        earliest = head . filter (ownedBy userid) . parse opts
        opts = ParserOptions Nothing Nothing
ownedBy :: String -> SAXEvent String String -> Bool
ownedBy uid (ownerUserId -> Just ouid) = uid == ouid
ownedBy _ _ = False
ownerUserId :: SAXEvent String String -> Maybe String
ownerUserId (StartElement "row" as) = lookup "OwnerUserId" as
ownerUserId _ = Nothing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With