Suneido

Integrated Application Platform

  • Home
  • Learning
    • Suneido Manual
    • Beginners
      • Inside Suneido
      • The Suneido Programming Language
      • The Suneido Database
      • Installing Suneido
      • Building Suneido
      • IDE Go To Tour
      • Upgrading To A New Release
    • Advanced
      • Canvas Control
      • DrawControl Part 1
      • DrawControl Part 2
      • DrawControl Part 3
      • SAX Like XML Processing
      • A Plug-In Architecture
      • A Simple Wizard Framework
      • An HTML Include Facility
      • An RSS 2 Feed Creator
      • MIME Generation
      • A New Add-on Facility
      • Workspace Improvement Hack
    • Mockito for Suneido
    • The Suneido Task Scheduler
    • Contributing To Suneido
    • Contributor Assignment of Copyright
    • Language Translation
    • Future Directions
    • Interview With Andrew Mckinlay
  • Forum
    • Announcements
    • Internals & Enhancements
    • Cookbook
    • General
  • FAQ
  • Screenshots
  • Downloads
  • Links

From the Couch 5 – SAX Like XML Processing

I spent this afternoon writing some classes to do XML processing similar to SAX (Simple API for XML). It’s not a full SAX implementation, but it does operate in a similar fashion.

WARNING: This is a very simple implementation, there are lots of things that aren’t handled.

The basic idea behind SAX is that the parser does not build up any data structures, instead it calls methods in a handler for the start and end of each element, and for the characters between tags.

The simplest use (in Suneido) would be:

xr = new XmlReader
xr.Parse(xmltext)

This won’t do much since it will use the default handler class – XmlContentHandler, which doesn’t do anything.

Here is a very simple XmlContentHandlerSample:

XmlContentHandler
    { 
    StartElement(qname, atts)
        { Print('START', qname, atts) }
    EndElement(qname)
        { Print('END', qname) }
    Characters(s)
        { Print(s) }
    }

(We don’t have to derive from XmlContentHandler if we’re going to implement all the methods, but it’s a good idea.)

We can now use this with:

xr = new XmlReader
xr.SetContentHandler(XmlContentHandlerSample)
xr.Parse(xmltext)

(if your handler uses instance variables, you’d have to create an instance, e.g. new MyHandler)

If xmltext was:

<body><tag color="red" size="12">chars</tag><solo /></body>

we would get the following output:

START body #()
START tag #(size: "12", color: "red")
chars
END tag
START solo #()
END solo
END body

For simplicity, the attributes are passed as a Suneido object, rather than an instance of an Attribute class.

Notice that we get START and END for the solo tag even though it did not have separate opening and closing tags.

Implementation

We’ll start with XmlReader – the main component.

We initialize the content handler to the default XmlContentHandler.

class
    {
    New()
        {
        .contentHandler = XmlContentHandler
        }

We allow setting a different content handler.

    SetContentHandler(contentHandler)
        {
        .contentHandler = contentHandler
        }

And now the main part. Parse consists of one large loop. Each iteration processes characters (if present) and then a tag. Processed text is removed from the beginning of text and when the text is empty, we’re done.

    Parse(text)
        {
        forever
            {
            i = text.Find('<')
            if i > 0
                .contentHandler.Characters(text.Substr(0, i))
            text = text.Substr(i + 1)
            if text is ''
                break
            j = text.Find('>')
            tag = text.Substr(0, j)
            qname = tag.BeforeFirst(' ')
            atts = .attributes(tag.AfterFirst(' '))
            if qname.Prefix?('/')
                .contentHandler.EndElement(qname.Substr(1))
            else if tag.Suffix?('/')
                {
                qname = qname.Tr('/')
                .contentHandler.StartElement(qname, atts)
                .contentHandler.EndElement(qname)
                }
            else
                .contentHandler.StartElement(qname, atts)
            text = text.Substr(j + 1)
            }
        }

A private method is used to process the attributes for a tag. It uses the built-in Scanner class to simplify reading quoted strings.

    attributes(s)
        {
        atts = Object()
        for (scan = Scanner(s); scan isnt (name = scan.Next()); )
            {
            if name is '/'
                break
            if scan.Type() is SCAN.WHITESPACE or scan.Type() is SCAN.NEWLINE
                continue
            if scan.Type() isnt SCAN.IDENTIFIER
                throw "XmlReader: expecting identifier"
            if scan.Next() isnt '='
                throw "XmlReader: expecting '='"
            scan.Next()
            if scan.Type() isnt SCAN.STRING
                throw "XmlReader: expecting string"
            atts[name] = scan.Value()
            }
        return atts
        }
    }

Errors are handled by throwing exceptions. A real SAX implementation would also have an error handler similar to the content handler.

XmlContentHandler is simple:

class
    {
    StartElement(qname, atts)
        {
        }
    EndElement(qname)
        {
        }
    Characters(string)
        {
        }
    }

It’s often useful to output Xml in a similar fashion. Here is XmlWriter, a content handler that simply puts the XML back into a string:

XmlContentHandler
    {
    New()
        {
        .text = ""
        .element = false
        }
    StartElement(qname, atts)
        {
        .flush()
        s = '<' $ qname
        for m in atts.Members()
            s $= ' ' $ m $ '=' $ '"' $ atts[m] $ '"'
        .element = s
        }
    Characters(string)
        {
        .flush()
        .text $= string
        }
    EndElement(qname)
        {
        if .element is false
            .text $= '</' $ qname $ '>'
        else
            {
            .text $= .element $ ' />'
            .element = false
            }
        }
    flush()
        {
        if .element is false
            return
        .text $= .element $ '>'
        .element = false
        }
    GetText()
        {
        .flush()
        return .text
        }
    }

The only complication here is detecting empty elements and outputting <tag /> instead of <tag></tag>. To handle this, StartElement saves the tag in .element and then EndElement checks for it. The GetText method allows retrieving the text. For example:

xr = new XmlReader
xr.SetContentHandler(xw = new XmlWriter)
xr.Parse('<body><tag color="red" size="12">chars</tag><solo /></body>')
xw.GetText()

=> '<body><tag size="12" color="red">chars</tag><solo /></body>'

A good improvement would be to “pretty print” the XML – adding newlines and indenting to make it more human readable.

These classes will be in stdlib in the next release. Until then, you can download sax.zip and use LibraryView > Import Records to load it into a library. This also includes simple XmlReaderTest and XmlWriterTest. Let me know what you think. I haven’t done much testing, so if you find any problems, please report them in the User Group.

Next, I’d like to write a simple version of XML-RPC (XML remote procedure calls). (Using these classes to process the XML, of course!)

References

SAX2 – Processing XML Efficiently With Java, David Brownell, O’Reilly, 2002

Contact Us | Legal Statement | Privacy Statement | SiteMap

(c) Suneido Software Corporation - Open Source Integrated Database and Programming Language