I spent this afternoon writing some classes to do XML processing similar to SAX (Simple API for XML). It’s not a full SAX implementation, but it does operate in a similar fashion.
WARNING: This is a very simple implementation, there are lots of things that aren’t handled.
The basic idea behind SAX is that the parser does not build up any data structures, instead it calls methods in a handler for the start and end of each element, and for the characters between tags.
The simplest use (in Suneido) would be:
xr = new XmlReader xr.Parse(xmltext)
This won’t do much since it will use the default handler class – XmlContentHandler, which doesn’t do anything.
Here is a very simple XmlContentHandlerSample:
XmlContentHandler { StartElement(qname, atts) { Print('START', qname, atts) } EndElement(qname) { Print('END', qname) } Characters(s) { Print(s) } }
(We don’t have to derive from XmlContentHandler if we’re going to implement all the methods, but it’s a good idea.)
We can now use this with:
xr = new XmlReader xr.SetContentHandler(XmlContentHandlerSample) xr.Parse(xmltext)
(if your handler uses instance variables, you’d have to create an instance, e.g. new MyHandler)
If xmltext was:
<body><tag color="red" size="12">chars</tag><solo /></body>
we would get the following output:
START body #() START tag #(size: "12", color: "red") chars END tag START solo #() END solo END body
For simplicity, the attributes are passed as a Suneido object, rather than an instance of an Attribute class.
Notice that we get START and END for the solo tag even though it did not have separate opening and closing tags.
Implementation
We’ll start with XmlReader – the main component.
We initialize the content handler to the default XmlContentHandler.
class { New() { .contentHandler = XmlContentHandler }
We allow setting a different content handler.
SetContentHandler(contentHandler) { .contentHandler = contentHandler }
And now the main part. Parse consists of one large loop. Each iteration processes characters (if present) and then a tag. Processed text is removed from the beginning of text and when the text is empty, we’re done.
Parse(text) { forever { i = text.Find('<') if i > 0 .contentHandler.Characters(text.Substr(0, i)) text = text.Substr(i + 1) if text is '' break j = text.Find('>') tag = text.Substr(0, j) qname = tag.BeforeFirst(' ') atts = .attributes(tag.AfterFirst(' ')) if qname.Prefix?('/') .contentHandler.EndElement(qname.Substr(1)) else if tag.Suffix?('/') { qname = qname.Tr('/') .contentHandler.StartElement(qname, atts) .contentHandler.EndElement(qname) } else .contentHandler.StartElement(qname, atts) text = text.Substr(j + 1) } }
A private method is used to process the attributes for a tag. It uses the built-in Scanner class to simplify reading quoted strings.
attributes(s) { atts = Object() for (scan = Scanner(s); scan isnt (name = scan.Next()); ) { if name is '/' break if scan.Type() is SCAN.WHITESPACE or scan.Type() is SCAN.NEWLINE continue if scan.Type() isnt SCAN.IDENTIFIER throw "XmlReader: expecting identifier" if scan.Next() isnt '=' throw "XmlReader: expecting '='" scan.Next() if scan.Type() isnt SCAN.STRING throw "XmlReader: expecting string" atts[name] = scan.Value() } return atts } }
Errors are handled by throwing exceptions. A real SAX implementation would also have an error handler similar to the content handler.
XmlContentHandler is simple:
class { StartElement(qname, atts) { } EndElement(qname) { } Characters(string) { } }
It’s often useful to output Xml in a similar fashion. Here is XmlWriter, a content handler that simply puts the XML back into a string:
XmlContentHandler { New() { .text = "" .element = false } StartElement(qname, atts) { .flush() s = '<' $ qname for m in atts.Members() s $= ' ' $ m $ '=' $ '"' $ atts[m] $ '"' .element = s } Characters(string) { .flush() .text $= string } EndElement(qname) { if .element is false .text $= '</' $ qname $ '>' else { .text $= .element $ ' />' .element = false } } flush() { if .element is false return .text $= .element $ '>' .element = false } GetText() { .flush() return .text } }
The only complication here is detecting empty elements and outputting <tag /> instead of <tag></tag>. To handle this, StartElement saves the tag in .element and then EndElement checks for it. The GetText method allows retrieving the text. For example:
xr = new XmlReader xr.SetContentHandler(xw = new XmlWriter) xr.Parse('<body><tag color="red" size="12">chars</tag><solo /></body>') xw.GetText() => '<body><tag size="12" color="red">chars</tag><solo /></body>'
A good improvement would be to “pretty print” the XML – adding newlines and indenting to make it more human readable.
These classes will be in stdlib in the next release. Until then, you can download sax.zip and use LibraryView > Import Records to load it into a library. This also includes simple XmlReaderTest and XmlWriterTest. Let me know what you think. I haven’t done much testing, so if you find any problems, please report them in the User Group.
Next, I’d like to write a simple version of XML-RPC (XML remote procedure calls). (Using these classes to process the XML, of course!)
References
SAX2 – Processing XML Efficiently With Java, David Brownell, O’Reilly, 2002