Wednesday 15 June 2016

Writing a parser of XML 4

In the previous post I discussed how to parse an XML document. Since the title of all of these posts claims to write a parser for XML, one would assume that doing the parsing bit would be the main and the most challenging task. Even I sustained the same belief for the better part of the development of this program. But that was not the case. 

I noticed a severe dependency in my parser. It depends on a certain format in which the XML document is presented to it. It does not expect any new lines or spaces between various elements. And by that I mean that it would implode if presented with a document with any other formatting. I tried to fix it but was too tired to change how my parser worked. So I decided to take a by pass. I told my self, it does not matter what formatting is applied to the XML document as it is in my hands to call the parser with a string. Hence I decided take the string from the XML document, strip it off from any newlines, spaces and tabs. And then pass it along to the parser for parsing it. 

While this also sounded simple enough as all one needs is to ignore three characters in a string from a file: \n, \t, ' '. But then I realized another twist in the tail. It's fine for me to change the spaces between elements, because ultimately they don't matter. But I cannot blindly do the same for the whole document. The data inside of some elements, should be passed along and stored as it is. We can not afford to ignore the formatting on that. Hence the whole idea of simply running through the string and ignoring every space and newline as one encounters them went down the drain. 

I needed a function that was capable of recognizing if the thing in hand was another element (in which case, all preceding spaces and newlines are to be ignored) or is it simple data (in which case preservation of the text format is important). And of course this function needed to be recursive in nature for it to work for the entire document. If you haven't realized, this looks an awful lot like our ExtractElement function. 

So what I did was simply create another function, named it cleanText() and then copied the code from ExtractElement function and modified it to meet my needs.

Believe it or not. But this took the most amount amount of time. Before actually writing a working solution, I tried many versions. And while working on every one  of these versions, I would eventually hit a brick wall. 

Again, check out the code at the following address, and look for the cleanText() function to see what I have discussed here:



No comments:

Post a Comment