Wednesday 15 June 2016

Writing a parser for XML 3

Parsing and storing the contents of the XML document:

Okay so til this point, we have a valid XML document. Now our job is to parse out the details and store them into appropriate classes. Since the primary purpose of these classes is storage, they shouldn't be made complex by adding various methods, methods that can take turns on parsing the document one by one and extracting only what they need. What I prefer is one centralized function that does the job.

First thing that needs to be done is class definition of our whole document. Think what an XML document can contain?

1) A header containing things like version, encoding, etc.
2) A root element.
3) Children to root element.

Remember these definitions are recursive in nature. So an element can contain as many element as it needs. And the names of these elements can be anything. Further more, an element can have attributes. And finally, there can be some actual data associated with an element. 

Keeping all of this in mind, we have to write a function (or collection of functions) that can psiphon all this information and store them into our objects. 

Getting things like version and encoding is easy. It's simple a matter of running through a string and that too the first line of the file.

The bigger matter is forming a correct hierarchy of all the elements of the XML document. Each document must have a root element. So the first order of business is to find this root element. Once we have the root element, it's name (and the attribute list) can be stored. Then it is time to see what is contained inside of this element. If simple string data, then we need to store it in the data member of the element. If it is another element, then we can repeat the same process for that element. This is nothing but a version of the good old DFS. Anyhow, the algorithm described above roughly looks like:

ExtractElements(Element element):
1) If element contains simple data:
2)      element.data= the text contained inside
3)      return to the calling place
4) element.child.push_back( new element encountered inside)
5) ExtractElements(element.child)
6) return


The above functionality is handled inside the parse() function using various other functions. Each function contributes to this extraction process. 

No comments:

Post a Comment