S.Word : June 2016

Wednesday, 29 June 2016

The Day's Deeds

The day presented itself with a challenging invitation. Invitation to admit a mistake and confront it's consequences. I was assigned a task on what feels like ages ago, to have a presentation ready in one day on two very essential but potentially insipid topics. I was to seek help from any fellow trainees. I chose Amisha and Deepti for I was new into my training and they seemed very helpful and creative (which they are). Fast forwarding that day, the presentation happened with misplaced sparks and leaks of energy. Anyway, it was time to submit the presentation before sir. So we did.

The first submission was made subject to comments and suggestions. And this started a sequence that remains alive till date. People comment, the presentation is changed accordingly, and people still comment. Not to say the presentation did not deserve it's share of constructive criticism, it was aspired to be excellent and it is not, still. While all of this was happening, I managed to sneak my self out of the conversation and hence away from responsibility. And this careless ignorance was finally recognized by sir, today. I want no further delay on this one as well. The presentation needs closure. Hopefully that will happen the following day.

Talking about SIM, Ranvir is doing some benevolent work with the user interface. His work is heartily appreciated. And more such helping hands are readily welcomed. I decided to add some way in which the user will be able to submit queries about the database of SIM. After all this was the intended purpose: analysis based on queries. There were two ways in which I could have approached this:

Provide a simple field for the user to enter any query they want on the set of tables.
Give only some pre-defined queries for the user to use.

So which one of the two approaches I decided to follow? Both. None of these are entirely in alignment to what SIM is supposed to be and none of these are entirely wrong. But more importantly, none of these are complete yet.

I was able to learn how to make data travel between pages, views and models using the POST and GET functions in HTTP and Django's functions to handle all of this. I also learnt the way in which we can submit SQL queries to Django models. There is one problem though: I was not able to query tables where I wanted to select only a subset of all the fields in the table. Either the query returned all the columns (when I selected all the columns or had the primary key as part of the needed subset of columns) or it showed an error when I chose to have only a few columns in the query and did not include the primary key in it. The error basically stated that I was not allowed to make queries without having primary keys in them. This is again a little weird.

A Django Guru promised a visit tomorrow, maybe I'll ask him about it.

Monday, 27 June 2016

Integrating Components of SIM

It feels like there are two parts of SIM. Posts on my blog confirm that too.

One part is responsible for displaying all the completed work to the user. I am not talking about UI here. UI needs major improvement but is not the priority right now. This part only displays the tables created by SIM in a very modest manner. Not to overstate things but the Django framework is used for this purpose.

The other component is the one which is also the oldest. And one that is still farthest from being complete. The parsing part, requires all the business logic, most interactions with the database and bulk of the time investment. Unlike the previous part, this one started out as small C++ functions coming together and eventually getting too complex to handle with efficacy. Hence it was decided to take this part and try and make it professional. By professional I meant avoid writing code from scratch and use some tool instead. Flex and Bison were selected for this purpose.

But today was the day to bring these together. Although none of the parts are completely complete, but still this marriage was long due and much anticipated. I knew that if done well, this combination will push the whole project to a meaningful form like no other SIM has seen so far. And so it did. But first let me quickly mention how it all happened:

1) Write a homepage view in Django that gives two options, one to see the tables and other to get a file from user and parse it into the database.

2) Write subsequent views and map urls to make sure the site's structure is coherent and intuitive.

3) Device a way in which the user can upload a file for parsing. References were taken from another project that used Django and required the user to upload files: Civil Octave.

Next up is improvements in the UI and loads of parsing.

Saturday, 25 June 2016

Presenting PageRank Pedagogically

First thing first, I am not entirely sure what pedagogically means. Forgive if you know and I am wrong. Look it up if you don't know and be impressed by the title.

Facing the possibility of an impromptu presentation, late in the night yesterday, I declared to give a presentation on PageRank. It was a great opportunity to share the awesomeness of the topic with others and it most certainly was an easy way out as I already had a good enough grasp on the topic from my previous endeavors with big data and map reduce.

Here is how the presentation went (at least in my head).

[Shifting to the stage]

Let me start by asking a simple question. What is a search engine? Try and describe it in the most abstract way possible.

It's a thing that takes a query as input and returns a response. Where query is a collection of keywords and response is a list of web pages.

The search engine looks for the keywords in the entire set of pages and returns those pages that match the keywords in the query. There are two ways of doing this search and match process:

The stupid way and the wise way. Wait, did you expect a more technical classification? Don't worry, I'll deliver that too:

Stupid way: Brute force. This means that for every keyword in the query, the search engine will go to each page and struggle through the entire page to see if the keyword is found on that page. This sounds really uncomplicated. And it works fine if the number of pages is limited to some earlier multiples of a hundred. But apply the same approach on the World Wide Web and you'll suffer at the hands of WWW's unfathomable size and the limitations of hardware at your exposure.

Wise way: Hashing. Hashing may sound super geeky and convoluted (which I'll say it is for the purposes of impressing the reader) but it's really not. Remember when we are in a classroom and the teachers takes the attendance? What does the teacher use to reference each student? A roll number. What is this thing called? Hashing.

Hashing is the process of assigning a key (possibly unique) to something real. Like a roll number (key) to a student (something real).

In order to somehow use this thing for a search engine, the first thing we need to do is divide every page and hence the whole web of pages into a list of keywords (a very long list). This constitutes the list of keys. And to every key, associate another list. This time the list of pages (or links to pages) where that keyword is found at least once.

This approach bests the brute force method already. Now the search engine does not have to traverse through every page in order to fine a match for a query. That has already been done for it. All the search engine needs to do is break down the query into keywords. Look for pages where each keyword in the query is found (using the hash function). Return the intersection (or union) of these pages to the user.

Suppose for a query, there are a hundred pages that match. Now how does the search engine decides which of these hundred pages is the most relevant? Is there a way to rank all the pages on the web? Yes there is. And it's called PageRank (or finding PageRank).

PageRank is an algorithm first introduced by Larry Page of Google. So what in the God's name is PageRank of a page?

PageRank is a measure of how popular a page is. Now this is an important point. The purpose of PageRank was to rank pages for the purpose of deciding which one is more relevant. And the criteria they are using to determine this relevance is the page's popularity. This algorithm functions on the assumption that popularity implies relevance. Ask yourself a question, how could an utterly irrelevant or mediocre page be popular? It can't. Hence the assumption is quite appealing, at least to me.

Now, let me introduce the notion of a random surfer. A random surfer is guy sitting in front of his computer, browsing the Web. Suppose the random surfer is presently on a webpage A. From here, the random surfer can take one of the two following actions:

Write a url in the address bar of the browser and jump to any page on the web, randomly.
Follow any of the links on page A and land on other pages depending upon which link has been clicked.

Let's denote the probability that the random surfer will go with the first option by d. So automatically the probability for the second option will be (1-d).

Let me work out a mathematical formula for PageRank. But before we go into that, I am going to introduce some symbols:

PR(A): PageRank of the page A.

N: Total number of pages in the system.

n(A): Number of links going out from page A to other pages on the web.

P(e): Probability of happening of event e.

Let's get started then:

PR(A) = How popular page A is.

= How likely it is that the random surfer will land on this page.

= P(Random surfer lands on page A).

= P(Random surfer jumps randomly from somewhere to page A OR Random surfer follows any of the incoming links from other pages to A).

= P(Random surfer jumps randomly from somewhere to page A) + P(Random surfer follows any of the incoming links from other pages to A). (1)

Now, P(Random surfer jumps randomly from somewhere to page A) can be calculated by dividing the probability of a jump (d) by the total number of contenders for after the jump (N). So this is equal to d/N.

P(Random surfer follows any of the incoming links from other pages to A) can be found using the following concept:

Assume that from the entire set of pages, [P1, P2, P3, ....., Pn] is the list of pages with links to A. Now the probability of reaching A after following links is the sum of the probability of following any one of these n links.

So this is equal to: P(being on P1 and clicking on the link to A) + P(being on P2 and clicking on the link to A) + ..... + P(being on Pn and clicking on the link to A).

Converting this into a nice simple summation equation we get:

P(Random surfer follows any of the incoming links from other pages to A) = Summation(being on Pi and clicking on the link to A), where i goes from 1 to n.

The probability of being on page Pi is nothing but PR(Pi). And if Pi has 10 links going out from it then only that link will matter that comes to A. So we'll divide PR(Pi) by n(P) which is 10 in this case.

Finally this probability of the random surfing choosing to follow links is to be multiplied by (1-d) which is the probability of choosing this whole option in the first place.

Substitute all of this in equation (1):

PR(A) = d/N + (1-d) * Summation( PR(Pi) / n(Pi) ) where i goes from 1 to n.

This formula is applied for every page in the system. An iterative method is used. Basically in the beginning, every page is assigned an equal PR of 1/N. Then this formula is applied to each page until the value of PR of all of the pages stops to change much. The final stable is PR is the actual PR of every page.

Since this is nothing but a probability distribution, the sum of PRs of every page after each iteration will be equal to one.

To check out a simulation of this algorithm have a peek at the following innocuous, innocent code:

PageRank

Friday, 24 June 2016

Adding templates and views to Sim

Up and till this point, I had completed work on the admin side of things (and by that I mean getting tables to the admin site). Today was the turn to give some presentable interface to the whole thing. The way I decide to go about interfacing the display of all these tables was by using an index page. The index page was intended to contain links to all different tables that the user was allowed to access. Clicking on each of these links would open the table in front of the user. That was it. This was my goal the entire time.

First of all I designed an index page. It was a very basic html page with a bunch of links to other html pages. The url of each of these pages was mapped to one template: the table template. This template took as parameter the name of the table that the user wanted to see. And then contents of that tables were rendered on the page. I must add this included a lot of typing. An annoying lot.

Abstraction makes things to sound easy, whether it be a problem or a solution. But the pain lies in the details. Every thing needed to make this possible required an "how to" on Google. By this point, I don't even remember what all I searched for today. That is perhaps a bad thing. I should be able to recall what I have looked for and hence found on the internet. Otherwise what is the point of doing all of this if I can't repeat it without internet? But wait, what is the browser history for? That is indeed a good idea to link my browser history for the day to my daily dairy. After all searching the web constitutes a major part of activities during training.

Anyway, during the later part of the day I asked for help from Tamandeep (a fellow trainee) to add some beauty to my html. He agreed. Hopefully he'll do a good job. Presently there are only two people working on SIM (me and an occupied Amarjeet), if all goes well then who knows maybe there is a third contributor in the making. All hinges upon the sincerity of his efforts.

Training for today ended with the sequel to a presentation on Octave. It was continued by Amarjeet as an attempt to introduce Octave to students of Civil Engineering. Today's presentation was mostly about interpolation. Amar introduced the topic nicely to everyone. And since the topic of the presentation had all the Civil students relate to it, the seminar soon became very interactive. I mean people were literally jumping off there seats to go on to the stage and explain their points. All in all, a good exciting session.

Thursday, 23 June 2016

Sim on Server

If I had to describe Sim in a very abstract way I would probably classify it’s components as the following:

Parsing modules that take a file and parses it. The parsed results are stored into a database.
A Database containing results from all the parsing plus some other standard tables.
Django site that takes the tables in the database and displays it.

The goal was to deploy this onto a server. Since the whole interface will eventually be in Django, the first thing that needed to be done was make the server capable of dealing with Django. Internet searches lead me to believe that mod-wsgi was the thing for it.

Now I had to install mod-wsgi on the server. But I couldn’t because I didn’t have the permissions for it. Anyway I tried to install it only for me and not for every user on the server. The hope was that I wouldn’t have needed root access for that. I didn’t knew if that was even possible. But I went ahead with it anyway.

Eventually I came to the realization that there wasn’t any straight or permanent way of doing this (as far as I could tell). I had to install this exclusively or find some alternative. But any alternative would also have required some installations. So it would have been stupid to run from it. I simply asked the admin for help.

As it turned out, I didn’t need mod-wsgi. I needed to create a virtual environment and then from there I had the power to install everything I need. So I did. Now it was time to bring the database to the server.

After learning how to transfer the database from one machine to the other, I was only a few setting changes away from achieving my goal. And……… I hit another brick wall.

Access and hence my ability to create database (by the name of Sim) was denied. Only after asking the ebullient admin, Amrit (ebullient because he had just repaired his laptop) I came to know that I need to append my user name in front of the database name to be able to create one. So many rules. There must be a guide document for new users on the experimental server to handle such frequently asked questions.

Wednesday, 22 June 2016

Adding More tables to Sim application in Django

Yesterday I decided to have tables in separate databases and then have separate apps, each for one of these databases. But HS Rai sir pointed out that this was probably not the right way to go. And that made sense because such rigorous diversification in the initial stages of the project don't promise for a coherent future. Things could get more and more messier if I were to keep this approach up.

Now the aim was to have one database (by the name of Sim, of course) and have this database contain all the tables. Then came the issue of tables having the same names. To counter this I had two ideas:

Change the names of the tables by augmenting their categories to their names. This way there will not be any redundancy in table names.
Combine all tables with the same name into one and then add another field (say category) that can differentiate rows from different input tables.

The second approach sounds a bit more cleaner off the bat. But this would require the tables having the same name to have the exact same number and type of fields as well. This was not the case. So I had to drop this approach. It may be something I would have to come back in the future but for now I had to go with the first approach i.e. changing the names of the tables.

Once the table names were changed, things seemed easy. All I had to do was import the changed database into MySQL. Then import MySQL into my models.py file. And then a couple of django commands later, I should've been good to go. But no. There was something that I forgot. Something that I learnt (complained) about the hard way last time around. Yes, the necessity of having a primary key in every table.

First of all I decided to see if there already were any possible fields in all or some of these tables that can rightfully be set as the primary key. But I wasn't that lucky. So I decided to go all commando on the database and simply add a surrogate primary key to each table. I named this field as id on purpose (knowing Django would hate me if I named something else as the primary key). This field was integer and had auto increment feature for all the tables.

After doing this, I had to face no other problem. All the tables are there (with data) on the Django site.

Next order of business is to get this thing running on the experimental server.

Tuesday, 21 June 2016

Django: Modeling into multiple databases

Models in Django are tables in a database. Each app in Django is allowed to have as many models as it needs in the model.py file. These models are then migrated to a database in your database product of choice. I had done only this so far.

But today I needed multiple databases to store multiple tables. I followed the following steps in attempt to complete this task:

1) Convert mdb files into mysql tables. This was done using the mdbtools kit in ubuntu.

2) Create new apps in the Django site.

3) Export the tables into the models.py file of the corresponding app.

4) Configure stuff in other files.

5) Run the server to see these work out.

Upon doing all of this, I was able to see the tables just fine, but there was no sign of any data in these tables. This was because I had simply imported the database schema from sql to my models. I never created a link for Django to also pick the data from the tables. And then I decided to follow the same method I used for SIM:

1) Clean up mysql from all the databases needed by Django.

2) Recreate these databases from the exported models.py files in Django.

3) Finally, populate the tables with data some how.

The approach was simple but since I had to deploy these models from more than one application into more than one databases, I ran into some problems:

1) Every migration went into the default database, so I had to learn to customize where the migrations should go.

2) Was not able to properly implement the above point.

So at the end of the day, I am stuck at the struggle of migrating my Django database into different databases.

Once this is done, then I'll have to figure out how to populate a mysql table directly from an mdb file.

Monday, 20 June 2016

Spaces and Tabs in Flex

As I have mentioned in previous blog posts that Flex is a scanner. A scanner that looks for patterns defined/ desired by the programer. The whole thing begins with somethings known as the regular expressions.

Regular expressions provide a way of describing textual patterns to the computer or a programming language. They form the basis of many advanced concepts in computer science, with pattern matching it self being one of them. Learning these regular expressions is absolutely crucial if one wishes to learn Flex or any other related scanner.

In Flex, we have to specify a regular expression. This tells flex what it should be looking for in the input text. And with each such regular expression we also have to specify some action code. This action code is what will run if the flex is able to find for the provided regular expression. One can use as many regular expression as one needs.

One of the most commonly used character in any text file is the space. Every word of every sentence, and then every sentence it self is delimited using a space. So it becomes very important for one to deal with spaces in the input file. In most cases, spaces are to be ignored i.e. action code corresponding to the regular expression describing a space shall do nothing. But sometimes when spaces become important to the grammar of our input then we need to identify spaces, acknowledge them, and write apt action code to deal with them.

I needed to do the exact same thing while writing up a flex file for a STAAD pro input file under the project SIM. I assumed spaces to work the same way as other characters do and hence I directly wrote a simple expression for identifying one or more consequent occurences of a space as:

[ ]+

There is a space inside those square boxes which is the character we are looking for. And the + sign indicates that the flex should expect any of the character inside the square brackets (in this case the space) at least once.

So far so good. But this did not work out well for me. This regular expression was leading me into trouble. Syntax error after syntax error. And it was only then that I realized that there is something wrong with my regular expression. Actually it was pointed out to me by a friend, Amarjeet Singh. The thing is that spaces in regular expressions are not represented by a space. They want extra treatment. So the right way of doing this is:

[ \t]+

That's right, a space followed by a tab. Now that is something to take a note of.

Saturday, 18 June 2016

Constraints on models in Django

Okay so I cannot take the risk of being discovered as wrong and hence foolish by someone else or even myself at some later point in time. Hence the following is just what I know till now. I don't have any claims on the truthiness of the following statements.

In Django, while connecting to a database, we write models in the models.py file.

A model is nothing but a python class that defines a table structure. This definition is then used to convert the models.py file into an actual database populated with as many tables as there are models in the file.

I am intentionally skipping over the details on how to write models and then apply them to a database. I might come back and edit this post but for now I assume the reader knows this already or have an alternative source to learn it.

What I would like to say is that I found certain constraints in the way which we are allowed to define these models. And these aren't so complicated either. I mean databases are very important part of any application. And one would want their database schema to be absolutely accurate and efficient. Flexibility in designing a database for an application is of paramount importance. You don't want any compromises do you?

But when designing tables through models of Django, there is some limitations on what we are allowed to do with our tables. Some of these restraints are as follows:

1) No composite keys:
Okay so one field isn't always enough to uniquely identify all other fields in a table. Someone should go tell Django that. Apparently you are not allowed have primary key consisting of more than one column in the table.

2) No table without a primary key:
Okay so this isn’t perhaps a bad thing. All tables should have a primary key. While most of them do, some of them don’t quite need one. But Django does not allow this to be our choice. It will rectify if we forgot to specify the primary key constraint on any of the field. In this case Django will automatically add another field by the name of id and make it the primary key with auto increment feature.

3) Key attributes must have default values:
This isn't so bad either but still the choice should be ours. Yet Django will throw an error if you forget to mention the default value of a key attribute whether it be the primary key or the foreign key.

4) The name of key attributes ends a certain way:
Assume that you have a key field by the name of "abc". Then when you migrate the changes to the database, then you will see that Django has automatically added a "_id" at the back of that field's name. Now this is annoying. I mean what is with Django and the word "id".

Thursday, 16 June 2016

Writing a parser for XML 5

Previously I discussed how I needed to and did in fact manipulate the input string from the XML document to make it acceptable to my program. So after doing that there comes the question of whether I can display that parsed and stored XML document in the same way in which I am using it? The answer is NO. A strict no.

Whenever one opens up an XML file in the browser, they see a nicely formatted hierarchy of the elements. And then there is always the option of using CSS with the XML file to make it even more appealing to the eye. I did not do any of this. All I wanted was to be able to present that nice hierarchy. That is it.

Printing the stored content was not difficult at all. All it took was a simple recursive function that indented a step further every time it is called with the next generation of elements. Of course this would require the function to know what the level of indentation was in the previous step or hierarchy. And hence to my function I had to pass a string as well that only contained the number of spaces/ tabs there needs to be for that particular call to the function. This way I am able to set the next generation by one extra level of indentation then the present generation by simply calling the function with some added spaces.

Wednesday, 15 June 2016

Writing a parser of XML 4

In the previous post I discussed how to parse an XML document. Since the title of all of these posts claims to write a parser for XML, one would assume that doing the parsing bit would be the main and the most challenging task. Even I sustained the same belief for the better part of the development of this program. But that was not the case.

I noticed a severe dependency in my parser. It depends on a certain format in which the XML document is presented to it. It does not expect any new lines or spaces between various elements. And by that I mean that it would implode if presented with a document with any other formatting. I tried to fix it but was too tired to change how my parser worked. So I decided to take a by pass. I told my self, it does not matter what formatting is applied to the XML document as it is in my hands to call the parser with a string. Hence I decided take the string from the XML document, strip it off from any newlines, spaces and tabs. And then pass it along to the parser for parsing it.

While this also sounded simple enough as all one needs is to ignore three characters in a string from a file: \n, \t, ' '. But then I realized another twist in the tail. It's fine for me to change the spaces between elements, because ultimately they don't matter. But I cannot blindly do the same for the whole document. The data inside of some elements, should be passed along and stored as it is. We can not afford to ignore the formatting on that. Hence the whole idea of simply running through the string and ignoring every space and newline as one encounters them went down the drain.

I needed a function that was capable of recognizing if the thing in hand was another element (in which case, all preceding spaces and newlines are to be ignored) or is it simple data (in which case preservation of the text format is important). And of course this function needed to be recursive in nature for it to work for the entire document. If you haven't realized, this looks an awful lot like our ExtractElement function.

So what I did was simply create another function, named it cleanText() and then copied the code from ExtractElement function and modified it to meet my needs.

Believe it or not. But this took the most amount amount of time. Before actually writing a working solution, I tried many versions. And while working on every one of these versions, I would eventually hit a brick wall.

Again, check out the code at the following address, and look for the cleanText() function to see what I have discussed here:

The XML Parser

Writing a parser for XML 3

Parsing and storing the contents of the XML document:

Okay so til this point, we have a valid XML document. Now our job is to parse out the details and store them into appropriate classes. Since the primary purpose of these classes is storage, they shouldn't be made complex by adding various methods, methods that can take turns on parsing the document one by one and extracting only what they need. What I prefer is one centralized function that does the job.

First thing that needs to be done is class definition of our whole document. Think what an XML document can contain?

1) A header containing things like version, encoding, etc.

2) A root element.

3) Children to root element.

Remember these definitions are recursive in nature. So an element can contain as many element as it needs. And the names of these elements can be anything. Further more, an element can have attributes. And finally, there can be some actual data associated with an element.

Keeping all of this in mind, we have to write a function (or collection of functions) that can psiphon all this information and store them into our objects.

Getting things like version and encoding is easy. It's simple a matter of running through a string and that too the first line of the file.

The bigger matter is forming a correct hierarchy of all the elements of the XML document. Each document must have a root element. So the first order of business is to find this root element. Once we have the root element, it's name (and the attribute list) can be stored. Then it is time to see what is contained inside of this element. If simple string data, then we need to store it in the data member of the element. If it is another element, then we can repeat the same process for that element. This is nothing but a version of the good old DFS. Anyhow, the algorithm described above roughly looks like:

ExtractElements(Element element):

1) If element contains simple data:

2) element.data= the text contained inside

3) return to the calling place

4) element.child.push_back( new element encountered inside)

5) ExtractElements(element.child)

6) return

The Parser Code

The above functionality is handled inside the parse() function using various other functions. Each function contributes to this extraction process.

Tuesday, 14 June 2016

Django for SIM

SIM converts a file into a number of database tables. This is done in order make the analysis process easier. Also doing so will allow us to see the details of more than one structures at one place. More over the analyst will be able to compare and contrast the features and facts about these different structures. Skipping over many uses, finally when completed, a user of SIM will be able to actually generate the file from the database as well. This will be a stellar feature. Components from different structures could then be used to create a new one.

But to do all of this we need an interface. An interface using which any one can upload files to parse or access data of already parsed files and whatnot. We decided to pursue the way of the web for this interface.

Django is a python based web management framework. And Django is what we are going to use to transfer our data base onto a nice and simple web page. And then slowly turn this into a working web application.

Problems/ Diffculties:

Django is easy to use. So they all said. But of course I needed to check that for my self and then form any opinion of it. I am sort of a technically challenged fella. This simple means it takes me time to get my self properly acquainted with new software let alone start using them for some productive work. Having said that, I was able to install and use Django in the matter of a hour or so. I also followed along a tutorial and was able to create a poll app with modest success. All was good up and till this point.

But then I re did everything for SIM. There were many burning questions, things that I didn't quite understand but just applied to get the job done. And then things took a turn south.

The major problem is that the tables created by SIM constantly use an attribute by the name of ID. But for some reasons this is not allowed by Django unless you make that field as the primary key, which I can't because some of these fields are not supposed to be the primary key for their respective tables. This is a weird thing. I hope to find the reason and hence the solution for this.

Monday, 13 June 2016

Writing a parser for XML 2

Any good software must have a defined list of requirements. Requirements that the software needs to fulfill. And by inheritance, any good program must have that virtue as well. Also this requirement set should be ready and locked before any development begins. Although that was not the opted strategy for this program but still I would like to start by specifying functions my parser needs to perform:

1) Check to see if the XML file is valid.

2) Parse and store details about the elements in the XML file and file it self in the program.

3) Use this stored information to display the contents in a well formatted form.

In this post, I'll be writing how I achieved the first of the above three points i.e. checking the validity of the document.

Checking for the validity of the file:

Before checking the body of document, we must see if it contains a valid header or not. This is easy enough as the header contains only one line that two with a fixed syntax. The major part of this validation consists of validating the body of the document.

To do this check I took advantage of the fact that XML does not allow any leniency in the ordering of the tags. If a tag opens in XML then it can be followed only by it's child elements, data inside the tag or the closing tag it self with the same name.

Also since the names of the tag can be anything, the question of matching to see if the tags used mean anything to XML goes out the window. For all my program cares, the tags can be any string without any spaces between them.

Now, to check the validity, all I have to see is whether the tag names match and is the order of opening and closing of the tags correct.

This is rather easy. All one needs is a stack. So create one for yourself or use an inbuilt one.

At the checking stage we do not care about the data contained in the tags. All we need is some come that will recognize all tag openings and closings in the exact order as they appear in the input file. Everything else is to be ignored at this stage. Once we have this, follow along the following algorithm:

1) If the encountered tag is an opening tag:

2) Push the tag name onto the stack.

3) If the encountered tag is a closing tag:

4) Pop a tag name from the stack.

5) Compare the two, the popped tag and the closing one.

6) If they don't match:

7) Document is INVALID

8) exit.

9) Repeat till end of the document is reached.

This simple algorithm does the job just fine. Some exceptions of course need to be handled for if encountered. For example, there must be some restrictions on what a tag name could be. That must also be taken care of in this algorithm it self.

The XML Parser

The checkElements() function (look for it in the above link) is used for this job.

Sunday, 12 June 2016

Writing a parser for XML

XML stands for Extensible Markup Language. It is used to tag data for the purpose of making it (data) easier to be learnt and used by a variety of software. But I guess you knew that already. So let me go directly to things that are less obvious.

Yesterday while being there for a presentation on XML, I engaged in a oddly spirited discussion on why is it that one should learn and then use XML when there are many RDBMS packages available that seem to do the same job. Whether comparing the two is fair or even relevant is a whole other issue. But during the discussion, I found my self automatically inclined towards XML (without any solid reason). And in order to have some leverage over the others in the discussion, I uttered, rather nonchalantly, that XML is easier to parse. If that were true then surely a point goes to XML over RDBMS (again, I am not at all qualified to make this comparison).

Today, following the presentation, I took upon my self to write a parser for XML. My motivations for this undertaking are two fold:

1) If I am able to do it, then that will surely mean that it is easy.

2) Even if it turns out to be easy, no one needs to know that. And then it can serve as a great endorsement for my false prowess in computer science.

Before getting into the details (in subsequent posts) of how I did it (am doing it), I would like to state once and for all, the following about the scope and ambition of my XML parser:

A) Competence and performance wise, it is very modest. No fancy algorithms. Pure and simple hacks in the name of parsing based entirely on my understanding of the matter.

B) It is not complete. At least not as of this moment. This incompleteness is not merely a choice but also a simple implication that writing a full parser for XML would require complete knowledge of it and rigorous coding and testing periods. All of which I intend not to promise.

I feel like I have provided some essential introduction in this post. You can check out the code at the following address:

The Parser Code

Do look for other posts on this very blog to completely understand the code and then may be write one such parser for your own amusement.

Thursday, 9 June 2016

Initial thoughts about SIM grammar

A lot had been done in SIM, parsing wise before the project was left in abeyance due to other things. When I re took the project during training, I thought it would be easier if I could simply shift on a tool for scanning and parsing. And still I stick by the this believe that a tool is indeed a better way to go. Following are the reasons why:

1) A tool would come with a huge amount guidance and help from Internet communities and forums.

2) A tool would have inbuilt functions and/or objects to do stuff that would other wise require much thought and effort to do by my own self.

3) A tool would make maintenance of the project much easier as less and less esoteric brand of code will be used.

4) And lastly, using a tool would require me to learn one. And learning new things is kind of like the whole point of this training period.

While most of the points mentioned above still hold good and strong, some of them have started to turn on me. For instance, the tool I choose to use: flex and bison, does not have a very intense text support online. Simply put, there aren't many books and tutorials on these. And whatever is available is either the same stuff everywhere or it is too complex to make much sense to a beginner like me. There really should be a better book for learning flex and bison.

And secondly, there are a few issues with compilation as well. Compiling the files with a C compiler is easy as pie, but the same can't be said about a C++ compiler.

Technical issues apart, the logical construction of the whole parsing process is also not as intuitive as I thought. Either I am being naive in the approach that I am taking or bison is not as simple as the text book describes it to be.

Wednesday, 8 June 2016

Flex and Bison for SIM

SIM stands for Structure Information Modeling. SIM parses a file and converts it into tabular information that can prove to be instrumental in analysis. Now the question arises: what type of file does SIM parses? To answer this question I must introduce a software first:

STAAD PRO, is a software used by civil engineers to design buildings and structures in 3D. All one needs to know about this software is that it always has a file associated with each design that contains all the commands the designer may/ may not enter in order to design stuff in the structure. Now this file (having a .std extension) contains everything there is to know about the design. And hence we parse it to extract all this information and store it in a much more comprehensible format i.e. a database. And this of course gives us the freedom of collecting information regarding various designs in one place instead of juggling that many files.

So far so good. The main focus of this post is the parsing bit of the file. The commands used in a .std file come from a wide range of fixed commands described in the technical manual for STAAD PRO. The amount of syntactic flexibility and sheer vocabulary, proportionate this to language in it's own right. This means that in essence SIM needs to handle this language not for the purposes of implementing it, but for the purpose of transforming it into a much more abstract form. And of course since the amount of parsing is going to get more and more complex, ad hoc methods of parsing will soon run out of convenience and efficiency.

Flex and Bison seem to be perfect for such a job. Other parts of SIM shall remain the same (and evolve thus) but parsing of the file needs to done using flex and bison. I need to understand the grammar of the STAAD PRO command languages in order to code that using bison. This is the difficult part. Forming regular expression shall be easy. At least I hope that is the case because then I would have to work even harder. And who likes to work hard? I don't.

Tuesday, 7 June 2016

Flex and Bison

Continuing the habit of missing the whole day by about 5 or 10 minutes and then finding the time to write a post for the said day, here I am, attempting to write a compendium of my second day at training in the form of this blog post.

For most part of the day I wrestled with tutorials and books all for one singular purpose of learning flex and bison. And it's a little dissuading to admit that I have only started to fully comprehend the meaning and usage of flex and bison. The theoretical bit is everything one can ask for. It's interesting, logical, a little philosophical and most importantly, intuitively easy.

Flex: It let's us define some patterns (regular expressions) and then scans through the given input to find every match there is to the defined patterns. These matching bits are called tokens.

Bison: It let's us define some rules to which the tokens of the language must abide to. Relations and dependencies among tokens of various kinds, etc are described in this file.

Without going into much details let me draw out how things work using a general sequence of events:

1) We define various token and their types in the bison file.

2) The bison file also include the grammar rules to be followed by these tokens.

3) The flex file describes using regular expressions how to identify these various tokens.

4) Then flex actually goes on to find these tokens in the input file.

5) On receiving these tokens in such combinations as defined in the grammar rules, certain actions are triggered. These actions are basically C/C++ code written in either of the two files.

6) The whole process continues till the end of the file is reached.

I believe it is time now try and implement some these concepts to a project: SIM and learn further as needed.

Monday, 6 June 2016

First Day of 6 Months Training

I am typing this blog entry at 12:00 AM. So by the time this will be finished I would already have entered day two of my training technically speaking, which is what I am supposed to do in this blog. Anyhow, 6 June 2016, the day I started my six months training as part of my Bachelors of Engineering.

I am taking my training from Testing and Consultancy Cell, GNDEC, Ludhiana; under the mentor ship of Dr. H S Rai. To state the obvious, I am very excited about this. There are a great many things to do and more importantly a great many things to learn.

Now coming to the point, what did I do today? Well firstly, today was kind of like an after thought. I almost had to postpone my joining day. I say this because otherwise the accounts of my work on the first day will render some understandable criticism and skeptic disappointment. I was able to carve only a couple of hours today during which I started learning about Flex and Bison. (Why? Well more on that in later posts as well)

These are tools for lexical analysis as well as semantic parsing. I have only started reading into them. More detailed blogs about them are sure to follow. They are interesting, very interesting indeed.

Oh, I almost forgot. I did manage to install both flex and bison on my laptop in the Ubuntu operating system. It's pretty simple:

Fire up the terminal and type in the following commands:

sudo apt-get install flex

To install flex, and

sudo apt-get install bison

To install bison.

Labels