First thing first, I am not entirely sure what pedagogically means. Forgive if you know and I am wrong. Look it up if you don't know and be impressed by the title.
Facing the possibility of an impromptu presentation, late in the night yesterday, I declared to give a presentation on PageRank. It was a great opportunity to share the awesomeness of the topic with others and it most certainly was an easy way out as I already had a good enough grasp on the topic from my previous endeavors with big data and map reduce.
Here is how the presentation went (at least in my head).
[Shifting to the stage]
Let me start by asking a simple question. What is a search engine? Try and describe it in the most abstract way possible.
It's a thing that takes a query as input and returns a response. Where query is a collection of keywords and response is a list of web pages.
The search engine looks for the keywords in the entire set of pages and returns those pages that match the keywords in the query. There are two ways of doing this search and match process:
The stupid way and the wise way. Wait, did you expect a more technical classification? Don't worry, I'll deliver that too:
Stupid way: Brute force. This means that for every keyword in the query, the search engine will go to each page and struggle through the entire page to see if the keyword is found on that page. This sounds really uncomplicated. And it works fine if the number of pages is limited to some earlier multiples of a hundred. But apply the same approach on the World Wide Web and you'll suffer at the hands of WWW's unfathomable size and the limitations of hardware at your exposure.
Wise way: Hashing. Hashing may sound super geeky and convoluted (which I'll say it is for the purposes of impressing the reader) but it's really not. Remember when we are in a classroom and the teachers takes the attendance? What does the teacher use to reference each student? A roll number. What is this thing called? Hashing.
Hashing is the process of assigning a key (possibly unique) to something real. Like a roll number (key) to a student (something real).
In order to somehow use this thing for a search engine, the first thing we need to do is divide every page and hence the whole web of pages into a list of keywords (a very long list). This constitutes the list of keys. And to every key, associate another list. This time the list of pages (or links to pages) where that keyword is found at least once.
This approach bests the brute force method already. Now the search engine does not have to traverse through every page in order to fine a match for a query. That has already been done for it. All the search engine needs to do is break down the query into keywords. Look for pages where each keyword in the query is found (using the hash function). Return the intersection (or union) of these pages to the user.
Suppose for a query, there are a hundred pages that match. Now how does the search engine decides which of these hundred pages is the most relevant? Is there a way to rank all the pages on the web? Yes there is. And it's called PageRank (or finding PageRank).
PageRank is an algorithm first introduced by Larry Page of Google. So what in the God's name is PageRank of a page?
PageRank is a measure of how popular a page is. Now this is an important point. The purpose of PageRank was to rank pages for the purpose of deciding which one is more relevant. And the criteria they are using to determine this relevance is the page's popularity. This algorithm functions on the assumption that popularity implies relevance. Ask yourself a question, how could an utterly irrelevant or mediocre page be popular? It can't. Hence the assumption is quite appealing, at least to me.
Now, let me introduce the notion of a random surfer. A random surfer is guy sitting in front of his computer, browsing the Web. Suppose the random surfer is presently on a webpage A. From here, the random surfer can take one of the two following actions:
- Write a url in the address bar of the browser and jump to any page on the web, randomly.
- Follow any of the links on page A and land on other pages depending upon which link has been clicked.
Let's denote the probability that the random surfer will go with the first option by d. So automatically the probability for the second option will be (1-d).
Let me work out a mathematical formula for PageRank. But before we go into that, I am going to introduce some symbols:
PR(A): PageRank of the page A.
N: Total number of pages in the system.
n(A): Number of links going out from page A to other pages on the web.
P(e): Probability of happening of event e.
Let's get started then:
PR(A) = How popular page A is.
= How likely it is that the random surfer will land on this page.
= P(Random surfer lands on page A).
= P(Random surfer jumps randomly from somewhere to page A OR Random surfer follows any of the incoming links from other pages to A).
= P(Random surfer jumps randomly from somewhere to page A) + P(Random surfer follows any of the incoming links from other pages to A). (1)
Now, P(Random surfer jumps randomly from somewhere to page A) can be calculated by dividing the probability of a jump (d) by the total number of contenders for after the jump (N). So this is equal to d/N.
P(Random surfer follows any of the incoming links from other pages to A) can be found using the following concept:
Assume that from the entire set of pages, [P1, P2, P3, ....., Pn] is the list of pages with links to A. Now the probability of reaching A after following links is the sum of the probability of following any one of these n links.
So this is equal to: P(being on P1 and clicking on the link to A) + P(being on P2 and clicking on the link to A) + ..... + P(being on Pn and clicking on the link to A).
Converting this into a nice simple summation equation we get:
P(Random surfer follows any of the incoming links from other pages to A) = Summation(being on Pi and clicking on the link to A), where i goes from 1 to n.
The probability of being on page Pi is nothing but PR(Pi). And if Pi has 10 links going out from it then only that link will matter that comes to A. So we'll divide PR(Pi) by n(P) which is 10 in this case.
Finally this probability of the random surfing choosing to follow links is to be multiplied by (1-d) which is the probability of choosing this whole option in the first place.
Substitute all of this in equation (1):
PR(A) = d/N + (1-d) * Summation( PR(Pi) / n(Pi) ) where i goes from 1 to n.
This formula is applied for every page in the system. An iterative method is used. Basically in the beginning, every page is assigned an equal PR of 1/N. Then this formula is applied to each page until the value of PR of all of the pages stops to change much. The final stable is PR is the actual PR of every page.
Since this is nothing but a probability distribution, the sum of PRs of every page after each iteration will be equal to one.
To check out a simulation of this algorithm have a peek at the following innocuous, innocent code:
No comments:
Post a Comment