Web Crawler Python Program

In this project, you are going to implement the core of a Web crawler, and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths:

.ics.uci.edu/
.cs.uci.edu/
.informatics.uci.edu/
.stat.uci.edu/
today.uci.edu/department/information_computer_sciences/*
As a concrete deliverable of this project, besides the code itself, you must submit a report containing answers to the following questions:

How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL, but discarding the fragment part. So, for example, http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection, please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment.
What is the longest page in terms of the number of words? (HTML markup doesn’t count as words)
What are the 50 most common words in the entire set of pages crawled under these domains? (Ignore English stop words, which can be found, for example, here (Links to an external site.)) Submit the list of common words ordered by frequency.
How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL, number, for example:
http://vision.ics.uci.edu, 10 (not the actual number here)

Sample Solution