Thursday, April 9, 2009

Assignment 7

I would like to credit Ross Day with helping our group on this assignment.
The first step in going through chapter 4 in PCI was to create a small set of pages that will need to be indexed. The author has provided such as list at http://kiwitobes.com/wiki. This was accomplished with the following code:

>>> import urllib2
>>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>>> contents=c.read()
>>> print contents[0:50]

The actual crawler code we are going to use, uses the Beautiful Soup API. BeautifulSoup.py was very easily downloaded from the Beautiful website. The BeautifulSoup.py was put in my working directory and:
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup=BeautifulSoup(urlopen('http://google.com'))
>>> soup.head.title
>>> links=soup('a')
>>> len(links)
16
>>> links[0] Images
>>> links[0].contents[0]

u'Images'

This is to make sure the BeautifulSoup.py works.

Back to the crawling...I tried to find a website that would work for the crawler. We could not find a single .html website that would work. Here is an example of what I tried:

>>> pagelist=['http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm']
>>> crawler=searchengine.crawler('')
>>> crawler.crawl(pagelist)

Indexing

http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm
Could not parse page
http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm

I'm not sure what it is that we are doing wrong. We tried to find a simple html website and it still would not crawl. The important thing is that the crawler is supposed to go through the website and index what it finds. Once indexed and stored the information can be queried and we can discover interesting things about that website. You can have content-based rankings where based on different criteria such as: Word Frequency, Document location, and Word distance are used to score the websites.

Once these scores are obtained it is common to normalize the scores so that they are easy to read and understand.

No comments:

Post a Comment