Wednesday, April 29, 2009

Assignment 9

Our final group presentation was on the topic of movie clustering. The group compared the results of clustering the list of movies from IMDB based on genre and then keywords. The clustering was very memory intensive and took quite a while to complete.Here are some of the things the group found:First, when the group clustered the movies based on genre they were some interesting results. The groups were much broader and general. Some of the clusters made sense, while other had no correlation whatsoever. Here are three pictures of what we found after some genre clustering:

This is a snippet of the, "Scarface," genre clustering:

This is a snippet of the war movie genre clustering:








This is a snippet of the, "Last Action Hero," genre clustering:



Now, the group decided to cluster the same list of movies based on the keywords associated with them rather than their genre. These took much longer but produced much more accurate results. Here some of the results:


Here are the results of the, "Scarface," keywords clustering:






Now, the results of war movie keyword clustering:





And finally, the results from, "Shanghai Noon," keyword clustering:






As you can see by the pictures, the keyword clustering was much more specific and more accurate. The genre clustering was very general. One conclusion that can be made is that with clustering you are faced with an age old engineering trade-off. More effort with better results or less effort with okay results.

Saturday, April 25, 2009

Assignment 8



PCI Chapter 5, Optimization



We as a group divided the chapter down and each created slides based on our assigned sections. Once completed we assembled the slides, and there were some interesting things that we learned.



One of the most important pieces to an optimization problem is the cost function. The cost function is also the most difficult things to determine. An optimization problem tries to minimize the cost function. There are a few methods of optimization. The first is Random Searching. Random Searching is just what the name implies. It is just random guessing. It is only really good for using as a baseline against other algorithms.




Hill Climbing is a variation of Random Searching. It finds its closest neighbors and finds the best value. It is only able to find the local minimum rather than the global one. Simulated Annealing uses random searching to find an initial value. The algorithm looks for progressively better values. It is also much more efficient when it comes to finding the global minimum. Genetic Algorithms start with a set of random solutions. It takes the best solutions and the rest are considered modifications of the best ones. It continues over an over until no improvement is shown. Time must also be spent deciding on how to represent the solution. Depending on the type of problem, the solution may vary on how it is represented.



One last thing to keep in mind is displaying the data. As with all data the visualization has to be a healthy mix of human understandable format, and enough detail to still be relevant. The algorithms described before do a good job of finding the solution, but displaying it can be a whole other issue. It has to be human readable and still retain its relevance.



Below are 2 example of the same data:



BAD






GOOD





These 2 pictures show the exact same data, but the second is much more human readable. This because the lines are not crossed. Keeping lines from crossing is the most common way of making data easy to read. This is done by counting lines, keeping track of lengths and positions and using an algorithm to make sure that none of these lines are crossing using this data. Ironically enough, this is very often done using a genetic algorithm. Also, to keep the data from being oddly distanced, min and max line lengths are often declared. This keeps thedata easy to read and it also keeps it auto genereated, which is important when dealing with very large data sets.

Thursday, April 9, 2009

Assignment 7

I would like to credit Ross Day with helping our group on this assignment.
The first step in going through chapter 4 in PCI was to create a small set of pages that will need to be indexed. The author has provided such as list at http://kiwitobes.com/wiki. This was accomplished with the following code:

>>> import urllib2
>>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>>> contents=c.read()
>>> print contents[0:50]

The actual crawler code we are going to use, uses the Beautiful Soup API. BeautifulSoup.py was very easily downloaded from the Beautiful website. The BeautifulSoup.py was put in my working directory and:
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup=BeautifulSoup(urlopen('http://google.com'))
>>> soup.head.title
>>> links=soup('a')
>>> len(links)
16
>>> links[0] Images
>>> links[0].contents[0]

u'Images'

This is to make sure the BeautifulSoup.py works.

Back to the crawling...I tried to find a website that would work for the crawler. We could not find a single .html website that would work. Here is an example of what I tried:

>>> pagelist=['http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm']
>>> crawler=searchengine.crawler('')
>>> crawler.crawl(pagelist)

Indexing

http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm
Could not parse page
http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm

I'm not sure what it is that we are doing wrong. We tried to find a simple html website and it still would not crawl. The important thing is that the crawler is supposed to go through the website and index what it finds. Once indexed and stored the information can be queried and we can discover interesting things about that website. You can have content-based rankings where based on different criteria such as: Word Frequency, Document location, and Word distance are used to score the websites.

Once these scores are obtained it is common to normalize the scores so that they are easy to read and understand.

Assignment 6

For this assignments we had to cluster a large data set of movies in any way. We took the data from IMBDs database of movies and decided after doing some tests to cluster by genre. We did this for 2 reasons, the binary true or false that determined if a movie fit a genre would allow movies to be part of multiple genres, subsequently providing good clustering results (hopefully), and it would also give the user something they can easily look at and determine if the clustering worked or not. The first thing we did was filter the movies, the data set had thousands of movies in the list and I only wanted about 2000. But we did not want any 2000.
We wanted the 2000 most popular movies so that we could look and see if these clusters make any sense once we got that far. So we ran a filter on the text file that took the movies with the most votes. In this case it was the movies with over 10,000 votes. The resulting data set was only a couple dozen over 2000, my desired goal.

The filter program is shown here:

def filterit2(self, ifilename):
# this will go through the movies.tab file and only save US
movies
# in addition it will add to the dictionary the genre values
self.movieMatrix = {}
ofile = open("US22.txt", 'w') logfile = open("log.txt", 'w')
ifile = open(ifilename)
line = ifile.readline()
# remove header
line = ifile.readline()
while line != '':
components = line.split('\t')
if len(components) > 23:
title = components[0] + "//" + components[1]

votes = int(components[5])
if votes > 10000 and title in self.movieList:
values = ""
for i in range(17, 24):
values = values + components[i].strip() + '\t'
logfile.write(title + "\n")
ofile.write(title + '\t' + values + '\n')
self.movieMatrix[title] = values
line = ifile.readline()
ofile.close()
logfile.close()
ifile.close()

Now that we had the data set it was time to trim the fat. The first thing we did was take the data into excel and name the columns. Then we deleted all but the genres and the movie names. Then as per usual we copied the data out of excel and pasted it into notepadd++ and saved it as a text file. We then ran the clustering algorithm on the data and at first we were convinced that it was not functioning. It almost immediately froze python and did nothing. After trying our luck with some debugging I realized that it was not frozen, despite the "not responding" message I was getting. Python was contusing to use more and more ram, until about a half an hour had passes and they began to free up the memory from python. About 15 minutes after the memory began to free up it finished. I think it is important to note here that it took an entire 15 minutes. This is a combination of the number of movies being clustered and the amount of attributes it is being clustered by. When I did a data set of 1000 with the same attributes it took only 5 minutes, leading my to believe the work load on clustering has a very high curve or is exponential.

After it finished clustering we printed my results and to my surprise they were almost always very accurate. When looking at the data I noticed that while there were the occasional anomalies, most of the data ended up in pretty specific categories, not broad like "action" or "romance" but very specific like "Comedy, action, law enforcement movies" such as the example posted below:

Last Action Hero
-
-
Last Boy Scout,
-
The Lethal Weapon
Lethal Weapon 2 Lethal Weapon 3
-
-
- Pirates of the Caribbean: The Curse of the Black Pearl Psycho
- Rush Hour Rush Hour 2
-

- Shanghai Noon
- Starsky & Hutch Team America: World PoliceThere is a movie in this example that does not fit completely, pirates of the Caribbean, its action and comedy. But it is not law enforcement, at least not predominantly in the plot. But the rest of the movies fit very accurately, even matching movies with their sequels, with nothing to go on but genre. This result is duplicated over and over again in my output. Also the movies, showing clustering by their tabbing, are more closely related to their closer tabbed neighbors than the ones that are farther away