Art Padilla: 2009

Wednesday, April 29, 2009

Assignment 9

Our final group presentation was on the topic of movie clustering. The group compared the results of clustering the list of movies from IMDB based on genre and then keywords. The clustering was very memory intensive and took quite a while to complete.Here are some of the things the group found:First, when the group clustered the movies based on genre they were some interesting results. The groups were much broader and general. Some of the clusters made sense, while other had no correlation whatsoever. Here are three pictures of what we found after some genre clustering:

This is a snippet of the, "Scarface," genre clustering:

This is a snippet of the war movie genre clustering:

This is a snippet of the, "Last Action Hero," genre clustering:

Now, the group decided to cluster the same list of movies based on the keywords associated with them rather than their genre. These took much longer but produced much more accurate results. Here some of the results:

Here are the results of the, "Scarface," keywords clustering:

Now, the results of war movie keyword clustering:

And finally, the results from, "Shanghai Noon," keyword clustering:

As you can see by the pictures, the keyword clustering was much more specific and more accurate. The genre clustering was very general. One conclusion that can be made is that with clustering you are faced with an age old engineering trade-off. More effort with better results or less effort with okay results.

Saturday, April 25, 2009

Assignment 8

PCI Chapter 5, Optimization

We as a group divided the chapter down and each created slides based on our assigned sections. Once completed we assembled the slides, and there were some interesting things that we learned.

One of the most important pieces to an optimization problem is the cost function. The cost function is also the most difficult things to determine. An optimization problem tries to minimize the cost function. There are a few methods of optimization. The first is Random Searching. Random Searching is just what the name implies. It is just random guessing. It is only really good for using as a baseline against other algorithms.

Hill Climbing is a variation of Random Searching. It finds its closest neighbors and finds the best value. It is only able to find the local minimum rather than the global one. Simulated Annealing uses random searching to find an initial value. The algorithm looks for progressively better values. It is also much more efficient when it comes to finding the global minimum. Genetic Algorithms start with a set of random solutions. It takes the best solutions and the rest are considered modifications of the best ones. It continues over an over until no improvement is shown. Time must also be spent deciding on how to represent the solution. Depending on the type of problem, the solution may vary on how it is represented.

One last thing to keep in mind is displaying the data. As with all data the visualization has to be a healthy mix of human understandable format, and enough detail to still be relevant. The algorithms described before do a good job of finding the solution, but displaying it can be a whole other issue. It has to be human readable and still retain its relevance.

Below are 2 example of the same data:

BAD

GOOD

These 2 pictures show the exact same data, but the second is much more human readable. This because the lines are not crossed. Keeping lines from crossing is the most common way of making data easy to read. This is done by counting lines, keeping track of lengths and positions and using an algorithm to make sure that none of these lines are crossing using this data. Ironically enough, this is very often done using a genetic algorithm. Also, to keep the data from being oddly distanced, min and max line lengths are often declared. This keeps thedata easy to read and it also keeps it auto genereated, which is important when dealing with very large data sets.

Thursday, April 9, 2009

Assignment 7

I would like to credit Ross Day with helping our group on this assignment.
The first step in going through chapter 4 in PCI was to create a small set of pages that will need to be indexed. The author has provided such as list at http://kiwitobes.com/wiki. This was accomplished with the following code:

>>> import urllib2
>>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')
>>> contents=c.read()
>>> print contents[0:50]

The actual crawler code we are going to use, uses the Beautiful Soup API. BeautifulSoup.py was very easily downloaded from the Beautiful website. The BeautifulSoup.py was put in my working directory and:
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup=BeautifulSoup(urlopen('http://google.com'))
>>> soup.head.title
>>> links=soup('a')
>>> len(links)
16
>>> links[0] Images
>>> links[0].contents[0]

u'Images'

This is to make sure the BeautifulSoup.py works.

Back to the crawling...I tried to find a website that would work for the crawler. We could not find a single .html website that would work. Here is an example of what I tried:

>>> pagelist=['http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm']
>>> crawler=searchengine.crawler('')
>>> crawler.crawl(pagelist)

Indexing

http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm
Could not parse page
http://rosemary.umw.edu/~stephen/cpsc330/milestone3.htm

I'm not sure what it is that we are doing wrong. We tried to find a simple html website and it still would not crawl. The important thing is that the crawler is supposed to go through the website and index what it finds. Once indexed and stored the information can be queried and we can discover interesting things about that website. You can have content-based rankings where based on different criteria such as: Word Frequency, Document location, and Word distance are used to score the websites.

Once these scores are obtained it is common to normalize the scores so that they are easy to read and understand.

Assignment 6

For this assignments we had to cluster a large data set of movies in any way. We took the data from IMBDs database of movies and decided after doing some tests to cluster by genre. We did this for 2 reasons, the binary true or false that determined if a movie fit a genre would allow movies to be part of multiple genres, subsequently providing good clustering results (hopefully), and it would also give the user something they can easily look at and determine if the clustering worked or not. The first thing we did was filter the movies, the data set had thousands of movies in the list and I only wanted about 2000. But we did not want any 2000.
We wanted the 2000 most popular movies so that we could look and see if these clusters make any sense once we got that far. So we ran a filter on the text file that took the movies with the most votes. In this case it was the movies with over 10,000 votes. The resulting data set was only a couple dozen over 2000, my desired goal.

The filter program is shown here:

def filterit2(self, ifilename):
# this will go through the movies.tab file and only save US
movies
# in addition it will add to the dictionary the genre values
self.movieMatrix = {}
ofile = open("US22.txt", 'w') logfile = open("log.txt", 'w')
ifile = open(ifilename)
line = ifile.readline()
# remove header
line = ifile.readline()
while line != '':
components = line.split('\t')
if len(components) > 23:
title = components[0] + "//" + components[1]

votes = int(components[5])
if votes > 10000 and title in self.movieList:
values = ""
for i in range(17, 24):
values = values + components[i].strip() + '\t'
logfile.write(title + "\n")
ofile.write(title + '\t' + values + '\n')
self.movieMatrix[title] = values
line = ifile.readline()
ofile.close()
logfile.close()
ifile.close()

Now that we had the data set it was time to trim the fat. The first thing we did was take the data into excel and name the columns. Then we deleted all but the genres and the movie names. Then as per usual we copied the data out of excel and pasted it into notepadd++ and saved it as a text file. We then ran the clustering algorithm on the data and at first we were convinced that it was not functioning. It almost immediately froze python and did nothing. After trying our luck with some debugging I realized that it was not frozen, despite the "not responding" message I was getting. Python was contusing to use more and more ram, until about a half an hour had passes and they began to free up the memory from python. About 15 minutes after the memory began to free up it finished. I think it is important to note here that it took an entire 15 minutes. This is a combination of the number of movies being clustered and the amount of attributes it is being clustered by. When I did a data set of 1000 with the same attributes it took only 5 minutes, leading my to believe the work load on clustering has a very high curve or is exponential.

After it finished clustering we printed my results and to my surprise they were almost always very accurate. When looking at the data I noticed that while there were the occasional anomalies, most of the data ended up in pretty specific categories, not broad like "action" or "romance" but very specific like "Comedy, action, law enforcement movies" such as the example posted below:

Last Action Hero
-
-
Last Boy Scout,
-
The Lethal Weapon
Lethal Weapon 2 Lethal Weapon 3
-
-
- Pirates of the Caribbean: The Curse of the Black Pearl Psycho
- Rush Hour Rush Hour 2
-

- Shanghai Noon
- Starsky & Hutch Team America: World PoliceThere is a movie in this example that does not fit completely, pirates of the Caribbean, its action and comedy. But it is not law enforcement, at least not predominantly in the plot. But the rest of the movies fit very accurately, even matching movies with their sequels, with nothing to go on but genre. This result is duplicated over and over again in my output. Also the movies, showing clustering by their tabbing, are more closely related to their closer tabbed neighbors than the ones that are farther away

Wednesday, March 4, 2009

Assignment 5

I have been exploring the use of visualization in terms of processing data. It has been very interesting, however not all of the visualizations make sense in terms of their artistic representations. There was however some that I did find somewhat helpful in determining the graphical representation, like frequently occuring surnames found in the 2000 census. A picture is shown below.

In this visualization, each surname is given a bubble, depending on the amount of times the name was found determines the size of the bubble. This is one of the more easily understood visualization to understand.

One of the other visualizations I found was the one that indicated the average home price index in the S&P out of 14 different states between 1998 and 2008, which is pictured below.

Some of the data that I inputed were the Governator's favor rates by quarter. This information was pulled by swivel.com

Wednesday, February 18, 2009

Assignment 4

Part1

I had some issues using feedparser and python for the code in the book. Even when I used the code from the book I got an error which was not able to parse.

Part2

As a group we have decided to present certain data which will show us those who died or survived the titanic! The X axis is sex of the passenger, the Y axis is the class of the passenger (1st, 2nd, 3rd, and crew), and blue is survived and red is did not survive. The data shows something that most people already know, but it is still interesting seeing the clustering of the data proving the point. That being, most men died and almost all of women 1st class passengers survived.

Another way we decided to cluster the data was by age and sex of the passenger, and whether or not they survived. What makes this more interesting is the fact that a fair amount of female children did not survive. The upper-right area is female child passengers, and red (did not survive) is the dominant color in that area. You can manipulate the visualizations in many ways, in order to discover interesting trends in the data set.

Part 3

For this part I decided to look up the number of deaths in Iraq by month. This is what I found.

I was also able to determine if they were from other coalition forces. Here are the number of UK military that died as a result of combat related deaths.

Sunday, February 8, 2009

Assignment 3

In creating a recommendation system with python we began by taking some of the recommender we compiled a program that will recommend an artist based on the band name entered.
This is a portion of the recommender:

import pylastkey = 'b8a9a83c0e60d30d237eea0dcdcf055a'secret = 'a0a206e56d580a78e8715155106371fa'sk = ''bandName=input('Please enter band name:')
artist = pylast.Artist(bandName, key, secret, sk)
similar = artist.get_similar()print similar

tracks = artist.get_top_tracks()
print "\n"print "\n"print "\n"print tracks

Tuesday, January 27, 2009

Assignment 2

In the continuance of Chapter 2, I have come up with the same answer as the textbook for "Recommending Items" pg 17 in PCI.

For >>> recommendations.getRecommendations(recommendations.critics,'Toby')

[(3.3477895267131013, 'The Night Listener'), (2.8325499182641614, 'Lady in the Water'), (2.5309807037655649, 'Just My Luck')]

And in

>>>recommendations.getRecommendations(recommendations.critics,'Toby',... similarity=recommendations.sim_distance)[(3.5002478401415877, 'The Night Listener'), (2.7561242939959363, 'Lady in the Water'), (2.5946144209447373, 'Just My Luck')]

Matching Products

This section uncovered an error in

recommendations.topMatches(movies,'Superman Returns')

[(0.65795169495976946, 'You, Me, and Dupree'), (0.48795003647426888, 'Lady in the Water'), (0.11180339887498941, 'Snakes on a Plane'), (-0.17984719479905439, 'The Night Listener'),

(-0.46625240412015717, 'Just My Luck')]

The text book rounds down and supplies (-0.422, 'Just My Luck').

Went to google for the feedparser, downloaded the zip file extracted and placed the feedparser.py file in my library and proceeded to the textbooks instructions.

Set up pydelicious and got about 2 pages of popular posts on programming.

So far so good, I used the files of code that were provided on the class website. Then I typed the following:

from pydelicious import get_popular,get_userposts,get_urlposts

>>> from deliciousrec import *

>>> delusers=initializeUserDict('programming')

>>> delusers ['arturousmc']={}

>>> fillItems(delusers)

>>> import random

>>> user=delusers.keys( )[random.randint(0,len(delusers)-1)]

>>> user

u'chaostheory'

>>> import recommendations

>>> recommendations.topMatches(delusers,user)

[(0.11907894736842106, u'synewaves'), (0.11907894736842106, u'mangosi'), (0.05131578947368421, u'xulu'), (0.05131578947368421, u'wdr1'), (0.05131578947368421, u'thomd')]

>>> recommendations.getRecommendations(delusers,user)[0:10]

[(0.19082672706681769, u'http://colorschemedesigner.com/'), (0.17667044167610421, u'http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html'), (0.14665911664779163, u'http://woork.blogspot.com/2009/01/beautiful-datepickers-and-calendars-for.html'), (0.14665911664779163, u'http://nettuts.com/freebies/cheat-sheets/jquery-cheat-sheet/'), (0.13250283125707815, u'http://www.noupe.com/tools/15-incredible-mac-apps-for-freelance-web-designers.html'), (0.13250283125707815, u'http://css.dzone.com/news/how-to-develop-a-firefox-exten'), (0.10249150622876559, u'http://www.vectorials.com/index.html'), (0.10249150622876559, u'http://www.templatemonster.com/'), (0.10249150622876559, u'http://www.smashingmagazine.com/2009/01/20/50-extremely-useful-php-tools/'), (0.10249150622876559, u'http://www.smashingmagazine.com/2008/01/14/monday-inspiration-data-visualization-and-infographics/')]

>>> url=recommendations.getRecommendations(delusers,user)[0][1]

>>> recommendations.topMatches(recommendations.transformPrefs(delusers),url)

[(0.48976000741676007, u'http://www.yelp.com/biz/delancey-street-foundation-movers-los-angeles#hrid:4SlRsxSrZDu8DbEvrCWdhg'), (0.48976000741676007, u'http://www.webstandards.org/action/acid2/guide/'), (0.48976000741676007, u'http://www.wasabi.net.cn/'), (0.48976000741676007, u'http://www.theonion.com/content/news/obama_disappointed_cabinet_failed'), (0.48976000741676007, u'http://www.schematic.com/#//')]

Finally I have added a search engine to del.icio.us!

In building the item comparison dataset, I added the code asked by the text to recommendations.py. and the following happened:

>>> reload(recommendations)

>>>> itemsim=recommendations.calculateSimilarItems(recommendations.critics)

>>> itemsim

{'Lady in the Water': [(0.40000000000000002, 'You, Me, and Dupree'), (0.2857142857142857, 'The Night Listener'), (0.22222222222222221, 'Snakes on a Plane'), (0.21052631578947367, 'Just My Luck'), (0.090909090909090912, 'Superman Returns')], 'Snakes on a Plane': [(0.22222222222222221, 'Lady in the Water'), (0.18181818181818182, 'The Night Listener'), (0.16666666666666666, 'Superman Returns'), (0.10526315789473684, 'Just My Luck'), (0.05128205128205128, 'You, Me, and Dupree')], 'You, Me, and Dupree': [(0.40000000000000002, 'Lady in the Water'), (0.18181818181818182, 'Just My Luck'), (0.14814814814814814, 'The Night Listener'), (0.053333333333333337, 'Superman Returns'), (0.05128205128205128, 'Snakes on a Plane')], 'Just My Luck': [(0.21052631578947367, 'Lady in the Water'), (0.18181818181818182, 'You, Me, and Dupree'), (0.13333333333333333, 'The Night Listener'), (0.10526315789473684, 'Snakes on a Plane'), (0.063492063492063489, 'Superman Returns')], 'Superman Returns': [(0.16666666666666666, 'Snakes on a Plane'), (0.10256410256410256, 'The Night Listener'), (0.090909090909090912, 'Lady in the Water'), (0.063492063492063489, 'Just My Luck'), (0.053333333333333337, 'You, Me, and Dupree')], 'The Night Listener': [(0.2857142857142857, 'Lady in the Water'), (0.18181818181818182, 'Snakes on a Plane'), (0.14814814814814814, 'You, Me, and Dupree'), (0.13333333333333333, 'Just My Luck'), (0.10256410256410256, 'Superman Returns')]}

>>> reload(recommendations)

>>>> recommendations.getRecommendedItems(recommendations.critics,itemsim,'Toby')

[(4.5, 'Lady in the Water')]

Using the MovieLens Dataset
I was successful in loading the datasets from a googeled site. I was not successful in loading the part of the assignment. The error i keep getting is as follows. No luck with Python tonight, maybe Arizona will have better luck than I'm having!

>>> prefs=recommendations.loadMovieLens()

Traceback (most recent call last):File "", line 1, in File "C:\Python26\lib\recommendations.py", line 163, in loadMovieLens for line in open(path+'/u.item'):IOError: [Errno 2] No such file or directory: 'C:Python26/Lib/u.item'

WEKA

The installation was pretty easy. I ran through the sample data and understand its format and processes.

I used the dataset provided on the class website that loaded directly on to WEKA. This was the part it returned.

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 235 77.5578 %

Incorrectly Classified Instances 68 22.4422 %

Kappa statistic 0.5443

Mean absolute error 0.1044

Root mean squared error 0.2725

Relative absolute error 52.0476%

Root relative squared error 86.5075 %

Total Number of Instances 303

According to the results there were 235 correctly classified instances equaling to 78% and 68 incorrectly classified instances that made up the other 22%. The majority were correctly classified so I would have to agree that this is a good turn out.

Sunday, January 25, 2009

Assignment 1

Python was installed successfully. I went through the tutorial posted on the assignment page and managed to get all the starter programs right.

Euclidean Distance

Then I got to the recommendations file which happened to work. However, once I tried to add the snippet to the recommendations file, I couldn't get it to work. I kept getting an indentation error. Finally i got the answer by using the correct indentations per line.

Pearson Correlation

The code gave me a very difficult time during this section. Once i noticed that the snippet 'from math import sqrt' was missing, I was able to get an answer equal to the textbook.

Manhattan Distance

After many attempts at ensuring that each code had the proper indentation, i finally sought the blog of a more experienced classmate. Thank you James Gallagher.

The code I used was

#returns a distance-based similarity score for person1 and person2def man_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1
# if they have no ratings in common, return 0 if len(si)==0:return 0
# Add up the squares of all the differences ManDistance = [ abs(prefs[person1][item] - prefs[person2][item]) for item in si ]
return (1/(1+sum(ManDistance)))

This produced an answer of 0.18181818181818182.

I have come to understand that Python is a great tool for people that aren't programmers to utilize. However I don't think I will be a programmer myself.

Art Padilla