For this assignments we had to cluster a large data set of movies in any way. We took the data from IMBDs database of movies and decided after doing some tests to cluster by genre. We did this for 2 reasons, the binary true or false that determined if a movie fit a genre would allow movies to be part of multiple genres, subsequently providing good clustering results (hopefully), and it would also give the user something they can easily look at and determine if the clustering worked or not. The first thing we did was filter the movies, the data set had thousands of movies in the list and I only wanted about 2000. But we did not want any 2000.
We wanted the 2000 most popular movies so that we could look and see if these clusters make any sense once we got that far. So we ran a filter on the text file that took the movies with the most votes. In this case it was the movies with over 10,000 votes. The resulting data set was only a couple dozen over 2000, my desired goal.
The filter program is shown here:
def filterit2(self, ifilename):
# this will go through the movies.tab file and only save US
movies
# in addition it will add to the dictionary the genre values
self.movieMatrix = {}
ofile = open("US22.txt", 'w') logfile = open("log.txt", 'w')
ifile = open(ifilename)
line = ifile.readline()
# remove header
line = ifile.readline()
while line != '':
components = line.split('\t')
if len(components) > 23:
title = components[0] + "//" + components[1]
votes = int(components[5])
if votes > 10000 and title in self.movieList:
values = ""
for i in range(17, 24):
values = values + components[i].strip() + '\t'
logfile.write(title + "\n")
ofile.write(title + '\t' + values + '\n')
self.movieMatrix[title] = values
line = ifile.readline()
ofile.close()
logfile.close()
ifile.close()
Now that we had the data set it was time to trim the fat. The first thing we did was take the data into excel and name the columns. Then we deleted all but the genres and the movie names. Then as per usual we copied the data out of excel and pasted it into notepadd++ and saved it as a text file. We then ran the clustering algorithm on the data and at first we were convinced that it was not functioning. It almost immediately froze python and did nothing. After trying our luck with some debugging I realized that it was not frozen, despite the "not responding" message I was getting. Python was contusing to use more and more ram, until about a half an hour had passes and they began to free up the memory from python. About 15 minutes after the memory began to free up it finished. I think it is important to note here that it took an entire 15 minutes. This is a combination of the number of movies being clustered and the amount of attributes it is being clustered by. When I did a data set of 1000 with the same attributes it took only 5 minutes, leading my to believe the work load on clustering has a very high curve or is exponential.
After it finished clustering we printed my results and to my surprise they were almost always very accurate. When looking at the data I noticed that while there were the occasional anomalies, most of the data ended up in pretty specific categories, not broad like "action" or "romance" but very specific like "Comedy, action, law enforcement movies" such as the example posted below:
Last Action Hero
-
-
Last Boy Scout,
-
The Lethal Weapon
Lethal Weapon 2 Lethal Weapon 3
-
-
- Pirates of the Caribbean: The Curse of the Black Pearl Psycho
- Rush Hour Rush Hour 2
-
- Shanghai Noon
- Starsky & Hutch Team America: World PoliceThere is a movie in this example that does not fit completely, pirates of the Caribbean, its action and comedy. But it is not law enforcement, at least not predominantly in the plot. But the rest of the movies fit very accurately, even matching movies with their sequels, with nothing to go on but genre. This result is duplicated over and over again in my output. Also the movies, showing clustering by their tabbing, are more closely related to their closer tabbed neighbors than the ones that are farther away