# Initialize Otter
import otter
grader = otter.Notebook("final-project.ipynb")
# Run this cell to set up the notebook, but please don't change it.
import numpy as np
import math
from datascience import *
# These lines set up the plotting functionality and formatting.
import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
In this project, we are exploring movie screenplays. We'll be trying to predict each movie's genre from the text of its screenplay. In particular, we have compiled a list of 5,000 words that occur in conversations between movie characters. For each movie, our dataset tells us the frequency with which each of these words occurs in certain conversations in its screenplay. All words have been converted to lowercase.
Run the cell below to read the movies
table. It may take up to a minute to load.
movies = Table.read_table('data/movies.csv')
movies.where("Title", "wild wild west").select(0, 1, 2, 3, 4, 14, 49, 1042, 4004)
Title | Year | Rating | Genre | # Words | breez | england | it | bravo |
---|---|---|---|---|---|---|---|---|
wild wild west | 1999 | 4.3 | comedy | 3446 | 0 | 0 | 0.0212635 | 0 |
The above cell prints a few columns of the row for the comedy movie Wild Wild West. The movie contains 3446 words. The word "it" appears 74 times, as it makes up $\frac{74}{3446} \approx 0.021364$ of the words in the movie. The word "england" doesn't appear at all. This numerical representation of a body of text, one that describes only the frequencies of individual words, is called a bag-of-words representation. A lot of information is discarded in this representation: the order of the words, the context of each word, who said what, the cast of characters and actors, etc. However, a bag-of-words representation is often used for machine learning applications as a reasonable starting point, because a great deal of information is also retained and expressed in a convenient and compact format. In this project, we will investigate whether this representation is sufficient to build an accurate genre classifier.
All movie titles are unique. The row_for_title
function provides fast access to the one row for each title.
Note: All movies in our dataset have their titles lower-cased.
title_index = movies.index_by('Title')
def row_for_title(title):
"""Return the row for a title, similar to the following expression (but faster)
movies.where('Title', title).row(0)
"""
return title_index.get(title)[0]
row_for_title('the terminator')
For example, the fastest way to find the frequency of "none" in the movie The Terminator is to access the 'none'
item from its row. Check the original table to see if this worked for you!
row_for_title('the terminator').item('none')
0.0009633911368015
Set expected_row_sum
to the number that you expect will result from summing all proportions in each row, excluding the first five columns.
# Set row_sum to a number that's the (approximate) sum of each row of word proportions.
expected_row_sum = 1
expected_row_sum
1
grader.check("q1_0")
q1_0
passed!
This dataset was extracted from a dataset from Cornell University. After transforming the dataset (e.g., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of 5000 common words in each movie.
print('Words with frequencies:', movies.drop(np.arange(5)).num_columns)
print('Movies with genres:', movies.num_rows)
Words with frequencies: 5000 Movies with genres: 370
The columns other than "Title", "Year", "Rating", "Genre", and "# Words" in the movies
table are all words that appear in some of the movies in our dataset. These words have been stemmed, or abbreviated heuristically, in an attempt to make different inflected forms of the same base word into the same string. For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing.
Stemming makes it a little tricky to search for the words you want to use, so we have provided another table that will let you see examples of unstemmed versions of each stemmed word. Run the code below to load it.
# Just run this cell.
vocab_mapping = Table.read_table('data/stem.csv')
stemmed = np.take(movies.labels, np.arange(3, len(movies.labels)))
vocab_table = Table().with_column('Stem', stemmed).join('Stem', vocab_mapping)
vocab_table.take(np.arange(1100, 1110))
Stem | Word |
---|---|
bond | bonding |
bone | bone |
bone | boning |
bone | bones |
bonu | bonus |
book | bookings |
book | books |
book | booking |
book | booked |
book | book |
Assign stemmed_message
to the stemmed version of the word "vegetables".
stemmed_message = vocab_table.where('Word', 'vegetables').column(0).item(0)
stemmed_message
'veget'
grader.check("q1_1_1")
q1_1_1
passed!
What stem in the dataset has the most words that are shortened to it? Assign most_stem
to that stem.
most_stem = vocab_table.group('Stem').sort('count', descending=True).column(0).item(0)
most_stem
'gener'
grader.check("q1_1_2")
q1_1_2
passed!
What is the longest word in the dataset whose stem wasn't shortened? Assign that to longest_uncut
. Break ties alphabetically from Z to A (so if your options are "albatross" or "batman", you should pick "batman").
# In our solution, we found it useful to first add columns with
# the length of the word and the length of the stem,
# and then to add a column with the difference between those lengths.
# What will the difference be if the word is not shortened?
len_stem = make_array()
for i in vocab_table.column(0):
len_stem = np.append(len_stem, len(i))
len_word = make_array()
for i in vocab_table.column(1):
len_word = np.append(len_word, len(i))
tbl_with_lens = vocab_table.with_columns('stem length', len_stem, 'word length', len_word)
tbl_with_dif = tbl_with_lens.with_column('difference', tbl_with_lens.column('word length') - tbl_with_lens.column('stem length'))
longest_uncut = tbl_with_dif.where('difference', 0).sort('word length', descending = True).column(1).item(1)
longest_uncut
'misunderstand'
grader.check("q1_1_3")
q1_1_3
passed!
Let's explore our dataset before trying to build a classifier. To start, we'll look at the relationship between words in proportions.
The first association we'll investigate is the association between the proportion of words that are "outer" and the proportion of words that are "space".
As usual, we'll investigate our data visually before performing any numerical analysis.
Run the cell below to plot a scatter diagram of space proportions vs outer proportions and to create the outer_space
table.
# Just run this cell!
outer_space = movies.select("outer", "space")
outer_space.scatter("outer", "space")
plots.axis([-0.001, 0.0025, -0.001, 0.005]);
plots.xticks(rotation=45);
Looking at that chart it is difficult to see if there is an association. Calculate the correlation coefficient for the association between proportion of words that are "outer" and the proportion of words that are "space" for every movie in the dataset, and assign it to outer_space_r
.
# Our solution took multiple lines
# these two arrays should make your code cleaner!
outer = movies.column("outer")
space = movies.column("space")
outer_su = (outer - np.mean(outer)) / np.std(outer)
space_su = (space - np.mean(space)) / np.std(space)
outer_space_r = np.mean(outer_su * space_su)
outer_space_r
0.2829527833012746
grader.check("q1_2_1")
q1_2_1
passed!
Choose two different words in the dataset with a correlation higher than 0.2 or smaller than -0.2 that are not outer and space and plot a scatter plot with a line of best fit for them. The code to plot the scatter plot and line of best fit is given for you, you just need to calculate the correct values to r
, slope
and intercept
.
Hint: It's easier to think of words with a positive correlation, i.e. words that are often mentioned together.
Hint 2: Try to think of common phrases or idioms.
word_x = "soldier"
word_y = "war"
# These arrays should make your code cleaner!
arr_x = movies.column(word_x)
arr_y = movies.column(word_y)
x_su = (arr_x - np.mean(arr_x)) / np.std(arr_x)
y_su = (arr_y - np.mean(arr_y)) / np.std(arr_y)
r = np.mean(x_su * y_su)
slope = r * np.std(arr_y) / np.std(arr_x)
intercept = np.mean(arr_y) - slope * np.mean(arr_x)
# DON'T CHANGE THESE LINES OF CODE
movies.scatter(word_x, word_y)
max_x = max(movies.column(word_x))
plots.title(f"Correlation: {r}, magnitude greater than .2: {abs(r) >= 0.2}")
plots.plot([0, max_x * 1.3], [intercept, intercept + slope * (max_x*1.3)], color='gold');
We're going to use our movies
dataset for two purposes.
Hence, we need two different datasets: training and test.
The purpose of a classifier is to classify unseen data that is similar to the training data. Therefore, we must ensure that there are no movies that appear in both sets. We do so by splitting the dataset randomly. The dataset has already been permuted randomly, so it's easy to split. We just take the top for training and the rest for test.
Run the code below (without changing it) to separate the datasets into two tables.
# Here we have defined the proportion of our data
# that we want to designate for training as 17/20ths
# of our total dataset. 3/20ths of the data is
# reserved for testing.
training_proportion = 17/20
num_movies = movies.num_rows
num_train = int(num_movies * training_proportion)
num_test = num_movies - num_train
train_movies = movies.take(np.arange(num_train))
test_movies = movies.take(np.arange(num_train, num_movies))
print("Training: ", train_movies.num_rows, ";",
"Test: ", test_movies.num_rows)
Training: 314 ; Test: 56
Draw a horizontal bar chart with two bars that show the proportion of Comedy movies in each dataset. Complete the function comedy_proportion
first; it should help you create the bar chart.
def comedy_proportion(table):
# Return the proportion of movies in a table that have the Comedy genre.
return table.where('Genre', 'comedy').num_rows / table.num_rows
dataset_array = make_array('Training', 'Test')
proportions_array = make_array(comedy_proportion(train_movies), comedy_proportion(test_movies))
comedy_proportions = Table().with_columns('Dataset', dataset_array, 'Proportion', proportions_array)
comedy_proportions.barh('Dataset', 'Proportion')
# Your solution may take multiple lines. Start by creating a table.
# If you get stuck, think about what sort of table you need for barh to work
K-Nearest Neighbors (k-NN) is a classification algorithm. Given some numerical attributes (also called features) of an unseen example, it decides whether that example belongs to one or the other of two categories based on its similarity to previously seen examples. Predicting the category of an example is called labeling, and the predicted category is also called a label.
An attribute (feature) we have about each movie is the proportion of times a particular word appears in the movies, and the labels are two movie genres: comedy and thriller. The algorithm requires many previously seen examples for which both the attributes and labels are known: that's the train_movies
table.
To build understanding, we're going to visualize the algorithm instead of just describing it.
In k-NN, we classify a movie by finding the k
movies in the training set that are most similar according to the features we choose. We call those movies with similar features the nearest neighbors. The k-NN algorithm assigns the movie to the most common category among its k
nearest neighbors.
Let's limit ourselves to just 2 features for now, so we can plot each movie. The features we will use are the proportions of the words "water" and "feel" in the movie. Taking the movie Monty Python and the Holy Grail (in the test set), 0.000804074 of its words are "water" and 0.0010721 are "feel". This movie appears in the test set, so let's imagine that we don't yet know its genre.
First, we need to make our notion of similarity more precise. We will say that the distance between two movies is the straight-line distance between them when we plot their features in a scatter diagram.
This distance is called the Euclidean ("yoo-KLID-ee-un") distance, whose formula is $\sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$.
For example, in the movie Clerks. (in the training set), 0.00016293 of all the words in the movie are "water" and 0.00154786 are "feel". Its distance from Monty Python and the Holy Grail on this 2-word feature set is $\sqrt{(0.000804074 - 0.000162933)^2 + (0.0010721 - 0.00154786)^2} \approx 0.000798379$. (If we included more or different features, the distance could be different.)
A third movie, The Avengers (in the training set), is 0 "water" and 0.00103173 "feel".
The function below creates a plot to display the "water" and "feel" features of a test movie and some training movies. As you can see in the result, Monty Python and the Holy Grail is more similar to "Clerks." than to the The Avengers based on these features, which is makes sense as both movies are comedy movies, while The Avengers is a thriller.
# Just run this cell.
def plot_with_two_features(test_movie, training_movies, x_feature, y_feature):
"""Plot a test movie and training movies using two features."""
test_row = row_for_title(test_movie)
distances = Table().with_columns(
x_feature, [test_row.item(x_feature)],
y_feature, [test_row.item(y_feature)],
'Color', ['unknown'],
'Title', [test_movie]
)
for movie in training_movies:
row = row_for_title(movie)
distances.append([row.item(x_feature), row.item(y_feature), row.item('Genre'), movie])
distances.scatter(x_feature, y_feature, group='Color', labels='Title', s=30)
training = ["clerks.", "the avengers"]
plot_with_two_features("monty python and the holy grail", training, "water", "feel")
plots.axis([-0.001, 0.0011, -0.004, 0.008]);
Compute the Euclidean distance (defined in the section above) between the two movies, Monty Python and the Holy Grail and The Avengers, using the water
and feel
features only. Assign it the name one_distance
.
Note: If you have a row, you can use item
to get a value from a column by its name. For example, if r
is a row, then r.item("Genre")
is the value in column "Genre"
in row r
.
Hint: Remember the function row_for_title
, redefined for you below.
title_index = movies.index_by('Title')
python = row_for_title("monty python and the holy grail")
avengers = row_for_title("the avengers")
one_distance = ((python.item('water') - avengers.item('water'))**2 + (python.item('feel')- avengers.item('feel'))**2) ** 0.5
one_distance
0.0008050869157478146
grader.check("q2_1_1")
q2_1_1
passed!
Below, we've added a third training movie, The Silence of the Lambs. Before, the point closest to Monty Python and the Holy Grail was Clerks., a comedy movie. However, now the closest point is The Silence of the Lambs, a thriller movie.
training = ["clerks.", "the avengers", "the silence of the lambs"]
plot_with_two_features("monty python and the holy grail", training, "water", "feel")
plots.axis([-0.001, 0.0011, -0.004, 0.008]);
Complete the function distance_two_features
that computes the Euclidean distance between any two movies, using two features. The last two lines call your function to show that Monty Python and the Holy Grail is closer to The Silence of the Lambs than it is to Clerks.
def distance_two_features(title0, title1, x_feature, y_feature):
"""Compute the distance between two movies with titles title0 and title1
Only the features named x_feature and y_feature are used when computing the distance.
"""
row0 = row_for_title(title0)
row1 = row_for_title(title1)
return ((row0.item(x_feature) - row1.item(x_feature))**2 + (row0.item(y_feature) - row1.item(y_feature))**2) ** 0.5
for movie in make_array("clerks.", "the silence of the lambs"):
movie_distance = distance_two_features(movie, "monty python and the holy grail", "water", "feel")
print(movie, 'distance:\t', movie_distance)
clerks. distance: 0.0007983810687227716 the silence of the lambs distance: 0.00022256314855564847
grader.check("q2_1_2")
q2_1_2
passed!
Define the function distance_from_python
so that it works as described in its documentation.
Note: Your solution should not use arithmetic operations directly. Instead, it should make use of existing functionality above!
def distance_from_python(title):
"""The distance between the given movie and "monty python and the holy grail",
based on the features "water" and "feel".
This function takes a single argument:
title: A string, the name of a movie.
"""
return distance_two_features("monty python and the holy grail", title, "water", "feel")
grader.check("q2_1_3")
q2_1_3
passed!
Using the features "water"
and "feel"
, what are the names and genres of the 5 movies in the training set closest to Monty Python and the Holy Grail? To answer this question, make a table named close_movies
containing those 5 movies with columns "Title"
, "Genre"
, "water"
, and "feel"
, as well as a column called "distance from python"
that contains the distance from Monty Python and the Holy Grail. The table should be sorted in ascending order by distance from python
.
# Your solution may take multiple lines.
movies_from_python = make_array()
movie_names = train_movies.column('Title')
for i in movie_names:
movies_from_python = np.append(movies_from_python, distance_from_python(i))
close_movies = train_movies.with_column("distance from python", movies_from_python).select("Title", "Genre", "water", "feel", "distance from python").sort("distance from python").take(np.arange(5))
close_movies
Title | Genre | water | feel | distance from python |
---|---|---|---|---|
alien | thriller | 0.00070922 | 0.00124113 | 0.000193831 |
tomorrow never dies | thriller | 0.000888889 | 0.000888889 | 0.00020189 |
the silence of the lambs | thriller | 0.000595948 | 0.000993246 | 0.000222563 |
innerspace | comedy | 0.000522193 | 0.00104439 | 0.00028324 |
some like it hot | comedy | 0.000528541 | 0.000951374 | 0.00030082 |
grader.check("q2_1_4")
q2_1_4
passed!
Next, we'll clasify Monty Python and the Holy Grail based on the genres of the closest movies.
To do so, define the function most_common
so that it works as described in its documentation below.
def most_common(label, table):
"""The most common element in a column of a table.
This function takes two arguments:
label: The label of a column, a string.
table: A table.
It returns the most common value in that column of that table.
In case of a tie, it returns any one of the most common values
"""
return table.group(label).sort('count', descending=True).column(0).item(0)
# Calling most_common on your table of 5 nearest neighbors classifies
# "monty python and the holy grail" as a thriller movie, 3 votes to 2.
most_common('Genre', close_movies)
'thriller'
grader.check("q2_1_5")
q2_1_5
passed!
Write a function called distance
to compute the Euclidean distance between two arrays of numerical features (e.g. arrays of the proportions of times that different words appear). The function should be able to calculate the Euclidean distance between two arrays of arbitrary (but equal) length.
Next, use the function you just defined to compute the distance between the first and second movie in the training set using all of the features. (Remember that the first five columns of your tables are not features.)
Note: To convert rows to arrays, use np.array
. For example, if t
was a table, np.array(t.row(0))
converts row 0 of t
into an array.
Note: If you're working offline: Depending on the versions of your packages, you may need to convert rows to arrays using the following instead: np.array(list(t.row(0))
def distance(features_array1, features_array2):
"""The Euclidean distance between two arrays of feature values."""
return np.sqrt(np.sum((features_array1 - features_array2)**2))
first_movie = np.array(train_movies.drop(np.arange(0, 5)).row(0))
second_movie = np.array(train_movies.drop(np.arange(0, 5)).row(1))
distance_first_to_second = distance(first_movie, second_movie)
distance_first_to_second
0.03335446890881317
grader.check("q3_0")
q3_0
passed!
Unfortunately, using all of the features has some downsides. One clear downside is computational -- computing Euclidean distances just takes a long time when we have lots of features. You might have noticed that in the last question!
So we're going to select just 20. We'd like to choose features that are very discriminative. That is, features which lead us to correctly classify as much of the test set as possible. This process of choosing features that will make a classifier work well is sometimes called feature selection, or, more broadly, feature engineering.
In this question, we will help you get started on selecting more effective features for distinguishing comedy from thriller movies. The plot below (generated for you) shows the average number of times each word occurs in a comedy movie on the horizontal axis and the average number of times it occurs in an thriller movie on the vertical axis.
Note: The line graphed is the line of best fit, NOT a y=x
The following questions ask you to interpret the plot above. For each question, select one of the following choices and assign its number to the provided name.
1. The word is common in both comedy and thriller movies
2. The word is uncommon in comedy movies and common in thriller movies
3. The word is common in comedy movies and uncommon in thriller movies
4. The word is uncommon in both comedy and thriller movies
5. It is not possible to say from the plot
What properties does a word in the bottom left corner of the plot have? Your answer should be a single integer from 1 to 5, corresponding to the correct statement from the choices above.
bottom_left = 4
grader.check("q3_1_1")
q3_1_1
passed!
Question 3.1.2
What properties does a word in the bottom right corner have?
bottom_right = 3
grader.check("q3_1_2")
q3_1_2
passed!
Question 3.1.3
What properties does a word in the top right corner have?
top_right = 1
grader.check("q3_1_3")
q3_1_3
passed!
Question 3.1.4
What properties does a word in the top left corner have?
top_left = 2
grader.check("q3_1_4")
q3_1_4
passed!
Question 3.1.5
If we see a movie with a lot of words that are common for comedy movies but uncommon for thriller movies, what would be a reasonable guess about the genre of the movie? Assign movie_genre
to the number corresponding to your answer:
1. It is a thriller movie.
2. It is a comedy movie.
movie_genre_guess = 2
grader.check("q3_1_5")
q3_1_5
passed!
Using the plot above, make an array of at least 10 common words that you think might let you distinguish between comedy and thriller movies. Make sure to choose words that are frequent enough that every movie contains at least one of them. Don't just choose the most frequent words, though--you can do much better.
You might want to come back to this question later to improve your list, once you've seen how to evaluate your classifier.
# Set my_features to an array of 20 features (strings that are column labels)
my_features = make_array('kill', 'mari', 'uh', 'well', 'dead', 'love', 'realli', 'great', 'yeah', 'gun')
# Select the 20 features of interest from both the train and test sets
train_my_features = train_movies.select(my_features)
test_my_features = test_movies.select(my_features)
grader.check("q3_1_6")
q3_1_6
passed!
This test makes sure that you have chosen words such that at least one appears in each movie. If you can't find words that satisfy this test just through intuition, try writing code to print out the titles of movies that do not contain any words from your list, then look at the words they do contain.
In two sentences or less, describe how you selected your features.
I selected my features by choosing words that were farther from the line of best fit, since that means they are more closely associated with one genre over the other.
Next, let's classify the first movie from our test set using these features. You can examine the movie by running the cells below. Do you think it will be classified correctly?
print("Movie:")
test_movies.take(0).select('Title', 'Genre').show()
print("Features:")
test_my_features.take(0).show()
Movie:
Title | Genre |
---|---|
new nightmare | thriller |
Features:
kill | mari | uh | well | dead | love | realli | great | yeah | gun |
---|---|---|---|---|---|---|---|---|---|
0.000729129 | 0 | 0 | 0.00401021 | 0.000364564 | 0.00109369 | 0.00401021 | 0.00109369 | 0.00109369 | 0 |
As before, we want to look for the movies in the training set that are most like our test movie. We will calculate the Euclidean distances from the test movie (using my_features
) to all movies in the training set. You could do this with a for
loop, but to make it computationally faster, we have provided a function, fast_distances
, to do this for you. Read its documentation to make sure you understand what it does. (You don't need to understand the code in its body unless you want to.)
# Just run this cell to define fast_distances.
def fast_distances(test_row, train_table):
"""Return an array of the distances between test_row and each row in train_rows.
Takes 2 arguments:
test_row: A row of a table containing features of one
test movie (e.g., test_my_features.row(0)).
train_table: A table of features (for example, the whole
table train_my_features)."""
assert train_table.num_columns < 50, "Make sure you're not using all the features of the movies table."
counts_matrix = np.asmatrix(train_table.columns).transpose()
diff = np.tile(np.array(list(test_row)), [counts_matrix.shape[0], 1]) - counts_matrix
np.random.seed(0) # For tie breaking purposes
distances = np.squeeze(np.asarray(np.sqrt(np.square(diff).sum(1))))
eps = np.random.uniform(size=distances.shape)*1e-10 #Noise for tie break
distances = distances + eps
return distances
Use the fast_distances
function provided above to compute the distance from the first movie in the test set to all the movies in the training set, using your set of features. Make a new table called genre_and_distances
with one row for each movie in the training set and two columns:
"Genre"
of the training movie"Distance"
from the first movie in the test set Ensure that genre_and_distances
is sorted in ascending order by distance to the first test movie.
# Your solution may take multiple lines of code.
genre_and_distances = train_movies.with_column('Distance', fast_distances(test_my_features.row(0), train_my_features)).select('Genre', 'Distance').sort('Distance', descending=False)
genre_and_distances
Genre | Distance |
---|---|
comedy | 0.00153825 |
comedy | 0.0016134 |
comedy | 0.00190262 |
comedy | 0.00192253 |
thriller | 0.0021041 |
comedy | 0.00215782 |
thriller | 0.00218006 |
comedy | 0.00242911 |
thriller | 0.00244957 |
thriller | 0.00245682 |
... (304 rows omitted)
grader.check("q3_1_8")
q3_1_8
passed!
Now compute the 7-nearest neighbors classification of the first movie in the test set. That is, decide on its genre by finding the most common genre among its 7 nearest neighbors in the training set, according to the distances you've calculated. Then check whether your classifier chose the right genre. (Depending on the features you chose, your classifier might not get this movie right, and that's okay.)
# Set my_assigned_genre to the most common genre among these.
closest_7 = genre_and_distances.take(np.arange(7))
my_assigned_genre = most_common('Genre', closest_7)
# Set my_assigned_genre_was_correct to True if my_assigned_genre
# matches the actual genre of the first movie in the test set.
my_assigned_genre_was_correct = False
print("The assigned genre, {}, was{}correct.".format(my_assigned_genre, " " if my_assigned_genre_was_correct else " not "))
The assigned genre, comedy, was not correct.
grader.check("q3_1_9")
q3_1_9
passed!
Now we can write a single function that encapsulates the whole process of classification.
Write a function called classify
. It should take the following four arguments:
test_my_features.row(0)
).train_my_features
).k
, the number of neighbors to use in classification.It should return the class a k
-nearest neighbor classifier picks for the given row of features (the string 'comedy'
or the string 'thriller'
).
def classify(test_row, train_rows, train_labels, k):
"""Return the most common class among k nearest neigbors to test_row."""
distances = fast_distances(test_row, train_rows)
genre_and_distances = Table().with_columns('Genre', train_labels, 'Distance', distances).select('Genre', 'Distance').sort('Distance', descending=False)
return most_common('Genre', genre_and_distances.take(np.arange(k)))
grader.check("q3_2_1")
q3_2_1
passed!
Assign tron_genre
to the genre predicted by your classifier for the movie "tron" in the test set, using 13 neighbors and using your 20 features.
# Define a row called tron_features.
tron_features = test_movies.where('Title', are.equal_to('tron')).select(my_features).row(0)
tron_genre = classify(tron_features, train_my_features, train_movies.column('Genre'), 13)
tron_genre
'comedy'
grader.check("q3_2_2")
q3_2_2
passed!
Finally, when we evaluate our classifier, it will be useful to have a classification function that is specialized to use a fixed training set and a fixed value of k
.
Create a classification function that takes as its argument a row containing your 20 features and classifies that row using the 13-nearest neighbors algorithm with train_20
as its training set.
def classify_feature_row(row):
return classify(row, train_my_features, train_movies.column('Genre'), 13)
# When you're done, this should produce 'Thriller' or 'Comedy'.
classify_feature_row(test_my_features.row(0))
'thriller'
grader.check("q3_2_3")
q3_2_3
passed!
Now that it's easy to use the classifier, let's see how accurate it is on the whole test set.
Question 3.3.1. Use classify_feature_row
and apply
to classify every movie in the test set. Assign these guesses as an array to test_guesses
. Then, compute the proportion of correct classifications.
test_guesses = test_my_features.apply(classify_feature_row)
proportion_correct =np.sum(test_movies.column('Genre') == test_guesses) / 56
proportion_correct
0.75
grader.check("q3_3_1")
q3_3_1
passed!
Question 3.3.2. An important part of evaluating your classifiers is figuring out where they make mistakes. Assign the name test_movie_correctness
to a table with three columns, 'Title'
, 'Genre'
, and 'Was correct'
. The last column should contain True
or False
depending on whether or not the movie was classified correctly.
# Feel free to use multiple lines of code
# but make sure to assign test_movie_correctness to the proper table!
test_movie_correctness = test_my_features.with_columns('Guess', test_guesses, 'Genre', test_movies.column('Genre'), 'Title', test_movies.column('Title')).select('Title', 'Genre', 'Guess')
test_movie_correctness = test_movie_correctness.with_column('Was correct', test_movie_correctness.column('Genre') == test_movie_correctness.column('Guess')).drop('Guess')
test_movie_correctness
test_movie_correctness.sort('Was correct', descending = True).show(56)
Title | Genre | Was correct |
---|---|---|
new nightmare | thriller | True |
the body snatcher | thriller | True |
godzilla | thriller | True |
rear window | thriller | True |
u turn | thriller | True |
jason goes to hell: the final friday | thriller | True |
the crow: salvation | thriller | True |
ed wood | comedy | True |
storytelling | comedy | True |
halloween h20: 20 years later | thriller | True |
gone in sixty seconds | thriller | True |
the butterfly effect | thriller | True |
edtv | comedy | True |
black rain | thriller | True |
bringing out the dead | thriller | True |
basic | thriller | True |
detroit rock city | comedy | True |
panic room | thriller | True |
juno | comedy | True |
my girl 2 | comedy | True |
nick of time | thriller | True |
airplane ii: the sequel | comedy | True |
suburbia | comedy | True |
body of evidence | thriller | True |
twelve monkeys | thriller | True |
the game | thriller | True |
sleepy hollow | thriller | True |
his girl friday | comedy | True |
mulholland dr. | thriller | True |
spare me | thriller | True |
annie hall | comedy | True |
jackie brown | thriller | True |
monty python and the holy grail | comedy | True |
star trek: the wrath of khan | thriller | True |
batman returns | thriller | True |
suspect zero | thriller | True |
sphere | thriller | True |
o brother where art thou? | comedy | True |
what lies beneath | thriller | True |
wonder boys | comedy | True |
the war of the worlds | thriller | True |
what women want | comedy | True |
the grifters | thriller | False |
smoke | comedy | False |
mystery of the wax museum | thriller | False |
fast times at ridgemont high | comedy | False |
the fifth element | thriller | False |
hannibal | thriller | False |
misery | thriller | False |
smokey and the bandit | comedy | False |
backdraft | thriller | False |
tron | thriller | False |
happy birthday wanda june | comedy | False |
cruel intentions | thriller | False |
the thin man | comedy | False |
three kings | comedy | False |
grader.check("q3_3_2")
q3_3_2
passed!
Question 3.3.3. Do you see a pattern in the types of movies your classifier misclassifies? In two sentences or less, describe any patterns you see in the results or any other interesting findings from the table above. If you need some help, try looking up the movies that your classifier got wrong on Wikipedia.
The classifier tends to associate violent words with thrillers, which is why movies such as 'Happy Birthday Wanda June' were misclassified as thrillers.
At this point, you've gone through one cycle of classifier design. Let's summarize the steps:
Now that you know how to evaluate a classifier, it's time to build a better one.
Develop a classifier with better test-set accuracy than classify_feature_row
. Your new function should have the same arguments as classify_feature_row
and return a classification. Name it another_classifier
. Then, check your accuracy using code from earlier.
You can use more or different features, or you can try different values of k
. (Of course, you still have to use train_movies
as your training set!)
Make sure you don't reassign any previously used variables here, such as proportion_correct
from the previous question.
# To start you off, here's a list of possibly-useful features
# Feel free to add or change this array to improve your classifier
new_features = make_array("laugh", "marri", "dead", "heart", "cop", 'kill', 'uh', 'well', 'love', 'realli', 'great', 'yeah', 'gun')
train_new = train_movies.select(new_features)
test_new = test_movies.select(new_features)
def another_classifier(row):
return classify(row, train_new, train_movies.column('Genre'), 20)
new_guesses = test_new.apply(another_classifier)
new_correct = np.sum(test_movies.column('Genre') == new_guesses) / 56
new_correct
0.7857142857142857
Question 4.2
Do you see a pattern in the mistakes your new classifier makes? What about in the improvement from your first classifier to the second one? Describe in two sentences or less.
Hint: You may not be able to see a pattern.
Two more movies were classified correctly, but movies with themes that aren't typical for their genre, such as 'Happy Birthday Wanda June' were still misclassified.
Question 4.3
Briefly describe what you tried to improve your classifier.
We added more words to the features array and increased the number of nearest neighbors.