Are Movies Getting Better or Worse?

Sun 26 January 2014

Inspired by a recent post by Greg Reda where he analyzes IMDB data to answer the question "what date in history has an equal number of films made before and after it?" (read the post to find out!), I decided to do some data exploration of the IMDB data myself. I downloaded the ratings data for all the movies listed on IMDB to try and answer the question Are movies getting better or worse over time?. The hypothesis is that the average user ratings of movies released each year is a good proxy for judging the quality of the movies released that year. In this exercise, we'll see whether this average rating is getting better, worse or staying the same.

First, let's download the data set and unzip it -

gunzip ratings.list.gz

The data file contains some documentation, the Top 250 list, the Bottom 10 list, the full data set and some more documentation. Also, in addition to movie ratings, it contains data on TV shows and multi-episode TV mini-series as well. Let's clean all this up by up by removing the extraneous lines from the start and the end of the file and filtering out the TV/mini-series data. We can easily do this from the command line itself -

tail -n+297 ratings.list | head -n -147 | grep -v -e '{.*}' | grep -v '(TV)' > ratings_cleaned.list

To answer the question we're concerned with, we need the year of the release for a given movie and its corresponding rating on IMDB. The clean data file still contains data that we don't care about. Also, the data about the year of the release is not cleanly separated into its own column. We thus extract the information we want using the Python script below and save the result in a clean CSV file. I could have done this in R itself, but I'm a firm believer in using the right tool for the job and Python is just simply better for text processing.

import re

of = open('ratings_cleaned.list')

for line in of.readlines():
        rating = line[27:30]
        year ='\([0-9]+\)', line).group(0)[1:-1]
        print rating + "," + year

python > ratings_cleaned.csv

Now we're ready to do some analysis in R. Let's read in the newly created CSV file and format the data properly into a data frame. We filter out the years before 1900 to narrow down the scope.

movies <- read.csv("ratings_cleaned.csv", header = F)
names(movies) <- c("Rating", "Year")
movies <- movies[movies$Year >= 1900, ]
##   Rating Year
## 1    6.3 2006
## 2    8.5 2013
## 3    6.6 2012
## 4    6.3 2011
## 5    6.3 2010
## 6    6.2 1986

To answer the question we started with, let's aggregate the data by year and calculate the yearly average of the ratings -

avg_by_year = aggregate(Rating ~ Year, data = movies, mean)
names(avg_by_year) = c("Year", "AvgRating")
##   Year AvgRating
## 1 1900     5.070
## 2 1901     5.296
## 3 1902     5.287
## 4 1903     5.078
## 5 1904     5.131
## 6 1905     5.249

To make it easy to visualize, let's plot a graph of the year and the average ratings of the movies released in that year -

ggplot(avg_by_year, aes(x = Year, y = AvgRating)) + geom_line(lwd = 1.2) + theme_bw()

plot of chunk imdb_ratings_plot

Aha, If we are to believe the IMDB ratings, it seems like the average movie ratings are actually going up! If we use this as proxy, we can postulate that the quality of movies is getting better over time.

Of course, the data is not perfect. There's likely recency as well as availabiity bias present in the data set (Note: IMDB does account for the number of reviews in calculating the weighted rating). This post is just meant to illustrate a fun little exercise in data analysis!