Do you like movies? Hollywood blockbusters or independent films? If you are a start-up production company looking for target countries in which your new films would be successful, how can you acquire insight on the target markets? Or you are just a movie lover and want to get some ideas about films that are similar to the ones you like.
To answer such questions we carried out research on what movie topics are popular and how diverse
the topics are. We investigated the distribution of
topics across the world and we'll describe some interesting results
of the topic analyses here. Hope you enjoy and make some interesting discoveries!
Mining of Plot Summaries
Our idea was to gather metadata about movies, primarily plot summaries from the web and analyze these with machine learning algorithms, specifically topic modeling (Latent Dirichlet Allocation) and distributed representation of words (Doc2Vec) to uncover the characterics, themes and topics of thousands of movies. You can find our detailed analysis and code here.
What are the popular movie topics across several movie-loving countries?
The topics of popular movies are an important indicator of interests or issues of a society, because the films we make and watch movies are inevitably affected by our social, political, and historical background. We looked at the movie plots of the 50 most popular films of 6 countries and identified a wide spectrum of topics in the popular movies, from 'love, relationship' to 'war', 'terrorism', and 'press, politics'. The top 2 topics represent our relationships with others (love, family, friendship), indicating that these are at the heart of our life. 2 out of the top 5 topics are related to 'crime', showing an interest in the darker side of our lives across the world.
Are there any differences in popular topics across countries?
Comparing popular topic distribution across 6 countries (click the square box "Global Top 10 Topic Distribution by Country in the above chart), we could see that the popularity of different topics is quite similar across the countries. One interesting difference we noticed for South Korea: The most popular topic in South Kroea is 'crime, mistery', while 'love, family' is the most popular in the other countries.
How do topics vary across the countries the movies were produced in?
To get more insights, we addressed a different view point: If topics of movies produced in a specific country show a unique interests in some particular topics it can be interpreted as a result of preference or common socio-political interests in that country. Thus, we performed analyses on topic distribution by origin of movies across about 70 countries since 2007.
The above map shows the distribution of the number of movies produced in each country. The size of a bubble represents the total number of movies produced in that country. As expected, the United States is the greatest producer of movies in the world, and Canada, United Kingdom, Germany, and India also produce a considerable number of movies. You can see the topic distribution of each country in-detail, by clicking the bubble on the map, and by hovering on the name of a country in the bar chart, you can order the topics.
Among those countries, India shows somewhat different behavior. Indian movies show a relatively high preference toward the topic 'crime, police, underworld', ranked 4th, while it did not rank in the top 10 for the other countries. But, of course 'love, realtionship' is also big in India. Movies produced in South Africa, Kenya, Egypt, and Cuba show the highest preference toward 'society, culture'. This could be attributed to underlying conflicts in their societies. Japanese, South Korean, and Russian movies show a relatively high interests in the topic 'spies, terrorism', compared to other countries. Vietnamese movies show a particularly high preference toward 'war', indicating their historical background may affect the topic of movies. These observations clearly show that the socio-political and historical backgrounds have a significant influence on the topics of movies.
Are there any difference between the preference of moviegoers and the preference of movie makers? We compared 'popular topics of movies viewed in a country' and 'prevalent topics of movies produced in a country'. By comparing the top 5 topics in each category for each country, we could find that 2-3 out of top 5 topics are different. Based on these observations, we see that there are some differences between moviegoers' preferences and movie-makers' preferences.
Distributed Representation of Movie Plots
Another powerful technique in natural language processing is distributed representation of words or collection of words. Word2Vec algorithm and its variant called Doc2Vec has become quite popular recently because of the efficient implementations that are able to capture the semantic content of a corpus almost as effectively as deep learning models that are much more complex. For this analysis, we applied Doc2Vec algorithm on movie plot summaries from Wikipedia.
The interactive plot below contains movies in the Wikipedia corpus from 2000 to 2015. For plotting, we used a 2-D projection (using tSNE) of the high dimensional vectors. Movies are colour coded based on the country of origin
Further, we created another doc2vec model that learns the vector representation for each country from all the plots of movies from that country. Again, we use t-SNE to do 2-D visualization. We can clearly see clustering of European countries and Asian countries distinct from each other.
Data Sources
- movie metadata from IMDb (www.imdb.com)
- movie metadata from wikipedia (wiki.dbpedia.org)
- box office top movies from Box Office Mojo (www.boxofficemojo.com)