Violence Rating Prediction from Movie Scripts

Table of Contents

  1. Introduction
  2. Dataset
  3. Methodology a. Features b. Models
  4. Experiments
  5. Results

Introduction

Violence is an important narrative tool, despite some of its ill effects. It is used to enhance a viewer’s experience, boost movie profits (Barranco, Rader, and Smith 2017; Thompson and Yokota 2004), and facilitate global market reach (Sparks, Sherry, and Lubsen 2005). Including violent content may modify a viewer’s perception of how exciting a movie is by intensifying the sense of relief when a plot-line is resolved favorably (Topel 2007; Sparks, Sherry, and Lubsen 2005).

There are many reasons why identifying violent content from movie scripts is useful:

1) Such a measure can provide filmmakers with an objective assessment of how violent a movie is 

2) It can help identify subtleties that producers mayotherwise not pick up on when violence is traditionally measured through action, and not language

3) it could suggest appropriate changes to a movie script even before production begins 

4) It can provide social scientists with empirical evidence for creating awareness of negative stereotypesin film

With an increase in movie production (777 movies released in 2017; up 8% from 2016 (Motion Picture Association of America 2017)), there is an increased demand for scalable tools to identify violent content. Here we describe machine learning models that characterize aspects of violent content in movies solely from the language used in scripts.

To date, most efforts in automatically detect violence in movies have used audio and video-based classifiers (e.g., (Ali and Senan 2018; Dai et al. 2015)). Others examined the effect and prevalence of violence in films using human annotations to identify violent content (e.g., (Yokota and Thompson, 2000; Webb et al. 2007)). From a linguistic point of view, the task of analyzing violent content in movie scripts is closely related to detecting abusive language. Abusive Language (Waseem et al. 2017) is an umbrella term generally used to group together offensive language, hate speech, cyber-bulling, and trolling (Burnap and Williams 2015; Nobata et al. 2016).

Dataset

We use the movie scripts collected by (Ramakrishna et al. 2017). It contains 945 Hollywood movies, from 12 different genres (1920—2016). Violent content ratings were provided by Common Sense Media (CSM). From the total 945 movie scripts in the script dataset, we found 732 movies (76.64%) for which CSM had ratings. The distribution of ratings labels is given in Table 1. To balance the negative skewedness of the rating distribution, we selected two cut-offs and encoded violent content as a three-level categorical variable (LOW<3, MED=3,HIGH>3).

  0 1 2 3 4 5 Total
No. 40 48 83 261 135 165 732
% 5.46 6.56 11.34 35.66 18.44 22.54 100

Table 1 Raw frequency (no.) and percentage (perc.) distri-bution of violence ratings

Methodology

Preprocessing

Movie scripts often contain both the dialogues of an actor(or utterances) and scene descriptions. We preprocessed our data to keep only the actors’ utterances. We discarded scenedescriptions (e.g., camera panning, explosion, interior, exterior).

Features

We collected language features from all the utterances. Our features can be divided into five categories: N-grams, Linguistic and Lexical, Sentiment, Abusive Language and Distributional Semantics.

These features were obtained at two units of analysis:

1) Utterance-level: text in each utterance is considered independently to be used in sequence models

2) Movie-level: all the utterances are treated as a single document for classification models

Because movie genres are generally related to the amount of violence in a movie (e.g., romance vs. horror), we evaluated all our models by including movie-genre as an one-hot encoded feature vector.

A summary of the features extracted can be found in Table 2.

Feature Utterance-level Movie-level
N-grams TF (IDF) TF (IDF)
Linguistic TF TF
Sentiment Scores Functionals
Abusive Language TF TF
Semantic Average Average.

Table 2 Summary of feature representation at utterance andmovie level.TF-Term frequency, IDF- Inverse document frequency, Functionals-Mean, Variance, Maximum, Minimum, and Range

N-grams

We included unigrams and bigrams to capture the relation of the words to violent content. Because screen-writers often portray violence using offensive words and use their censored versions, we included 3, 4 and 5 char-acter n-grams as additional features.

Linguistic

We include: number of punctuation marks (periods, quotes, ques-tion marks), number of repetitions, and the number of capitalized letters. We also include word percentages across 192 lexical categories using Empath (Fast,Chen, and Bernstein 2016).

Sentiment

We used two sentiment analysis tools that are commonly used to process text:

* AFINN-111 (Nielsen  2011):  For a particular sentence, it produces a single score (-5 to +5) by summing the valence ratings for all words.

* Valence aware dictionary and sentiment reasoner (VADER) (Gilbert 2014): A lexicon and rule-based sentiment analyzer that produces a score (-1 to +1) for a document.

We estimated a movie-level sentiment score using the statistical functionals (mean, variance, max, min and range) across all the utterances in the script.

Additionally, we also obtain the percentage of words in the lexical categories of positive and negative emotions from Empath.

Semantics

In our feature set we include a 300-dimensional word2vec word representation trained on a large news corpus (Mikolov et al. 2013). We obtained utterance-level embeddings by averaging the word representations in an utterance. Similarly, we obtained movie-level embeddings by averaging all the utterance-level embeddings.

Abusive Language

We collected the following Explicit Abusive Language features:

a. Number of insults and hate blacklist words from [Hatebase](http://www.hatebase.org)

b. Cross-domain lexicon of abusive words from (Wiegand et al. 2018)

c. Human annotated hate-speech terms collected by (Davidson et al. 2017)

Models

We train two types of models:

a. Linear Support Vector Classifier (LinearSVC) using the movie-level features to classify a movie script into one of  three categories of violent content (i.e.,LOW/MED/HIGH).

b. Recurrent Neural Networks (RNN) to investigate if context can improve violence identification. 
We consider two main forms of context: conversational context, and movie genre. 
The former refers to what is being said in relation to what has been previously said. 
This follows from the fact that most utterances are not independent from one another, but rather follow a thread of conversation.
The latter takes into account that utterances in a movie follow a particular theme set by the movie's genre (e.g., action, sci-fi).
[Figure 1](#fig1) shows the proposed architecture.

Figure 1 Each utterance is represented as a vector of concatenated feature-types. A sequence of utterances is fed to a RNN with attention, resulting in a H-dimensional representation. This vector is then concatenated with genre representation and fed to the softmax layer for classification

Implementation

LinearSVC was implemented using scikit-learn (Pedregosa et al. 2011). Features were centered and scaled using sklearn’s robust scaler. We estimated model’s performance and optimal penalty parameter through nested 5-fold cross validation.

RNN models were implemented in Keras (Chollet andothers 2015). We used the Adam optimizer with mini-batchsize of 16 and learning rate of 0.001. To prevent over-fitting, we use drop-out of 0.5, and train until convergence (i.e., consecutive loss with less than $10^{-8}$ difference). For the RNN layer, we evaluated Gated Recurrent Units (Cho et al.2014) and Long Short-Term Memory cells (Hochreiter and Schmidhuber 1997). Both models were trained with number of hidden units $H \in {4,8,16,32}$.

Experiments

Baselines

We consider both explicit and implicit abusive language classifiers. For explicit, we trained SVCs using lexicon based-approaches. The lexicon considered were wordlist from Hatebase, manually curated n-gram list from (Davidson et al. 2017), and cross-domain lexicon from (Wiegand et al. 2018). Additionally, we compare against implementations of two state-of-the-art models for implicit abusive language classification: (Nobata et al. 2016), a LinearSVC trained on Linguistic, N-gram, Semantic features plus Hatebase lexicon, and (Pavlopoulos, Malakasiotis, and Androutsopoulos 2017), an RNN with deep-attention.

Results

Models Prec Rec F-score
Abusive Language Classifiers      
Hatebase 29.8 37.4 30.5
(Davidson et al. 2017) 13.6 33.1 19.3
(Wiegand et al. 2018) 28.3 34.8 26.8
(Nobata et al. 2016) 55.4 54.5 54.8
(Pavlopoulos et al. 2017) 53.3 52.0 52.5
Semantic-only (word2vec)      
Linear SVC 56.3 55.8 56.0
GRU (16) 53.6 52.3 51.5
LSTM (16) 52.7 54.0 52.5
Movie-level features      
Linear SVC 60.5 58.4 59.1
Utterance-level features      
GRU (4) 52.4 49.5 49.5
GRU (8) 58.2 58.2 58.2
GRU (16) 60.9 60.0 60.4
GRU (32) 58.8 58.4 58.4
LSTM (4) 54.1 54.2 52.1
LSTM (8) 56.6 57.6 57.0
LSTM (16) 57.4 57.2 57.2
LSTM (32) 56.4 56.2 55.9

Table 3 Classification results: 5-fold cross-validation precision (Prec), recall (Rec) and F1 macro average scores for each classifier. In parenthesis are the number of units in each hidden layer.

Table 3 shows the macro-averaged classification performance of baseline models and our proposed models. Precision, recall and F-score (F1) for all models was estimated using 5-fold cross validation. Consistent with previous works, lexicon-based approaches resulted in a higher number of false positives, leading to a high recall but low precision (Schmidt and Wiegand 2017); in contrast, both implicit abusive language classifiers and our methods achieve a better balance between precision and recall. Our results suggest that models trained on the complete feature set performed better than other models. The difference in performance is significant (permutation tests, n=105, all p<0.05). This suggests that the additional language features contribute to the classification performance. The best performance is obtained using a 16-unit GRU with attention (GRU-16), trained on all features. GRU-16 performed significantly better than the baselines (permutation test, smallest delta= 0.056$, n=105, all p<0.05), and better than the RNN models trained on word2vec only (delta= 0.079$, n= 105,p<0.05). We were unable to find statistical differences in performance between LinearSVC trained on movie-level features and GRU-16 (perm. test, p >0.05).

Conclusion

We present an approach to identify violence from the language used in movie scripts. This can prove beneficial for filmmakers to edit content and social scientists to understand representations in movies. Our work is the first to study how linguistic features can be used to predict violence in movies both at utterance- and movie-level. This comes with certain limitations. For example, our approach does not account for modifications in post-production (e.g., an actor delivering a line with a threatening tone). We aim to analyze this in our future work with multimodal approaches using audio, video and text.

References

Barranco, R. E.; Rader, N. E.; and Smith, A. 2017. Violence at thebox office.Communication Research44(1):77–95

Chollet, F., et al. 2015. Keras. https://keras.io.

Dai, Q.; Zhao, R.; Wu, Z.; Wang, X.; Gu, Z.; Wu, W.; and Jiang, Y.2015. Fudan-huawei at mediaeval 2015: Detecting violent scenesand affective impact in movies with deep learning. In Working Notes Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 14-15, 2015.

Davidson, T.; Warmsley, D.; Macy, M. W.; and Weber, I. 2017.Automated hate speech detection and the problem of offensive language. In Proceedings of the Eleventh International Conference onWeb and Social Media, ICWSM 2017, Montreal, Quebec, Canada,May 15-18, 2017., 512–515.

Fast, E.; Chen, B.; and Bernstein, M. S. 2016. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems,4647–4657. ACM.

Gilbert, C. H. E. 2014. Vader: A parsimonious rule-based modelfor sentiment analysis of social media text. InEighth InternationalConference on Weblogs and Social Media (ICWSM-14).

Hochreiter, S., and Schmidhuber, J. 1997. Long short-term mem-ory.Neural computation9(8):1735–1780.

Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J.2013. Distributed representations of words and phrases and theircompositionality. InAdvances in Neural Information ProcessingSystems 26. Curran Associates, Inc. 3111–3119.

Motion Picture Association of America. 2017. Theme report: Acomprehensive analysis and survey of the theatrical and home en-tretainment market environment (theme) for 2017. Technical report. [online] Accessed: 07/25/2018.

Nielsen, F. ̊A. 2011. A new ANEW: evaluation of a word list forsentiment analysis in microblogs. InProceedings of the ESWC2011Workshop on ‘Making Sense of Microposts’: Big things come insmall packages, Heraklion, Crete, Greece, May 30, 2011, 93–98

Nobata, C.; Tetreault, J.; Thomas, A.; Mehdad, Y.; and Chang, Y.2016. Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web,145–153.

Pavlopoulos, J.; Malakasiotis, P.; and Androutsopoulos, I. 2017. Deeper attention to abusive user content moderation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1125–1135. Association for Computational Linguistics.

Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion,B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg,V.; et al. 2011. Scikit-learn: Machine learning in python.Journalof machine learning research12(Oct):2825–2830.

Schmidt, A., and Wiegand, M. 2017. A survey on hate speechdetection using natural language processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, 1–10.

Sparks, G. G.; Sherry, J.; and Lubsen, G. 2005. The appeal ofmedia violence in a full-length motion picture: An experimentalinvestigation. Communication Reports18(1-2):21–30.

Thompson, K. M., and Yokota, F. 2004. Violence, sex, and profan-ity in films: correlation of movie ratings with content. Medscape General Medicine 6 (3).

Waseem, Z.; Davidson, T.; Warmsley, D.; and Weber, I. 2017. Understanding abuse: A typology of abusive language detection sub-tasks. InProceedings of the First Workshop on Abusive LanguageOnline, 78–84. Vancouver, BC, Canada: Association for Compu-tational Linguistics.

Wiegand, M.; Ruppenhofer, J.; Schmidt, A.; and Greenberg, C.2018. Inducing a lexicon of abusive words–a feature-based approach. In Proceedings of the 2018 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Papers), volume 1,1046–1056.

Yokota, F., and Thompson, K. M. 2000. Violence in G-rated animated films.Jama283(20):2716–2720.