Rating Predictor for BoardGameGeek reviews

2020 May 11th

Predict rating of review using BoardGameGeek Reviews dataset

The goal of this project is to use the corpus of reviews present in this dataset, learn the reviews and their corresponding rating.

Once the model is trained using the review data, we ask the user to input a new review and predict the rating of that review.

We begin by importing all the basic libraries:

1!pip install -q pandas numpy nltk scikit-learn matplotlib seaborn wordcloud ipython
1import random
2import gc
3import re
4import string
5import pickle
6import os
7import warnings
8warnings.simplefilter(action='ignore', category=FutureWarning)
9
10import numpy as np
11import pandas as pd
12from IPython.display import display, HTML
13
14from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
15from sklearn.model_selection import train_test_split, KFold
16from sklearn.naive_bayes import MultinomialNB
17from sklearn.linear_model import LogisticRegression, LinearRegression
18from sklearn.feature_extraction.text import TfidfVectorizer
19
20import seaborn as sns
21from matplotlib import pyplot as plt
22
23import nltk
24from nltk.corpus import stopwords
25nltk.download(['words', 'stopwords', 'wordnet'], quiet=True)
1True
1from google.colab import drive
2drive.mount('/content/drive')
1Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Here we define the paths of the dataset and the pickle files

We will be pickling our dataset after every major step due to save time

Due to the sheer amount of data we have, it does take time to do pre-processing

Also, I will be invoking the garbage collector often just to save memory as we need every bit of it due to the volume of input data

1DATA_DIR = "/content/drive/My Drive/Colab Notebooks/data/"
2
3DATASET = DATA_DIR + "bgg-13m-reviews.csv"
4
5REVIEWS_PICKLE = DATA_DIR + "reviews.pkl"
6REVIEWS_DATA_CHANGED = False
7
8CLEAN_TEXT_PICKLE = DATA_DIR + "clean_text.pkl"
9LEMMATIZED_PICKLE = DATA_DIR + "lemmatized.pkl"
10STOPWORDS_PICKLE = DATA_DIR + "stopwords.pkl"
11CLEANED_PICKLE = DATA_DIR + "cleaned_reviews.pkl"
12
13MODEL_PICKLE = DATA_DIR + "model.pkl"
14VOCABULARY_PICKLE = DATA_DIR + "vocab.pkl"

Loading the data

Now that we've got the formalities out of our way, let's start exploring our dataset and see how we can solve this problem.

First, we need to load the dataset like so:

1reviews = pd.DataFrame()
2
3if os.path.isfile(REVIEWS_PICKLE) or REVIEWS_DATA_CHANGED:
4 print('Pickled reviews found. Loading from pickle file.')
5 reviews = pd.read_pickle(REVIEWS_PICKLE, compression="gzip")
6else:
7 print("Reading reviews csv")
8 reviews = pd.read_csv(DATASET, usecols=['comment', 'rating'])
9 print("Pickling dataset")
10 reviews.to_pickle(REVIEWS_PICKLE, compression="gzip")
11
12reviews = reviews.reindex(columns=['comment', 'rating'])
13
14reviews = reviews.sample(frac=1, random_state=3).reset_index(drop=True)
15
16gc.collect()
17display(reviews)
1Pickled reviews found. Loading from pickle file.
commentrating
0NaN6.5
1NaN8.0
2NaN8.0
3NaN7.0
4http://boardgamers.ro/saboteur-recenzie/6.0
.........
13170068"Solid deck-builder in which you need to strat...6.5
13170069I adore this game. Not only is it an incredibl...10.0
13170070NaN9.0
13170071NaN6.0
13170072Still the easiest game to get to the table. Es...6.0

13170073 rows × 2 columns

Dropping rows with empty reviews

Great! We've got our dataset loaded.

Also, consecutive runs should be faster because we've pickled the data

As we can see, we've got several rows with no review and just a rating

We will be removing this rows as we cannot use them

1empty_reviews = reviews.comment.isna().sum()
2print("Dropping {} rows from dataset".format(empty_reviews))
3
4reviews.dropna(subset=['comment'], inplace=True)
5reviews = reviews.reset_index(drop=True)
6
7print("We now have {} reviews remaining".format(reviews.shape[0]))
8
9gc.collect()
10display(reviews)
1Dropping 10531901 rows from dataset
2We now have 2638172 reviews remaining
commentrating
0http://boardgamers.ro/saboteur-recenzie/6.0
1A nice family game but a little light for my o...5.0
2"This was the first modern board game that my ...8.0
3Great two player game. In fact, how did Knizia...9.0
4Risk but with more pieces.5.0
.........
2638167Great light game. Theme and simplicity make it...7.0
2638168Detailed board All sleeved9.0
2638169"Solid deck-builder in which you need to strat...6.5
2638170I adore this game. Not only is it an incredibl...10.0
2638171Still the easiest game to get to the table. Es...6.0

2638172 rows × 2 columns

Cleaning the reviews

Wow! That was a huge difference. We've shed off about 10 million rows!

Now, we need to begin cleaning up these reviews as they have a lot of redundant information and context.

For this dataset, we will be removing the following patterns from the text:

  • Words less than 3 and more than 15 characters
  • Special characters
  • URLs
  • HTML tags
  • All kinds of redundant whitespaces
  • Words that have numbers between them
  • Words that begin or end with a number

Once all this is done, we tokenize the reviews and save them in the dataframe

1def clean_text(text):
2 text = text.lower().strip()
3 text = " ".join([w for w in text.split() if len(w) >= 3 and len(w) <= 15])
4 text = re.sub('\[.*?\]', '', text)
5 text = re.sub('https?://\S+|www\.\S+', '', text)
6 text = re.sub('<.*?>+', '', text)
7 text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
8 text = re.sub('\n', '', text)
9 text = re.sub('\w+\d+\w*', '', text)
10 text = re.sub('\d+\w+\d*', '', text)
11 text = re.sub('\W+', '', text)
12 text = tokenizer.tokenize(text)
13 return text
14
15
16if os.path.isfile(CLEAN_TEXT_PICKLE) or REVIEWS_DATA_CHANGED:
17 print("Pickled clean text found. Loading from pickle file.")
18 reviews = pd.read_pickle(CLEAN_TEXT_PICKLE, compression="gzip")
19else:
20 print("Cleaning reviews")
21 tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
22 reviews['comment'] = reviews['comment'].apply(clean_text)
23 print("Pickling cleaned dataset")
24 reviews.to_pickle(CLEAN_TEXT_PICKLE, compression="gzip")
25
26gc.collect()
27display(reviews)
1Pickled clean text found. Loading from pickle file.
commentrating
0[currently, this, sits, list, favorite, game]10.0
1[know, says, how, many, plays, but, many, many...10.0
2[will, never, tire, this, game, awesome]10.0
3[this, probably, the, best, game, ever, played...10.0
4[fantastic, game, got, hooked, games, all, ove...10.0
.........
2638167[horrible, party, game, dumping, this, one]3.0
2638168[difficult, build, anything, all, with, the, i...3.0
2638169[lego, created, version, pictionary, only, you...3.0
2638170[this, game, very, similar, creationary, comes...2.5
2638171[this, game, was, really, bad, worst, that, pl...2.0

2638172 rows × 2 columns

Calculating unique words

We've gotten rid of most of the oddities in our text data.

But, when it comes to such huge models text is present in many forms of speech so it's better to lemmatize them. But, before we do that, let's see how many unique words we currently have.

1uniq_words = []
2
3unique_words = reviews.explode('comment').comment.nunique()
4uniq_words.append(unique_words)
5
6print("Unique words before lemmatizing {}".format(unique_words))
1Unique words before lemmatizing 313013

Lemmatizing

Wow, that's a lot of words. Let's see what happens after lemmatizing our reviews.

1def lemmatize_data(text):
2 return [lemmatizer.lemmatize(w) for w in text]
3
4
5if os.path.isfile(LEMMATIZED_PICKLE) or REVIEWS_DATA_CHANGED:
6 print("Pickled lemmatized reviews found. Loading from pickle file.")
7 reviews = pd.read_pickle(LEMMATIZED_PICKLE, compression="gzip")
8else:
9 print("Lemmatizing reviews")
10 lemmatizer = nltk.stem.WordNetLemmatizer()
11 reviews['comment'] = reviews['comment'].apply(lemmatize_data)
12 print("Pickling lemmatized reviews")
13 reviews.to_pickle(LEMMATIZED_PICKLE, compression="gzip")
14
15display(reviews)
16gc.collect()
17
18unique_words = reviews.explode('comment').comment.nunique()
19uniq_words.append(unique_words)
20print("Unique words after lemmatizing {}".format(unique_words))
1Pickled lemmatized reviews found. Loading from pickle file.
commentrating
0[currently, this, sits, list, favorite, game]10.0
1[know, say, how, many, play, but, many, many, ...10.0
2[will, never, tire, this, game, awesome]10.0
3[this, probably, the, best, game, ever, played...10.0
4[fantastic, game, got, hooked, game, all, over...10.0
.........
2638167[horrible, party, game, dumping, this, one]3.0
2638168[difficult, build, anything, all, with, the, i...3.0
2638169[lego, created, version, pictionary, only, you...3.0
2638170[this, game, very, similar, creationary, come,...2.5
2638171[this, game, wa, really, bad, worst, that, pla...2.0

2638172 rows × 2 columns

1Unique words after lemmatizing 300678

Removing stopwords and non-english words

Now let's remove the stopwords as well as they are mostly redundant for our training and mainly help us understand context.

We do have quite a fair number of custom words to remove and I've identified some custom words that I've added to the list as well.

1def remove_stopwords(text):
2 words = [
3 w for w in text if w not in stop_words and w in words_corpus or not w.isalpha()]
4 words = list(filter(lambda word: len(word) > 2, set(words)))
5 return words
6
7
8if os.path.isfile(STOPWORDS_PICKLE) or REVIEWS_DATA_CHANGED:
9 print("Reading stopwords pickle")
10 reviews = pd.read_pickle(STOPWORDS_PICKLE, compression="gzip")
11else:
12 words_corpus = set(nltk.corpus.words.words())
13
14 stop_words_json = {"en": ["a", "a's", "able", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against", "ain't", "aint", "all", "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "an", "and", "another", "any", "anybody", "anyhow", "anyone", "anything", "anyway", "anyways", "anywhere", "apart", "appear", "appreciate", "appropriate", "are", "aren't", "arent", "around", "as", "aside", "ask", "asking", "associated", "at", "available", "away", "awfully", "b", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "both", "brief", "but", "by", "c", "c'mon", "cmon", "cs", "c's", "came", "can", "can't", "cannot", "cant", "cause", "causes", "certain", "certainly", "changes", "clearly", "co", "com", "come", "comes", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn't", "course", "currently", "d", "definitely", "described", "despite", "did", "didn't", "different", "do", "does", "doesn't", "doesn", "doing", "don't", "done", "down", "downwards", "during", "e", "each", "edu", "eg", "eight", "either", "else", "elsewhere", "enough", "entirely", "especially", "et", "etc", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "f", "far", "few", "fifth", "first", "five", "followed", "following", "follows", "for", "former", "formerly", "forth", "four", "from", "further", "furthermore", "g", "get", "gets", "getting", "given", "gives", "go", "goes", "going", "gone", "got", "gotten", "greetings", "h", "had", "hadn't", "hadnt", "happens", "hardly", "has", "hasn't", "hasnt", "have", "haven't", "havent", "having", "he", "he's", "hes", "hello", "help", "hence", "her", "here", "here's", "heres", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "hi", "him", "himself", "his", "hither", "hopefully", "how", "howbeit", "however", "i", "i'd", "id", "i'll", "i'm", "im", "i've", "ive", "ie", "if", "ignored", "immediate", "in", "inasmuch", "inc", "indeed", "indicate", "indicated", "indicates", "inner", "insofar", "instead", "into", "inward", "is", "isn't", "isnt", "it", "it'd", "itd", "it'll", "itll", "it's", "its", "itself", "j", "just", "k", "keep", "keeps", "kept", "know", "known", "knows", "l", "last", "lately", "later", "latter", "latterly", "least", "less", "lest", "let", "let's", "lets", "like", "liked", "likely", "little", "look", "looking", "looks", "ltd", "m", "mainly", "many", "may", "maybe", "me", "mean", "meanwhile", "merely", "might", "more", "moreover", "most", "mostly", "much", "must", "my", "myself", "n", "name", "namely", "nd", "near", "nearly", "necessary", "need", "needs", "neither", "never", "nevertheless", "new", "next", "nine", "no", "nobody", "non", "none", "noone", "nor", "normally", "not", "nothing", "novel", "now", "nowhere", "o", "obviously", "of", "off", "often", "oh", "ok", "okay", "old", "on", "once", "one", "ones", "only", "onto", "or", "other", "others", "otherwise", "ought", "our", "ours", "ourselves", "out", "outside", "over", "overall", "own", "p", "particular", "particularly", "per", "perhaps", "placed", "please", "plus", "possible", "presumably", "probably", "provides", "q", "que", "quite", "qv", "r", "rather", "rd", "re", "really", "reasonably", "regarding", "regardless", "regards", "relatively", "respectively", "right", "s", "said", "same", "saw", "say", "saying", "says", "second", "secondly", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "shall", "she", "should", "shouldn't", "shouldnt", "since", "six", "so", "some", "somebody", "somehow", "someone", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "specified", "specify", "specifying", "still", "sub", "such", "sup", "sure", "t", "t's", "ts", "take", "taken", "tell", "tends", "th", "than", "thank", "thanks", "thanx", "that", "that's", "thats", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "there's", "theres", "thereafter", "thereby", "therefore", "therein", "theres", "thereupon", "these", "they", "they'd", "theyd", "they'll", "theyll", "they're", "theyre", "theyve", "they've", "think", "third", "this", "thorough", "thoroughly", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "took", "toward", "towards", "tried", "tries", "truly", "try", "trying", "twice", "two", "u", "un", "under", "unfortunately", "unless", "unlikely", "until", "unto", "up", "upon", "us", "use", "used", "useful", "uses", "using", "usually", "uucp", "v", "value", "various", "very", "via", "viz", "vs", "w", "wa", "want", "wants", "was", "wasn't", "wasnt", "way", "we", "we'd", "we'll", "we're", "we've", "weve", "welcome", "well", "went", "were", "weren't", "werent", "what", "what's", "whats", "whatever", "when", "whence", "whenever", "where", "wheres", "where's", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "who's", "whos", "whoever", "whole", "whom", "whose", "why", "will", "willing", "wish", "with", "within", "without", "won't", "wont", "wonder", "would", "wouldn't", "wouldny", "x", "y", "yes", "yet", "you", "you'd", "youd", "you'll", "youll", "you're", "youre", "you've", "youve", "your", "yours", "yours", "yourself", "yourselves", "z", "zero"]}
15
16 stop_words_json_en = set(stop_words_json['en'])
17
18 stop_words_nltk_en = set(stopwords.words('english'))
19
20 custom_stop_words = ["rating", "ish", "havn", "dice", "end", "set", "doesnt", "give", "find", "doe", "system", "tile", "table", "deck", "box", "made", "part", "based", "worker", "wife", "put", "havent", "game", "play", "player", "one", "two", "card", "ha", "wa", "dont", "board", "time", "make", "rule", "thing", "version", "mechanic", "year", "theme", "rating", "family", "child", "money", "edition", "collection", "piece", "wasnt", "didnt"]
21
22 stop_words = stop_words_nltk_en.union(
23 stop_words_json_en, custom_stop_words)
24
25 print("Removing {} stopwords from text".format(len(stop_words)))
26 print()
27
28 reviews['comment'] = reviews['comment'].apply(remove_stopwords)
29
30 print("Pickling stopwords data")
31 reviews.to_pickle(STOPWORDS_PICKLE, compression="gzip")
32
33gc.collect()
34
35print("After removing stopwords:")
36display(reviews)
1Reading stopwords pickle
2After removing stopwords:
commentrating
0[list, favorite]10.0
1[uncounted]10.0
2[tire, awesome]10.0
3[negotiation, skill, thinking]10.0
4[fantastic, hooked]10.0
.........
2638167[party, horrible, dumping]3.0
2638168[difficult, included, build]3.0
2638169[create, abstract, fun, limited, number, pictu...3.0
2638170[similar, creationary]2.5
2638171[bad, worst, genre]2.0

2638172 rows × 2 columns

Let's see what the word count is now

1unique_words = reviews.explode('comment').comment.nunique()
2uniq_words.append(unique_words)
3print(unique_words)
137633

Wow! That a marked difference considering what we started off with. This will definitely help us during our training phase as it reduces training time and computations as well.

Moreover, this helps improve our accuracy as we will be targeting important keywords that contribute to a particular sentiment.

Let's see what the wordcloud looks like now.

1x = [1, 2, 3]
2y = [313013, 300678, 37633]
3
4fig, ax = plt.subplots(figsize=(10, 8))
5ax.plot(x, y, color="#663399")
6ax.set_ylabel('Unique Words')
7ax.set_xlabel('Bins')
8plt.show()

png

Finishing up and saving progress

In the process of all this cleaning, there would be several rows that have become empty.

Let's remove them and then pickle our dataset so that we don't have to do all this cleaning again.

1def drop_empty_reviews(df):
2 df = df.drop(df[~df.comment.astype(bool)].index)
3 return df
4
5
6if os.path.isfile(CLEANED_PICKLE) or REVIEWS_DATA_CHANGED:
7 print("Reading cleaned reviews pickle")
8 reviews = pd.read_pickle(CLEANED_PICKLE, compression="gzip")
9else:
10 reviews = drop_empty_reviews(reviews)
11 print("Pickling cleaned reviews data")
12 reviews.to_pickle(CLEANED_PICKLE, compression="gzip")
13
14REVIEWS_DATA_CHANGED = False
15
16gc.collect()
17
18display(reviews)
1Reading cleaned reviews pickle
commentrating
0[list, favorite]10.0
1[uncounted]10.0
2[tire, awesome]10.0
3[negotiation, skill, thinking]10.0
4[fantastic, hooked]10.0
.........
2638167[party, horrible, dumping]3.0
2638168[difficult, included, build]3.0
2638169[create, abstract, fun, limited, number, pictu...3.0
2638170[similar, creationary]2.5
2638171[bad, worst, genre]2.0

2405729 rows × 2 columns

Exploring our dataset

Now that we've completed our cleaning, we need to assess what kind of model we should use to predict our ratings.

But first, let's see how many distinct values we have in the ratings column.

1reviews.rating.nunique()
13213

We've got 3k+ unique values. This would definitely help us in a Regression model.

But, since we will be predicting single digit ratings, I'm going to round off the column. This will allow us to perform both regression and classification and choose what give's us a better result.

Do note that rounding off generally isn't recommended as it leads to loss of information and will affect the outcome.

I'm only doing this due to computational constraints. You should see better results in Regression models without rounding the values.

1reviews.rating = reviews.rating.round().astype(int)
2
3reviews
commentrating
0[list, favorite]10
1[uncounted]10
2[tire, awesome]10
3[negotiation, skill, thinking]10
4[fantastic, hooked]10
.........
2638167[party, horrible, dumping]3
2638168[difficult, included, build]3
2638169[create, abstract, fun, limited, number, pictu...3
2638170[similar, creationary]2
2638171[bad, worst, genre]2

2405729 rows × 2 columns

We should have 11 distinct values (0 to 10) after this. Let's see how how many ratings of each category we have.

1reviews.rating.value_counts()
18 595145
27 520442
36 480153
49 219429
55 200143
610 140037
74 127085
83 66466
92 38184
101 18634
110 11
12Name: rating, dtype: int64

We can see that we have less than 50 reviews with a zero rating. Their contribution is fairly negligible so we're going to remove them.

1class_distrib = reviews.rating.value_counts(ascending=True)
2classes_to_drop = class_distrib[class_distrib < 50].index.values
3
4for rating in classes_to_drop:
5 rows_to_drop = reviews[reviews.rating == rating].index.to_list()
6 reviews.drop(rows_to_drop, inplace=True)
7 print("We have {} reviews remaining".format(reviews.shape))
1We have (2405718, 2) reviews remaining

Let's visualize the distribution of ratings a bit better by using a histogram

1fig, ax = plt.subplots(figsize=(10, 8))
2sns.barplot(reviews.rating.value_counts().index,
3 reviews.rating.value_counts(), color="#663399")
4ax.set_xlabel("Rating")
5ax.set_ylabel("Count")
6fig.show()

png

Splitting dataset for training, validation and testing

Looks like the data is following somewhat of a normal distribution. Most of the reviews lie between 6 to 9. This will definitely bias the model as the dataset is imbalanced.

Now that we have a vague idea about or course of action, let's split the dataset for training and testing. We will be using around 10% of this data mainly due to the volume and computational constraints.

1X_train, X_test, y_train, y_test = train_test_split(
2 reviews.comment, reviews.rating, train_size=0.1, test_size=0.1)
3X_dev, X_test, y_dev, y_test = train_test_split(X_test, y_test, train_size=0.5)
4
5X_train = X_train.apply(' '.join)
6X_dev = X_dev.apply(' '.join)
7X_test = X_test.apply(' '.join)
8
9gc.collect()
10
11print("Number of records chosen for training: {}".format(X_train.shape[0]))
12print("Number of records chosen for development: {}".format(X_dev.shape[0]))
13print("Number of records chosen for testing: {}".format(X_test.shape[0]))
1Number of records chosen for training: 240571
2Number of records chosen for development: 120286
3Number of records chosen for testing: 120286

Great! We've got that sorted. Since we have 10 unique values in ratings and we're dealing with text, let's start off with a Multinomial Naive Bayes Classifier.

But, before we do that, we need to calculate the TF-IDF values for our reviews.

TF is the term-frequency for every word in the review i.e. the number of times a word has appeared in a review. This is fairly straightforward to calculate by using a counter per review.

IDF (Inverse Document Frequency) is slightly more trickier. Essentially, IDF is the weight of a word across all reviews. It is a measure of how common or rare a word is across the corpus. This helps us understand which words need to be prioritized over others. The closer IDF is to 0, the more common it is and the lesser it will contribute to the model. We need to target words that have a weight closer to 1 as they contribute the most.

This can be calculated by taking the total number of reviews, dividing it by the number of reviews that contain a word, and calculating the logarithm just to balance large numbers that can arise. This is what the final formula looks like:

$tfidf{i,d} = tf{i,d} \cdot idf_{i}$

1vectorizer = TfidfVectorizer()
2
3X_train_vec = vectorizer.fit_transform(X_train)
4
5X_dev_vec = vectorizer.transform(X_dev)
6X_test_vec = vectorizer.transform(X_test)

Selecting a model

We will be testing the following models

  • Multinomial Naive Bayes (Works well with text data)
  • Linear Regression (Target column is a number so maybe it will work)
  • Logistic Regression (Works well with categorical data)

First let's give Naive Bayes a shot mainly due to the flexibility we get with text data. Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

This is what the final classification rule looks like:

$\hat{y} = \arg\maxy P(y) \prod{i=1}^{n} P(x_i \mid y)$

1accuracy_log = []
2
3mnbc = MultinomialNB()
4
5mnbc.fit(X_train_vec, y_train)
6
7mnbc_predicted = mnbc.predict(X_dev_vec)
8
9acc_score = accuracy_score(mnbc_predicted, y_dev)
10mnbc_accuracy = round(acc_score * 100, 2)
11mnbc_report = classification_report(
12 y_dev, mnbc_predicted, digits=4, zero_division=False)
13
14gc.collect()
15
16print("Accuracy for Multinomial Naive Bayes = {}%".format(mnbc_accuracy))
17print()
18accuracy_log.append(acc_score)
19print("Classification Report for Multinomial Naive Bayes")
20print(mnbc_report)
1Accuracy for Multinomial Naive Bayes = 29.4%
2
3Classification Report for Multinomial Naive Bayes
4 precision recall f1-score support
5
6 1 0.0000 0.0000 0.0000 926
7 2 0.0000 0.0000 0.0000 1845
8 3 0.0000 0.0000 0.0000 3357
9 4 0.3286 0.0036 0.0071 6451
10 5 0.1967 0.0047 0.0093 10112
11 6 0.2816 0.3333 0.3053 24033
12 7 0.2678 0.2212 0.2423 25999
13 8 0.3073 0.7239 0.4314 29635
14 9 0.2568 0.0035 0.0068 10955
15 10 0.4318 0.0054 0.0108 6973
16
17 accuracy 0.2940 120286
18 macro avg 0.2071 0.1296 0.1013 120286
19weighted avg 0.2724 0.2940 0.2221 120286
1print("Confusion Matrix for Multinomial Naive Bayes:")
2
3mnb_cm = confusion_matrix(y_dev, mnbc_predicted, normalize='all')
4
5fig, ax = plt.subplots(figsize=(15, 15))
6sns.heatmap(mnb_cm, annot=True, linewidths=0.5, ax=ax,
7 cmap="Purples", linecolor="#663399", fmt="f", square=True)
8fig.show()
1Confusion Matrix for Multinomial Naive Bayes:

png

Let's see how well our model works for a few real-world reviews

1def check_real_review_predictions(model, vectorizer):
2 real_reviews = ["This game is absolutely pathetic. Horrible story and characters. Will never play this game again.",
3 "Worst.",
4 "One of the best games I've ever played. Amazing story and characters. Recommended.",
5 "good."]
6
7 test_vectorizer = TfidfVectorizer(vocabulary=vectorizer.vocabulary_)
8 vectorized_review = test_vectorizer.fit_transform(real_reviews)
9 predicted = model.predict(vectorized_review)
10 combined = np.vstack((real_reviews, predicted)).T
11 return combined
12
13
14preds = check_real_review_predictions(mnbc, vectorizer)
15preds
1array([['This game is absolutely pathetic. Horrible story and characters. Will never play this game again.',
2 '7'],
3 ['Worst.', '6'],
4 ["One of the best games I've ever played. Amazing story and characters. Recommended.",
5 '8'],
6 ['good.', '7']], dtype='<U97')

Well, this is exactly what we feared. All our ratings are between 6 to 8.

Looks like our model is being biased by our input data. I've got a feeling that some form of regression would work slighlty better here. Let's give Linear Regression a shot.

1linreg = LinearRegression(n_jobs=4)
2
3linreg.fit(X_train_vec, y_train)
4
5linreg_predicted = linreg.predict(X_dev_vec)
6
7acc_score = accuracy_score(np.round(linreg_predicted), y_dev)
8linreg_accuracy_round = round(acc_score * 100, 2)
9
10gc.collect()
11
12print("Accuracy for Linear Regression = {}%".format(linreg_accuracy_round))
13accuracy_log.append(acc_score)
1Accuracy for Linear Regression = 27.47%

The accuracy is a little worse but let's check with some of our reviews before we dismiss this idea.

1preds = check_real_review_predictions(linreg, vectorizer)
2preds
1array([['This game is absolutely pathetic. Horrible story and characters. Will never play this game again.',
2 '4.437034389776194'],
3 ['Worst.', '2.3588930267669355'],
4 ["One of the best games I've ever played. Amazing story and characters. Recommended.",
5 '9.024957161994099'],
6 ['good.', '7.021012417756805']], dtype='<U97')

Perfect. Looks like we are on the right track. Since this is mainly a classification problem, Logistic Regression would be a better fit here. Let's try that:

1logreg = LogisticRegression(n_jobs=4)
2
3logreg.fit(X_train_vec, y_train)
4
5logreg_predicted = logreg.predict(X_dev_vec)
6
7logreg_accuracy = round(accuracy_score(logreg_predicted, y_dev) * 100, 2)
8logreg_report = classification_report(
9 y_dev, logreg_predicted, digits=4, zero_division=False)
10
11gc.collect()
12
13print("Accuracy for Logistic Regression = {}%".format(logreg_accuracy))
14print()
15accuracy_log.append(acc_score)
16print("Classification Report for Logistic Regression")
17print(logreg_report)
1Accuracy for Logistic Regression = 29.94%
2
3Classification Report for Logistic Regression
4 precision recall f1-score support
5
6 1 0.2321 0.0594 0.0946 926
7 2 0.2030 0.0580 0.0902 1845
8 3 0.2054 0.0316 0.0547 3357
9 4 0.2100 0.0632 0.0972 6451
10 5 0.2145 0.0562 0.0890 10112
11 6 0.2888 0.3937 0.3332 24033
12 7 0.2679 0.2854 0.2764 25999
13 8 0.3337 0.5489 0.4151 29635
14 9 0.2703 0.0481 0.0817 10955
15 10 0.3354 0.1565 0.2134 6973
16
17 accuracy 0.2994 120286
18 macro avg 0.2561 0.1701 0.1745 120286
19weighted avg 0.2818 0.2994 0.2647 120286

While our acurracy is better, our solution hasn't fully converged. Let's increase the max iterations to a high value and see where it converges.

We can then use that and see how our model does. The default value is 100 so we will be increasing it to 1000.

1logreg = LogisticRegression(max_iter=1000, n_jobs=4)
2
3logreg.fit(X_train_vec, y_train)
4
5logreg_predicted = logreg.predict(X_dev_vec)
6
7acc_score = accuracy_score(logreg_predicted, y_dev)
8logreg_accuracy = round(acc_score * 100, 2)
9logreg_report = classification_report(
10 y_dev, logreg_predicted, digits=4, zero_division=False)
11
12gc.collect()
13
14print("Accuracy for Logistic Regression = {}%".format(logreg_accuracy))
15print()
16accuracy_log.append(acc_score)
17print("Classification Report for Logistic Regression")
18print(logreg_report)
19print("Solution converged at iteration = {}".format(logreg.n_iter_[-1]))
1Accuracy for Logistic Regression = 30.08%
2
3Classification Report for Logistic Regression
4 precision recall f1-score support
5
6 1 0.3687 0.0713 0.1195 926
7 2 0.2318 0.0569 0.0914 1845
8 3 0.2272 0.0349 0.0604 3357
9 4 0.2103 0.0629 0.0969 6451
10 5 0.2022 0.0552 0.0867 10112
11 6 0.2902 0.3916 0.3334 24033
12 7 0.2717 0.2817 0.2766 25999
13 8 0.3310 0.5616 0.4165 29635
14 9 0.2674 0.0504 0.0848 10955
15 10 0.3666 0.1428 0.2056 6973
16
17 accuracy 0.3008 120286
18 macro avg 0.2767 0.1709 0.1772 120286
19weighted avg 0.2849 0.3008 0.2651 120286
20
21Solution converged at iteration = 637

Let's have a look at the confision matrix and see if things got any better

1print("Confusion Matrix for Logistic Regression:")
2
3logreg_cm = confusion_matrix(y_dev, logreg_predicted, normalize='all')
4
5fig, ax = plt.subplots(figsize=(15, 15))
6sns.heatmap(logreg_cm, annot=True, linewidths=0.5, ax=ax,
7 cmap="Purples", linecolor="#663399", fmt="f", square=True)
8fig.show()
1Confusion Matrix for Logistic Regression:

png

While we did lose some accuracy, our solution has converged and the confusion matrix looks slightly better. Let's see how it classifies our real world reviews.

1preds = check_real_review_predictions(logreg, vectorizer)
2preds
1array([['This game is absolutely pathetic. Horrible story and characters. Will never play this game again.',
2 '1'],
3 ['Worst.', '1'],
4 ["One of the best games I've ever played. Amazing story and characters. Recommended.",
5 '10'],
6 ['good.', '7']], dtype='<U97')

Perfect! These ratings correspond somewhat to what we were expecting.

But before we move ahead, we will ensure the consistency of these metrics by using Stratified K-Fold cross validation.

Since we are dealing with imbalanced data, a Stratified K-Fold is necessary to maintain the distribution of the input dataset.

1kfold = KFold(n_splits=3)
2fold = 1
3
4for train_index, test_index in kfold.split(X_train, y_train):
5 X_tr, X_te = X_train[X_train.index.isin(
6 train_index)], X_train[X_train.index.isin(test_index)]
7 y_tr, y_te = y_train[y_train.index.isin(
8 train_index)], y_train[y_train.index.isin(test_index)]
9
10 vectorizer = TfidfVectorizer()
11
12 X_tr_vec = vectorizer.fit_transform(X_tr)
13 X_te_vec = vectorizer.transform(X_te)
14
15 logreg.fit(X_tr_vec, y_tr)
16
17 logreg_predicted = logreg.predict(X_te_vec)
18
19 acc_score = accuracy_score(logreg_predicted, y_te)
20 logreg_accuracy = round(acc_score * 100, 2)
21
22 gc.collect()
23
24 print("Accuracy for Logistic Regression Fold {} = {}%".format(
25 fold, logreg_accuracy))
26 print()
27 fold += 1
1Accuracy for Logistic Regression Fold 1 = 30.94%
2
3Accuracy for Logistic Regression Fold 2 = 31.6%
4
5Accuracy for Logistic Regression Fold 3 = 28.53%

Looks like Logistic Regression is performaing fairly well when compared to it's Linear counterpart or even Naive Bayes for that matter. Let's see how the model is performing on the unseen test data.

1logreg.fit(X_train_vec, y_train)
2final_predicted = logreg.predict(X_test_vec)
3
4acc_score = accuracy_score(final_predicted, y_test)
5logreg_accuracy = round(acc_score * 100, 2)
6
7print("Final Accuracy for test data with Logistic Regression = {}%".format(
8 logreg_accuracy))
1Final Accuracy for test data with Logistic Regression = 30.16%

While this isn't the accuracy we were expecting it looks like our accuracy is platueauing. This could be due to the inherent bias in the dataset.

Due to this, I will be training the model using a around 50% of the data and pickling it so that we can use it in our online Flask application to predict any new reviews that a user can input.

1X_train, _, y_train, _ = train_test_split(
2 reviews.comment, reviews.rating, train_size=0.5, test_size=0.01)
3X_train = X_train.apply(' '.join)
4
5print("Number of records chosen for training the final model = {}".format(
6 X_train.shape[0]))
7
8vectorizer = TfidfVectorizer()
9X_train_vec = vectorizer.fit_transform(X_train)
10
11logreg.fit(X_train_vec, y_train)
12gc.collect()
1Number of records chosen for training the final model = 1202859
2
3
4
5
6
742

We will pickle this model and the vocabulary of the TF-IDF vectorizer. This will be used in the Flask application that we will be building.

1if os.path.isfile(MODEL_PICKLE) or REVIEWS_DATA_CHANGED:
2 print('Pickled model already present')
3 os.remove(DATA_DIR + MODEL_PICKLE)
4
5if os.path.isfile(VOCABULARY_PICKLE) or REVIEWS_DATA_CHANGED:
6 print('Pickled vocab already present')
7 os.remove(DATA_DIR + VOCABULARY_PICKLE)
8
9print("Pickling model")
10pickle.dump(logreg, open(MODEL_PICKLE, "wb"))
11
12print("Pickling vocabulary")
13pickle.dump(vectorizer.vocabulary_, open(VOCABULARY_PICKLE, "wb"))
1Pickling model
2Pickling vocabulary

Challenges faced

The biggest challenges that I faced revolved around cleaning the dataset and dealing with imbalanced data.

Every model that I tested was giving similar accuracy thus leading to the conclusion that the dataset itself was creating an implict bias.

The dataset was difficult to clean and implement models on due to the sheer volume of data and the hardware constraints.

Even while attempting to deal with imbalanced data, there was issues with accuracy because methods such as SMOTE and StratifiedKFold do not work well with text data as the features.

The other challenge I faced was deploying the model to a production environment. If I used a majority of the data to train the model, it became too heavy and every AJAX call to the server took a long time to load.

Thus, I ended up loading the model as soon as the server started and caching it in the memory thereby speeding up consecutive requests.

My contribution

Built a Flask application that predicts the rating of any review. Please visit the page here and enter a review. You should see the prediction for that review. This application has been deployed on Heroku and the form is available on my Portfolio website that is built using GatsbyJS. The code for this application is available on my GitHub.

You can also see a demo of this application on YouTube.

A copy of the dataset and it's pickle files are available here and the IPython Notebook can be downloaded from here. (Right-click and Save As to download the file)

References

StratifiedKFold

LogisticRegression

Sending POST Request via Axios

SMOTE Oversampling

Regression Methods

Pickling ML models

Multi-class Text Classification

Multinomial Naive Bayes

TF-IDF Vectorizer