Real or Not? NLP with Disaster Tweets

2020 March 3rd

The problem at hand is a standard use case for an NLP solution. We have a dataset that consists of many tweets. Some of these tweets pertain to announcing an emergency and some do not. We need to use Natural Language Processing to calssify which is which.

We are given two files. The first one is train.csv which contains the following columns:

  • id
  • keyword
  • location
  • text
  • target

This file contains about 8,500 tweets that are already classified and need to be used for training our model.

The second file is test.csv which contains the following columns:

  • id
  • keyword
  • location
  • text

This file contains 3,700 non-classified tweets which will be used to test the model for it's accuracy.

To solve this problem we need to do the following:

  • Import the train.csv and test.csv datasets
  • Perform text pre-processing such as
  • Tokenize the processed data
  • Train the model using train.csv
  • Test the model using test.csv

Let us begin by importing the necessary libraries such as NumPy, Pandas, NLTK, Scikit-learn and matplotlib.

1import numpy as np
2import pandas as pd
3
4# text processing libraries
5import re
6import string
7import nltk
8from nltk.corpus import stopwords
9
10# sklearn
11from sklearn import model_selection
12from sklearn.feature_extraction.text import TfidfVectorizer
13from sklearn.naive_bayes import MultinomialNB
14
15# matplotlib and seaborn for plotting
16import matplotlib.pyplot as plt
17import seaborn as sns
18from wordcloud import WordCloud
19
20import random
21
22nltk.download(['stopwords', 'wordnet'])
1[nltk_data] Downloading package stopwords to /home/karan/nltk_data...
2[nltk_data] Package stopwords is already up-to-date!
3[nltk_data] Downloading package wordnet to /home/karan/nltk_data...
4[nltk_data] Package wordnet is already up-to-date!
5
6
7
8
9
10True
1train_csv = "./data/train.csv"
2test_csv = "./data/test.csv"
3submission_csv = "./data/submission.csv"

We load the datasets using Pandas

1train = pd.read_csv(train_csv)
2print('Training data shape: ', train.shape)
3print(train.head())
4
5print()
6
7test = pd.read_csv(test_csv)
8print('Testing data shape: ', test.shape)
9print(test.head())
1Training data shape: (7613, 5)
2 id keyword location text \
30 1 NaN NaN Our Deeds are the Reason of this #earthquake M...
41 4 NaN NaN Forest fire near La Ronge Sask. Canada
52 5 NaN NaN All residents asked to 'shelter in place' are ...
63 6 NaN NaN 13,000 people receive #wildfires evacuation or...
74 7 NaN NaN Just got sent this photo from Ruby #Alaska as ...
8
9 target
100 1
111 1
122 1
133 1
144 1
15
16Testing data shape: (3263, 4)
17 id keyword location text
180 0 NaN NaN Just happened a terrible car crash
191 2 NaN NaN Heard about #earthquake is different cities, s...
202 3 NaN NaN there is a forest fire at spot pond, geese are...
213 9 NaN NaN Apocalypse lighting. #Spokane #wildfires
224 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan

Now that the datasets are loaded, we need to begin exploring them.

Firstly, we should see how many columns have missing data and assess if any of these columns will be used to train the model. If yes, we need to try our best to fill them. If not, we can leave them as they are.

Let's see how many values are missing in both datasets for every column:

1print("Missing values in training dataset")
2print(train.isnull().sum(), end='\n\n')
3
4print("Missing values in testing dataset")
5print(test.isnull().sum(), end='\n\n')
1Missing values in training dataset
2id 0
3keyword 61
4location 2533
5text 0
6target 0
7dtype: int64
8
9Missing values in testing dataset
10id 0
11keyword 26
12location 1105
13text 0
14dtype: int64

We know for a fact that the text and the target columns will be used for training. Since both these columns do not have missing data, we can proceed. We will consider filling the remaining columns if need be.

Now, we also need to ensure that there is enough data per class to train the model. For a particular class, if we have too little data to train, we might lose accuracy and if we have too much, we might end up overfitting the data. Since we only have two target values - 0 and 1, let us check the count of both in the training set:

1print(train['target'].value_counts())
2
3sns.barplot(train['target'].value_counts().index,
4 train['target'].value_counts(), palette='Purples_r')
10 4342
21 3271
3Name: target, dtype: int64
4
5
6
7
8
9<matplotlib.axes._subplots.AxesSubplot at 0x7fa0734cb280>

png

This is a preview of what a disaster and a non-disaster tweet look like:

1# Disaster tweet
2disaster_tweets = train[train['target'] == 1]['text']
3print(disaster_tweets.values[1])
4
5# Not a disaster tweet
6non_disaster_tweets = train[train['target'] == 0]['text']
7print(non_disaster_tweets.values[1])
1Forest fire near La Ronge Sask. Canada
2I love fruits

Let us also see what the Top 20 keywords are in order to understand the trend of the tweets

1sns.barplot(y=train['keyword'].value_counts()[:20].index,
2 x=train['keyword'].value_counts()[:20], orient='h', palette='Purples_r')
1<matplotlib.axes._subplots.AxesSubplot at 0x7fa070fb6670>

png

Most of these keywords seem to be related to disasters which means we need to explore further.

Let us now see how many times the word "disaster" has been mentioned in the text and a disaster has actually occured

1train.loc[train['text'].str.contains(
2 'disaster', na=False, case=False)].target.value_counts()
11 102
20 40
3Name: target, dtype: int64

As we can see, that's about 71% which is something we need to make a note of.

Let us now explore the "location" column

1train['location'].value_counts()
1USA 104
2New York 71
3United States 50
4London 45
5Canada 29
6 ...
7North West London 1
8Holly Springs, NC 1
9Ames, Iowa 1
10( ?å¡ ?? ?å¡), 1
11London/Bristol/Guildford 1
12Name: location, Length: 3341, dtype: int64

We can see that there are too many unique values. We need to cluster them to understand the trend of this data. For now, we will be brute-forcing this but I am sure that there are better ways to do this.

1# Replacing the ambigious locations name with Standard names
2train['location'].replace({'United States': 'USA',
3 'New York': 'USA',
4 "London": 'UK',
5 "Los Angeles, CA": 'USA',
6 "Washington, D.C.": 'USA',
7 "California": 'USA',
8 "Chicago, IL": 'USA',
9 "Chicago": 'USA',
10 "New York, NY": 'USA',
11 "California, USA": 'USA',
12 "FLorida": 'USA',
13 "Nigeria": 'Africa',
14 "Kenya": 'Africa',
15 "Everywhere": 'Worldwide',
16 "San Francisco": 'USA',
17 "Florida": 'USA',
18 "United Kingdom": 'UK',
19 "Los Angeles": 'USA',
20 "Toronto": 'Canada',
21 "San Francisco, CA": 'USA',
22 "NYC": 'USA',
23 "Seattle": 'USA',
24 "Earth": 'Worldwide',
25 "Ireland": 'UK',
26 "London, England": 'UK',
27 "New York City": 'USA',
28 "Texas": 'USA',
29 "London, UK": 'UK',
30 "Atlanta, GA": 'USA',
31 "Mumbai": "India"}, inplace=True)
32
33sns.barplot(y=train['location'].value_counts()[:5].index,
34 x=train['location'].value_counts()[:5], orient='h', palette='Purples_r')
1<matplotlib.axes._subplots.AxesSubplot at 0x7fa07143c940>

png

1train['location'].value_counts()
1USA 445
2UK 118
3Africa 51
4India 46
5Worldwide 45
6 ...
7Quito, Ecuador. 1
8Im Around ... Jersey 1
9New Brunswick, NJ 1
10England. 1
11London/Bristol/Guildford 1
12Name: location, Length: 3312, dtype: int64

As we can see, the clustering is much better now


Now, we begin exploring the most important column which is the "text" column

1train['text'].head()
10 Our Deeds are the Reason of this #earthquake M...
21 Forest fire near La Ronge Sask. Canada
32 All residents asked to 'shelter in place' are ...
43 13,000 people receive #wildfires evacuation or...
54 Just got sent this photo from Ruby #Alaska as ...
6Name: text, dtype: object

Before we begin working on our model, we need to clean up the text column. There are a few basic transformations we can do such as making the text lowercase, removing special symbols, removing links and removing words containing numbers like so:

1def clean_text(text):
2 text = text.lower()
3 text = re.sub('\[.*?\]', '', text)
4 text = re.sub('https?://\S+|www\.\S+', '', text)
5 text = re.sub('<.*?>+', '', text)
6 text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
7 text = re.sub('\n', '', text)
8 text = re.sub('\w*\d\w*', '', text)
9 return text
10
11
12# Applying the cleaning function to both test and training datasets
13train['text'] = train['text'].apply(lambda x: clean_text(x))
14test['text'] = test['text'].apply(lambda x: clean_text(x))
15
16disaster_tweets = disaster_tweets.apply(lambda x: clean_text(x))
17non_disaster_tweets = non_disaster_tweets.apply(lambda x: clean_text(x))
18
19# Let's take a look at the updated text
20print(train['text'].head())
21print()
22print(test['text'].head())
10 our deeds are the reason of this earthquake ma...
21 forest fire near la ronge sask canada
32 all residents asked to shelter in place are be...
43 people receive wildfires evacuation orders in...
54 just got sent this photo from ruby alaska as s...
6Name: text, dtype: object
7
80 just happened a terrible car crash
91 heard about earthquake is different cities sta...
102 there is a forest fire at spot pond geese are ...
113 apocalypse lighting spokane wildfires
124 typhoon soudelor kills in china and taiwan
13Name: text, dtype: object

As we can see, the text looks a lot better now. After this, we break each sentence down to a list of words. This process is called tokenization.

1# Tokenizing the training and the test set
2tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
3train['text'] = train['text'].apply(lambda x: tokenizer.tokenize(x))
4test['text'] = test['text'].apply(lambda x: tokenizer.tokenize(x))
5train['text'].head()
10 [our, deeds, are, the, reason, of, this, earth...
21 [forest, fire, near, la, ronge, sask, canada]
32 [all, residents, asked, to, shelter, in, place...
43 [people, receive, wildfires, evacuation, order...
54 [just, got, sent, this, photo, from, ruby, ala...
6Name: text, dtype: object

Next, we remove common structural words from the sentences. These words occur very frequently and help form the sentence. In this case, these stopwords aren't needed.

1def remove_stopwords(text):
2 words = [w for w in text if w not in stopwords.words('english')]
3 return words
4
5
6train['text'] = train['text'].apply(lambda x: remove_stopwords(x))
7print(train.head())
8print()
9test['text'] = test['text'].apply(lambda x: remove_stopwords(x))
10print(test.head())
1id keyword location text \
20 1 NaN NaN [deeds, reason, earthquake, may, allah, forgiv...
31 4 NaN NaN [forest, fire, near, la, ronge, sask, canada]
42 5 NaN NaN [residents, asked, shelter, place, notified, o...
53 6 NaN NaN [people, receive, wildfires, evacuation, order...
64 7 NaN NaN [got, sent, photo, ruby, alaska, smoke, wildfi...
7
8 target
90 1
101 1
112 1
123 1
134 1
14
15 id keyword location text
160 0 NaN NaN [happened, terrible, car, crash]
171 2 NaN NaN [heard, earthquake, different, cities, stay, s...
182 3 NaN NaN [forest, fire, spot, pond, geese, fleeing, acr...
193 9 NaN NaN [apocalypse, lighting, spokane, wildfires]
204 11 NaN NaN [typhoon, soudelor, kills, china, taiwan]

After removing the stopwords, we should essentially be left with tokenized keywords. We can see the most common words and their promincence by generating a Wordcloud like so:

1def purple_color_func(word, font_size, position, orientation, random_state=None,
2 **kwargs):
3 return "hsl(270, 50%%, %d%%)" % random.randint(50, 60)
4
5
6fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[26, 8])
7wordcloud1 = WordCloud(background_color='white',
8 color_func=purple_color_func,
9 random_state=3,
10 width=600,
11 height=400).generate(" ".join(disaster_tweets))
12ax1.imshow(wordcloud1)
13ax1.axis('off')
14ax1.set_title('Disaster Tweets', fontsize=40)
15
16wordcloud2 = WordCloud(background_color='white',
17 color_func=purple_color_func,
18 random_state=3,
19 width=600,
20 height=400).generate(" ".join(non_disaster_tweets))
21ax2.imshow(wordcloud2)
22ax2.axis('off')
23ax2.set_title('Non Disaster Tweets', fontsize=40)
1Text(0.5, 1.0, 'Non Disaster Tweets')

png

We will now lemmatize these tokens to bring them to their base dictionary form i.e. the lemma

1def combine_text(list_of_text):
2 combined_text = ' '.join(lemmatizer.lemmatize(token)
3 for token in list_of_text)
4 return combined_text
5
6
7lemmatizer = nltk.stem.WordNetLemmatizer()
8
9train['text'] = train['text'].apply(lambda x: combine_text(x))
10test['text'] = test['text'].apply(lambda x: combine_text(x))
11
12print(train['text'].head())
13print()
14print(test['text'].head())
10 deed reason earthquake may allah forgive u
21 forest fire near la ronge sask canada
32 resident asked shelter place notified officer ...
43 people receive wildfire evacuation order calif...
54 got sent photo ruby alaska smoke wildfire pour...
6Name: text, dtype: object
7
80 happened terrible car crash
91 heard earthquake different city stay safe ever...
102 forest fire spot pond goose fleeing across str...
113 apocalypse lighting spokane wildfire
124 typhoon soudelor kill china taiwan
13Name: text, dtype: object

Now that we're done with pre-processing the text, we can begin working on our model. Before we do that, we need to transform these words to numerical vectors. These vectors tell us the degree of presence of the words. This presence needs to be measured across documents and the count of the "non-informational" words needs to decreased as well. We will be using TF-IDF to achieve this. We need to vectorize both our train and test datasets to use them in our model like so:

1tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
2train_tfidf = tfidf.fit_transform(train['text'])
3test_tfidf = tfidf.transform(test["text"])
4
5print(train_tfidf)
6print()
7print(test_tfidf)
1(0, 5830) 0.4283910272487035
2 (0, 3630) 0.442802667923881
3 (0, 212) 0.3877668938463252
4 (0, 5829) 0.2746947051361022
5 (0, 2800) 0.3039078384184142
6 (0, 7604) 0.32581037903448745
7 (0, 2328) 0.442802667923881
8 (1, 3479) 0.4845040732080171
9 (1, 3621) 0.3822396608931049
10 (1, 1381) 0.4246053615119869
11 (1, 5177) 0.3842811189438505
12 (1, 6306) 0.34607258534244945
13 (1, 3452) 0.24165665268092038
14 (1, 3620) 0.33618860952320734
15 (2, 3147) 0.2657764825882111
16 (2, 6724) 0.23569671822410906
17 (2, 3062) 0.2227770081423819
18 (2, 6594) 0.2424595559593297
19 (2, 7069) 0.48248657179539656
20 (2, 8358) 0.6043165944555675
21 (2, 538) 0.2871184150457327
22 (2, 7778) 0.29148733099902296
23 (3, 3068) 0.4425596400870512
24 (3, 1335) 0.3027908527958075
25 (3, 10334) 0.3239380496636641
26 : :
27 (7611, 2826) 0.401472044402183
28 (7611, 1810) 0.14026885071570205
29 (7611, 9348) 0.1849947012928698
30 (7611, 4820) 0.16272014191955392
31 (7611, 8273) 0.16575262991283457
32 (7611, 5498) 0.1382488599249523
33 (7611, 4746) 0.12733479355201754
34 (7611, 7852) 0.17846146282777567
35 (7611, 7159) 0.11567429717158513
36 (7611, 1427) 0.1197790706321725
37 (7612, 10335) 0.3032439067801956
38 (7612, 7547) 0.2755923592409695
39 (7612, 4437) 0.2741703849012986
40 (7612, 5255) 0.2755923592409695
41 (7612, 7545) 0.27017267319512944
42 (7612, 11) 0.2955965702272841
43 (7612, 10) 0.2801726390042653
44 (7612, 1348) 0.2653738195873699
45 (7612, 6484) 0.260057317420961
46 (7612, 6483) 0.24173953361522896
47 (7612, 5254) 0.24580182458871397
48 (7612, 6396) 0.19925009425214993
49 (7612, 4426) 0.21234459847134035
50 (7612, 1335) 0.21637434089753566
51 (7612, 10334) 0.2314861276039809
52
53 (0, 9230) 0.6106120272454298
54 (0, 4170) 0.5433993122977064
55 (0, 2061) 0.4048592115698027
56 (0, 1427) 0.4098282059408354
57 (1, 8823) 0.37518901218054423
58 (1, 8026) 0.44653543048461586
59 (1, 4270) 0.36656281875633034
60 (1, 3094) 0.3442506418970376
61 (1, 2800) 0.34996700236507755
62 (1, 2515) 0.41706689882493625
63 (1, 1684) 0.332476779760557
64 (2, 8914) 0.2982876339731651
65 (2, 8739) 0.3096651397148402
66 (2, 8097) 0.2891700660397562
67 (2, 7189) 0.38384233856781863
68 (2, 3621) 0.3028246271679674
69 (2, 3620) 0.2663412532836387
70 (2, 3545) 0.40738572199413514
71 (2, 3452) 0.19144948376046125
72 (2, 1390) 0.34349296061508244
73 (2, 76) 0.3177008929234673
74 (3, 10334) 0.45820131669706116
75 (3, 5406) 0.7099753740236011
76 (3, 415) 0.5347770766002663
77 (4, 9747) 0.42903790729283375
78 : :
79 (3260, 5467) 0.3647601893232333
80 (3260, 4041) 0.441311847098637
81 (3260, 4040) 0.386172737090434
82 (3260, 2402) 0.34506447460979967
83 (3260, 1618) 0.4071282862615373
84 (3261, 10225) 0.3047656428157674
85 (3261, 10222) 0.24052318368948278
86 (3261, 6766) 0.3089712220235017
87 (3261, 6765) 0.3047656428157674
88 (3261, 5890) 0.35282352257211624
89 (3261, 5889) 0.35282352257211624
90 (3261, 4890) 0.3089712220235017
91 (3261, 4889) 0.25147762791285616
92 (3261, 4553) 0.3089712220235017
93 (3261, 4234) 0.3047656428157674
94 (3261, 4229) 0.25476025076136466
95 (3262, 10624) 0.3866807595647437
96 (3262, 7078) 0.2482639888828409
97 (3262, 6220) 0.33401120592562156
98 (3262, 6219) 0.33401120592562156
99 (3262, 2930) 0.3059323676289684
100 (3262, 2917) 0.2146772887900108
101 (3262, 1691) 0.3866807595647437
102 (3262, 91) 0.3866807595647437
103 (3262, 90) 0.3563580209819999

Let us begin training our classifier using the vectors we just generated. We will be using a Naive Bayes classifier to classify the tweets in the training dataset. Then, we will use this model to predict the class of the tweets in the test dataset like so:

1# Fitting a simple Naive Bayes on TFIDF
2clf_NB_TFIDF = MultinomialNB()
3scores = model_selection.cross_val_score(
4 clf_NB_TFIDF, train_tfidf, train["target"], cv=5, scoring="f1")
5print(scores)
1[0.57703631 0.58502203 0.62051282 0.60203139 0.74344718]
1clf_NB_TFIDF.fit(train_tfidf, train["target"])
1MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Finally, we can our trained classifier with the tweets from our test set to see how well our classifier is doing. The predictions are stored in the file data/submission.csv.

1df = pd.DataFrame()
2predictions = clf_NB_TFIDF.predict(test_tfidf)
3df["id"] = test['id']
4df["target"] = predictions
5print(df)
6
7df.to_csv(submission_csv, index=False)
1id target
20 0 1
31 2 0
42 3 1
53 9 1
64 11 1
7... ... ...
83258 10861 1
93259 10865 0
103260 10868 1
113261 10874 1
123262 10875 1
13
14[3263 rows x 2 columns]
idtarget
001
120
231
391
4111
3258108611
3259108650
3260108681
3261108741
3262108751

3263 rows × 2 columns

The notebook can be downloaded here if you'd like to review it

References:

Intro to NLP - Kaggle

NLP Tutorials

Getting started with feature vectors

TF-IDF Vectorizer