Comparing Sentiment Analysis Models
Part 1 with TextAnalysis.jl | Amazon Reviews Dataset
Task Description
Use the amazon review data from Kaggle to test the efficiency of our Sentiment Analysis models that live in TextAnalysis.jl. Compare it with models in ScikitLearn and Spacy python libraries. Upload your results as an issue in the TextAnalysis package.
Some basic machine learning knowledge is useful for this task.
Special thanks to Ayush Kaushal; an exemplary mentor without whom this task wouldn't be possible.
Find below, the julia
part of the task. The python notebook would be attached too but would have sparse documentation.
The process of algorithmically identifying and categorizing opinions expressed in text to determine the user’s attitude toward the subject of the document (or post).
This is how I understand it.
Importing Required Packages
using TextAnalysis, FileIO
I would be working on the test data since the train one is humongous and my laptop was unable to render that in Jupyter every single time even when left for about an hour.
So, declaring the test reviews as review
as evident by the code below.
reviews = Document("text/test.ft.txt")
Getting to know some of our data.
We can see that the .txt
file contains reviews in the form of :
"label1(/2) space ...the review..."
Exploratory data analysis also reveals that reviews beginning with __label__2
are positive reviews. That means that their sentiment score would also be higher (I'll demonstrate that in a sec...)
Similarly, reviews beginning with __label__1
are negative reviews and so their sentiment score should evidently be lower.
sent = SentimentAnalyzer()
#seeing how the data is arranged.
readlines("text/test.ft.txt")[1:3]
Now, I'll see that for the first 10 reviews in our dataset, what the actual label is and what sentiment score does our model return. From this we'll be able to know that the model isn't perfect and does indeed predict wrong sentiments for some reviews. Thus, developing a need to do text pre-processing to make the reviews comparable and remove unnecessary stuff like urls and other things generally not contributing to the read/feel of the review. !
tab = "SNo. | Label | Prediction Score | Should be | Predicted | Correct/Incorrect "
println(tab)
println("-"^(length(tab)+5))
for i in 1:15
label = readlines("text/test.ft.txt")[i][1:10];
review = readlines("text/test.ft.txt")[i][11:end];
review = StringDocument(review);
pred = sent(review);
if label == "__label__2"
should_be = "+ve"
else
should_be = "-ve"
end
if pred >= 0.5
pred_be = "+ve"
elseif pred < 0.5
pred_be = "-ve"
end
if pred_be == should_be
correct = "Correct"
else
correct = "Incorrect"
end
println("$i | $label | $pred | $should_be | $pred_be | $correct ")
end
It's clear that our model isn't optimal since out of 15 samples, only 4 were correct predictions. However, I went a little too harsh on the model since in some cases, like in
14 | __label__2 | 0.45876318 | +ve | -ve | Incorrect
the model was within some limit of correct predictions. So yeah, sorry Mr. Sentiment Analyzer.
Moving on towards trying to improve the accuracy of predcitions by performing some general pre-defined text-processing functions in TextAnalysis package. But first, I want to know the length of our test data set so I can make batches of processing accrodinly to my computational powers.
test_data = readlines("text/test.ft.txt")
length(test_data)
Ok, so now we know the size of the data we're dealing with let's get started with the pre-processing.
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.
A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.
test_labels = []
test_string = []
fal_pos = 0
fal_neg = 0
tru_pos = 0
tru_neg = 0
for i in 1:length(test_data)
label =test_data[i][1:10];
push!(test_labels, label);
review = test_data[i][11:end];
push!(test_string, review)
#after adding reviews and labels in their respective arrays.
#I'll perform pre-processing on individual reviews.
review_sd = StringDocument(review)
remove_corrupt_utf8!(review_sd)
stem!(review_sd)
remove_case!(review_sd)
#remove_indefinite_articles!(review_sd)
#remove_definite_articles!(review_sd)
if label == "__label__2"
should_be = "+ve"
else
should_be = "-ve"
end
pred = sent(review_sd)
if pred >= 0.5
pred_be = "+ve"
elseif pred < 0.5
pred_be = "-ve"
end
if pred_be == "+ve" && should_be == "+ve"
tru_pos += 1
elseif pred_be == "-ve" && should_be == "-ve"
tru_neg += 1
elseif pred_be == "-ve" && should_be == "+ve"
fal_pos += 1
elseif pred_be == "+ve" && should_be == "-ve"
fal_neg += 1
end
end
Ahh! Finally it's complete.
We get BoundsError: attempt to access 32×5000 Array{Float32,2} at index [Base.Slice(Base.OneTo(32)), 5001]
error however on seeing this issue on TextAnalysis package.
Ref:BoundsError in sentiment analysis I've decided to ignore it. Let's get on towards calculating our predictions metrices: Precision / F1Score / Recall.
$$P = \frac{T_p}{T_p+F_p}$$
$$R = \frac{T_p}{T_p + F_n}$$
$$ F1 = \frac{2 \cdot P\cdot R}{P+ R} $$
precision = tru_pos / (tru_pos + fal_pos)
println("Precision is $precision")
recall = tru_pos / (tru_pos + fal_neg)
println("Recall is $recall")
f1score = (2 * precision * recall) / (precision + recall)
#f1score is from 0 --> 1
println("F1Score is $f1score.")