Note: Part 2 of this notebook is accomplished with TensorFlow and can be found here.

Task Description

Use the amazon review data from Kaggle to test the efficiency of our Sentiment Analysis models that live in TextAnalysis.jl. Compare it with models in ScikitLearn and Spacy python libraries. Upload your results as an issue in the TextAnalysis package.

Some basic machine learning knowledge is useful for this task.

Special thanks to Ayush Kaushal; an exemplary mentor without whom this task wouldn't be possible.

Find below, the julia part of the task. The python notebook would be attached too but would have sparse documentation.

The process of algorithmically identifying and categorizing opinions expressed in text to determine the user’s attitude toward the subject of the document (or post).

This is how I understand it.

Importing Required Packages

using TextAnalysis, FileIO

I would be working on the test data since the train one is humongous and my laptop was unable to render that in Jupyter every single time even when left for about an hour.

So, declaring the test reviews as review as evident by the code below.

reviews = Document("text/test.ft.txt")
FileDocument("text/test.ft.txt", TextAnalysis.DocumentMetadata(Languages.English(), "text/test.ft.txt", "Unknown Author", "Unknown Time"))

Getting to know some of our data.

We can see that the .txt file contains reviews in the form of :

"label1(/2) space ...the review..."

Exploratory data analysis also reveals that reviews beginning with __label__2 are positive reviews. That means that their sentiment score would also be higher (I'll demonstrate that in a sec...) Similarly, reviews beginning with __label__1 are negative reviews and so their sentiment score should evidently be lower.

Getting our pre-trained Sentiment Analyser to check on these lines.

sent = SentimentAnalyzer()
┌ Info: CUDAdrv.jl failed to initialize, GPU functionality unavailable (set JULIA_CUDA_SILENT or JULIA_CUDA_VERBOSE to silence or expand this message)
└ @ CUDAdrv C:\Users\shekh\.julia\packages\CUDAdrv\3EzC1\src\CUDAdrv.jl:69
Sentiment Analysis Model Trained on IMDB with a 88587 word corpus
#seeing how the data is arranged.
3-element Array{String,1}:
 "__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing \"Who was that singing ?\""                                                                                                                                                                                                                                                                                         
 "__label__2 One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it."
 "__label__1 Batteries died within a year ...: I bought this charger in Jul 2003 and it worked OK for a while. The design is nice and convenient. However, after about a year, the batteries would not hold a charge. Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

Now, I'll see that for the first 10 reviews in our dataset, what the actual label is and what sentiment score does our model return. From this we'll be able to know that the model isn't perfect and does indeed predict wrong sentiments for some reviews. Thus, developing a need to do text pre-processing to make the reviews comparable and remove unnecessary stuff like urls and other things generally not contributing to the read/feel of the review. !

tab = "SNo. |  Label  | Prediction Score | Should be |  Predicted  | Correct/Incorrect  "
for i in 1:15
    label = readlines("text/test.ft.txt")[i][1:10];
    review = readlines("text/test.ft.txt")[i][11:end];
    review = StringDocument(review);
    pred = sent(review);
    if label == "__label__2"
        should_be = "+ve"
        should_be = "-ve"
    if pred >= 0.5
        pred_be = "+ve"
    elseif pred < 0.5
        pred_be = "-ve"
    if pred_be == should_be
        correct = "Correct"
        correct = "Incorrect"
    println("$i   | $label  | $pred  |  $should_be  |  $pred_be | $correct  ")
SNo. |  Label  | Prediction Score | Should be |  Predicted  | Correct/Incorrect  
1   | __label__2  | 0.39506337  |  +ve  |  -ve | Incorrect  
2   | __label__2  | 0.5314957  |  +ve  |  +ve | Correct  
3   | __label__1  | 0.52432084  |  -ve  |  +ve | Incorrect  
4   | __label__2  | 0.5501878  |  +ve  |  +ve | Correct  
5   | __label__2  | 0.5919624  |  +ve  |  +ve | Correct  
6   | __label__1  | 0.61544746  |  -ve  |  +ve | Incorrect  
7   | __label__1  | 0.732198  |  -ve  |  +ve | Incorrect  
8   | __label__1  | 0.55473757  |  -ve  |  +ve | Incorrect  
9   | __label__2  | 0.4127747  |  +ve  |  -ve | Incorrect  
10   | __label__1  | 0.58470565  |  -ve  |  +ve | Incorrect  
11   | __label__2  | 0.5855292  |  +ve  |  +ve | Correct  
12   | __label__1  | 0.51694876  |  -ve  |  +ve | Incorrect  
13   | __label__1  | 0.5547061  |  -ve  |  +ve | Incorrect  
14   | __label__2  | 0.45876318  |  +ve  |  -ve | Incorrect  
15   | __label__1  | 0.52366424  |  -ve  |  +ve | Incorrect  

It's clear that our model isn't optimal since out of 15 samples, only 4 were correct predictions. However, I went a little too harsh on the model since in some cases, like in

14 | __label__2 | 0.45876318 | +ve | -ve | Incorrect

the model was within some limit of correct predictions. So yeah, sorry Mr. Sentiment Analyzer.

Moving on towards trying to improve the accuracy of predcitions by performing some general pre-defined text-processing functions in TextAnalysis package. But first, I want to know the length of our test data set so I can make batches of processing accrodinly to my computational powers.

test_data = readlines("text/test.ft.txt")

Ok, so now we know the size of the data we're dealing with let's get started with the pre-processing.

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

test_labels = []
test_string = []
fal_pos = 0
fal_neg = 0
tru_pos = 0
tru_neg = 0
for i in 1:length(test_data)
    label =test_data[i][1:10];
    push!(test_labels, label);
    review = test_data[i][11:end];
    push!(test_string, review)
    #after adding reviews and labels in their respective arrays.
    #I'll perform pre-processing on individual reviews.
    review_sd = StringDocument(review)
    if label == "__label__2"
        should_be = "+ve"
        should_be = "-ve"
    pred = sent(review_sd)
    if pred >= 0.5
        pred_be = "+ve"
    elseif pred < 0.5
        pred_be = "-ve"
    if pred_be == "+ve" && should_be == "+ve"
        tru_pos += 1
    elseif pred_be == "-ve" && should_be == "-ve"
        tru_neg += 1
    elseif pred_be == "-ve" && should_be == "+ve"
        fal_pos += 1
    elseif pred_be == "+ve" && should_be == "-ve"
        fal_neg += 1
BoundsError: attempt to access 32×5000 Array{Float32,2} at index [Base.Slice(Base.OneTo(32)), 5001]

 [1] throw_boundserror(::Array{Float32,2}, ::Tuple{Base.Slice{Base.OneTo{Int64}},Int64}) at .\abstractarray.jl:538
 [2] checkbounds at .\abstractarray.jl:503 [inlined]
 [3] _getindex at .\multidimensional.jl:669 [inlined]
 [4] getindex at .\abstractarray.jl:981 [inlined]
 [5] embedding(::Array{Float32,2}, ::Array{Float64,1}) at C:\Users\shekh\.julia\packages\TextAnalysis\pcFQf\src\sentiment.jl:27
 [6] (::TextAnalysis.var"#24#25"{Dict{Symbol,Any}})(::Array{Float64,1}) at C:\Users\shekh\.julia\packages\TextAnalysis\pcFQf\src\sentiment.jl:40
 [7] get_sentiment(::TextAnalysis.var"#26#27", ::Array{String,1}, ::Dict{Symbol,Any}, ::Dict{String,Any}) at C:\Users\shekh\.julia\packages\TextAnalysis\pcFQf\src\sentiment.jl:59
 [8] (::TextAnalysis.SentimentModel)(::Function, ::Array{String,1}) at C:\Users\shekh\.julia\packages\TextAnalysis\pcFQf\src\sentiment.jl:87
 [9] SentimentAnalyzer at C:\Users\shekh\.julia\packages\TextAnalysis\pcFQf\src\sentiment.jl:103 [inlined] (repeats 2 times)
 [10] top-level scope at .\In[33]:28

Ahh! Finally it's complete.

We get BoundsError: attempt to access 32×5000 Array{Float32,2} at index [Base.Slice(Base.OneTo(32)), 5001] error however on seeing this issue on TextAnalysis package.

Ref:BoundsError in sentiment analysis I've decided to ignore it. Let's get on towards calculating our predictions metrices: Precision / F1Score / Recall.

$$P = \frac{T_p}{T_p+F_p}$$

$$R = \frac{T_p}{T_p + F_n}$$

$$ F1 = \frac{2 \cdot P\cdot R}{P+ R} $$


precision = tru_pos / (tru_pos + fal_pos)
println("Precision is $precision")
Precision is 0.583117838593833
recall = tru_pos / (tru_pos + fal_neg)
println("Recall is $recall")
Recall is 0.5144996465068449
f1score = (2 * precision * recall) / (precision + recall)
#f1score is from 0 --> 1
println("F1Score is $f1score.")
F1Score is 0.5466638895622987.

End of Report