Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

Chapter 3: Classification

Exercise: Question 3

Problem Statement: Build a spam classifier ( a more challenging experience)

  • Download examples of spam and ham from Apaches SpamAssasin's Public DataSet.
  • Unzip data and familiarize yourself with data format.
  • Split data-sets into training and test.
  • Write a data preparation pipeline to convert each email into a feature vector. The pipeline should transform email into a (sparse) vector that indicates presence or absence of each possible word.
  • You may add hyperparameters to prep. pipeline to control whether or not to strip of email header, convert mail to lowercase, remove punctuation, replace URLS with "url", replace all numbers with "NUM" or do stemming.

{Optional}, try out several classifiers and see if you can build a great spam classifier, with high recall and precision

Official Data Desc.

  • spam: 500 spam messages, all received from non-spam-trap sources.

  • easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc).

  • hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc.

  • easy_ham_2: 1400 non-spam messages. A more recent addition to the set.

  • spam_2: 1397 spam messages. Again, more recent.

Total count: 6047 messages, with about a 31% spam ratio

import tarfile
import os
import urllib

down_path = "http://spamassassin.apache.org/old/publiccorpus/"
ham_url = down_path + "20030228_easy_ham.tar.bz2"
spam_url = down_path + "20030228_spam.tar.bz2"
spam_path = os.path.join("datasets", "spam")

def fetch_spam_data(spam_url=spam_url, spam_path=spam_path):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", ham_url), ("spam.tar.bz2", spam_url)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=spam_path)
        tar_bz2_file.close()
fetch_spam_data()
ham_directory = os.path.join(spam_path, "easy_ham")
spam_directory = os.path.join(spam_path, "spam")
ham_filenames = [name for name in sorted(os.listdir(ham_directory)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(spam_directory)) if len(name) > 20]
print(len(ham_filenames))
print(len(spam_filenames))
2500
500
#using email module and policy function (in email) in python to parse mails
import email
import email.policy

def get_mails(is_spam, file, spam_path=spam_path):
    if is_spam:
        directory = "spam"
    else:
        directory = "easy_ham"
    with open(os.path.join(spam_path, directory, file), "rb") as f:
              return email.parser.BytesParser(policy=email.policy.default).parse(f)
ham_emails = [get_mails(is_spam=False, file=name) for name in ham_filenames]
spam_emails = [get_mails(is_spam=True, file=name) for name in spam_filenames]
print(ham_emails[42].get_content().strip())
< >
> I downloaded a driver from the nVidia website and installed it using RPM.
> Then I ran Sax2 (as was recommended in some postings I found on the net),
but
> it still doesn't feature my video card in the available list. What next?


hmmm.

Peter.

Open a terminal and as root type
lsmod
you want to find a module called
NVdriver.

If it isn't loaded then load it.
#insmod NVdriver.o
Oh and ensure you have this module loaded on boot.... else when you reboot
you might be in for a nasty surprise.

Once the kernel module is loaded

#vim /etc/X11/XF86Config

in the section marked
Driver I have "NeoMagic"
you need to have
Driver "nvidia"

Here is part of my XF86Config

Also note that using the card you are using you 'should' be able to safely
use the FbBpp 32 option .

Section "Module"
 Load  "extmod"
 Load  "xie"
 Load  "pex5"
 Load  "glx"
 SubSection "dri"    #You don't need to load this Peter.
  Option     "Mode" "666"
 EndSubSection
 Load  "dbe"
 Load  "record"
 Load  "xtrap"
 Load  "speedo"
 Load  "type1"
EndSection

#Plus the Modelines for your monitor should be singfinicantly different.

Section "Monitor"
 Identifier   "Monitor0"
 VendorName   "Monitor Vendor"
 ModelName    "Monitor Model"
 HorizSync    28.00-35.00
 VertRefresh  43.00-72.00
        Modeline "800x600" 36 800 824 896 1024 600 601 603 625
 Modeline "1024x768" 49 1024 1032 1176 1344 768 771 777 806
EndSection

Section "Device"

 Identifier  "Card0"
 Driver      "neomagic" #Change this to "nvidia"... making sure the modules
are in the correct path
 VendorName  "Neomagic" # "Nvidia"
 BoardName   "NM2160"
 BusID       "PCI:0:18:0"
EndSection

Section "Screen"
 Identifier "Screen0"
 Device     "Card0"
 Monitor    "Monitor0"
 DefaultDepth 24
 SubSection "Display"
  Depth     1
 EndSubSection
 SubSection "Display"
  Depth     4
 EndSubSection
 SubSection "Display"
  Depth     8
 EndSubSection
 SubSection "Display"
  Depth     15
 EndSubSection
 SubSection "Display"
  Depth     16
 EndSubSection
 SubSection "Display"
  Depth     24
  #FbBpp   32 #Ie you should be able lto uncomment this line
  Modes   "1024x768" "800x600" "640x480" # And add in higher resulutions as
desired.
 EndSubSection
EndSection


-- 
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie
print(spam_emails[42].get_content().strip())
Help wanted.  We are a 14 year old fortune 500 company, that is
growing at a tremendous rate.  We are looking for individuals who
want to work from home.

This is an opportunity to make an excellent income.  No experience
is required.  We will train you.

So if you are looking to be employed from home with a career that has
vast opportunities, then go:

http://www.basetel.com/wealthnow

We are looking for energetic and self motivated people.  If that is you
than click on the link and fill out the form, and one of our
employement specialist will contact you.

To be removed from our link simple go to:

http://www.basetel.com/remove.html


7749doNL1-136DfsE5701lGxl2-486pAKM7127JwoR4-054PCfq9499xMtW0-594hucS91l66

Some emails are actually multipart, with images and attachments. Let's look at the various types of structures.

def email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()
    
from collections import Counter

def structure_count(emails):
    structures = Counter()
    for email in emails:
        structure = email_structure(email)
        structures[structure] += 1
    return structures    
structure_count(ham_emails).most_common()
[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]
structure_count(spam_emails).most_common()
[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

we can see that spam has got quite a lot HTML and plain text (either together or individualy) ham mails are often plain text and are signed using PGP (spam isn't). Concretely, email structure appears to be an important feature in classification

#email_headers
for header, value in spam_emails[42].items():
    print(header,"-->",value)
Return-Path --> <bill@bluemail.dk>
Delivered-To --> zzzz@localhost.spamassassin.taint.org
Received --> from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 98B7343F99	for <zzzz@localhost>; Mon, 26 Aug 2002 10:12:43 -0400 (EDT)
Received --> from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Mon, 26 Aug 2002 15:12:43 +0100 (IST)
Received --> from smtp.easydns.com (smtp.easydns.com [205.210.42.30])	by webnote.net (8.9.3/8.9.3) with ESMTP id TAA11952;	Fri, 23 Aug 2002 19:49:56 +0100
From --> bill@bluemail.dk
Received --> from bluemail.dk (klhtnet.klht.pvt.k12.ct.us [206.97.9.2])	by smtp.easydns.com (Postfix) with SMTP	id 754E52CFFB; Fri, 23 Aug 2002 14:49:52 -0400 (EDT)
Reply-To --> bill@bluemail.dk
Message-ID --> <003d35d40cab$6883b2c8$6aa10ea4@khnqja>
To --> byrt5@hotmail.com
Subject --> FORTUNE 500 COMPANY HIRING, AT HOME REPS.
MiME-Version --> 1.0
Content-Type --> text/plain; charset="iso-8859-1"
X-Priority --> 3 (Normal)
X-MSMail-Priority --> Normal
X-Mailer --> Microsoft Outlook Express 6.00.2462.0000
Importance --> Normal
Date --> Fri, 23 Aug 2002 14:49:52 -0400
Content-Transfer-Encoding --> 8bit

a networking guy would assure you that this in-fact is an overload of info which can be used for effective classification however, i gotta read some of these headers up to get more background info on how spam affects the headers... For now lets just figure stuff out from the "Subject" header.

spam_emails[42]["Subject"]
'FORTUNE 500 COMPANY HIRING, AT HOME REPS.'
import numpy as np
from sklearn.model_selection import train_test_split

X = np.array(ham_emails + spam_emails)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Feature-Engineering

#ToDo
#- Convert HTML to plain text (using BS4 or regex)
import re 
from html import unescape

def htmlTOtext(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text) 
#checking htmlTOtext
htmlSPAM = []
for email in X_train[y_train==1]:
    if email_structure(email) == "text/html":
        htmlSPAM.append(email)
sampleSPAM = htmlSPAM[5]
print(sampleSPAM.get_content().strip()[:1000], "...")
<html><body><center>

<table bgcolor="663399" border="2" width="999" cellspacing="0" cellpadding="0">
  <tr>
    <td colspan="3" width="999"> <hr><font color="yellow"> 
<center>
<font size="7"> 
<br><center><b>Get 12 FREE VHS or DVDs! </b><br>
<table bgcolor="white" border="2" width="500">
  <tr>    <td>
 <font size="7"> <font color="003399"><center>Click <a href="http://www.bozomber.com/porno/index.html"> HERE For Details!</a>
<font size="5"><br>
</td></tr></table> <br> 

<table bgcolor="#CCFF33" border="2" width="600">
  <tr>    <td><center><center><font size="6"><font color="6633CC"><br>
We Only Have HIGH QUALITY <br>Porno Movies to Choose From!<br><br>
 
 "This is a <i>VERY SPECIAL, LIMITED TIME OFFER</i>."<br><br> Get up to 12 DVDs absolutely FREE,<br> with<a href="http://www.bozomber.com/porno/index.html"> NO COMMITMENT!</a> 
 <br><br>
There's <b>no better deal anywhere</b>.<br>
There's <i>no catches</i> and <i>no gimmicks</i>. <br>You only pay for the shipping,<br> and the DVDs  ...
print(htmlTOtext(sampleSPAM.get_content())[:1000], "...")
Get 12 FREE VHS or DVDs!
  Click  HYPERLINK  HERE For Details!
We Only Have HIGH QUALITY Porno Movies to Choose From!
 "This is a VERY SPECIAL, LIMITED TIME OFFER." Get up to 12 DVDs absolutely FREE, with HYPERLINK  NO COMMITMENT!
There's no better deal anywhere.
There's no catches and no gimmicks. You only pay for the shipping, and the DVDs are absolutely free!
Take a Peak at our HYPERLINK   Full Catalog!
 High quality cum filled titles such as:
 HYPERLINK  500 Oral Cumshots 5
Description: 500 Oral Cum Shots! I need hot jiz on my face! Will you cum in my mouth?
 Dozens of Dirty Hardcore titles such as:
 HYPERLINK  Amazing Penetrations No. 17
Description: 4 full hours of amazing penetrations with some of the most beautiful women in porn!
 From our "Sexiest Innocent Blondes" collections:
 HYPERLINK  Audition Tapes
Description: Our girls go from cute, young and innocent, to screaming sex goddess
 beggin' to have massive cocks in their tight, wet pussies and asses!
 ...
#Great! Now let's write a function that takes an email as input and returns its content as plain text, whatever its format is:
def emailTOtext(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return htmlTOtext(html)

NSFW Below

not me, but the data is NSFW

print(emailTOtext(sampleSPAM)[:100], "...")
Get 12 FREE VHS or DVDs!
  Click  HYPERLINK  HERE For Details!
We Only Have HIGH QUALITY Porno Movi ...

let's do more text preprocessing, technically-> stemming

import nltk
stemmer = nltk.PorterStemmer()
for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive","Technology","Convulated"):
        print(word, "-->", stemmer.stem(word))
Computations --> comput
Computation --> comput
Computing --> comput
Computed --> comput
Compute --> comput
Compulsive --> compuls
Technology --> technolog
Convulated --> convul

let's also do as the problem statement says and change all URLS to "URL'

import urlextract
urlextractor = urlextract.URLExtract()
#try
print(urlextractor.find_urls("My personal website is talktosharmadhav.netlify.com and I like to surf wikipedia.com and keep my code on www.github.com/pseudocodenerd I just watched this https://www.youtube.com/watch?v=_7QRpuhz-90"))
['talktosharmadhav.netlify.com', 'wikipedia.com', 'www.github.com/pseudocodenerd', 'https://www.youtube.com/watch?v=_7QRpuhz-90']

lol nice, it works
Now, let's put all this together into a text transformer

from sklearn.base import BaseEstimator, TransformerMixin

class dopeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
        
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = emailTOtext(email)
            
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', str(text))#regexIStough!!
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            if self.replace_urls and urlextractor is not None:
                urls = list(set(urlextractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)  
    
    def fit(self, X, y=None):
        return self
            
sampleX = X_train[:2]
sampleXwordcount = dopeTransformer().fit_transform(sampleX)
print(sampleXwordcount)
[Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'R': 1})
 Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'by': 3, 'jefferson': 2, 'I': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'to': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'http': 1, 'www': 1, 'postfun': 1, 'com': 1, 'pfp': 1, 'worboi': 1, 'html': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'To': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'E': 1, 'remsburg': 1, 'letter': 1, 'william': 1, 'short': 1, 'again': 1, 'becom': 1, 'most': 1, 'pervert': 1, 'system': 1, 'that': 1, 'ever': 1, 'shone': 1, 'man': 1, 'absurd': 1, 'untruth': 1, 'were': 1, 'perpetr': 1, 'upon': 1, 'a': 1, 'larg': 1, 'band': 1, 'dupe': 1, 'import': 1, 'led': 1, 'paul': 1, 'first': 1, 'great': 1, 'corrupt': 1})]

with the the word counts with us, we need to vectorize them for use in the dataset.

from scipy.sparse import csr_matrix

class dopeVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocab_size =1000):
        self.vocab_size = vocab_size
            
    def fit(self, X, y=None):#builds the vocabulary (an ordered list of the most common words)
        countT = Counter()
        for word_count in X:
            for word, count in word_count.items():
                countT[word]+=min(count, 10)
        mostCommon = countT.most_common()[:self.vocab_size]
        self.mostCommon = mostCommon
        self.vocab = {word: index + 1 for index, (word, count) in enumerate(mostCommon)}
        return self
    
    def transform(self, X, y=None):
        R=[]; C=[]; Data=[]
        for r, word_count in enumerate(X):
            for word, count in word_count.items():
                R.append(r)
                C.append(self.vocab.get(word,0))
                Data.append(count)
        return csr_matrix((Data, (R, C)), shape=(len(X), self.vocab_size + 1))
sampleVectorX = dopeVectorTransformer(vocab_size=5)
sampleVectors = sampleVectorX.fit_transform(sampleXwordcount)
print(sampleVectors)
sampleVectors.toarray()
print(sampleVectorX.vocab)
  (0, 0)	6
  (1, 0)	115
  (1, 1)	11
  (1, 2)	9
  (1, 3)	8
  (1, 4)	3
  (1, 5)	3
{'the': 1, 'of': 2, 'and': 3, 'all': 4, 'christian': 5}

nice

#let's do this on the entire data now we have tests it
from sklearn.pipeline import Pipeline

pre_processing = Pipeline([("email_to_word_count", dopeTransformer()),
                          ("wordcount_to_vector", dopeVectorTransformer()),
                          ])
X_final = pre_processing.fit_transform(X_train)
#finally
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = LogisticRegression(random_state=42)
score = cross_val_score(model, X_final, y_train, cv=3, verbose=3)
score.mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
C:\Users\shekh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.985, total=   0.1s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s
[CV]  ................................................................
[CV] .................................. , score=0.99125, total=   0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s finished
0.9870833333333332

98.7; Dope