As a fledgling Data Science student at Lambda School, it’s easy to get lost in Python’s vast ecosystem of data science libraries such as pandas, NumPy, and scikit-learn and lose sight of the underlying algorithm we depend on for our analysis.
As a part of our Computer Science (CS) coursework — that we’re sharing with our Lambda Web Development compatriots —our objective is to “take it back to the streets” and implement a common algorithm with native code rather than leaning on an existing library. This exercise would seem to help flex our CS learning and also refresh our understanding of the algorithmic underpinnings of the libraries we use every day. Challenge accepted!
As a fan of the Go programming language, I decided to add a personal twist and use Go (https://golang.org) instead of Python as the implementation language of choice.
Kick It With Go
Go is a statically typed, compiled language initially designed at Google by Robert Griesemer, Rob Pike, and Ken Thompson (Wikipedia). It is fast becoming the lingua franca of networked computing leveraged by prominent projects such as Docker, Kubernetes, and Terraform as well as a common choice for server-side systems at companies such as Google, Uber, Twitch, and Dropbox.
The Problem: Spam Message Classification
A classic problem of the Internet age was combating the major pain point of this new form of communication we embraced called email. That pain was spam — unwanted email messages mostly in the form of tiresome advertisements and outright scams. (How many Nigerian princes are there anyways!)
The issue is how can a computer program determine that this bunch of text is a birthday wish from Mom and that gaggle of words is asking me to purchase some new form of snake oil.
One answer lay in the form of Bayesian Inference that builds upon Bayes Theorem to infer or classify email messages such as spam (unwanted) or ham (wanted).
Bayes Theorem is a mathematical expression that describes the probability of an event occurring in terms of known conditions prior to that event. In the case of our pesky spam message (a nasty event), can we predict that the next email we receive is spam based on what we currently know about the words included in that email?
We don’t know with absolute certainty that the next message is spam but a smart computer program can quickly read the inbound message’s words and “know” how often those words were included in spam messages observed in the past.
Computer programs are fast (usually) so the hope is that our program can read the bag of words from the message, assess how those words have been used in messages in the past (humans have marked past messages as spam/unwanted or ham/desired) and classify that message as spam and remove it from our view.
So we have a possible future event — that my next email will be spam. Let’s call that future event “A”. And we have known events that have occurred in the past. That is past emails with a particular set of words that have been considered spam or ham. Let’s call that event “B”.
Bayes Theorem gives us a way to express how the probability of spam (“A”) given the probability that the presence of a bunch of words has resulted in messages deemed to be spam in the past (“B”). Here’s that formula from Wikipedia:
One step we can take with our basic spam filtering approach is to assume that each message word is independent of each other. That is we naively assume that if we come across the word “cold” we don’t have any information that the word “beer” will also be in the message as well. This refers to the naive bayesian approach. It may be unrealistic (in the case of my personal network… perhaps extremely so) but we’ll go with it.
But making this assumption allows us to simplify the Bayes Theorem expression above for spam modeling purposes.
In short, if we take a naive approach given a set of tokens the probability that a message is spam can be expressed by what we’ve analyzed in our dataset. That is, the probability that those same tokens were in spam divided by the sum of the probabilities that the set of tokens in both spam and ham messages. Intuitively, if a bunch of words is almost always in spam messages but rarely in ham messages, the probability of spam will tend towards 100%. This formulation seems to make intuitive sense.
A Message Object
We’ll jump into Go code by describing what our program means by a message. Using Go’s struct object, the logic defines a message as a data structure with two attributes: Text (the message string) and isSpam (a true/false flag indicating if the message is spam). Here’s a little snippet of code that makes this definition.
A bunch of messages such as we’ll use for fitting our model is simply an array of message structs represented by the myMessages variable defined in line 7 above.
Tokenize Message Words
To mathematically process words in a message, we’ll need to get a handle on how to define words from all of the characters that can be included in a message. For example, how does our process interpret punctuation such as the question mark at the end of this sentence?
To help, we’ll need to “tokenize” words to eliminate extraneous and ambiguous characters and treat words in a consistent manner across all messages. To do that, we’ll fashion a rudimentary tokenization process using these basic rules:
- generate a unique token string for each allowable word in all messages
- ignore spaces so each word is considered the text between spaces (ie. separate the message string into individual words splitting the string when encountering embedded spaces)
- lowercase all alpha characters
- and… throw out all characters other than lowercase alphanumeric characters a to z and digits 0 to 1
So the message “Hey! hey… Why didn’t you get 2 DQ peanut buster parfaits?!” when tokenized would generate this list of tokens
After tokenizing all of our words we’ll have a count of how many times each word appears in a spam message or ham (non-spam) message. Presumably, some spammy words are far more prevalent in spam messages than not.
Create a Classifier Model
Similar to what we do with some of the Python ecosystem’s sophisticated statistical inference and machine learning libraries we’ll create a model object to classify our messages. Like messages above, the logic stands up a Go struct object to house the model’s attributes.
We’ll fit our dataset to the model and fill in these object attribute details based on that operation. In particular, we’ll know based on our data:
- our unique set of tokens
- the number of spam messages in which a token is found
- the number of ham messages in which a token is found
- total number of messages we have processed
- with the total the number of spam messages
- … and the number of ham messages
To fit the classifier model, the code reads in a CSV file of spam/ham messages downloaded from Kaggle (Dataset: Spam Text Message Classification — thanks TeamAI & Kaggle! link)
Okay… the reason we’re here. Let’s do a prediction. Our task is to calculate the naive bayesian metrics generating a value that we can interpret as a prediction.
To generate a prediction the logic will:
- accept a string text
- convert the string to a set of tokens
- for each token, calculate the probability of it appearing in a spam and ham message
- for the set of tokens calculate the overall probability of that set appearing in a spam and ham message; in this case, instead of multiplying very small numbers the logic adds up the natural log of each value
- return the set’s spam and ham values and generate the overall naive bayesian metric we talked about above
Take a Look!
For Go developers or those interested in checking some go code, take a gander at my GitHub repo: https://github.com/danoand/CS-Data-Science-Build-Week-1. (Still a noob at Go — be gentle!)
The Go modeling code resides in the modeling.go file in the root of the repo. The comparable Python model can be found in the python/app.py file.
Better yet! The repo is a web application that displays an interactive web page where you can predict messages from the underlying dataset or type in your own message. You can compare the results from our native Go model and a Python scikit-learn model.
Navigate to https://bayes.dananderson.dev to take a look!