I've been trying to educate myself about machine learning because it's all the rage. Also, understanding it on some level seems essential to grasp what the humongous Internet companies are doing to our lives.
Word2Vec has gotten some attention because it seems to extract sophisticated semantic information from a document collection just by looking at word usage in context. Synonyms are its most basic result, found by looking for the nearest neighbors of the vectors representing words (based on cosine similarity of the vectors).
What, I wondered, might Word2Vec tell me about the Tweeter-in-Chief? I had a pretty good idea, to be honest: next to nothing, because Word2Vec relies on context and lots of data. But that wasn't really the point of the exercise. The point was to use the tools as if I were doing something real.
I found a
data collection of the tweets issued by the TiC since election day, which turns out to be not really very much data, and I started my journey.
The Setup
I'm focusing on Spark for data exploration these days. And I wanted to do this in a notebook, because Data Science! I've been making some contributions to the
BeakerX project, which evolved from a standalone notebook system to a set of kernels and extensions for Jupyter. So, Spark's ML lib in a Jupyter/BeakerX Scala notebook it is.
The BeakerX Scala kernel doesn't have built-in Spark integration, but it does have a command to automatically download libraries (and their dependencies) and put them on the classpath. Specifying the spark-core, spark-sql, and spark-mllib libraries does the trick. Almost. More about that soon.
Reading and Preprocessing the Data
The data was nicely prepared in JSON, suitable for reading in with Spark's JSON reader. And the data provider had included information that the data contained many duplicates, an artifact of the way it was collected from Twitter. So, the first step was to remove duplicates and extract a single text field.
Next comes tokenization. I opted for a regular expression tokenizer that split on non-word characters. That means my tokens did not includes at-signs: nytimes and @nytimes are the same token. It seemed like the right thing, but writing this now I wonder if I should test its impact.
To Stop or Not To Stop
Most text analysis removes stop words—common words that occur so frequently that they usually don't help differentiate documents. Word2Vec relies on context, though. Should I remove stop words or not? I went to the web to try to answer that question. I didn't find an authoritative answer, but the balance seemed to be toward removing stop words. So I did.
Library Issues
Now I'm ready to run Word2Vec on my prepared data. But no. I'm getting a missing class at runtime. It turns out that there is a bug in the handling of transitive dependencies for libraries, apparently inherited from Ivy. It takes a while to figure out the problem: I need to manually add the
arpack dependency because the dependency resolver is loading the
arpack source jar instead of the binary jar. (I was lucky to find a bug report against SBT that mentioned this problem.) I also notice, after the missing library is resolved, a complaint about not finding native libraries for the linear algebra routines, which isn't fatal but hurts performance. The solution for that is at least documented.
Word2Vec finally runs, producing a model that I can query.
What Did I Get?
Word2VecModel supports a couple of handy built-in queries that save the trouble of feeding the results into a separate step to find nearest neighbors with cosine similarity. So, here's where I get to use the BeakerX notebook's table display functionality. I generated two tables. The first table iterates over the vocabulary and shows me, for each word, its three closest synonyms. The second table shows me just the closest synonym, but it includes the score. That makes it easy to sort by score and see the best matches. Which is where things get weird.
Tickets to the White House?
One of the top pairings is "tickets" and "whitehouse". Huh? I look at the tweets containing "tickets", and they are all promoting post-election campaign rallies. They all have a link to a website for tickets. The tweets containing "whitehouse" actually contain "@whitehouse" and they seem to be mentions/retweets of material published by the @whitehouse feed, also with links. So it turns out that the shared context is the tokens generated by a URL: "https", "t", "co", followed by a unique string.
Revisiting the data cleaning process, I figure the best thing to do is just remove all the URLs from the text before tokenization. So I add a regular expression substitution to map from text to text without URLs. Works great, except that I'm still seeing some URLs.
Your Data Is Never as Clean as You Think
I was looking for "https://t.co/\w+" in my regular expression. But, surprise, some of the URLs actually look like "https://t.c…". It seems that the original data extraction somehow truncated the tweets. I don't know exactly how or why. I fix the URL regular expression so that anything that starts with "https:" and has no spaces up to a "…" will also get removed from the input. I decide not to worry about the other truncations, I just want those URLs gone.
Results
As I expected, there weren't any surprises, and precious little that was even what you might expect. The interesting close synonyms are things like "repeal" and "border" (linked by both being legislative priorities?), and "state" and "alabama" (clearly semantically linked).
Looking past the closest match to the second or third, there are a few fun things from looking at interesting words.
"obamacare" is close to "repeal" and "healthcare". The closest match to "russia" is "hillary", and "russians" is close to "phony". "democrat" is close to "republican". "foxnews" is close to "seanhannity". "nytimes" is close to "media" and the closest match to "times" is "dishonest". "cnn" is close to "russia". "fake" is close to "media". "hurricaneharvey" (presumably a hashtag) is close to "fema". "god" is close to "military".
Somehow the closest match for "obama" is "ivankatrump". The latter is clearly an @ reference; it looks like it's because of retweets of Ivanka talking about the Trump administration overlapping with direct remarks about the Obama administration. "Administration" seems to be the common context, from my eyeballing it.
Conclusions
It's always educational to get hands-on and end-to-end with software, even for a toy project. A lot of the value comes from working through all the little things that crop up, since nothing ever works exactly as anticipated. Applying tools that really require a large dataset to a tiny one is not going to result in amazing insights, of course. Even if it didn't yield particularly interesting results, I think it was worth doing.