## Finding the Right Words

At Warby Parker, we get a lot of tweets. Our social media team does a great job of responding to each one and recording various metadata that the team uses for reporting purposes. One type of post that we often see from our Home Try-On customers is an informal poll to Warby Parker, friends, and family regarding Home Try-On choices. Here’s an example of one that we posted ourselves, featuring Chris Becker from our Tech team:

The Social Media team recently wanted to track this type of post in order to better engage customers and help them with their Home Try-Ons, as well as gauge their involvement in our Home Try-On program. By providing meaningful responses to these posts, Warby Parker can both assist customers through the process and provide a better customer experience. They approached the Data Science team with the task of automating this process for tracking incoming tweets and historic data.

#### The Problem

From a data science perspective, this is simply a text classification problem. Given the tweet text, the task is to classify whether it is a Home Try-On poll or not. In particular, we need to map the contents of the post to one of two classes: “Positive” if the post is a Home Try-On poll and “Negative” otherwise. The issues faced were the following:

1. Dealing with natural language is always an issue. People can sometimes use the exact same words and mean entirely different things. For example, consider the following sentences: “Meatloaf for dinner, great!” and “Meatloaf for dinner? Great…”. The same words were used, but one implied that meatloaf for dinner was a good thing, and the other implied the opposite.

2. There is an inherent class imbalance. Warby Parker gets a lot of tweets, and only a small fraction of them are the type of posts we would label as positive. In fact, a classifier that simply labels all tweets as negative would have an accuracy of about 99%.

3. There were many unlabeled data points. When our Social Media team approached us with the problem, Warby Parker had received well over 100,000 tweets to date but had fewer than 100 positive-labeled examples.

#### Feature Selection

Selecting numeric features from a text post usually involves text pre-processing, followed by text vectorization. In our case, minimal text pre-processing was used. Punctuation was removed, and words from a given stopword list were filtered out (words such as the, and, and of).

For text vectorization, we used Term Frequency Inverse Document Frequency (tf-idf) as a means of counting word frequencies in a text. For a given term (or word) $t$ and a given text document $d$, $tf-idf$ for a given text is calculated by

$tfidf(t, d) = tf(t,d) \times idf(t, d)$,

where $tf(t,d)$, the term frequency is defined by

$tf(t, d) = \frac{f(t, d)}{N_{d}}$,

where $tf(t,d)$ is the number of times $t$ appears in $d$, and $N$ is the number of terms in document $d$. The inverse document frequency is the inverse of the ratio of documents containing term $t$ to the total number of documents $idf(t,d)$, and is defined as

$idf(t,d) = log\left ( \frac{N}{\left | \left \{ d \in D; t\in d \right \} \right |} \right )$,

where $N$ is the number of documents in the corpus.

Using the Scikit-Learn Python package, we can do both stopword filtering and tf-idf all at once.

In the above code snippet, we initialize a TfidfVectorizer object, where we specify the stopword list. We then call .fit_transform(corpus), which creates the $tf-idf$ model of the corpus and then transforms the corpus. The max_df parameter specifies the maximum term frequency that we will consider. Words appearing more frequently than this in a given document will be ignored. The sublinear_tf parameter, when True, transforms the term frequency calculation to:

$tf_{new} = 1 + log(tf)$.

This transformation, as the name implies, is sublinear, which in practice lessens the effect of frequently occurring words. Since this calculation is only for the term frequency—and thus only affects the numerator of the $tf-idf$ calculation—having a sublinear transform is a good way to reduce the effect of very common words.

#### Hand-made features

Because of the nature of tweets and the nature of tweets about our Home Try-On program, some features can be exploited. We created a few hand-made features that performed well. These features were:

1. True/False value indicating whether the tweet contained a URL

2. The domain of the URL, or None if no URL was present

3. How far into the tweet the URL appears, expressed as the starting index divided by tweet length

In general, the tweets polling for Home Try-On help included URLs, so feature #1 proved to be a good feature to filter out inputs. The remaining features were helpful in providing insight into what type of possible content the URL will contain.

#### Class Imbalances and Unlabeled Data

Upon receiving the data from our Social Media team, we had a large corpus of tweets, but very few labeled points. Worse yet, the natural proportion of positive-to-negative examples was low (i.e., most tweets, although unlabeled, were not positive examples). The main way we dealt with this problem was:

1. Build classifier on data that we have, inferring the label of unlabeled points (using semi-supervised learning methods)

2. Examining the performance of the classifier, fixing any obviously misclassified examples (using good old fashioned elbow grease)

3. Repeat

To infer labels of unlabeled points, we utilized the LabelSpreading class from Sci-kit learn. This is a semi-supervised learning technique that infers labels on unlabeled points given a small number of labeled points. Using it is rather straightforward:

and voila! The variable y_inferred now contains inferred labels for each points.

#### Model Selection

After feature selection, we made both a multinomial Naive Bayes and an SVM model for classification. The SVM model outperformed the Naive Bayes model and is what I will discuss here.

SVMs come out of the box with a few hyperparameters. The C parameter specifies how sensitive the model is to errors (effectively acting as a regularization parameter), and the radial basis kernel comes with a gamma parameter, specifying the width of the kernel.

In general, optimizing hyperparameters is a difficult problem, but we can approximate the optimization through sampling. It has been shown that random sampling can be more effective than more heuristic approaches like grid search[1]. We can easily perform random hyperparameter search in Scikit-learn by the following:

When specifying hyperparameters, we input a scipy distribution object by which to sample. The RandomizedSearchCV object takes care of sampling from this distribution. In this case, we used the scipy.stats.expon distribution, an exponential distribution. The n_iter parameter specifies how many models to build and how many hyperparameter configurations to sample. The n_jobs parameter specifies how many jobs to run concurrently.

#### Conclusion

Binary text classification can always be messy, but by taking a few precautions, you can avoid many headaches. In particular, keeping in mind the types of behaviors present in the domain and exploiting these behaviors through hand-made features, noting the proportion of positive to negative samples, and searching for more optimal hyperparameters all help reduce the ambiguities.

You might be asking yourself, how well did our classifier perform? Well on a corpus with 74,016 examples, 10,520 of which were labeled and 63,496 were unlabeled, the classifier had a cross-validation F-score of 0.92 (0.88 for positive and 0.96 for negative). It turns out that people tend to use a lot of the same words when making a Home Try-On poll post! For example, the word ‘pick’ appeared in about 34% of positive examples and only about 2% of negative examples, and the word ‘help’ appeared in about 47% of positive versus about 10% of negative examples.

Oh, and by the way, Chris ended up choosing the Beckett, and we think it was a good choice.

[1] : Bergstra, James, and Yoshua Bengio. “Random search for hyper-parameter optimization.” The Journal of Machine Learning Research 13 (2012): 281-305.

Posted on