Extracting value from social and news data

The use of machine learning and AI techniques has opened new avenues for quantitative fund managers to derive value from traditional and non-traditional data sources everywhere in the world. Investors are starting to obtain powerful insights from conversations and interactions taking place in the news and on Twitter. Many are discovering that sentiment data is a major source of untapped alpha — unlike the traditional financial data or strategies previously available.

This is creating a massive opportunity, as everyone is looking for an edge in getting relevant market-moving information ahead of other market participants; however, to capture this opportunity, investors must embrace some quantitative practices. Identifying a reliable signal isn’t as simple as reading the news or following the right people on Twitter. It requires human intuition to underpin the strategy, infrastructure to handle large volumes of data and machine learning to model that data.

Cleaning and handling social data

Recent breakthroughs in technology have made transforming the many varieties of data more effective. One method used is to employ natural language processing to extract and tag relevant information buried within troves of unstructured text.

“This involves defining words in the correct context,” says Arun Verma, Senior Quantitative Researcher and Head of Quant Research Solutions at Bloomberg. “Cook and Apple on their own could refer to a recipe, but together in a string of text, it likely involves the company Apple (AAPL).”

One method is “named entity disambiguation,” which determines items in a Twitter stream or news article that link to a company name. It’s a necessary step in processing text for analysis.

Before fitting a model though, a human being must label stories in the training set — a portion of labeled data used to teach a model. An algorithm studies the different relationships in this sample data before a fully trained classifier tests the remaining observations. Using high-quality, labeled training data improves the chances that the model will find a pattern that repeats itself.

To label the text accurately, a training set is curated by human experts who assign a sentiment score to each story in the set from the perspective of a long-term investor in the company. They focus solely on the text instead of the outcome, thus scores don’t reflect any subsequent price movement. Once models are in development, further testing can be used to check the accuracy of manual classification.

Change is coming. Sooner than you think.
On October 16th in New York, we’ll take a deep dive into the technologies that are changing how we spend, save and invest.

Request an invite to the event using code: NEWS

Finding a signal in the noise

From here, the labeled data can be fed into a machine-learning model such as a Support Vector Machine (SVM) that determines whether the story belongs to a specific class.

An SVM training algorithm classifies the text into two groups with different features. When dealing with sentiment, where stories are scored positive, negative or neutral, a more nuanced approach is required. Verma observes that Bloomberg applies multiple support vectors and pairwise classification to convert a multi-class sequence like sentiment into a series of two-class problems.

Each SVM operates in a high dimensional space and follows the bag-of-words framework — a catalog of words related to finance and investing. That way the training algorithm can discover an optimal separator between each class: positive-neutral, positive-negative or negative-neutral.

“The results of the three binary classifiers are fed into a new machine-learning model like K-Nearest Neighbors (KNN) to classify stories without a clear sentiment into one of the three classes,” says Verma. KNN analyzes and categorizes stories in real time based on cases from training data found in the neighborhood of the target story.

To check if the machine learning model is performing well, the next step would be to construct a confusion matrix that maps predicted classifications against actual classes. The correct predictions fall on the diagonal, while misclassifications sit on the off-diagonal entries. It not only validates or discredits the algorithms, but also the human experts who labeled the initial data — providing a starting point for making improvements.

Of course, fixing every problem can lead to overfitting. The classic way to handle errors without overfitting is to divide the data set into three groups: a training, a test and a validation set. When improvements in the training set do not coincide with the test set, it’s a strong signal to stop fine-tuning the model.

“In the end, we want the machine to help improve human behavior, and vice versa,” says Gautam Mitra, founder and CEO of OptiRisk Systems; Mitra spoke at a recent webinar on social and news data.

Long-term performance from sentiment

When the model and data are in good shape, they can combine to be a powerful tool for predicting price movement. Naturally, positive information about a company or industry might translate to greater buying activity, whereas negative press may precede a sell-off.

During a recent webinar, Verma demonstrated the benefit of trading sentiment with three different long-short strategies: long the top 1/3 of stocks and short the bottom 1/3; long the top 5% and short the bottom 5%; and a proportional portfolio of long and short positions bounded by the mean.

Each strategy ranks stocks by daily sentiment before the market opens and closes existing positions at the close. The results exhibit a strong synergy between news and Twitter data that outperforms any individual factor over a 1-year period.

This article was written for the Bloomberg Professional Services blog.