Part 2 – Anomaly Detection on Twitter Sentiment Stream
This post resumes the research our Senior Data Scientist, Ilana Lichtenstein, has been undertaking – this is part 2 of her series.
PART 2 – ANOMALY DETECTION ON TWITTER SENTIMENT STREAM FOLLOWING MICROSOFT SQL SERVER 2016 ANNOUNCEMENT
In Part 1, I introduced the topic of anomaly detection in streaming analytics context. In this part, I will briefly describe and apply three simple anomaly detection tools to find unusual single records within a stream. First let’s dig into the dataset we will be using.
MICROSOFT SQL SERVER ANNOUNCEMENT – TWEETING NOW…
Microsoft announced in March 7 2016 that the new release of SQL Server 2016 will have various improved features including an in-memory database, improved encryption and security, predictive analytics and will be compatible with enterprise Linux distributions.
STEP 1: GET THE TWITTER STREAM
We used the streamR package to stream tweets for a one hour period on 8 March 2016 with the keyword “Microsoft SQL Server”. It is worth noting that the API returns only a limited subset of matched tweets. But since we are doing a freeby use-case, we will work with what we have access to for now.
STEP 2. CREATE A SENTIMENT STREAM FROM THE DATA
The next step in our use-case is to perform sentiment analysis using Timothy Jurka’s sentiment package (no longer hosted on CRAN). Our sentiment analysis approach follows that lined out in Mining Twitter with R – Sentiment Analysis with sentiment (https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment). We plotted this sentiment score for each tweet against its time-stamp.
R PACKAGES & METHODS OVERVIEW
We explored the following anomaly detection packages in R.
This package identifies single point additive outliers. It first identifies and removes seasonal components and then smoothes the seasonally adjusted data with the supsmu function. Anomalies are identified as those residuals that lie within 3*IQR from the 1st and 3rd quartile of the set of residuals.
This method also detects additive outliers by using the method S-H-ESD (Seasonal-Hybrid Extreme Studentized Deviate test) developed within Twitter. It is an adaptation of a Generalised ESD method. S-H-ESD method is similar to tsoutliers but employs robust statistics to smooth the seasonally-adjusted data, and uses the generalised ESD method to identify anomalies among the residuals. AnomalyDetectionTs is an alternative method in the same package which we do not have time to cover here. For more details, see:
- Generalised ESD, see: http://www.real-statistics.com/students-t-distribution/identifying-outliers-using-t-distribution/generalized-extreme-studentized-deviate-test
- Twitter’s S-H-ESD see: http://www.slideshare.net/arunkejariwal/statistical-learning-based-anomaly-detection-twitter
The tsoutliers package (not to be confused with the tsoutliers method above), is able to find additive outliers but also other types of outliers, such as innovation outliers, level shift, temporary change and seasonal level shift. More details can be found at http://www.jalobe.com/doc/tsoutliers.pdf.
USE CASE 1 – UNUSUAL SINGLE TWEET
We start by looking at the original sentiment vector where each entry in the vector represents an individual tweet.
Figure 1: plot of sentiment score against timestamp. Figure 2: anomalies detected by AnomalyDetectionVec, with parameters: period=15, max_anoms=0.01.
METHOD 1: FORECAST::TSOUTLIERS
tsoutliers returned 154 indexes (roughly 23% of all indexes) as outliers.
METHOD 2: ANOMALYDETECTION::ANOMALYDETECTIONVEC
AnomalyDetectionVec relies on the generalised ESD method (described briefly above), which assumes the inliers are approximately normally distributed. A QQ-plot using inliers derived from applying the tsoutliers method above, shows this assumption is reasonable for our data. (figure not included). We ran AnomalyDetectionVec limiting the results to return 1% percent of data points as anomalies (6 anomalies). Results are seen in Figure 2. AnomalyDetectionVec requires us to specify a period, which we did somewhat arbitrarily (set to 15), since the data is not seasonal. Looking up the indices reported to contain anomalies, positive tweets match text like:
Microsoft SQL Server 2016 will support Linux; astoundingly sensible! http://blogs.microsoft.com/blog/2016/03/07/announcing-sql-server-on-linux/ …
Microsoft determines Linux has added enough features to support its advanced SQL platform, plans to port it over https://blogs.microsoft.com/blog/2016/03/07/announcing-sql-server-on-linux/ …
Amazing, given how Microsoft once viewed Linux! Times sure have changed.... https://twitter.com/buckwoodymsft/status/706941622395801601 …
It's a new world! #Microsoft will bring #SQLserver to #Linux! http://blogs.microsoft.com/blog/2016/03/07/announcing-sql-server-on-linux/ …
Note: these are original tweets, thus their time stamps are incorrect. The tweets in our sample were a collection of original and re-tweets within the one hour period of collection.
We also applied a generalised ESD test to the data (rosnerTest through package EnvStats) which finds 3 anomalies, where 2 of 3 are among the 6 reported by AnomalyDetectionVec.
USE CASE 2 – BINNING DATA
Here we consider the average sentiment in each minute of streamed data. We use the mean to find the average score for each bin rather than the median, since we want to be sensitive to the scores of outliers in each bin.
Figure 3: plot of binned sentiment score against time stamp. Figure 4: anomalies detected by AnomalyDetectionVec, with parameters: period=15, max_anoms=0.05.
METHOD 1: FORECAST::TSOUTLIERS
This method finds one outlier only in the data at index 8 (at time stamp “8:23”).
METHOD 2: ANOMALYDETECTION::ANOMALYDETECTIONVEC
AnomalyDetectionVec (this time allowing up to 5% of the data points to be detected as anomalies), locates index 8 and 18 as outliers: see Figure 4. Running EnvStats::rosnerTest also finds exactly these two indices as anomalies.
tsoutliers may not find global anomalies that lie in a region with a temporary level change (a few data points are on average a bit higher or lower than points outside the region). The fitted model incorporates the temporary change and so calculated residuals for the region may not be large in magnitude (hence will not be found to be anomalies). This is the case for index 18. In contrast, AnomalyDetectionVec does its normalisation by subtracting the stream’s median value from the seasonally-adjusted data, which means temporary changes remain in the adjusted data, and will be correctly identified as anomalies.
METHOD 3: TSOUTLIERS::TSO
We ran tso setting cval to 2.6, which detected an additive outlier at index 8, and a temporary change at index 17 (Figure LHS). For comparison to the other methods, we reran the tso function restricting the type of outliers to additive only (Figure RHS).
The restricted run detects index 18 as an additive outlier (as the AnomalyDetectionVec method did), as well as indices 17 and 49.
Figure 5: plot generated by tso when including all anomaly types. Figure 6: plot generated when restricting anomaly type to Additive Outliers only.
I was curious why index 18 was absent from the original tso results (see Figure 5). It may be because tso works by iteratively detecting outliers and removing their effects, until no more outliers are detected. Perhaps removal of the temporary change effect at index 17 modified the data around this index (including at index 18), so that no additive outliers were present in the adjusted data. Limiting the iterative function to only detect and remove additive outliers addresses this issue (see Figure 6).
So, which anomaly detection algorithm should you use on your data stream? Well, as you would expect me to do, I will sit on the fence and suggest that this really depends on your data, and the types of anomalies you want to find in it.
tsoutliers is a good choice for straightforward additive outlier detection. However, global anomalies that occur in a region with local effects, may be missed. There is no input parameter in the function signature for setting the threshold.
AnomalyDetectionVec is a robust method that also detects additive outliers regardless of local effects. Also not discussed here, but it’s worth noting AnomalyDetectionVec assumes no linear trend in the data If your data does contain a linear trend you can remove it yourself, or utilise the longterm_period parameter in the AnomalyDetectionVec method (see package help files and http://probdist.com/twitters-anomaly-detection-package). AnomalyDetectionVec also requires you to know the periodicity of your data. If your data does not have a seasonal component, consider using a generalised ESD method.
tso is great for finding different types of outliers. As residuals are calculated from a fitted arima model (or structural time series model), trend is taken care of, so the user doesn’t need to worry about first removing it. Be clear in advance about which types of outliers you wish to detect, as inclusion of some types might limit the ability to identify other types.