Anomaly Detection on Big Data Streams
This series on streaming anomaly detection presents a detailed study into a Forefront Proof of Concept (PoC). Written by our Senior Data Scientist, Ilana Lichtenstein, this series explores some of the terminology, options we reviewed, some of the science, the path we set down, and our final results in streaming anomaly detection.
Part 1 – Introduction to Real-Time Anomaly and Change Point Detection in Big Data Streams
April 13, 2016
By Ilana Lichtenstein
So, you’ve jumped on the IoT bandwagon. You are streaming data left, right and centre. In fact, you may be streaming data from thousands of different data sources at once, be they ISP servers, sensors on manufacturing conveyor belts, transactional systems or people. The big question, is what do we do with all that live incoming data? That’s going to depend on why you collected the data in the first place. You do know, right? One thing we do know, is that while companies are scrambling to collect “big data”, they ultimately need to get some value and insight from this investment. This blog looks at one type of stream mining analysis in strong demand by companies today – anomaly and change-point detection.
ANOMALY AND CHANGE-POINT DETECTION
Figure 1 – Cookie Monster and an appropriate anomaly detection test
Companies aren’t interested in every single record in a data stream, however, a company will probably be interested when something unusual occurs.
For example, a technology company may need to identify unusual traffic or performance measurements through one of its servers. A break down of this use case can be found over at Nikolay Laptev’s blog. Or a manufacturing plant manager may wish to promptly identify a faulty piece of equipment in need of near-term repair, otherwise known as preventative maintenance. As a final example, an organisation may wish to identify unusual public responses to a product launch or press release. By analysing a Twitter stream, they can identify unusual tweets, taking into account text and meta-data (tweet time and geographic location, number of followers and retweets).
While impacting a variety of different businesses in very different ways, all the above use-cases have 3 things in common:
- They involve streaming large amounts of data in real time (as the events are generated)
- They involve looking for unusual events in the data stream
- They involve finding, reporting and acting on those unusual events quickly
This multi-part series looks at a few different approaches to finding unusual events and changes in a live data stream. In this first part, I’ll provide an overview of some approaches to this problem. Parts 2-4 will look at some use-cases providing code and data so the reader can repeat any exercises. Part 2 will look at finding single point anomalies in a stream of sentiment data. Part 3 will look at detecting change points within a stream. Since I am a bit of an R junkie, the use-cases will be carried out using R packages. Part 4 will look at some adaptive approaches to anomaly detection, designed more specifically with streaming data in mind. Finally, Part 5 will look at the development of a SAPUI5 app that seamlessly coordinates streaming sensor data arriving at our Smart Data Streaming server (SAP HANA SPS10) while performing real-time anomaly detection using a popular R package. So, let’s get on with it.
THE “STUFF OF STREAMS”…
Hey, I liked the pun! A data stream has been formally described as “an ordered and potentially unbounded sequence of objects”. Informally, streaming data is basically data that is transferred record by record from one or more sources to a server or other device in real-time. Typically, a record has a time-stamp and at least one value. We can describe the values next to the time-stamp as fields or dimensions.
WHAT IS AN ANOMALY OR A CHANGE-POINT ANYWAY?
Good question. An anomaly, for our purposes, is an unusual single occurrence in a sequence of events. We can consider a time-series setting. The time-series is an ordered set of records, and an anomaly is a record with a value that is significantly higher or lower than its expected value, given the fitted time-series model (see Figure 2). We could could also consider a set of multiple time-series such as the log files showing traffic volumes passing through many servers. In this case, an anomaly is a single time-series showing significantly different properties to other time-series in the set.
On the other hand, a change-point is a point in the data stream where a more lasting change occurs in the characteristics or parameters describing the stream. Other terms such as concept drift or level shift are also used to describe such behaviour. Different tools and packages available for change-point analysis use different terminology, define the changes differently and use different methods to identify them, but ultimately change-point analysis is about finding sustained change in the process we are measuring through our data stream. For an introduction to these topics, request a copy of the following publication through ResearchGate.
Figure 2 – A visualisation of a time-series showing individual points or records that are anomalies
STREAM PROCESSING APPROACHES
Traditional data mining techniques for classification and clustering of data have been largely developed with batch data in mind. But these methods do not always perform well in a data streaming environment. This is leading the development of adaptive stream mining algorithms.
WRAPPER (OR MICRO-BATCHING) VERSUS ADAPTIVE APPROACHES
A wrapper approach involves processing a fixed number of records at a time in a window using conventional machine learning methods. Either a sliding window (taking overlapping windows) or jumping window (records are processed in a single window, only once) approach can be used. In adaptive approaches, any models or parameter set used to describe the stream are updated with each new record as it arrives. Typically, such approaches have a mechanism for weighting the most recent records and gradually forgetting older records, discounting their influence. For more on evolving prediction models see relevant chapters in “Outlier Detection for Temporal Data: A Survey”. In this series of blog posts, most of the R packages we discuss would rely on utilising a wrapper approach in a streaming environment.
Any machine learning method applied in a streaming environment must be able to deal with concept change within the stream. A concept generally refers to a model specifying the value of a target variable. When the statistical properties of this model change over time (e.g. the expected value of a variable, relationship to another variable, or variance), we say there is concept change.
We discussed above that sometimes the aim is to identify this change through change-point analysis, however, sometimes this is not the primary goal of the machine learning algorithm. For example, in the case of anomaly detection, the algorithm must update variables to reflect the new concept and continue with the task of finding anomalies. In general, wrapper approaches must ensure the window processed is sufficiently large to allow accurate estimation of parameters (such as trend and seasonal coefficients in an ARIMA model), but also sufficiently small that the processing can still be done in real-time and the approach can deal with concept drift within the series.
SOME ADDITIONAL TECHNICAL ISSUES
When considering which package to use in a streaming environment (R or otherwise), ultimately users have to consider how the algorithm will sit within the overall stream mining architecture. Specifically,
- Memory and caching: which records (rows) and fields (columns) from the data stream must be retained and for how long?
- Latency and speed: Will the algorithm be able to access and process the incoming data efficiently so that, if needed:
- the model is rapidly adapted,
- model features are quickly passed to other processes in the pipeline (no bottleneck),
- computing resources freed up,
- users quickly alerted to the anomalies?
- Accuracy: The algorithm should be evaluated for sensitivity and specificity. Evaluation is different in batch and streaming environments.
More to come – enjoy the series!
Part 2 – Anomaly Detection on Twitter Sentiment Stream following Microsoft SQL Server 2016 Announcement
This post resumes the research our Senior Data Scientist, Ilana Lichtenstein, has been undertaking – this is part 2 of her series.
May 23, 2016
By Ilana Lichtenstein
PART 2 – ANOMALY DETECTION ON TWITTER SENTIMENT STREAM FOLLOWING MICROSOFT SQL SERVER 2016 ANNOUNCEMENT
In Part 1, I introduced the topic of anomaly detection in streaming analytics context. In this part, I will briefly describe and apply three simple anomaly detection tools to find unusual single records within a stream. First let’s dig into the dataset we will be using.
MICROSOFT SQL SERVER ANNOUNCEMENT – TWEETING NOW…
Microsoft announced in March 7 2016 that the new release of SQL Server 2016 will have various improved features including an in-memory database, improved encryption and security, predictive analytics and will be compatible with enterprise Linux distributions.
STEP 1: GET THE TWITTER STREAM
We used the streamR package to stream tweets for a one hour period on 8 March 2016 with the keyword “Microsoft SQL Server”. It is worth noting that the API returns only a limited subset of matched tweets. But since we are doing a freeby use-case, we will work with what we have access to for now.
STEP 2. CREATE A SENTIMENT STREAM FROM THE DATA
The next step in our use-case is to perform sentiment analysis using Timothy Jurka’s sentiment package (no longer hosted on CRAN). Our sentiment analysis approach follows that lined out in Mining Twitter with R – Sentiment Analysis with sentiment(https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment). We plotted this sentiment score for each tweet against its time-stamp.
R PACKAGES & METHODS OVERVIEW
We explored the following anomaly detection packages in R.
This package identifies single point additive outliers. It first identifies and removes seasonal components and then smoothes the seasonally adjusted data with the supsmufunction. Anomalies are identified as those residuals that lie within 3*IQR from the 1st and 3rdquartile of the set of residuals.
This method also detects additive outliers by using the method S-H-ESD (Seasonal-Hybrid Extreme Studentized Deviate test) developed within Twitter. It is an adaptation of a Generalised ESD method. S-H-ESD method is similar to tsoutliers but employs robust statistics to smooth the seasonally-adjusted data, and uses the generalised ESD method to identify anomalies among the residuals. AnomalyDetectionTs is an alternative method in the same package which we do not have time to cover here. For more details, see:
- Generalised ESD, see: http://www.real-statistics.com/students-t-distribution/identifying-outliers-using-t-distribution/generalized-extreme-studentized-deviate-test
- Twitter’s S-H-ESD see: http://www.slideshare.net/arunkejariwal/statistical-learning-based-anomaly-detection-twitter
The tsoutliers package (not to be confused with the tsoutliers method above), is able to find additive outliers but also other types of outliers, such as innovation outliers, level shift, temporary change and seasonal level shift. More details can be found at http://www.jalobe.com/doc/tsoutliers.pdf.
USE CASE 1 – UNUSUAL SINGLE TWEET
We start by looking at the original sentiment vector where each entry in the vector represents an individual tweet.
Figure 1: plot of sentiment score against timestamp.
Figure 2: anomalies detected by AnomalyDetectionVec, with parameters: period=15, max_anoms=0.01.
METHOD 1: FORECAST::TSOUTLIERS
tsoutliers returned 154 indexes (roughly 23% of all indexes) as outliers.
METHOD 2: ANOMALYDETECTION::ANOMALYDETECTIONVEC
AnomalyDetectionVec relies on the generalised ESD method (described briefly above), which assumes the inliers are approximately normally distributed. A QQ-plot using inliers derived from applying the tsoutliers method above, shows this assumption is reasonable for our data. (figure not included). We ran AnomalyDetectionVec limiting the results to return 1% percent of data points as anomalies (6 anomalies). Results are seen in Figure 2. AnomalyDetectionVec requires us to specify a period, which we did somewhat arbitrarily (set to 15), since the data is not seasonal. Looking up the indices reported to contain anomalies, positive tweets match text like:
Note: these are original tweets, thus their time stamps are incorrect. The tweets in our sample were a collection of original and re-tweets within the one hour period of collection.
We also applied a generalised ESD test to the data (rosnerTest through package EnvStats) which finds 3 anomalies, where 2 of 3 are among the 6 reported by AnomalyDetectionVec.
USE CASE 2 – BINNING DATA
Here we consider the average sentiment in each minute of streamed data. We use the mean to find the average score for each bin rather than the median, since we want to be sensitive to the scores of outliers in each bin.
Figure 3: plot of binned sentiment score against time stamp.
Figure 4: anomalies detected by AnomalyDetectionVec, with parameters: period=15, max_anoms=0.05.
METHOD 1: FORECAST::TSOUTLIERS
This method finds one outlier only in the data at index 8 (at time stamp “8:23”).
METHOD 2: ANOMALYDETECTION::ANOMALYDETECTIONVEC
AnomalyDetectionVec (this time allowing up to 5% of the data points to be detected as anomalies), locates index 8 and 18 as outliers: see Figure 4. Running EnvStats::rosnerTest also finds exactly these two indices as anomalies.
tsoutliers may not find global anomalies that lie in a region with a temporary level change (a few data points are on average a bit higher or lower than points outside the region). The fitted model incorporates the temporary change and so calculated residuals for the region may not be large in magnitude (hence will not be found to be anomalies). This is the case for index 18. In contrast,AnomalyDetectionVec does its normalisation by subtracting the stream’s median value from the seasonally-adjusted data, which means temporary changes remain in the adjusted data, and will be correctly identified as anomalies.
METHOD 3: TSOUTLIERS::TSO
We ran tso setting cval to 2.6, which detected an additive outlier at index 8, and a temporary change at index 17 (Figure LHS). For comparison to the other methods, we reran the tso function restricting the type of outliers to additive only (Figure RHS).
The restricted run detects index 18 as an additive outlier (as the AnomalyDetectionVec method did), as well as indices 17 and 49.
Figure 5: plot generated by tso when including all anomaly types.
Figure 6: plot generated when restricting anomaly type to Additive Outliers only.
I was curious why index 18 was absent from the original tso results (see Figure 5). It may be because tso works by iteratively detecting outliers and removing their effects, until no more outliers are detected. Perhaps removal of the temporary change effect at index 17 modified the data around this index (including at index 18), so that no additive outliers were present in the adjusted data. Limiting the iterative function to only detect and remove additive outliers addresses this issue (see Figure 6).
So, which anomaly detection algorithm should you use on your data stream? Well, as you would expect me to do, I will sit on the fence and suggest that this really depends on your data, and the types of anomalies you want to find in it.
tsoutliers is a good choice for straightforward additive outlier detection. However, global anomalies that occur in a region with local effects, may be missed. There is no input parameter in the function signature for setting the threshold.
AnomalyDetectionVec is a robust method that also detects additive outliers regardless of local effects. Also not discussed here, but it’s worth noting AnomalyDetectionVec assumes no linear trend in the data If your data does contain a linear trend you can remove it yourself, or utilise the longterm_period parameter in the AnomalyDetectionVec method (see package help files and http://probdist.com/twitters-anomaly-detection-package). AnomalyDetectionVec also requires you to know the periodicity of your data. If your data does not have a seasonal component, consider using a generalised ESD method.
tso is great for finding different types of outliers. As residuals are calculated from a fitted arimamodel (or structural time series model), trend is taken care of, so the user doesn’t need to worry about first removing it. Be clear in advance about which types of outliers you wish to detect, as inclusion of some types might limit the ability to identify other types.