Part 1 – Introduction to Real-Time Anomaly and Change Point Detection in Big Data Streams
So, you’ve jumped on the IoT bandwagon. You are streaming data left, right and centre. In fact, you may be streaming data from thousands of different data sources at once, be they ISP servers, sensors on manufacturing conveyor belts, transactional systems or people. The big question, is what do we do with all that live incoming data? That’s going to depend on why you collected the data in the first place. You do know, right? One thing we do know, is that while companies are scrambling to collect “big data”, they ultimately need to get some value and insight from this investment. This blog looks at one type of stream mining analysis in strong demand by companies today – anomaly and change-point detection.
ANOMALY AND CHANGE-POINT DETECTION
Figure 1 – Cookie Monster and an appropriate anomaly detection test
Companies aren’t interested in every single record in a data stream, however, a company will probably be interested when something unusual occurs.
For example, a technology company may need to identify unusual traffic or performance measurements through one of its servers. A break down of this use case can be found over at Nikolay Laptev’s blog. Or a manufacturing plant manager may wish to promptly identify a faulty piece of equipment in need of near-term repair, otherwise known as preventative maintenance. As a final example, an organisation may wish to identify unusual public responses to a product launch or press release. By analysing a Twitter stream, they can identify unusual tweets, taking into account text and meta-data (tweet time and geographic location, number of followers and retweets).
While impacting a variety of different businesses in very different ways, all the above use-cases have 3 things in common:
- They involve streaming large amounts of data in real time (as the events are generated)
- They involve looking for unusual events in the data stream
- They involve finding, reporting and acting on those unusual events quickly
This multi-part series looks at a few different approaches to finding unusual events and changes in a live data stream. In this first part, I’ll provide an overview of some approaches to this problem. Parts 2-4 will look at some use-cases providing code and data so the reader can repeat any exercises. Part 2 will look at finding single point anomalies in a stream of sentiment data. Part 3 will look at detecting change points within a stream. Since I am a bit of an R junkie, the use-cases will be carried out using R packages. Part 4 will look at some adaptive approaches to anomaly detection, designed more specifically with streaming data in mind. Finally, Part 5 will look at the development of a SAPUI5 app that seamlessly coordinates streaming sensor data arriving at our Smart Data Streaming server (SAP HANA SPS10) while performing real-time anomaly detection using a popular R package. So, let’s get on with it.
THE “STUFF OF STREAMS”…
Hey, I liked the pun! A data stream has been formally described as “an ordered and potentially unbounded sequence of objects”. Informally, streaming data is basically data that is transferred record by record from one or more sources to a server or other device in real-time. Typically, a record has a time-stamp and at least one value. We can describe the values next to the time-stamp as fields or dimensions.
WHAT IS AN ANOMALY OR A CHANGE-POINT ANYWAY?
Good question. An anomaly, for our purposes, is an unusual single occurrence in a sequence of events. We can consider a time-series setting. The time-series is an ordered set of records, and an anomaly is a record with a value that is significantly higher or lower than its expected value, given the fitted time-series model (see Figure 2). We could could also consider a set of multiple time-series such as the log files showing traffic volumes passing through many servers. In this case, an anomaly is a single time-series showing significantly different properties to other time-series in the set.
Figure 2 – A visualisation of a time-series showing individual points or records that are anomalies
On the other hand, a change-point is a point in the data stream where a more lasting change occurs in the characteristics or parameters describing the stream. Other terms such as concept drift or level shift are also used to describe such behaviour. Different tools and packages available for change-point analysis use different terminology, define the changes differently and use different methods to identify them, but ultimately change-point analysis is about finding sustained change in the process we are measuring through our data stream. For an introduction to these topics, request a copy of the following publication through ResearchGate.
STREAM PROCESSING APPROACHES
Traditional data mining techniques for classification and clustering of data have been largely developed with batch data in mind. But these methods do not always perform well in a data streaming environment. This is leading the development of adaptive stream mining algorithms.
WRAPPER (OR MICRO-BATCHING) VERSUS ADAPTIVE APPROACHES
A wrapper approach involves processing a fixed number of records at a time in a window using conventional machine learning methods. Either a sliding window (taking overlapping windows) or jumping window (records are processed in a single window, only once) approach can be used. In adaptive approaches, any models or parameter set used to describe the stream are updated with each new record as it arrives. Typically, such approaches have a mechanism for weighting the most recent records and gradually forgetting older records, discounting their influence. For more on evolving prediction models see relevant chapters in “Outlier Detection for Temporal Data: A Survey”. In this series of blog posts, most of the R packages we discuss would rely on utilising a wrapper approach in a streaming environment.
Any machine learning method applied in a streaming environment must be able to deal with concept change within the stream. A concept generally refers to a model specifying the value of a target variable. When the statistical properties of this model change over time (e.g. the expected value of a variable, relationship to another variable, or variance), we say there is concept change.
We discussed above that sometimes the aim is to identify this change through change-point analysis, however, sometimes this is not the primary goal of the machine learning algorithm. For example, in the case of anomaly detection, the algorithm must update variables to reflect the new concept and continue with the task of finding anomalies. In general, wrapper approaches must ensure the window processed is sufficiently large to allow accurate estimation of parameters (such as trend and seasonal coefficients in an ARIMA model), but also sufficiently small that the processing can still be done in real-time and the approach can deal with concept drift within the series.
SOME ADDITIONAL TECHNICAL ISSUES
When considering which package to use in a streaming environment (R or otherwise), ultimately users have to consider how the algorithm will sit within the overall stream mining architecture. Specifically,
- Memory and caching: which records (rows) and fields (columns) from the data stream must be retained and for how long?
- Latency and speed: Will the algorithm be able to access and process the incoming data efficiently so that, if needed:
- the model is rapidly adapted,
- model features are quickly passed to other processes in the pipeline (no bottleneck),
- computing resources freed up,
- users quickly alerted to the anomalies?
- Accuracy: The algorithm should be evaluated for sensitivity and specificity. Evaluation is different in batch and streaming environments.
More to come – enjoy the series!