ITS632_Chapter9PPT1.ppt

Data Mining
Anomaly Detection

Lecture Notes for Chapter 9

Introduction to Data Mining

by

Tan, Steinbach, Kumar

        © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *

        Anomaly/Outlier Detection

        • What are anomalies/outliers?
        • The set of data points that are considerably different than the remainder of the data
        • Variants of Anomaly/Outlier Detection Problems
        • Given a database D, find all the data points x  D with anomaly scores greater than some threshold t
        • Given a database D, find all the data points x  D having the top-n largest anomaly scores f(x)
        • Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D
        • Applications:
        • Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

        Importance of Anomaly Detection

        Ozone Depletion History

        • In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
        • Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
        • The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!

        Anomaly Detection

        • Challenges
        • How many outliers are there in the data?
        • Method is unsupervised
        • Validation can be quite challenging (just like for clustering)
        • Finding needle in a haystack
        • Working assumption:
        • There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

        Anomaly Detection Schemes

        • General Steps
        • Build a profile of the “normal” behavior
        • Profile can be patterns or summary statistics for the overall population
        • Use the “normal” profile to detect anomalies
        • Anomalies are observations whose characteristics
          differ significantly from the normal profile
        • Types of anomaly detection
          schemes
        • Graphical & Statistical-based
        • Distance-based
        • Model-based

                  Statistical Approaches

                  • Assume a parametric model describing the distribution of the data (e.g., normal distribution)
                  • Apply a statistical test that depends on
                  • Data distribution
                  • Parameter of distribution (e.g., mean, variance)
                  • Number of expected outliers (confidence limit)

                  Grubbs’ Test

                  • Detect outliers in univariate data
                  • Assume data comes from normal distribution
                  • Detects one outlier at a time, remove the outlier, and repeat
                  • H0: There is no outlier in data
                  • HA: There is at least one outlier
                  • Grubbs’ test statistic:
                  • Reject H0 if:

                  Statistical-based – Likelihood Approach

                  • Assume the data set D contains samples from a mixture of two probability distributions:
                  • M (majority distribution)
                  • A (anomalous distribution)
                  • General Approach:
                  • Initially, assume all the data points belong to M
                  • Let Lt(D) be the log likelihood of D at time t
                  • For each point xt that belongs to M, move it to A
                  • Let Lt+1 (D) be the new log likelihood.
                  • Compute the difference,  = Lt(D) – Lt+1 (D)
                  • If  > c (some threshold), then xt is declared as an anomaly and moved permanently from M to A

                  Limitations of Statistical Approaches

                  • Most of the tests are for a single attribute
                  • In many cases, data distribution may not be known
                  • For high dimensional data, it may be difficult to estimate the true distribution

                  Distance-based Approaches

                  • Data is represented as a vector of features
                  • Three major approaches
                  • Nearest-neighbor based
                  • Density based
                  • Clustering based

                  Nearest-Neighbor Based Approach

                  • Approach:
                  • Compute the distance between every pair of data points
                  • There are various ways to define outliers:
                  • Data points for which there are fewer than p neighboring points within a distance D
                  • The top n data points whose distance to the kth nearest neighbor is greatest
                  • The top n data points whose average distance to the k nearest neighbors is greatest

                  Outliers in Lower Dimensional Projection

                  • In high-dimensional space, data is sparse and notion of proximity becomes meaningless
                  • Every point is an almost equally good outlier from the perspective of proximity-based definitions
                  • Lower-dimensional projection methods
                  • A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density

                  Clustering-Based

                  • Basic idea:
                  • Cluster the data into groups of different density
                  • Choose points in small cluster as candidate outliers
                  • Compute the distance between candidate points and non-candidate clusters.
                  • If candidate points are far from all other non-candidate points, they are outliers

                            Base Rate Fallacy in Intrusion Detection

                            • I: intrusive behavior,
                              I: non-intrusive behavior
                              A: alarm
                              A: no alarm
                            • Detection rate (true positive rate): P(A|I)
                            • False alarm rate: P(A|I)
                            • Goal is to maximize both
                            • Bayesian detection rate, P(I|A)
                            • P(I|A)