ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions 论文

2022IEEE Transactions on Knowledge and Data Engineering引用 380
Anomaly Detection Techniques and ApplicationsWater Systems and OptimizationFault Detection and Control Systems

摘要

Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ECOD</small> (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the “rare events” that appear in the tails of a distribution. In a nutshell, <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ECOD</small> first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ECOD</small> then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ECOD</small> computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ECOD</small> , which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that <sc xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ECOD</small> outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.