Graph-based anomaly detection identifies unusual data points by analyzing graph structures, where nodes represent data points and edges depict relationships, such as distances or similarities. In this tutorial, I explain how to detect anomalies using a graph-based method in Python. The tutorial covers:
- Introduction to Graph method
- Generating test data
- Graph construction
- Anomaly detection
- Conclusion
- Source code listing
Introduction to Graph method
Graph-based anomaly detection uses graph structures to identify unusual data points. Each data point is treated as a node, and edges represent relationships such as distances or similarities between nodes. Anomalies are identified by analyzing graph properties like node degree, clustering coefficients, or connectivity patterns.
Key concept
In graph-based anomaly detection, the graph is constructed as follows:
- Nodes: Represent individual data points, each serving as a vertex in the graph.
- Edges: Represent relationships between nodes, usually based on a similarity or distance threshold that decides if two nodes should be connected.
- Adjacency Matrix: A matrix that defines the connectivity of the graph, where each element indicates whether a pair of nodes is connected by an edge.
- Node Degree refers to the number of edges connected to it, representing its level of connectivity in the graph. Nodes with significantly high or low degrees compared to the average may be considered anomalies, as they could indicate data points that deviate from the norm in terms of relationships with other points in the graph.
The graph-based method is useful when relationships between data points are crucial, such as in social networks, supply chains, or traffic systems, and when traditional methods like clustering or density estimation fail to effectively capture anomalies.
Generating test data
We'll start by loading the required libraries for this tutorial. You can install those libraries using pip command.
Next, we'll generate synthetic data using make_blob() function and visualize it in a graph.
Graph construction
First, we need to compute the distance matrix matrix using distance_matrix() function. The distance_threshold defines the maximum allowable distance between two points for them to be considered "connected" in the graph. A smaller threshold results in fewer edges, connecting only data points that are very close, making the graph sparse. Conversely, a larger threshold increases the number of edges, potentially leading to a denser graph.
Next, we construct the adjacency matrix. The adjacency matrix is built by checking pairwise distances between data points and comparing them to a predefined threshold. If the distance between two points is below the threshold, they are considered "connected."
Finally, we convert the adjacency matrix into a graph structure using NetworkX library, where each node represents a data point, and edges are established between points that are within a specified distance threshold.
Anomaly detection
Each node in the graph has a degree, which indicates the number of edges (connections) it has. A node with a lower degree is less connected and may be an outlier. We collect the degrees of all the nodes into a list for further analysis.
We compute the value below which 1% of the node degrees fall, and use this value as the threshold to identify anomalous nodes. Then we can find all the nodes whose degree is less than or equal to the 1st percentile degree threshold.
Using the extracted anomalous points, we can visualize them within the graph for better analysis.
Conclusion
In this tutorial we explored graph-based anomaly detection, where we constructed a graph based on pairwise distances and analyzed node degrees to identify anomalies. Nodes with low degrees, representing isolated data points, were identified as potential outliers. This method uses graph structure to effectively detect anomalies in datasets. Full source code is provided below.
Source code listing
No comments:
Post a Comment