DataTechNotes: Graph-based Anomaly Detection Example

Graph-based anomaly detection identifies unusual data points by analyzing graph structures, where nodes represent data points and edges depict relationships, such as distances or similarities. In this tutorial, I explain how to detect anomalies using a graph-based method in Python. The tutorial covers:

Introduction to Graph method
Generating test data
Graph construction
Anomaly detection
Conclusion
Source code listing

Introduction to Graph method

Graph-based anomaly detection uses graph structures to identify unusual data points. Each data point is treated as a node, and edges represent relationships such as distances or similarities between nodes. Anomalies are identified by analyzing graph properties like node degree, clustering coefficients, or connectivity patterns.

Key concept
In graph-based anomaly detection, the graph is constructed as follows:

Nodes: Represent individual data points, each serving as a vertex in the graph.
Edges: Represent relationships between nodes, usually based on a similarity or distance threshold that decides if two nodes should be connected.
Adjacency Matrix: A matrix that defines the connectivity of the graph, where each element indicates whether a pair of nodes is connected by an edge.
Node Degree refers to the number of edges connected to it, representing its level of connectivity in the graph. Nodes with significantly high or low degrees compared to the average may be considered anomalies, as they could indicate data points that deviate from the norm in terms of relationships with other points in the graph.

The graph-based method is useful when relationships between data points are crucial, such as in social networks, supply chains, or traffic systems, and when traditional methods like clustering or density estimation fail to effectively capture anomalies.

Generating test data

We'll start by loading the required libraries for this tutorial. You can install those libraries using pip command.

 
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from sklearn.datasets import make_blobs
from scipy.spatial import distance_matrix

Next, we'll generate synthetic data using make_blob() function and visualize it in a graph.

 
# Generate synthetic data
data, _ = make_blobs(n_samples=350, centers=1, cluster_std=0.4, random_state=42)
plt.figure(figsize=(10, 8))
plt.scatter(data[:, 0], data[:, 1])
plt.title("Test Dataset")
plt.show()

Graph construction

First, we need to compute the distance matrix matrix using distance_matrix() function. The distance_threshold defines the maximum allowable distance between two points for them to be considered "connected" in the graph. A smaller threshold results in fewer edges, connecting only data points that are very close, making the graph sparse. Conversely, a larger threshold increases the number of edges, potentially leading to a denser graph.

 
# Compute distance matrix
distance_threshold = 0.5
distances = distance_matrix(data, data)

Next, we construct the adjacency matrix. The adjacency matrix is built by checking pairwise distances between data points and comparing them to a predefined threshold. If the distance between two points is below the threshold, they are considered "connected."

 
# Create adjacency matrix
adj_matrix = (distances < distance_threshold).astype(int)
np.fill_diagonal(adj_matrix, 0)

Finally, we convert the adjacency matrix into a graph structure using NetworkX library, where each node represents a data point, and edges are established between points that are within a specified distance threshold.

# Create graph
graph = nx.from_numpy_array(adj_matrix)

Anomaly detection

Each node in the graph has a degree, which indicates the number of edges (connections) it has. A node with a lower degree is less connected and may be an outlier. We collect the degrees of all the nodes into a list for further analysis.

 
# Calculate node degrees
node_degrees = np.array([deg for _, deg in graph.degree()])

We compute the value below which 1% of the node degrees fall, and use this value as the threshold to identify anomalous nodes. Then we can find all the nodes whose degree is less than or equal to the 1st percentile degree threshold.

 
# Identify anomalies
degree_threshold = np.percentile(node_degrees, 1)
anomalies = np.where(node_degrees <= degree_threshold)[0]

Using the extracted anomalous points, we can visualize them within the graph for better analysis.

 
# Visualize anomalies
plt.figure(figsize=(10, 8))
plt.scatter(data[:, 0], data[:, 1], c="blue", label="Normal Points")
plt.scatter(data[anomalies, 0], data[anomalies, 1], c="red", label="Anomalies")
plt.title("Graph-Based Anomaly Detection")
plt.legend()
plt.grid()
plt.show()
 

Conclusion

In this tutorial we explored graph-based anomaly detection, where we constructed a graph based on pairwise distances and analyzed node degrees to identify anomalies. Nodes with low degrees, representing isolated data points, were identified as potential outliers. This method uses graph structure to effectively detect anomalies in datasets. Full source code is provided below.

Source code listing

 
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
from sklearn.datasets import make_blobs
from scipy.spatial import distance_matrix

# Generate synthetic data
data, _ = make_blobs(n_samples=350, centers=1, cluster_std=0.4, random_state=42)
plt.figure(figsize=(10, 8))
plt.scatter(data[:, 0], data[:, 1])
plt.title("Test Dataset")
plt.show()

# Compute distance matrix
distance_threshold = 0.5
distances = distance_matrix(data, data)

# Create adjacency matrix
adj_matrix = (distances < distance_threshold).astype(int)
np.fill_diagonal(adj_matrix, 0)

# Create graph
graph = nx.from_numpy_array(adj_matrix)

# Calculate node degrees
node_degrees = np.array([deg for _, deg in graph.degree()])

# Identify anomalies
degree_threshold = np.percentile(node_degrees, 1)
anomalies = np.where(node_degrees <= degree_threshold)[0]

# Visualize anomalies
plt.figure(figsize=(10, 8))
plt.scatter(data[:, 0], data[:, 1], c="blue", label="Normal Points")
plt.scatter(data[anomalies, 0], data[anomalies, 1], c="red", label="Anomalies")
plt.title("Graph-Based Anomaly Detection")
plt.legend()
plt.grid()
plt.show()
 

DataTechNotes

Pages

Graph-based Anomaly Detection Example

No comments:

Post a Comment