Data Inspection and Outlier Detection

Firstly, let’s start with importing necessary PyTorch modules.

from torchvision.datasets import Imagenette
from torch.utils.data import DataLoader
from torchvision.transforms import v2
import torch

import numpy as np
import matplotlib.pyplot as plt

Next, let’s prepare the datasets. In general, you will need two separate sets of images: Reference Dataset: This dataset will be used to extract reference image features and to fit the outlier detection models. In ML, this may be a training dataset. Inference Dataset: This dataset will be treated as incoming new data on which you want to perform outlier detection. In ML, this may be testing or validation dataset or any new samples in production. For this tutorial, we will use a subset of the ImageNet dataset called Imagenette. The Reference data is from be the Imagenette training data, whereas the Inference data will be from the validation dataset.

# Transforms
TRANSFORMS = v2.Compose([v2.ToImage(), 
                         v2.ToDtype(torch.float32, scale=True), 
                         v2.CenterCrop(size=(160,160)),
                         v2.Resize(size=(224,224))])
NORMALIZE = v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

# Datasets
train_ds = Imagenette(root="./example_data", split='train', size='160px', transform=TRANSFORMS, download=True)
ref_set = torch.utils.data.Subset(train_ds, indices=[i for i in range(0, len(train_ds), 10)])
val_ds = Imagenette(root="./example_data", split='val', size='160px', transform=TRANSFORMS)

# DataLoaders
ref_loader = DataLoader(ref_set, batch_size=32, shuffle=False)
inf_loader = DataLoader(val_ds, batch_size=6, shuffle=True)

# Labels mapping
CLASS_NAMES = ["tench", "English springer", "cassette player", 
               "chain saw", "church", "French horn", "garbage truck", 
               "gas pump", "golf ball", "parachute"]

LOGIT2NAME = {
    0: "tench",
    1: "English springer",
    2: "cassette player",
    3: "chain saw",
    4: "church",
    5: "French horn",
    6: "garbage truck",
    7: "gas pump",
    8: "golf ball",
    9: "parachute"
}

Data Inspector Module Setup

You are now ready to create an Outlier Detector — using the Reference dataset to obtain the typical distribution of visual features and applying that fitted model to the Inference dataset. The Obz AI package is designed to be highly modular and customizable. As a first step, we’ll import the specific feature extractors and the outlier detection algorithm you want to use.

from obzai.data_inspector.extractor import FirstOrderExtractor, CLIPExtractor
from obzai.data_inspector.detector import GMMDetector, PCAReconstructionLossDetector

First Order Extractor and GMM Detector

FirstOrderExtractor is a straightforward and fast tool designed to extract first-order statistical features from images. These features summarize general properties of the pixel intensity values, such as mean, variance, skewness, etc. For example, they are useful for identifying images that are overly bright/dark or excessively variable in their intensities compared to the Reference dataset. Note: First-order statistical features are invariant to the arrangement of pixels; in other words, they do not capture spatial relationships within the image. At any point, you can view the list of features that FirstOrderExtractor computes by accessing its .feature_names attribute. This helps you to understand exactly which statistics are being extracted from your images.

first_order_extrc = FirstOrderExtractor()
first_order_extrc.feature_names

[‘entropy’, ‘min’, ‘max’, ‘10th_percentile’, ‘90th_percentile’, ‘mean’, ‘median’, ‘interquartile_range’, ‘range’, ‘mean_absolute_deviation’, ‘robust_mean_absolute_deviation’, ‘root_mean_square’, ‘skewness’, ‘kurtosis’, ‘variance’, ‘uniformity’] GMMDetector is an outlier detection method that utilizes a Gaussian Mixture Model (GMM). To configure and use the GMMDetector, please follow the steps below:

extractors - Sequence of Extractor objects which process your data. Currently, only the FirstOrderExtractor is accepted.
n_components - A number of Gaussian components for the mixture model. This controls the complexity of the model and how finely it can separate data clusters.
outlier_quantile - Set the quantile threshold to determine what is considered an outlier. Data points falling below this quantile are classified as outliers.
show_progress - If set to True, a progress bar will be displayed during feature extraction to visualize operation progress.

After initialization, you can fit the model to your reference data by calling .fit method with a reference data. Ensure that the data you want to model comes in the form of a PyTorch DataLoader object.

gmm_detector = GMMDetector(extractors=[first_order_extrc], n_components=3, outlier_quantile=0.01, show_progress=True)
gmm_detector.fit(ref_loader)

That’s all! Your GMM based Outlier Detector is ready to use. You can pass batches of images in the inference data into .detect() method. This method returns a named tuple with:

img_features - extracted features for each image in the batch.
outliers - boolean vector indicating if samples in the batch are outliers.

We can run this outlier detection on a single batch of images, and display the outputs.

# Example code to run inference on a single batch
image_batch, _ = next(iter(inf_loader))
detection_results = gmm_detector.detect(image_batch)

print("Outlier detection results for a batch:")
print(detection_results.outliers)

print("First Order Features for a batch:")
print(detection_results.img_features)

print("Detection scores:")
print(detection_results.scores)

We can now run this outlier detection on ~1000 samples from the inference data

## Example code to run inference on multiple batches (upto 1000 samples)
# Process multiple batches from the inference loader
firstorder_total_samples = 0
firstorder_total_outliers = 0
firstorder_outliers = []
firstorder_features = []
firstorder_scores = []

for batch_images, _ in inf_loader:
    # Process batch
    batch_results = gmm_detector.detect(batch_images)
    firstorder_total_outliers += batch_results.outliers.sum()
    firstorder_total_samples += len(batch_results.outliers)

    firstorder_outliers.append(batch_results.outliers.flatten())
    firstorder_features.append(batch_results.img_features['FirstOrderExtractor'].T)
    firstorder_scores.append(batch_results.scores.flatten())
    
    # Break after 1000 samples
    if firstorder_total_samples >= 1000:
        break

print(f"Processed {firstorder_total_samples} samples")
print(f"Found {firstorder_total_outliers} first-order outliers ({(firstorder_total_outliers/firstorder_total_samples)*100:.2f}%)")

We can visualize the results by creating a histogram of outlier status and a distribution of scores based on GMM. In the second figure, we are plotting the distributrion of scores with y-axis in a log scale.

# Convert lists to numpy arrays for easier manipulation
firstorder_outliers_array = np.concatenate(firstorder_outliers)
firstorder_scores_array = np.concatenate(firstorder_scores)

# Calculate counts for normal and outlier samples
firstorder_outlier_counts = np.bincount(firstorder_outliers_array.astype(int))

# Create the same plots with a minimal look
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))

# Plot 1: Outlier Status Histogram with minimal style
ax1.bar(['Normal', 'Outlier'], firstorder_outlier_counts, color='black')
ax1.set_title('Distribution of Outliers', pad=20)
ax1.set_ylabel('Count')
percentage = (firstorder_outlier_counts[1] / len(firstorder_outliers_array)) * 100
ax1.text(1, firstorder_outlier_counts[1]/2, f'{percentage:.1f}%', ha='center', va='center', color='white')

# Plot 2: Detection Scores Distribution with minimal style
ax2.hist(firstorder_scores_array, bins=500, color='black', edgecolor='none')
ax2.set_yscale('log')  # Set y-axis to logarithmic scale
ax2.set_title('Distribution of Detection Scores', pad=20)
ax2.set_xlabel('Score')
ax2.set_ylabel('Count (log scale)')
ax2.legend()

plt.tight_layout()
plt.show()

For completeness, it is good to visualize some images within the normal range and outside of the range.

# Get first batch of outliers from scores and images
normal_images = []
outlier_images = []
batch_idx = 0

for batch_images, _ in inf_loader:
    detection_results = gmm_detector.detect(batch_images)
    
    # Add normal images
    normal_idx = (~detection_results.outliers).nonzero()[0]
    if len(normal_idx) > 0:
        for idx in normal_idx[:3-len(normal_images)]:
            if len(normal_images) < 3:
                normal_images.append(batch_images[idx].cpu().numpy())
    
    # Add outlier images
    outlier_idx = detection_results.outliers.nonzero()[0]
    if len(outlier_idx) > 0:
        for idx in outlier_idx[:3-len(outlier_images)]:
            if len(outlier_images) < 3:
                outlier_images.append(batch_images[idx].cpu().numpy())
    
    # Break if we have enough images
    if len(normal_images) == 3 and len(outlier_images) == 3:
        break
    
    batch_idx += 1

# Create figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(10, 5))

# Plot normal images
for i, img in enumerate(normal_images):
    axes[0, i].imshow(img.transpose(1, 2, 0))
    axes[0, i].axis('off')
    axes[0, i].set_title('Within Range', fontsize=10)

# Plot outlier images
for i, img in enumerate(outlier_images):
    axes[1, i].imshow(img.transpose(1, 2, 0))
    axes[1, i].axis('off')
    axes[1, i].set_title('Outside of Range (Outliers)', fontsize=10)


plt.suptitle('Examples of Outliers based on First Order Features', fontsize=10)
plt.tight_layout()
plt.show()

CLIP Extractor and PCA Reconstruction Loss Detector

GMMDetector applied on First Order Features provides a straightforward quantification that are invariant to the spatial arrangement of objects. We provide more advanced outlier detection based on principal component analysis (PCA), which are suitable for high dimensional embedding features. Here, we look at how to use a deep learning model (CLIP; Contrastive Language-Image Pre-training) to extract embedding features that take account for spatial relationships. To handle this type of high dimensional features, we use PCAReconstructionLossDetector as an outlier detection algorithm. There will an image schematic presentation of Reconstruction Loss idea. To set up the PCAReconstructionLossDetector, you need to provide the following arguments:

extractor: This is an already-instantiated extractor object that will be used to turn images into feature vectors. It is best to use an extractor that outputs features with high dimensions, such as CLIPExtractor.
n_components: This is the number of principal components (PCs). In other words, it controls how many dimensions will be in the reduced latent space.
outlier_quantile: This value is used to decide if an image should be considered an outlier. If the reconstruction loss (the difference between the original and reconstructed image in feature space) for an image is higher than this quantile, then the detector will classify that image as an outlier.
show_progress: Set this to True if you want to see a progress bar while features are being extracted. If you don’t want to see a progress bar, set it to False.

Note that unlike GMMDetector, PCAReconstructionLossDetector considers an unusually high value (e.g., reconstruction loss) to indicate evidence for an outlier.

clip_extractor = CLIPExtractor()
pca_detector = PCAReconstructionLossDetector(extractor=clip_extractor, n_components=128, outlier_quantile=0.01, show_progress=True)

pca_detector.fit(ref_loader)

After having fitted this outlier detection algorithm on our reference data, we can apply it on samples from the inference data.

## Example code to run inference on a single batch
# image_batch, _ = next(iter(inf_loader))
# detection_results = pca_detector.detect(image_batch)

# print("CLIP features:")
# print(detection_results.img_features)

# print("Outlier status:")
# print(detection_results.outliers)

# print("PCA Reconstruction Loss")
# print(detection_results.scores)

# print("2D PCA Components")
# print(detection_results.projector_results.pca_coords)

# print("2D UMAP Components")
# print(detection_results.projector_results.umap_coords)

## Example code to run inference on multiple batches (upto 1000 samples)
# Process multiple batches from the inference loader
clip_total_samples = 0
clip_total_outliers = 0
clip_outliers = []
clip_features = []
clip_scores = []

for batch_images, _ in inf_loader:
    # Process batch
    batch_results = pca_detector.detect(batch_images)
    clip_total_outliers += batch_results.outliers.sum()
    clip_total_samples += len(batch_results.outliers)

    clip_outliers.append(batch_results.outliers.flatten())
    clip_features.append(batch_results.img_features['CLIPExtractor'].T)
    clip_scores.append(batch_results.scores.flatten())
    
    # Break after 1000 samples
    if clip_total_samples >= 1000:
        break

print(f"Processed {clip_total_samples} samples")
print(f"Found {clip_total_outliers} outliers ({(clip_total_outliers/clip_total_samples)*100:.2f}%)")

# Convert lists to numpy arrays for easier manipulation
clip_outliers_array = np.concatenate(clip_outliers)
clip_scores_array = np.concatenate(clip_scores)

# Calculate counts for normal and outlier samples
clip_outlier_counts = np.bincount(clip_outliers_array.astype(int))

# Create the same plots with a minimal look
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))

# Plot 1: Outlier Status Histogram with minimal style
ax1.bar(['Normal', 'Outlier'], clip_outlier_counts, color='black')  
ax1.set_title('Distribution of Outliers', pad=20)
ax1.set_ylabel('Count')
percentage = (clip_outlier_counts[1] / len(clip_outliers_array)) * 100
ax1.text(1, clip_outlier_counts[1]/2, f'{percentage:.1f}%', ha='center', va='center', color='white')

# Plot 2: Detection Scores Distribution with minimal style
ax2.hist(clip_scores_array, bins=500, color='black', edgecolor='none')
ax2.set_yscale('log')  # Set y-axis to logarithmic scale
ax2.set_title('Distribution of Detection Scores', pad=20)
ax2.set_xlabel('Score')
ax2.set_ylabel('Count (log scale)')
ax2.legend()

plt.tight_layout()
plt.show()

clip_outlier_counts[1]/2

We can now visualize some of samples that are considered outliers by the CLIP-based detection method.

# Get first batch of outliers from scores and images
normal_images = []
outlier_images = []
batch_idx = 0

for batch_images, _ in inf_loader:
    detection_results = pca_detector.detect(batch_images)
    
    # Add normal images
    normal_idx = (~detection_results.outliers).nonzero()[0]
    if len(normal_idx) > 0:
        for idx in normal_idx[:3-len(normal_images)]:
            if len(normal_images) < 3:
                normal_images.append(batch_images[idx].cpu().numpy())
    
    # Add outlier images
    outlier_idx = detection_results.outliers.nonzero()[0]
    if len(outlier_idx) > 0:
        for idx in outlier_idx[:3-len(outlier_images)]:
            if len(outlier_images) < 3:
                outlier_images.append(batch_images[idx].cpu().numpy())
    
    # Break if we have enough images
    if len(normal_images) == 3 and len(outlier_images) == 3:
        break
    
    batch_idx += 1

# Create figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(10, 5))

# Plot normal images
for i, img in enumerate(normal_images):
    axes[0, i].imshow(img.transpose(1, 2, 0))
    axes[0, i].axis('off')
    axes[0, i].set_title('Within Range', fontsize=10)

# Plot outlier images
for i, img in enumerate(outlier_images):
    axes[1, i].imshow(img.transpose(1, 2, 0))
    axes[1, i].axis('off')
    axes[1, i].set_title('Outside of Range (Outliers)', fontsize=10)

plt.suptitle('Examples of Outliers based on CLIP', fontsize=10)
plt.tight_layout()
plt.show()

Inspecting reference features

When you fit an Outlier Detector, it internally extracts and stores a set of reference features from your data. If you wish to access these reference features for inspection or further analysis, you can do so easily. Simply use the .return_reference_features() method on your detector instance. This method will return the reference features associated with the given detector. Let’s see how to retrieve the reference features from a GMMDetector object:

refs_gmm = gmm_detector.return_reference_features()

Similary you can retrieve reference features from you PCA Reconstruction Loss Detector.

refs_pca = pca_detector.return_reference_features()

This particular type of a detector produces also 2 dimensional represenations of data points prepared with PCA and UMAP algorithms. It is particularly usefull for visualization on scatter plot. You can return those embedded features by calling .return_reference_2D_components()

embedding_coords = pca_detector.return_reference_2D_components()
print("## PCA Coordinates of Reference Set ##")
print(embedding_coords.pca_coords)
print("## UMAP Coordinates of Reference Set ##")
print(embedding_coords.umap_coords)

PCA Coordinates of Reference Set

x_coor y_coor 0 9.501494 7.013512 1 17.202290 12.576481 2 15.603800 10.027860 3 14.825594 10.396540 4 11.999342 10.963834 .. … … 942 5.190470 -10.450700 943 6.155877 -10.435268 944 4.931119 -11.205107 945 7.275045 -11.389244 946 1.419314 -6.772340 [947 rows x 2 columns]

UMAP Coordinates of Reference Set

x_coor y_coor 0 9.501342 7.013336 1 17.202326 12.576518 2 15.603857 10.027987 3 14.825516 10.396494 4 11.999347 10.963912 .. … … 942 5.190484 -10.450880 943 6.155859 -10.435368 944 4.931139 -11.205077 945 7.275099 -11.389211 946 1.419346 -6.772443 [947 rows x 2 columns]

Getting Started

Cloud

Local

Data Inspection and Outlier Detection

Data Inspector Module Setup

First Order Extractor and GMM Detector

CLIP Extractor and PCA Reconstruction Loss Detector

Inspecting reference features

PCA Coordinates of Reference Set

UMAP Coordinates of Reference Set

Getting Started

Cloud

Local

​Data Inspector Module Setup

​First Order Extractor and GMM Detector

​CLIP Extractor and PCA Reconstruction Loss Detector

​Inspecting reference features

​PCA Coordinates of Reference Set

​UMAP Coordinates of Reference Set

Data Inspector Module Setup

First Order Extractor and GMM Detector

CLIP Extractor and PCA Reconstruction Loss Detector

Inspecting reference features

PCA Coordinates of Reference Set

UMAP Coordinates of Reference Set