Engineering Insights

The Alert Fatigue Crisis

Every large enterprise monitoring setup shares the same pathology: too many alerts, not enough signal. A typical deployment we audit has 12,000+ alerts per day across infrastructure, application, and business metrics. Of those, roughly 97% are false positives — threshold-based alerts firing on normal variance, transient spikes, or scheduled maintenance.

The result is alert fatigue. On-call engineers learn to ignore alerts. Dashboards become noise. When a real incident occurs, the signal is buried in a wall of irrelevant notifications. Mean time to detect (MTTD) stretches to hours. Mean time to resolve (MTTR) stretches further.

HALO, our ML-powered anomaly detection platform, addresses this by replacing static thresholds with learned baselines, correlating related anomalies into unified incidents, and generating automated root cause hypotheses. This post describes the ML methodology in detail.

Feature Engineering for Time Series Anomaly Detection

Raw metric values are poor features for anomaly detection. A CPU utilization of 85% is alarming at 3 AM on a Sunday but normal at 2 PM on a Monday during a batch processing window. Effective anomaly detection requires features that capture context.

We engineer four categories of features for each metric:

Temporal features. Hour of day (cyclically encoded as sin/cos), day of week, month, holiday indicator, minutes since last deployment, minutes since last scheduled job. These capture periodic patterns and known event context.

python
import numpy as np

def encode_temporal_features(timestamp: datetime) -> dict[str, float]:
    hour = timestamp.hour + timestamp.minute / 60.0
    dow = timestamp.weekday()
    return {
        "hour_sin": np.sin(2 * np.pi * hour / 24),
        "hour_cos": np.cos(2 * np.pi * hour / 24),
        "dow_sin": np.sin(2 * np.pi * dow / 7),
        "dow_cos": np.cos(2 * np.pi * dow / 7),
        "is_weekend": float(dow >= 5),
        "is_business_hours": float(9 <= hour <= 17 and dow < 5),
    }

Statistical features (rolling windows). For each metric, we compute rolling statistics over multiple windows (5min, 15min, 1hr, 6hr, 24hr): mean, standard deviation, median, min, max, rate of change, coefficient of variation. This gives the model a multi-scale view of recent behavior.

Lag features. Values at the same time 1 day ago, 1 week ago, and 4 weeks ago. These help the model learn periodic patterns. A metric that is always high on Monday mornings should not trigger an anomaly on Monday morning.

Cross-metric features. Correlation between related metrics: CPU utilization with request rate, memory usage with active connections, error rate with response latency. An increase in CPU that correlates with an increase in request rate is likely normal scaling. An increase in CPU with no change in request rate is suspicious.

Each metric produces ~60 features. For a service with 50 monitored metrics, the model ingests 3,000 features per inference cycle.

Model Architecture

We use an ensemble of three model types, each capturing different anomaly patterns:

Isolation Forest for Point Anomalies

Isolation Forests excel at detecting individual data points that are statistically unusual in the feature space. They are fast to train, fast to infer, and require no assumptions about the data distribution.

We train one Isolation Forest per service (not per metric) so that cross-metric relationships are captured. The contamination parameter is set to 0.01 based on historical labeling — roughly 1% of data points in our labeled dataset are true anomalies.

LSTM Autoencoder for Temporal Anomalies

Point anomaly detectors miss temporal patterns — sequences of values that are individually normal but collectively anomalous (a slow, steady increase in memory usage over hours, for example). Our LSTM autoencoder is trained to reconstruct normal time series windows (60-minute sequences at 1-minute granularity). At inference time, high reconstruction error indicates the sequence deviates from learned normal behavior.

python
import torch
import torch.nn as nn

class LSTMAutoencoder(nn.Module):
    def __init__(self, n_features: int, hidden_size: int = 128, n_layers: int = 2):
        super().__init__()
        self.encoder = nn.LSTM(
            input_size=n_features, hidden_size=hidden_size,
            num_layers=n_layers, batch_first=True, dropout=0.2
        )
        self.decoder = nn.LSTM(
            input_size=hidden_size, hidden_size=hidden_size,
            num_layers=n_layers, batch_first=True, dropout=0.2
        )
        self.output_layer = nn.Linear(hidden_size, n_features)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (batch, seq_len, n_features)
        _, (hidden, cell) = self.encoder(x)

        # Repeat last hidden state for decoder input
        decoder_input = hidden[-1].unsqueeze(1).repeat(1, x.size(1), 1)
        decoder_output, _ = self.decoder(decoder_input, (hidden, cell))

        return self.output_layer(decoder_output)

    def anomaly_score(self, x: torch.Tensor) -> torch.Tensor:
        reconstruction = self.forward(x)
        return torch.mean((x - reconstruction) ** 2, dim=(1, 2))

The autoencoder is retrained weekly on a rolling 30-day window of data, ensuring the baseline adapts to gradual infrastructure changes (new deployments, capacity changes, traffic growth).

Prophet-based Seasonal Decomposition

For metrics with strong seasonal patterns (daily traffic curves, weekly batch job patterns), we use Facebook Prophet to decompose the time series into trend, seasonality, and residual components. Anomalies are detected in the residual — what remains after accounting for expected seasonal behavior.

This catches anomalies that the other models miss: a metric that is within its normal range but at the wrong time. Request rate of 500 RPS is normal at 2 PM but anomalous at 2 AM.

Ensemble Scoring

Each model produces an anomaly score (0 to 1). The ensemble score is a weighted combination:

python
def ensemble_score(
    isolation_score: float,
    autoencoder_score: float,
    prophet_score: float,
    weights: tuple[float, float, float] = (0.3, 0.4, 0.3)
) -> float:
    scores = [isolation_score, autoencoder_score, prophet_score]
    weighted = sum(s * w for s, w in zip(scores, weights))
    # Boost score if multiple models agree
    n_flagged = sum(1 for s in scores if s > 0.7)
    agreement_boost = 0.15 * max(0, n_flagged - 1)
    return min(1.0, weighted + agreement_boost)

The agreement boost is important: when multiple independent models flag the same data point, confidence increases nonlinearly. A point flagged by all three models is very likely a true anomaly.

The Correlation Engine

Detecting individual anomalies is half the problem. The other half is correlating related anomalies into incidents. When a database slows down, you see anomalies in: database query latency, application response time, error rates, queue depth, and potentially upstream service latency. These are not five incidents — they are one incident with five symptoms.

HALO's correlation engine uses a service dependency graph (automatically discovered via distributed tracing data) combined with temporal proximity. Two anomalies are correlated if:

1. They occur within a configurable time window (default: 5 minutes). 2. The affected services have a dependency relationship (direct or within 2 hops). 3. The anomaly patterns are consistent with the dependency direction (upstream anomalies should precede downstream anomalies).

Correlated anomalies are merged into a single incident. The incident is annotated with a probable root cause — the earliest anomaly in the dependency chain — and a blast radius — all services affected directly or transitively.

Real-Time Inference Architecture

HALO processes metrics in real-time with the following architecture:

Ingestion. Metrics flow from Prometheus exporters, CloudWatch, Datadog agents, and custom instrumentation into a Kafka topic. Each message contains a service identifier, metric name, timestamp, value, and labels.

Feature computation. A Flink streaming job consumes the Kafka topic and computes the engineered features described above. Rolling window statistics are maintained in Flink state. The job outputs enriched feature vectors to a second Kafka topic.

Model inference. A pool of Python inference workers consumes feature vectors and runs the ensemble model. Inference latency is ~15ms per metric per cycle (1-minute cycles). Workers are stateless and horizontally scalable.

Correlation and alerting. Detected anomalies flow to the correlation engine (a stateful service backed by Redis for the temporal window) which groups them into incidents. Incidents above a severity threshold are pushed to PagerDuty, Slack, and the HALO dashboard.

End-to-end latency from metric emission to alert delivery: under 90 seconds. This is fast enough for real-time incident detection while allowing sufficient data for meaningful statistical features.

Production Results

Across three enterprise deployments:

Alert volume: 12,000/day to 720/day (94% reduction in noise)
False positive rate: 97% to 12% (8x improvement)
Mean time to detect (MTTD): 23 minutes to 3.2 minutes
Mean time to resolve (MTTR): 47 minutes to 6 minutes (87% reduction)
Estimated annual savings from prevented downtime: $340K per deployment

The 12% false positive rate is intentional — we tune for high recall (catching real incidents) at the expense of some false positives. Missing a real incident is far more costly than investigating a false alarm.

Key Lessons

The most important lesson from building HALO: anomaly detection is a context problem, not a statistics problem. The same metric value can be normal or anomalous depending on time, correlated metrics, recent deployments, and business context. Feature engineering that captures this context is more impactful than model architecture choices. An Isolation Forest with excellent features outperforms a transformer with poor features every time.

Anomaly DetectionMLHALOTime SeriesInfrastructure

From 12,000 Alerts to Actionable Intelligence: ML-Powered Anomaly Detection

The Alert Fatigue Crisis

Feature Engineering for Time Series Anomaly Detection

Model Architecture

Isolation Forest for Point Anomalies

LSTM Autoencoder for Temporal Anomalies

Prophet-based Seasonal Decomposition

Ensemble Scoring

The Correlation Engine

Real-Time Inference Architecture

Production Results

Key Lessons

Related Articles

Building Production RAG Systems: Lessons from 50,000 Legal Documents

Agentic AI in the Enterprise: From Chatbots to Autonomous Workflows

Why RAG Beats Fine-Tuning for 90% of Enterprise Use Cases

Want to build something similar?