DevOps / Infrastructure

HALO - Infrastructure Monitoring

Built an ML-powered observability platform that reduced MTTR by 87% and eliminated 94% of alert noise, saving $340K annually.

Client

Enterprise SaaS Company

Duration

3 months

Team Size

4 engineers

The Challenge

The client operated a large-scale SaaS platform serving 2,000+ enterprise customers across multiple AWS regions. Their monitoring stack — a patchwork of Datadog, PagerDuty, and custom CloudWatch alarms — was generating approximately 12,000 alerts per day. The operations team had determined through careful analysis that 97% of these alerts were false positives or non-actionable noise: threshold breaches that self-resolved, correlated symptoms of a single root cause firing as independent alerts, and alerts on metrics that had drifted from their original baselines without threshold updates.

The human cost was severe. The SRE team of eight engineers was spending 60% of their time triaging alerts rather than improving infrastructure. Mean time to resolution (MTTR) sat at 47 minutes because engineers had to wade through noise to find the signal. Alert fatigue had set in — the team had unconsciously developed a habit of dismissing alerts that "looked routine," which meant real incidents were occasionally missed. Two P1 incidents in the previous quarter had been caught by customers before the on-call team, resulting in SLA breaches that cost the company $180K in credits.

The client had attempted to solve this with increasingly complex static thresholds and alert routing rules. They had over 3,400 alerting rules in their system, many of which contradicted each other or referenced deprecated infrastructure. The maintenance burden of the rules themselves had become a full-time job. They needed a fundamentally different approach — one that could learn what "normal" looks like for each metric, adapt to seasonal patterns, and correlate related alerts into actionable incidents automatically.

The Solution

We built an intelligent monitoring platform that replaces static thresholds with ML-powered anomaly detection and automated alert correlation. The system ingests metrics from the client's existing telemetry sources — Prometheus, CloudWatch, and application-level custom metrics — and builds dynamic baseline models for each metric stream independently.

The anomaly detection engine uses an ensemble approach combining three model types: a seasonal decomposition model (STL) that captures daily, weekly, and monthly patterns; an isolation forest for detecting point anomalies in high-dimensional metric spaces; and an LSTM autoencoder trained on rolling 30-day windows for detecting subtle distributional shifts. Each model votes independently, and the ensemble requires consensus from at least two models before flagging an anomaly. This multi-model approach reduced false positive rates from 97% to under 6% while maintaining 99.2% true positive recall on historical incident data.

The alert correlation engine is the second core innovation. When anomalies are detected across multiple metrics, the system uses a temporal-spatial correlation algorithm to group them into incidents. It analyzes metric co-occurrence patterns, infrastructure topology (which services talk to which), and deployment timelines to determine whether multiple anomalies share a root cause. A single database connection pool exhaustion, for example, previously generated 47 independent alerts across application latency, error rates, queue depths, and downstream service health checks. Our correlation engine collapses these into a single incident with a probable root cause and affected blast radius.

We also built a predictive alerting layer that identifies emerging issues 15-30 minutes before they breach impact thresholds. The system monitors metric trajectory and rate-of-change patterns, comparing them against historical incident precursors. When it detects a metric following a trajectory that historically preceded an outage, it fires a predictive alert with confidence scoring and recommended remediation steps. This gives the SRE team a window to intervene before customers are affected.

Technical Architecture

The platform is architected as a streaming data pipeline built for real-time processing at scale. The ingestion layer collects metrics via a Kafka-based message bus that handles 50,000+ metrics per second. Prometheus exporters, CloudWatch metric streams, and custom application instrumentations all publish to topic-partitioned Kafka queues. A Flink-based stream processor performs real-time aggregation, downsampling, and feature extraction before routing processed metrics to the analysis layer.

The analysis layer runs the ML ensemble as a set of containerized microservices on Kubernetes. Each model type (STL, isolation forest, LSTM autoencoder) runs as an independent service with its own scaling characteristics — the LSTM models are GPU-accelerated for inference speed, while the statistical models run on CPU-optimized instances. Model outputs are aggregated by an ensemble coordinator service that applies voting logic and anomaly confidence scoring. The correlation engine runs as a separate stateful service that maintains a sliding window of recent anomalies and applies graph-based correlation algorithms using the infrastructure topology as a constraint graph.

The storage layer uses TimescaleDB for time-series metric storage with automated retention policies, PostgreSQL for incident records and correlation metadata, and Redis for real-time state caching (current baselines, active incidents, model feature buffers). The presentation layer is a React dashboard served via FastAPI, with WebSocket connections for real-time alert streaming. The dashboard provides a topology-aware incident view, metric drill-downs with anomaly overlays, and a correlation explanation panel that shows why specific alerts were grouped. The entire platform runs on AWS EKS with Terraform-managed infrastructure, and model retraining is orchestrated via Airflow on a weekly cadence.

Results

87%

MTTR Reduction

Mean time to resolution dropped from 47 minutes to 6 minutes

94%

Alert Noise Eliminated

Daily actionable alerts reduced from 12,000 to ~720 correlated incidents

$340K

Annual Savings

Savings from reduced SRE triage time and eliminated SLA breach credits

91%

Predictive Accuracy

Incidents predicted 15-30 minutes before customer impact

Tech Stack

PythonFastAPIKafkaApache FlinkTensorFlowTimescaleDBPostgreSQLRedisKubernetesWebSocketReactTerraform

“

Before HALO, our on-call rotation was a nightmare. Engineers were burned out from alert fatigue, and real incidents got lost in the noise. Now our team gets actionable incidents, not thousands of meaningless alerts. The predictive alerting alone has prevented at least three major outages.

Marcus Rivera

VP of Engineering, Enterprise SaaS Company