The Challenge

A Fortune 500 industrial manufacturer operating three high-throughput production lines was losing an average of 14 days of unplanned downtime per quarter. Each hour of unexpected shutdown cost the company approximately $87,000 in lost production, expedited repairs, and downstream supply chain disruptions. Over the prior fiscal year, unplanned equipment failures had cost the organization north of $10M.

The maintenance operation was entirely reactive. When a critical motor, compressor, or hydraulic press failed, the team scrambled to diagnose the issue, source replacement parts, and get the line running again. The average time to recovery was 9.2 hours. Worse, the unpredictability made production scheduling unreliable, causing cascading delays to customer deliveries and strained relationships with key accounts.

The factory floor had over 200 sensors across the three production lines — vibration monitors, temperature probes, pressure gauges, current sensors, and acoustic detectors. These sensors generated roughly 2TB of time-series data per month. But the data flowed into historian databases where it sat unused. No one had the tooling or expertise to turn raw sensor telemetry into actionable maintenance intelligence.

Our Approach

Week 1-3: Data Pipeline & Sensor Audit

We started with a physical walkthrough of all three production lines alongside the maintenance engineering team. Understanding the equipment — what fails, how it fails, and what the early warning signs look like — was essential before touching any data. We cataloged 47 critical assets and mapped each to its associated sensor feeds.

We then built a real-time data ingestion pipeline using Apache Kafka to stream sensor data from the factory's existing OPC-UA gateways. This replaced the batch-dump-to-historian approach with a continuous flow of time-series data into InfluxDB, our purpose-built time-series database. We also backfilled 18 months of historical data from the historian databases, aligning it with maintenance work order records to create labeled training data — a painstaking process that required close collaboration with the maintenance team to verify failure timestamps and root causes.

Week 4-6: Feature Engineering & Exploratory Analysis

Raw sensor readings alone are poor predictors of failure. The signal is in the patterns — subtle shifts in vibration frequency, slow temperature drift, pressure oscillation changes. We engineered 120+ features from the raw sensor streams:

Exploratory analysis revealed that 78% of historical failures showed detectable anomalies in at least one feature channel 48-96 hours before the failure event. This confirmed that a 72-hour prediction window was achievable.

Week 7-9: Model Development & Validation

We built an ensemble model combining gradient-boosted trees (LightGBM) for tabular feature classification with a 1D convolutional neural network (CNN) for raw waveform pattern detection on vibration data. The ensemble approach let us leverage the strengths of both architectures — the interpretability and robustness of tree-based models with the pattern recognition capabilities of deep learning.

We trained on 14 months of data and validated on the held-out 4 months. The model achieved 92% precision and 89% recall on the validation set, meaning it correctly predicted 9 out of 10 actual failures while keeping false alarms to a manageable rate. We implemented a tiered alert system: yellow alerts for elevated risk (24-72 hours out), red alerts for imminent failure (under 24 hours).

Week 10-12: Dashboard, Deployment & Team Training

We built a Grafana-based operational dashboard that gave maintenance supervisors a real-time view of equipment health across all three lines. Each asset displayed a health score from 0-100, trend indicators, and drill-down views showing which sensor channels were driving risk elevation. When the model issued an alert, the dashboard displayed the predicted failure mode, recommended inspection steps, and suggested parts to have on hand.

Deployment was staged by production line — Line A in week 10, Line B in week 11, Line C in week 12 — with the maintenance team shadowing the system and providing feedback that we incorporated in real time. We trained 12 maintenance engineers and 4 supervisors on the dashboard and alert response procedures.

Tech Stack

Python scikit-learn TensorFlow Apache Kafka InfluxDB Grafana AWS IoT Docker

Results

Over the first two quarters following full deployment:

We've been talking about predictive maintenance for five years. Arkyon actually delivered it — not a PowerPoint, not a pilot that never scales, but a system our maintenance crews use every single shift. The ROI paid for the project in the first six weeks.

J.T. — VP of Operations, Fortune 500 Manufacturer

What Made This Work