The Challenge
A Fortune 500 industrial manufacturer operating three high-throughput production lines was losing an average of 14 days of unplanned downtime per quarter. Each hour of unexpected shutdown cost the company approximately $87,000 in lost production, expedited repairs, and downstream supply chain disruptions. Over the prior fiscal year, unplanned equipment failures had cost the organization north of $10M.
The maintenance operation was entirely reactive. When a critical motor, compressor, or hydraulic press failed, the team scrambled to diagnose the issue, source replacement parts, and get the line running again. The average time to recovery was 9.2 hours. Worse, the unpredictability made production scheduling unreliable, causing cascading delays to customer deliveries and strained relationships with key accounts.
The factory floor had over 200 sensors across the three production lines — vibration monitors, temperature probes, pressure gauges, current sensors, and acoustic detectors. These sensors generated roughly 2TB of time-series data per month. But the data flowed into historian databases where it sat unused. No one had the tooling or expertise to turn raw sensor telemetry into actionable maintenance intelligence.
Our Approach
Week 1-3: Data Pipeline & Sensor Audit
We started with a physical walkthrough of all three production lines alongside the maintenance engineering team. Understanding the equipment — what fails, how it fails, and what the early warning signs look like — was essential before touching any data. We cataloged 47 critical assets and mapped each to its associated sensor feeds.
We then built a real-time data ingestion pipeline using Apache Kafka to stream sensor data from the factory's existing OPC-UA gateways. This replaced the batch-dump-to-historian approach with a continuous flow of time-series data into InfluxDB, our purpose-built time-series database. We also backfilled 18 months of historical data from the historian databases, aligning it with maintenance work order records to create labeled training data — a painstaking process that required close collaboration with the maintenance team to verify failure timestamps and root causes.
Week 4-6: Feature Engineering & Exploratory Analysis
Raw sensor readings alone are poor predictors of failure. The signal is in the patterns — subtle shifts in vibration frequency, slow temperature drift, pressure oscillation changes. We engineered 120+ features from the raw sensor streams:
- Statistical features: rolling means, standard deviations, skewness, and kurtosis over 1-hour, 6-hour, and 24-hour windows
- Frequency-domain features: FFT-derived dominant frequencies and spectral energy for vibration sensors, critical for detecting bearing degradation
- Cross-sensor correlations: temperature-vibration coupling coefficients that shift when equipment enters early failure modes
- Operational context: production load, cycle counts since last maintenance, ambient conditions, and shift patterns
Exploratory analysis revealed that 78% of historical failures showed detectable anomalies in at least one feature channel 48-96 hours before the failure event. This confirmed that a 72-hour prediction window was achievable.
Week 7-9: Model Development & Validation
We built an ensemble model combining gradient-boosted trees (LightGBM) for tabular feature classification with a 1D convolutional neural network (CNN) for raw waveform pattern detection on vibration data. The ensemble approach let us leverage the strengths of both architectures — the interpretability and robustness of tree-based models with the pattern recognition capabilities of deep learning.
We trained on 14 months of data and validated on the held-out 4 months. The model achieved 92% precision and 89% recall on the validation set, meaning it correctly predicted 9 out of 10 actual failures while keeping false alarms to a manageable rate. We implemented a tiered alert system: yellow alerts for elevated risk (24-72 hours out), red alerts for imminent failure (under 24 hours).
Week 10-12: Dashboard, Deployment & Team Training
We built a Grafana-based operational dashboard that gave maintenance supervisors a real-time view of equipment health across all three lines. Each asset displayed a health score from 0-100, trend indicators, and drill-down views showing which sensor channels were driving risk elevation. When the model issued an alert, the dashboard displayed the predicted failure mode, recommended inspection steps, and suggested parts to have on hand.
Deployment was staged by production line — Line A in week 10, Line B in week 11, Line C in week 12 — with the maintenance team shadowing the system and providing feedback that we incorporated in real time. We trained 12 maintenance engineers and 4 supervisors on the dashboard and alert response procedures.
Tech Stack
Results
Over the first two quarters following full deployment:
- 45% reduction in unplanned downtime — from 14 days per quarter to 7.7 days, with continued improvement as the model learned from new data
- $4.7M in annual savings — combining avoided downtime costs, reduced emergency repair premiums, and optimized spare parts inventory
- 92% prediction accuracy — the model correctly predicted failures an average of 68 hours in advance, giving the team ample time to schedule maintenance during planned windows
- Maintenance shifted from reactive to proactive — the ratio of planned to unplanned maintenance work orders went from 35:65 to 72:28
- Mean time to recovery dropped 61% — when failures did occur, the team already had diagnostic context and parts staged, cutting MTTR from 9.2 hours to 3.6 hours
We've been talking about predictive maintenance for five years. Arkyon actually delivered it — not a PowerPoint, not a pilot that never scales, but a system our maintenance crews use every single shift. The ROI paid for the project in the first six weeks.
J.T. — VP of Operations, Fortune 500 Manufacturer
What Made This Work
- Domain immersion — spending the first week on the factory floor with maintenance engineers gave us physical intuition about failure modes that no amount of data exploration alone could provide
- Feature engineering depth — the 120+ engineered features, especially frequency-domain transforms, were the difference between a demo-quality model and a production-grade one
- Tiered alerting — yellow and red alert tiers matched the maintenance team's existing workflow, making adoption natural rather than disruptive
- Staged rollout with feedback loops — deploying one line at a time let us incorporate real-world feedback and build trust with the maintenance team before scaling