View on GitHub →

1- Overview

An end-to-end predictive maintenance system for aircraft turbofan engines. RUL prediction on NASA CMAPSS dataset using Random Forest, XGBoost, sklearn Pipelines & MLflow.

turbofan_cmapss.png


2- Context & Motivation

Problem Statement

Aviation maintenance operates on two modes: scheduled (replace every N hours regardless of condition) and unscheduled (repair after failure). The first wastes money by replacing healthy components while the second risks catastrophic in-flight failure. Predictive maintenance is the third way: maintenance operations are scheduled when and only when a component is approaching failure.

The business case is substantial: a single unplanned engine removal can rapidly cost hundreds of thousands of dollars in AOG (Aircraft on Ground)fees and emergency parts, before even accounting for safety consequences.

Technical challenges

An aircraft turbofan engine degrades continuously through thousands of operational cycles. The challenge is to:

  • Extract a clean degradation signal from noisy, high-dimensional time-series data
  • Predict the number of remaining cycles with sufficient accuracy to schedule maintenance in advance
  • Generalize to engines the model has never observed

To do so, RUL (Remaining Useful Life) is introduced. Predicting it accurately is the key challenge in maintenance prediction problems, and can be applied to a broad range of scientific and industrial applications.

rul.png

Project Goal & Methodology

Build an end-to-end ML pipeline from raw sensor data to operational risk assessment, using the NASA CMAPSS benchmark dataset, validated against the official NASA test set and the PHM 2008 competition scoring function.

Raw CMAPSSdataEDA &explorationFeatureengineeringModel training& tuningNASA benchmark& risk analysis

3- Dataset

CMAPSS (Commercial Modular Aero-Propulsion System Simulation) was released by NASA Ames Research Center for the 2008 PHM challenge and remains the standard benchmark for RUL prediction. It simulates run-to-failure degradation of a turbofan engine’s high-pressure compressor.

This project uses the FD001 subset: 100 training engines and 100 test engines, all operating under a single condition with a single fault mode.

Each row in the dataset represents one engine at one operational cycle, with 21 sensor readings and 3 operating condition settings:

#SymbolDescriptionUnit
Engine
Cycle
Setting 1Altitudeft
Setting 2Mach NumberM
Setting 3TRA (Throttle-Resolver Angle)deg
Sensor 1T2Total temperature at fan inlet°R
Sensor 2T24Total temperature at LPC outlet°R
Sensor 3T30Total temperature at HPC outlet°R
Sensor 4T50Total temperature at LPT outlet°R
Sensor 5P2Pressure at fan inletpsia
Sensor 6P15Total pressure in bypass-ductpsia
Sensor 7P30Total pressure at HPC outletpsia
Sensor 8NfPhysical fan speedrpm
Sensor 9NcPhysical core speedrpm
Sensor 10eprEngine pressure ratio
Sensor 11Ps30Static pressure at HPC outletpsia
Sensor 12phiRatio of fuel flow to Ps30pps/psi
Sensor 13NRfCorrected fan speedrpm
Sensor 14NRcCorrected core speedrpm
Sensor 15BPRBypass ratio
Sensor 16farBBurner fuel-air ratio
Sensor 17htBleedBleed enthalpy
Sensor 18Nf_dmdDemanded fan speedrpm
Sensor 19PCNfR_dmdDemanded corrected fan speedrpm
Sensor 20W31HPT coolant bleedlbm/s
Sensor 21W32LPT coolant bleedlbm/s
  • LPC/HPC: Low/High Pressure Compressor
  • LPT/HPT: Low/High Pressure Turbine

Engine 1 sensors show a clear monotonic degradation while others are flat or pure noise:

Sensor evolution over cycles for engine 1

Engine lifetimes in the training set range from 128 to 362 cycles, giving the following distribution:

Engine lifetime distribution


4- Feature Engineering

Raw sensor readings require significant transformation before they can be used as model inputs. Four operations are applied in sequence: RUL targetclip=120Low variance removalvar<1e-5Temporal featureswindow=5Low RUL correlation removal|corr|<0.1max_cycles − t7 sensors + 3 settings removedrolling mean/std, diff28 features retained

RUL computation and clipping

For each engine: $$\text{RUL}(t) = \text{max\_cycles} - t$$

A clip of 120 cycles is applied. Beyond this value, the engine is healthy and predicting a precise RUL has less operational relevance. Indeed, maintenance is not scheduled 300 cycles in advance. This clipping is standard in the CMAPSS literature and also speeds up convergence so that the model focuses on the region where prediction matters.

Low-variance feature removal

Features whose variance falls below $10^{-5}$ are constant (or near-constant) and carry no useful information.

Temporal features

Mechanical degradation is a slow process. In order to capture underlying trends while smoothing the noise, a 5-cycle rolling mean and standard deviation are implemented, as well as an instantaneous difference to capture rapid changes. For each retained sensor we then get:

  • rolling_mean_5 for local trend
  • rolling_std_5 for local variability
  • diff_1 (first-order difference) for the rate of change

Correlation filtering

Features with $|\text{corr}(\text{RUL})| \leq 0.1$ are removed. The final feature set contains 28 features.

Feature correlation with RUL


5- Model Selection & Training

The split is performed by engine unit (not randomly across rows) to prevent data leakage: the same engine can’t appear in both training and test sets. A standard progression from simple to more complex models is then followed.

Baseline

The baseline model systematically predicts the mean RUL of the training set for every observation. It is the minimum benchmark to outperform.

Linear Regression

First supervised model, scale-sensitive so wrapped in a StandardScaler, to discover if the degradation signal is partially linearly exploitable.

Random Forest

As a reminder, Random Forest is an ensemble learning model of decision trees, each trained on a random subset of features and data (bagging). The final prediction is the average across all trees. Each tree sees a random bootstrap sample of the data and at each split considers only a random subset of features. This double randomness reduces variance (overfitting) while keeping each tree individually interpretable.

Random Forest

Before hyperparameter tuning, a vanilla Random Forest is trained with default parameters, and the top 15 features by importance are selected. Feature importance is measured by average impurity decrease (Gini) across all splits — providing interpretable feature rankings. Reducing dimensionality at this stage lowers overfitting and speeds up the grid search.

XGBoost

Gradient-boosted trees: rather than building trees in parallel (like RF), XGBoost builds them sequentially, each tree correcting the residual errors of the previous. This makes it more data-efficient, but also more prone to overfitting without regularization.

XGBoost

Before hyperparameter tuning, a vanilla XGBoost is trained with default parameters, and the top 15 features by importance are selected. XGBoost uses gain-based feature importance (improvement in loss from each split), which ranks features differently than RF’s Gini impurity.

Hyperparameter tuning (Random Forest and XGBoost)

GridSearchCV() over the key regularization parameters:

ModelParameters tuned
Random Forestn_estimators ∈ {100, 200, 300}
max_depth ∈ {None, 10, 20}
min_samples_leaf ∈ {1, 3, 5}
XGBoostn_estimators ∈ {100, 200, 300}
max_depth ∈ {3, 5, 7}
learning_rate ∈ {0.05, 0.1, 0.2}

Preventing data leakage with GroupKFold for cross-validation

A random train/test split would place different cycles of the same engine in both train and test folds. GroupKFold partitions by engine unit, ensuring each fold contains engines unseen during training, therefore avoiding any data leakage.

GroupKFold cross-validation (k=3, split by engine unit)Fold 1Fold 2Fold 3Val (engines 1–33)Train (engines 34–100)TrainVal (34–66)TrainTrain (engines 1–66)Val (67–100)

6- Results

Each model is wrapped in a sklearn Pipeline (scaler → model). MLflow experiment tracking is implemented and enables direct run comparison in the MLflow UI.

Metrics (RMSE, MAE, R², NASA Score)

Standard regression metrics for model selection: $$ \mathbf{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$ $$ \mathbf{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $$ $$ \mathbf{R^2} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2} {\sum_{i=1}^{n} (y_i - \bar{y})^2} $$

In addition, the NASA asymmetric scoring function for operational validation is introduced.

For each engine, the individual score is:

$$s_i = \begin{cases} e^{-d_i/13} - 1 & \text{if } d_i < 0 \text{ (early prediction)} \\ e^{d_i/10} - 1 & \text{if } d_i \geq 0 \text{ (late prediction)} \end{cases}$$

where $d_i = \hat{y}_i - y_i$ is the prediction error.

The total score is $S = \sum_i s_i$. Lower is better, an undetected failure is far more costly than a preventive maintenance action.

NASA scoring function asymmetry

Model comparison

Model comparison — RMSE, MAE, R²

ModelRMSE (test)MAE (test)R² (test)
Baseline39.9235.260.000
Linear Regression19.0015.390.773
Random Forest15.9010.920.841
RF Tuned15.5110.710.849
XGBoost16.9811.480.819
XGBoost Tuned15.5311.180.849

Tuning narrows the train/test gap substantially in both RF and XGBoost, confirming the initial vanilla models were overfitting. RF Tuned and XGBoost Tuned reach essentially identical performance. Random Forest Tuned model is selected as the final model for its robustness and interpretability.

NASA benchmark

The final model is retrained on the full training set and evaluated on the official NASA test file, a completely held-out set never seen during development.

MetricValue
RMSE17.34 cycles
MAE12.10 cycles
0.806
NASA Score902.7

NASA benchmark — true vs predicted RUL

True vs predicted RUL

A NASA score of 902.7 is an encouraging result, ranking our model among the top 6 (based on the PHM 2008 competition results). However, this should be interpreted with caution, as we only used the FD001 dataset, which is the easiest subset.

Operational risk analysis

In a real maintenance context, maintenance is triggered when the predicted RUL falls below an operational threshold, determined by a human operator. For aerospace applications, the threshold should be set conservatively low given the catastrophic cost of an in-flight failure.

To translate our model performance into business value, we categorize predictions into three operational zones:

ZoneConditionOperational consequence
Safe$ abs(d) <= 13 $ cyclesMaintenance planned correctly
Early warning$d < -13$ cyclesUnnecessary early intervention
Danger$d > 13$ cyclesEngine may fail before maintenance, critical safety risk

The threshold of 13 cycles is derived from the NASA scoring function scale parameter for early predictions.

Operational risk per engine


7- Conclusion & Next Steps

Physical Interpretation

T50 (LPT outlet temperature) emerges as the dominant predictor of RUL. This result aligns with FD001 dataset single fault mode. Because T50 sits at the very end of the gas path, it acts as a natural integrator of all upstream degradation mechanisms, making it the most informative single signal for RUL estimation.

The use of a 5-cycle rolling mean across all selected features is physically motivated: mechanical degradation is a slow, cumulative process whose signature is more reliably captured over several consecutive cycles than in any instantaneous measurement.

Limitations and next steps

  • FD001 is the simplest CMAPSS subset: single operating condition, single fault mode. Extending to FD002/FD003/FD004 (multiple conditions, multiple fault modes) would better reflect real-world complexity.

  • The RUL clipping at 120 cycles is a modeling assumption. In production, a two-stage model could first classify whether the engine is in its degradation phase before predicting RUL.

  • LSTM / Transformer architectures are known to outperform classical ML models on this benchmark by explicitly modeling temporal dependencies, this is what I will focus on next.


Python · pandas · numpy · scikit-learn · sklearn Pipeline · XGBoost · MLFlow · matplotlib · seaborn · joblib