Skip to content

Week 7 | Session 5: Model Evaluation, Overfitting, Random Forest & Error Metrics

Course: Supply Chain Digitization — Module 3: Analytics in SCM



Step 1 — Measuring Model Performance on Test Data

Section titled “Step 1 — Measuring Model Performance on Test Data”

Test data: 300 observations held out from the 1000-retailer dataset. NEVER used during training. For each of the 300 retailers: yᵢ = actual order quantity | ŷᵢ = predicted by regression tree

  • MSE formula (test data): MSE = (1/n) × Σᵢ₌₁ⁿ (yᵢ − ŷᵢ)²
  • Result at depth = 2: MSE (test) = 56,62,511
  • Result at depth = 2 (training): MSE (train) = 59,96,464
from sklearn.metrics import mean_squared_error
# Predict on test data
y_pred_test = reg_tree.predict(X_test)
# Compute MSE on test data
mse_test = mean_squared_error(Y_test, y_pred_test)
print("Test MSE:", mse_test)
# Output: Test MSE: 5662511.xx

Overfitting — Training vs. Test MSE Simulation

Section titled “Overfitting — Training vs. Test MSE Simulation”

Hypothesis: If we split nodes more (deeper tree) → more refined segments → lower MSE.

  • For TRAINING data: Yes — MSE always decreases as depth increases.
  • For TEST data: Initially decreases → then increases. Model starts memorising the training retailers instead of learning general patterns → fails on new retailers. This is Overfitting.
Tree DepthTraining MSETest MSEInterpretation
2 (baseline)59,96,46456,62,511Starting point. Test MSE slightly lower than training.
3, 4↓ Reduced↓ ReducedBoth improving — more splits helping.
5 ← Optimal↓ ReducedMinimum ★Best test performance. Optimal tree depth = 5.
6↓ Still reducing↑ Starts risingOVERFITTING BEGINS. Test performance degrades.
7–12↓ → 0 (memorising)↑ Keeps risingSevere overfitting. Test prediction useless.

Random Forest Algorithm — The Solution to Overfitting

Section titled “Random Forest Algorithm — The Solution to Overfitting”

Overfitting and Random Forest concept

Random Forest = many decision trees built on randomised subsets of data and features. Ensemble modelling: Instead of 1 model → build k models → combine predictions. Averaging across many trees cancels out individual errors → more accurate and stable.

  1. Bootstrap Sampling: Draw k random samples WITH REPLACEMENT. Each tree trains on a different sample (some retailers repeated, some left out).
  2. Feature Randomisation: Each tree uses a random subset of features (e.g. 3 out of 10). Different trees focus on different aspects.
  1. Bootstrap sampling: Create k different training samples.
  2. Randomise features: Each tree uses a random subset of features.
  3. Build k trees: Build k regression trees → each gives one predicted demand (ŷ).
  4. Combine predictions: Average of k predicted demands: Ŷ_final = (ŷ₁ + ŷ₂ + … + ŷₖ) / k

Results — Random Forest vs. Single Regression Tree

Section titled “Results — Random Forest vs. Single Regression Tree”
ModelMax DepthTest MSEBetter?
Single Regression Tree (depth 2)256,62,511Reference
Single Regression Tree (depth 5)5~44,xx,xxxBetter than depth 2
Random Forest (20 trees, depth 5)5 (per tree)37,21,603Best ✓ (~34% lower than depth 2)

Demand Forecast Error Metrics — All 5 Measures

Section titled “Demand Forecast Error Metrics — All 5 Measures”

Error term: eₜ = Fₜ − Dₜ (where Fₜ = Forecasted, Dₜ = Actual)

MetricFull NameFormulaWhen to Use / Interpretation
MSEMean Squared Error(1/n) × Σ(Fₜ − Dₜ)²Penalises large errors. Good for model comparison. Not in original units.
MADMean Absolute Deviation(1/n) × Σ|Fₜ − Dₜ|Easier to interpret (same units as demand). More robust to outliers.
MAPEMean Absolute % Error(100/n) × Σ|Fₜ − Dₜ| / DₜMost interpretable for management (“off by X% on average”). Scale-independent.
BiasForecast Bias(1/n) × Σ(Fₜ − Dₜ)Measures systematic over/under-estimation. Should be close to 0.
Tracking SignalTracking SignalBias / MADSignals whether model is drifting. Triggers review if outside ±4 to ±6.

Model Selection Framework — 7-Step Process

Section titled “Model Selection Framework — 7-Step Process”
#StepExample from This Course
1Understand the data7 retailer features: encoded categoricals. Checked for outliers.
2Choose evaluation metricMSE chosen. MAPE better for management reporting.
3Split data70% training | 30% test. Test data used ONLY for final evaluation.
4Experiment with multiple algorithmsTried regression tree vs. random forest. RF wins.
5Hyper-parameter tuningTested depths 2–12 to find optimal depth=5.
6Consider interpretabilitySingle tree: very interpretable. Random forest: harder to explain but more accurate.
7Check resource constraints20-tree random forest is manageable. 1000+ tree XGBoost may need cloud.

  • Analytics + Big Data: Data → Model → Decision → Value. Big Data 6 Vs.
  • Types of Analytics: Descriptive → Diagnostic → Predictive → Prescriptive.
  • Predictive Maintenance: Classification tree (Gini). Output: 4 leaf node rules. Python implementation.
  • Demand Forecasting: Regression tree (MSE). Output: continuous order quantities. Python: Dummy variables.
  • Random Forest & Evaluation: Overfitting, Bootstrap, Error metrics (MSE, MAD, MAPE).