Skip to content

Week 7 | Session 3: Building the Regression Tree — MSE, Splitting Logic & Worked Predictions

Course: Supply Chain Digitization — Module 3: Analytics in SCM



4 Retailer Predictions — Applying the Regression Tree

Section titled “4 Retailer Predictions — Applying the Regression Tree”
RetailerRegionBalance (₹L)LocationAge (yrs)Size (K sqft)Promo (0/1)HolidaysPredicted DemandNode
AWest10Urban128 ≤ 30.5 ✓1 ✓32360 unitsNode 4
BEast14Rural23 > 17.5 ✓33 > 30.5 ✓018227 unitsNode 6
CNorth3Semi-Urban1220 ≤ 30.5 ✓1 ✓22360 unitsNode 4
DSouth20Urban2012 ≤ 30.5 ✓0 ✗2943 unitsNode 3
  • Retailers A & C: Land in the same node (Node 4 — 2360 units) despite being in different regions and locations. The model tells us: those variables don’t matter as much as size + promotion.
  • Retailer B: Large store + old store. Lands in Node 6 — highest demand, lowest support.
  • Retailer D: Small store + no promotion. Lands in Node 3 — lowest demand.

Why Size (30.5K sq ft)? Why Promotion at Node 1? The algorithm’s choices are determined entirely by which variable + cutoff maximally reduces MSE at each node — not by human judgment.

3-Step Algorithm for Building a Regression Tree

Section titled “3-Step Algorithm for Building a Regression Tree”
  1. Place all training data in root node (Node 0): 700 observations in Node 0. Predicted demand = ȳ = 2270. MSE = 8,151,813. Baseline prediction = simple average.
  2. Split root using the variable + cutoff that gives MAXIMUM reduction in MSE: Tried all variables + all possible cutoffs. Size ≤ 30.5K sq ft gave maximum MSE reduction.
  3. Repeat Step 2 for each internal node until stopping criteria is met: Node 1 split using Promotion. Node 2 split using Age. Stop at depth 2.

Mean Squared Error (MSE) — The Splitting Criterion

Section titled “Mean Squared Error (MSE) — The Splitting Criterion”

Definition: MSE = Average of squared differences between actual demand and predicted demand (mean). Purpose: Measures how spread out the Y values are around their mean within a node. Lower MSE = less variance = more homogeneous group = better prediction.

MSE = (1/n) × Σᵢ₌₁ⁿ (yᵢ − ȳ)² (where yᵢ = actual demand for retailer i, ȳ = predicted demand, n = number of observations)

  • Squared error: Using squares ensures negative and positive errors don’t cancel out, and penalises large deviations more heavily.
  • Mean: Dividing by n normalises for node size → comparable across nodes of different sizes.

At each node: Try all (variable, cutoff) combinations. Select the (variable, cutoff) that gives the LARGEST reduction in MSE → this is the optimal split.


NodeObs. (n)Predicted Demand (ȳ)MSE FormulaMSE ValueInterpretation
Node 0 (Root)7002270Σ(yᵢ − 2270)² / 7008,151,813Baseline — no feature info. High MSE = bad prediction.
Node 1 (Size ≤ 30.5K)6121902Σ(yᵢ − 1902)² / 6126,605,698MSE reduced vs. Node 0. More homogeneous group.
Node 2 (Size > 30.5K)884829Σ(yᵢ − 4829)² / 8811,412,707Higher MSE — large stores vary widely in demand.
Node 3 (Small, No Promo)198943Σ(yᵢ − 943)² / 1982,384,088Lower MSE — no-promo small stores cluster tightly around 943.

Before split: All retailers in one group → mean ȳ = 2270. Retailers range from 0 to 8000+ units → MSE very high. After size split: Small stores cluster around 1902. Large stores cluster around 4829. Each group is more similar internally → lower MSE within each group.


Stopping Criteria — When to Stop Splitting

Section titled “Stopping Criteria — When to Stop Splitting”
Stopping CriterionDefinitionApplied in This Example
Max tree depthStop splitting once the tree reaches a pre-set number of levels from root nodeDepth = 2 used here. Not split further.
Min. observations per nodeDo not split if node has fewer than a minimum number of observationsNode 6 has only 32 obs (5%). If threshold = 10%, it would trigger a stop.
Min. MSE reduction (Delta threshold)Do not split if the max possible MSE reduction is below a threshold value δPrevents trivially small improvements from creating unnecessary complexity.
  • Overfitting in regression tree: If no stopping → tree eventually creates one leaf per retailer → MSE on training data = 0 → but model fails completely on new retailers (test data).

Classification Tree vs. Regression Tree — Algorithm Comparison

Section titled “Classification Tree vs. Regression Tree — Algorithm Comparison”
Aspect of Building the TreeClassification Tree (Machine Failure)Regression Tree (Demand Forecast)
Target variable (Y)Categorical (Fail / Not Fail)Continuous (Order Quantity in units)
Leaf node predictionMajority class label + probabilityMean (ȳ) of all Y values in that leaf
Splitting criterionGini Index or Entropy (impurity reduction)Mean Squared Error (MSE) (variance reduction)
Stopping criteriamax depth, min obs, min Gini reductionmax depth, min obs, min MSE reduction (δ)
Core differenceUses Gini/Entropy as split quality metricUses MSE as split quality metric — ONLY change needed!

  • 4 retailer predictions: A & C → Node 4 (2360, promo + small store) | B → Node 6 (8227, large + old) | D → Node 3 (943, small, no promo)
  • Build algorithm: 3 steps — place all data in root → split using variable/cutoff with max MSE reduction → repeat until stopping criteria
  • MSE formula: (1/n) × Σ(yᵢ − ȳ)² | Measures within-group variance | Lower MSE = better, more homogeneous group
  • MSE values: Node 0 = 8,151,813 → Node 1 = 6,605,698 → Node 3 = 2,384,088 (MSE reduces with each split)
  • Stopping criteria: Max depth | Min observations per node | Min MSE reduction per split (δ threshold)
  • vs. Classification tree: Only difference = MSE replaces Gini/Entropy as splitting metric. All other steps are identical.
  • Next sessions: Python implementation of regression tree for this demand forecasting case + error metrics (MAE, RMSE, MAPE)