Skip to content

Week 7 | Session 4: Regression Tree in Python — Dummy Variables & Full Implementation

Course: Supply Chain Digitization — Module 3: Analytics in SCM



Problem: Region has values “South”, “East”, “West”, “North” — these are TEXT, not numbers. ML models (decision tree, random forest, logistic regression, etc.) cannot perform mathematical operations on text directly.

Solution: Convert categorical text into binary (0/1) numerical dummy variables — one column per category (minus one base category).

  • Rule: If a categorical variable has m categories → create (m − 1) dummy variables. One category is left out as the “base category”.
  • Why (m − 1)? The mth category is implied when all (m − 1) dummy variables = 0. Including all m would create perfect multicollinearity (dummy variable trap).
  • Applies to: ALL ML models. Always encode categorical variables first.

Dummy Variables Created — Region and Location

Section titled “Dummy Variables Created — Region and Location”
VariableCategories (m)Dummy Variables Created (m − 1)Base Category (not created)Interpretation Rule
Region4 (N, S, E, W)region_north, region_south, region_westEast (all three dummies = 0)If region_south = 1 → South. If all three = 0 → East (base).
Location3 (Rural, Semi-Urban, Urban)location_semi_urban, location_urbanRural (both dummies = 0)If location_urban = 1 → Urban. If both = 0 → Rural (base).

How to Read Dummy Variables — Worked Examples

Section titled “How to Read Dummy Variables — Worked Examples”
Obs.Actual Regionregion_northregion_southregion_westActual Locationlocation_semi_urban
978South010Rural0 (Rural = base)
979West001Semi-Urban1
981East (base)000Urban0, but location_urban=1
983North100Semi-Urban1

Variable Count — Before vs. After Encoding

Section titled “Variable Count — Before vs. After Encoding”
VariableBefore EncodingAfter Encoding
Region1 column (text)3 columns: region_north | region_south | region_west
Location1 column (text)2 columns: location_semi_urban | location_urban
Numerical vars5 columns5 columns (unchanged)
TOTAL FEATURES (X)7 columns10 columns (3 + 2 + 5)

# ── STEP 1: Import Data ───────────────────────────────────────────
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
df = pd.read_csv("demand.csv")
# ── STEP 2: Define X (features) and Y (target) ────────────────────
X_features = list(df.columns)
X_features.remove("order_quantity")
X_df = df[X_features]
Y = df["order_quantity"]
# ── STEP 3: Encode Categorical Variables (CRITICAL STEP) ──────────
# pd.get_dummies() creates m-1 dummies and drops the first category
encoded_df = pd.get_dummies(X_df, drop_first=True)
X = encoded_df # now 10 columns instead of 7
# ── STEP 4: Train-Test Split ──────────────────────────────────────
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.30, random_state=42)
# ── STEP 5: Build Regression Tree ────────────────────────────────
# DecisionTreeRegressor used because target is CONTINUOUS
regr = DecisionTreeRegressor(max_depth=2)
regr.fit(X_train, Y_train)
# ── STEP 6: Print / Visualise Tree ────────────────────────────────
plt.figure(figsize=(15, 10))
tree.plot_tree(regr, feature_names=X_train.columns, filled=True)
plt.show()
  • pd.get_dummies(..., drop_first=True): Built-in pandas function. Creates all dummy variables automatically and removes the base category column. Must do BEFORE train-test split.
  • DecisionTreeRegressor: Used instead of DecisionTreeClassifier because the target (order_quantity) is CONTINUOUS. This is the key code difference.
  • No criterion needed: Regressor defaults to MSE (squared error) as the splitting criterion.

Output Verification — Python Matches Theory Exactly

Section titled “Output Verification — Python Matches Theory Exactly”
NodeConditionPython Output (ȳ, Obs, MSE)Theory Output (Session 7.2)Match?
0All 700 training obs.ȳ=2270 | n=700 MSE=8,151,813ȳ=2270 | n=700 MSE=8,151,813✓ Match
1Size ≤ 30.5K sq ftȳ=1902 | n=612ȳ=1902 | n=612✓ Match
2Size > 30.5K sq ftȳ=4829 | n=88ȳ=4829 | n=88✓ Match
3Size ≤ 30.5K, Promo = 0ȳ=943 | n=198ȳ=943 | n=198✓ Match
4Size ≤ 30.5K, Promo = 1ȳ=2360 | n=414ȳ=2360 | n=414✓ Match
5Size > 30.5K, Age ≤ 17.5ȳ=2887.3 | n=56ȳ=2887 | n=56✓ Match (rounded)
6Size > 30.5K, Age > 17.5ȳ=8226.87 | n=32ȳ=8227 | n=32✓ Match (rounded)

Code Difference: Classification Tree vs. Regression Tree

Section titled “Code Difference: Classification Tree vs. Regression Tree”
Code ElementClassification Tree (Machine Failure)Regression Tree (Demand Forecast)
sklearn importDecisionTreeClassifierDecisionTreeRegressor
Splitting criterioncriterion="gini"Default: squared_error (MSE)
Dummy variable encodingNot neededREQUIRED — Region and Location are categorical
Target variable typeCategorical (0/1)Continuous (units ordered)

  • Categorical encoding: Region (4 cats → 3 dummies, East=base) | Location (3 cats → 2 dummies, Rural=base). 7 features → 10 after encoding.
  • Dummy rule: m categories → (m−1) dummy variables. Base category = all zeros. Apply to ALL ML models.
  • Code change: DecisionTreeRegressor instead of DecisionTreeClassifier. Everything else identical.
  • Key code steps: Import → define X/Y → encode categoricals → split 70/30 → build RegTree (max_depth=2) → print
  • Output matches theory: All 7 nodes match (ȳ, n, MSE) exactly. Python replicates the hand-built theory tree.
  • Next session: Test model on 300 held-out observations. Compute MAE, RMSE, MAPE. Assess model quality.