Skip to content
Free preview · Modules 1 and 2 of this paid course are open to everyone. Module 3 onward requires an access code.
Module 02 of 1250 min readBeginner

The machine-learning toolbox

Regression, classification, clustering, trees, ensembles, neural networks — at a glance, with their best use cases.

17%

Listen along

Read “The machine-learning toolbox” aloud

Plays in your browser using on-device text-to-speech — nothing leaves the page.

Learning objectives

By the end of this module, you should be able to:

  • 01Identify when to reach for regression, classification, clustering, trees, ensembles, or neural networks
  • 02Pick the right model family for a real analyst problem (credit, churn, anomaly detection)
  • 03Recognise the bias-variance tradeoff and what overfitting actually looks like
  • 04Avoid the common mistake of using deep learning when a simpler model would do better

Most of your analyst career, you will not be training large language models. You will be doing pretty ordinary things to pretty ordinary tabular data: predicting whether a loan will default, whether a customer will churn, whether a transaction is fraudulent. For these problems, the right toolbox is older and smaller than the AI hype suggests.

Linear and logistic regression

Still the workhorses. Linear regression for continuous outcomes (predicting house price, sales next quarter). Logistic regression for binary outcomes (default, churn, click). The interpretability — every coefficient has a clear economic meaning — makes them the right choice when you need to justify the model to a regulator, a manager, or a court.

Trees, forests, and gradient boosting

A decision tree splits the feature space recursively, asking yes/no questions: 'Is income > KES 50,000? Is age > 35?' A single tree overfits, but ensembles of trees — random forests (bagging) and gradient-boosted trees (boosting) — are remarkably powerful.

Gradient-boosted tree implementations (XGBoost, LightGBM, CatBoost) win the majority of tabular-data Kaggle competitions. For Kenyan credit-scoring problems where your features are income, age, employment, M-Pesa transaction patterns, and loan history, gradient boosting is usually the right answer.

Clustering and dimensionality reduction

k-means clustering groups data into k similar clusters. Useful for customer segmentation. Principal Component Analysis (PCA) reduces high-dimensional data to fewer dimensions while preserving variance — useful for exploring data and as preprocessing.

Neural networks — when they're worth it

Neural networks dominate when the input is unstructured: text, images, audio, video. They do not generally beat gradient boosting on tabular data. The exception: very large tabular datasets (millions of rows, hundreds of features) where a network can exploit patterns trees miss.

The simple-model bias

Senior analysts always benchmark a simple model first — logistic regression, gradient boosting. If a complex model only beats it by 1-2 percentage points, ship the simple one. The cost of complexity (debugging, monitoring, retraining, regulatory scrutiny) is much higher than the small accuracy gain.

Bias-variance and overfitting

Every model is a tradeoff between underfitting (too rigid, misses patterns — high bias) and overfitting (too flexible, learns noise — high variance). The cure for overfitting: more data, regularisation, cross-validation, and the discipline to validate on data the model never saw during training.

Exercise

You are building a credit-scoring model for a Kenyan bank. The dataset is 80,000 loan records with 25 features (income, age, employment, M-Pesa transaction patterns, prior loan history, etc.) and a binary default label. Junior data scientists on the team are pushing to use a deep neural network. Walk through how you would respond: what would you benchmark first, what would you measure, and when (if ever) would the neural network be the right choice?

Key takeaways

  • Most analyst problems are still solved by linear/logistic regression or tree-based ensembles, not neural networks
  • Random forests and gradient-boosted trees (XGBoost, LightGBM, CatBoost) win most Kaggle tabular competitions
  • Neural networks dominate when the input is unstructured (text, images, audio) — not when it's tabular
  • Always benchmark a simple model first. If logistic regression gets 88% and your neural net gets 89%, ship logistic regression

Further reading

  1. 01

    An Introduction to Statistical Learning (2nd Edition)

    Gareth James, Daniela Witten, Trevor Hastie & Robert Tibshirani · Springer · 2021

  2. 02

    The Elements of Statistical Learning

    Trevor Hastie, Robert Tibshirani & Jerome Friedman · Springer · 2009

  3. 03

    XGBoost: A Scalable Tree Boosting System

    Tianqi Chen & Carlos Guestrin · KDD · 2016

Loading progress…
LeadAfrikPublic Economics Hub