The machine-learning toolbox — AI for Analysts Module 2

Most of your analyst career, you will not be training large language models. You will be doing pretty ordinary things to pretty ordinary tabular data: predicting whether a loan will default, whether a customer will churn, whether a transaction is fraudulent. For these problems, the right toolbox is older and smaller than the AI hype suggests.

Linear and logistic regression

Still the workhorses. Linear regression for continuous outcomes (predicting house price, sales next quarter). Logistic regression for binary outcomes (default, churn, click). The interpretability — every coefficient has a clear economic meaning — makes them the right choice when you need to justify the model to a regulator, a manager, or a court.

Trees, forests, and gradient boosting

A decision tree splits the feature space recursively, asking yes/no questions: 'Is income > KES 50,000? Is age > 35?' A single tree overfits, but ensembles of trees — random forests (bagging) and gradient-boosted trees (boosting) — are remarkably powerful.

Gradient-boosted tree implementations (XGBoost, LightGBM, CatBoost) win the majority of tabular-data Kaggle competitions. For Kenyan credit-scoring problems where your features are income, age, employment, M-Pesa transaction patterns, and loan history, gradient boosting is usually the right answer.

Clustering and dimensionality reduction

k-means clustering groups data into k similar clusters. Useful for customer segmentation. Principal Component Analysis (PCA) reduces high-dimensional data to fewer dimensions while preserving variance — useful for exploring data and as preprocessing.

Neural networks — when they're worth it

Neural networks dominate when the input is unstructured: text, images, audio, video. They do not generally beat gradient boosting on tabular data. The exception: very large tabular datasets (millions of rows, hundreds of features) where a network can exploit patterns trees miss.

The simple-model bias

Senior analysts always benchmark a simple model first — logistic regression, gradient boosting. If a complex model only beats it by 1-2 percentage points, ship the simple one. The cost of complexity (debugging, monitoring, retraining, regulatory scrutiny) is much higher than the small accuracy gain.

Bias-variance and overfitting

Every model is a tradeoff between underfitting (too rigid, misses patterns — high bias) and overfitting (too flexible, learns noise — high variance). The cure for overfitting: more data, regularisation, cross-validation, and the discipline to validate on data the model never saw during training.

Exercise

You are building a credit-scoring model for a Kenyan bank. The dataset is 80,000 loan records with 25 features (income, age, employment, M-Pesa transaction patterns, prior loan history, etc.) and a binary default label. Junior data scientists on the team are pushing to use a deep neural network. Walk through how you would respond: what would you benchmark first, what would you measure, and when (if ever) would the neural network be the right choice?