Login with Hive Keychain
Enter your Hive username to sign in securely.
Welcome to HiveComb
HiveComb runs on Hive — an open, decentralized blockchain where your posts, votes, and communities belong to you, not a company. To get started, follow these steps:
Create a Hive account
Set up your free account — it only takes a minute.
Install Hive Keychain
A browser extension that securely signs your transactions — your keys never leave your device.
Refresh & log in
Once Keychain is installed, refresh this page and click Login again.
Need help? Join our Discord and we'll help you get set up.
No account? Create one
Having trouble creating your account? Come to our Discord and we'll get you set up.
No posts found
Try adjusting your filters or wait for the worker to classify more posts.
No posts found
Try adjusting your filters or wait for the worker to classify more posts.
No posts found
Try adjusting your filters or wait for the worker to classify more posts.
Welcome to HiveComb!
Choose your default filters to see the content you care about most.
Languages
Categories
Sentiment
Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
What will I learn
- You will learn how linear regression works -- mathematically, not just intuitively;
- the cost function (MSE) formalized with proper notation;
- gradient descent with analytically derived gradients;
- how to implement the entire thing in pure NumPy -- no libraries hiding the work;
- how to watch the model converge by tracking loss over epochs;
- feature scaling and why it makes gradient descent dramatically faster;
- train/test splits -- the difference between memorization and real learning;
- what the learned parameters actually mean.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch (this post)
Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
Nine episodes. Nine episodes of building intuition, learning NumPy, turning data into numbers, making predictions from gut feeling, finding patterns, formalizing loss functions, watching training loops converge, and covering the math -- linear algebra in #8, calculus and probability in #9. That was the preparation. This is the payoff.
Today we build our first real machine learning model from absolute scratch. No scikit-learn. No PyTorch. No library doing the heavy lifting behind a .fit() call. Just you, me, and NumPy. Every formula derived, every gradient understood, every line of code doing exactly what we tell it to do. By the end of this episode you'll have a working linear regression implementation that you built yourself and that you can explain to anyone who asks how it works.
And honestly? This is my favourite episode to write so far. Because this is where everything connects. The dot product from episode #8 computes our predictions. The MSE from episode #6 measures our error. The chain rule from episode #9 gives us the gradients. And the training loop from episode #7 ties it all together into a learning machine. All nine episodes, converging into one working model ;-)
Let's go.
The model, formalized
Back in episodes #5 through #7, we fit lines to apartment data. A line has a slope and an intercept: prediction = slope * sqm + intercept. That's linear regression with one feature. But real data has multiple features -- square meters, number of rooms, age of the building, floor level, whether there's a balcony. We need to handle ALL of them at once.
Linear regression predicts a continuous output as a weighted sum of inputs:
prediction = w1*x1 + w2*x2 + ... + wn*xn + b
Or in matrix notation (remember episode #8?):
prediction = Xw + b
Where:
- X is the data matrix, shape (m, n) -- m samples, n features
- w is the weight vector, shape (n,) -- one weight per feature
- b is the bias (scalar) -- the intercept
- prediction is the predictions vector, shape (m,)
Now here's a trick we mentioned briefly in episode #8 that becomes essential here: the bias trick. We append a column of ones to X, which lets us fold the bias into the weight vector. In stead of having Xw + b with separate w and b, we get Xw where the last element of w is the bias (because it multiplies the column of ones).
Let's set this up with some synthetic apartment data where we KNOW the true relationship -- just like in episode #7:
import numpy as np
np.random.seed(42)
# Generate data: price = 2500*sqm + 800*rooms - 500*age + 15000 + noise
# We KNOW these true values. The model has to discover them.
m = 100 # samples
sqm = np.random.uniform(30, 150, m)
rooms = np.random.randint(1, 6, m).astype(float)
age = np.random.uniform(0, 40, m)
noise = np.random.randn(m) * 15000
price = 2500 * sqm + 800 * rooms - 500 * age + 15000 + noise
# Feature matrix with bias column
X = np.column_stack([sqm, rooms, age, np.ones(m)]) # (100, 4)
y = price # (100,)
print(f"X shape: {X.shape}") # (100, 4)
print(f"y shape: {y.shape}") # (100,)
print(f"Features: sqm, rooms, age, bias")
print(f"\nFirst 3 samples:")
for i in range(3):
print(f" {sqm[i]:.0f} sqm, {rooms[i]:.0f} rooms, {age[i]:.0f} yrs old "
f"-> EUR {price[i]:,.0f}")
That np.column_stack with np.ones(m) is the bias trick in action. Our model becomes prediction = Xw where w has 4 elements: three feature weights plus the bias. This simplification makes both the math and the code cleaner -- one matrix multiply handles everything, no special cases for the intercept.
The cost function: Mean Squared Error
We need a way to measure "how wrong is the model, overall?" One single number that captures the total quality of all predictions across all samples. We already know the answer from episode #6 -- Mean Squared Error:
J(w) = (1/m) * sum((yi - prediction_i)^2)
Square the errors (so negatives don't cancel positives), average across all samples, done. Lower is better. The notation J(w) emphasizes that the cost depends on the weights -- change the weights, the predictions change, and the cost changes with them.
def compute_cost(X, y, w):
m = len(y)
predictions = X @ w
cost = (1/m) * np.sum((y - predictions) ** 2)
return cost
# Initialize weights to zeros -- terrible starting point, but standard
w = np.zeros(X.shape[1])
initial_cost = compute_cost(X, y, w)
print(f"Initial cost (all weights zero): {initial_cost:,.0f}")
print(f"(This is enormous because we're predicting EUR 0 for everything)")
With all weights at zero, every prediction is EUR 0. Every apartment priced at zero. The cost is astronomical. That's fine -- the whole point of training is to start terrible and systematically get better. We saw the same pattern in episode #7 when we initialized slope and intercept to zero and watched them crawl toward the true values.
Deriving the gradient -- for real this time
Here's where episode #9's calculus pays off. In episodes #6 and #7 I handed you the gradient formulas and said "trust me, we'll derive them later." Later is now.
We need the partial derivative of J(w) with respect to the weight vector w -- how does the cost change when we nudge each weight? Starting from:
J(w) = (1/m) * sum((yi - Xi*w)^2)
Applying the chain rule (the outer function is squaring, the inner function is the error):
dJ/dw = -(2/m) * X_transpose * (y - Xw)
Let me verify this makes sense piece by piece:
(y - Xw)is the error vector -- how wrong each prediction is, for all m samplesX_transposeis X flipped so its shape goes from (m, n) to (n, m) -- remember the transpose from episode #8X_transpose @ (y - Xw)correlates each feature column with the errors -- it tells us which features are most "responsible" for the mistakes- The result is a vector with one element per weight -- each element says "nudge this weight in this direction to reduce the cost"
This is the exact same X.T @ errors expression we saw at the end of episode #8. Back then I told you it computes gradients for all weights simultaneously. Now you can see why -- the matrix multiplication does a dot product between each feature column and the error vector, and those dot products ARE the partial derivatives.
Let's implement it:
def compute_gradient(X, y, w):
m = len(y)
predictions = X @ w
errors = y - predictions
gradient = -(2/m) * (X.T @ errors)
return gradient
# Check gradient at initial weights
grad = compute_gradient(X, y, w)
print(f"Gradient shape: {grad.shape}") # (4,)
print(f"Gradient values: {grad}")
But how do we know our analytical gradient formula is correct? We verify it numerically. Remember the central difference method from episode #9? Nudge one weight slightly in both directions and measure how the cost changes:
# Numerical gradient check for weight 0 (sqm weight)
epsilon = 1e-5
w_plus = w.copy()
w_plus[0] += epsilon
w_minus = w.copy()
w_minus[0] -= epsilon
numerical_grad = (compute_cost(X, y, w_plus) - compute_cost(X, y, w_minus)) / (2 * epsilon)
analytical_grad = compute_gradient(X, y, w)[0]
print(f"Gradient check for w[0] (sqm weight):")
print(f" Numerical: {numerical_grad:,.2f}")
print(f" Analytical: {analytical_grad:,.2f}")
print(f" Difference: {abs(numerical_grad - analytical_grad):,.6f}")
# Check ALL weights
numerical_grads = np.zeros_like(w)
for i in range(len(w)):
wp = w.copy()
wp[i] += epsilon
wm = w.copy()
wm[i] -= epsilon
numerical_grads[i] = (compute_cost(X, y, wp) - compute_cost(X, y, wm)) / (2 * epsilon)
analytical_grads = compute_gradient(X, y, w)
print(f"\nFull gradient check:")
for i, name in enumerate(["sqm", "rooms", "age", "bias"]):
diff = abs(numerical_grads[i] - analytical_grads[i])
print(f" {name:>5s}: numerical={numerical_grads[i]:>12.2f} "
f"analytical={analytical_grads[i]:>12.2f} diff={diff:.6f}")
If the numerical and analytical gradients match (difference less than 0.01 or so), our math is correct. This gradient checking technique is essential -- always verify your gradients when building models from scratch. I cannot stress this enough. I've had bugs in gradient code that produced plausible-looking training curves (loss going down, parameters moving) but converged to wrong values. The numerical check catches those bugs. It's slow (one forward pass per weight), so you'd never use it during actual training, but as a verification tool it's invaluable ;-)
Gradient descent: the full implementation
Now we combine everything into the complete training loop. If you've been following since episode #7, this will feel familiar -- same four-step structure, just with more features and proper matrix notation:
def gradient_descent(X, y, learning_rate=1e-9, n_epochs=1000):
m, n = X.shape
w = np.zeros(n)
history = []
for epoch in range(n_epochs):
# STEP 1: Forward pass -- predict with current weights
predictions = X @ w
# STEP 2: Compute cost (MSE)
cost = (1/m) * np.sum((y - predictions) ** 2)
history.append(cost)
# STEP 3: Compute gradient (calculus + linear algebra)
gradient = -(2/m) * (X.T @ (y - predictions))
# STEP 4: Update weights -- step downhill
w = w - learning_rate * gradient
if epoch % 200 == 0:
print(f"Epoch {epoch:>5d} cost={cost:>14,.0f} "
f"w={np.array2string(w, precision=1, separator=', ')}")
return w, history
# Train!
print("Training linear regression from scratch:\n")
w_learned, history = gradient_descent(X, y, learning_rate=1e-9, n_epochs=1000)
print(f"\nLearned weights:")
print(f" sqm: {w_learned[0]:>8.1f} (true: 2500)")
print(f" rooms: {w_learned[1]:>8.1f} (true: 800)")
print(f" age: {w_learned[2]:>8.1f} (true: -500)")
print(f" bias: {w_learned[3]:>8.1f} (true: 15000)")
Run it and watch the cost column. That same characteristic curve from episode #7 -- steep drop followed by a long flat tail. The weights will move toward the true values but probably won't hit them exactly. Two reasons: (a) we only have 100 noisy samples, so the best the model can do still has residual error from the noise, and (b) the learning rate is tiny at 1e-9, so convergence is slow.
Which brings us to a practical problem you've probably already noticed.
Feature scaling: why it matters (a lot)
That learning rate of 1e-9 is ridiculously small. Why can't we use something bigger like 0.01? Because our features have wildly different scales. Square meters ranges from 30 to 150. Rooms ranges from 1 to 5. Age ranges from 0 to 40. The sqm feature dominates the gradient because its values are the largest -- so the gradient component for sqm is huge compared to rooms. Using a learning rate large enough for rooms would cause sqm to overshoot and diverge. Using a rate small enough for sqm makes rooms and age learn painfully slowly.
The fix: standardize each feature to zero mean and unit variance. After scaling, all features live on roughly the same range (centered around 0 with standard deviation 1), and a single learning rate works equally well for all of them.
# Standardize features (NOT the bias column -- that stays as ones)
means = X[:, :3].mean(axis=0)
stds = X[:, :3].std(axis=0)
X_scaled = X.copy()
X_scaled[:, :3] = (X[:, :3] - means) / stds
print(f"Before scaling - feature ranges:")
print(f" sqm: {X[:, 0].min():.0f} to {X[:, 0].max():.0f}")
print(f" rooms: {X[:, 1].min():.0f} to {X[:, 1].max():.0f}")
print(f" age: {X[:, 2].min():.0f} to {X[:, 2].max():.0f}")
print(f"\nAfter scaling - feature ranges:")
print(f" sqm: {X_scaled[:, 0].min():.2f} to {X_scaled[:, 0].max():.2f}")
print(f" rooms: {X_scaled[:, 1].min():.2f} to {X_scaled[:, 1].max():.2f}")
print(f" age: {X_scaled[:, 2].min():.2f} to {X_scaled[:, 2].max():.2f}")
# NOW we can use a much larger learning rate
print(f"\nTraining with scaled features:\n")
w_scaled, hist_scaled = gradient_descent(X_scaled, y, learning_rate=0.01, n_epochs=1000)
Look at the difference. With scaled features, we jumped from learning_rate=1e-9 to learning_rate=0.01 -- seven orders of magnitude bigger! And the model converges WAY faster because all features contribute equally to the gradient. No single feature is drowning out the others.
This is not optional knowledge. Feature scaling is something you'll do in essentially every ML project. When someone's model "isn't learning," bad scaling is one of the first things to check. The mechanics are simple (subtract mean, divide by standard deviation), but forgetting to do it will waste hours of debugging time. I've been there. More than once ;-)
The learned weights will be in "scaled space" -- they correspond to the standardized features, not the original ones. To interpret them in original units, you need to convert back. We'll do that in a moment.
Convergence analysis: are we there yet?
Let's look at how the cost evolves during training. This is the same convergence analysis we did in episode #7, but now with a proper multi-feature model:
history = np.array(hist_scaled)
print("Convergence analysis:\n")
checkpoints = [0, 10, 50, 100, 200, 500, 999]
for i in checkpoints:
reduction = (1 - history[i] / history[0]) * 100
print(f" Epoch {i:>4d}: cost={history[i]:>14,.0f} "
f"({reduction:>5.1f}% reduction from start)")
# How many epochs to reach 99% of total improvement?
total_improvement = history[0] - history[-1]
for i, cost in enumerate(history):
if history[0] - cost >= 0.99 * total_improvement:
print(f"\n99% of improvement reached at epoch {i}")
break
You'll see the same pattern from episode #7: 90% of the improvement happens in the first 10-20% of epochs. The rest is fine-tuning. The loss drops sharply early on when the parameters are far from optimal (big gradients, big steps), then flattens out as we approach the minimum (small gradients, tiny steps). This is why early stopping works -- you almost never need to run to full convergence. The practical improvement from epoch 200 to epoch 1000 is often negligable compared to what happened in the first 100.
Train/test split: the real test
So far we've been training and evaluating on the same data. But as we discussed way back in episodes #4 and #7 -- that's cheating. A model could memorize every data point perfectly and still be useless on new data it hasn't seen. The true measure of a model is generalization: does it work on fresh data?
# Shuffle and split: 80% train, 20% test
indices = np.random.permutation(m)
split = int(0.8 * m)
X_train = X_scaled[indices[:split]]
y_train = y[indices[:split]]
X_test = X_scaled[indices[split:]]
y_test = y[indices[split:]]
print(f"Train: {X_train.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")
# Train only on training data
w_final, _ = gradient_descent(X_train, y_train, learning_rate=0.01, n_epochs=1000)
# Evaluate on BOTH sets
train_preds = X_train @ w_final
test_preds = X_test @ w_final
train_mae = np.abs(y_train - train_preds).mean()
test_mae = np.abs(y_test - test_preds).mean()
train_mse = ((y_train - train_preds) ** 2).mean()
test_mse = ((y_test - test_preds) ** 2).mean()
print(f"\nTrain MAE: EUR {train_mae:,.0f}")
print(f"Test MAE: EUR {test_mae:,.0f}")
print(f"Train MSE: {train_mse:,.0f}")
print(f"Test MSE: {test_mse:,.0f}")
# The ratio tells you about overfitting
ratio = test_mae / train_mae
print(f"\nTest/Train MAE ratio: {ratio:.3f}")
print(f"(Close to 1.0 = good generalization)")
If the test MAE is close to the train MAE, the model generalized well -- it learned the underlying pattern (price depends on sqm, rooms, age) rather than memorizing the specific noise in the training data. For a simple linear model with 100 samples and a linear ground truth, this should work nicely. The test/train MAE ratio should be close to 1.0.
If the ratio were significantly above 1 -- say, test MAE is 2x or 3x the train MAE -- that would be a red flag for overfitting. With linear regression on well-scaled data, overfitting is rare because the model is so simple (only 4 parameters for 100 samples). But when we move to more complex models -- decision trees, neural networks, models with thousands or millions of parameters -- overfitting becomes the central challenge. We're planting a seed here that we'll come back to again and again.
What the weights actually mean
One of the biggest advantages of linear regression over more complex models is interpretability. The weights have a direct, concrete meaning: each weight quantifies the effect of one feature on the prediction.
But our weights are currently in "scaled space" because we standardized the features. To interpret them in original units, we need to convert back:
# Convert scaled weights back to original feature space
# For each feature: w_original = w_scaled / std
# For the bias: it absorbs the mean-shifting
w_original = np.zeros(4)
w_original[:3] = w_scaled[:3] / stds
w_original[3] = w_scaled[3] - np.sum(w_scaled[:3] * means / stds)
print("Interpreted weights (original scale):\n")
print(f" Each additional sqm adds: EUR {w_original[0]:>+,.0f}")
print(f" Each additional room adds: EUR {w_original[1]:>+,.0f}")
print(f" Each year of age subtracts: EUR {w_original[2]:>+,.0f}")
print(f" Base price: EUR {w_original[3]:>+,.0f}")
print(f"\nTrue values for comparison:")
print(f" sqm coefficient: 2500")
print(f" rooms coefficient: 800")
print(f" age coefficient: -500")
print(f" base price: 15000")
This is interpretability. You can explain to a non-technical person: "each extra square meter adds roughly EUR 2,500 to the predicted price, each extra room adds about EUR 800, and each year of age reduces the price by about EUR 500." Try doing that with a neural network that has 10 million parameters. You can't. (Well, you can try -- there are interpretability techniques -- but it's orders of magnitude harder.)
This interpretability is why linear regression is still heavily used in fields like economics, medicine, and social science, even though "fancier" models exist. When you need to understand why a model makes a prediction, not just what it predicts, linear regression delivers that transparency for free.
Packaging it: a clean reusable class
Let's take everything we've built and package it into a clean class. This is not just about code organization -- it's about building the habit of creating reusable models with a .fit() and .predict() interface. Every ML library you'll ever use follows this pattern:
class LinearRegression:
def __init__(self, learning_rate=0.01, n_epochs=1000):
self.lr = learning_rate
self.n_epochs = n_epochs
self.weights = None
self.means = None
self.stds = None
self.history = []
def _scale(self, X):
"""Standardize features to zero mean, unit variance."""
X_s = X.copy()
X_s = (X_s - self.means) / self.stds
return X_s
def fit(self, X, y):
# Compute scaling parameters from training data
self.means = X.mean(axis=0)
self.stds = X.std(axis=0)
self.stds[self.stds == 0] = 1.0 # avoid division by zero
# Scale and add bias column
X_s = self._scale(X)
X_b = np.column_stack([X_s, np.ones(len(X_s))])
m, n = X_b.shape
self.weights = np.zeros(n)
self.history = []
for epoch in range(self.n_epochs):
preds = X_b @ self.weights
cost = np.mean((y - preds) ** 2)
self.history.append(cost)
gradient = -(2/m) * (X_b.T @ (y - preds))
self.weights -= self.lr * gradient
return self
def predict(self, X):
X_s = self._scale(X)
X_b = np.column_stack([X_s, np.ones(len(X_s))])
return X_b @ self.weights
# Usage -- raw features, no manual scaling needed
raw_features = np.column_stack([sqm, rooms, age])
model = LinearRegression(learning_rate=0.01, n_epochs=500)
model.fit(raw_features, y)
preds = model.predict(raw_features)
mae = np.abs(y - preds).mean()
print(f"Model MAE: EUR {mae:,.0f}")
print(f"Final cost: {model.history[-1]:,.0f}")
print(f"Weights: {model.weights.round(2)}")
Notice something? This looks a LOT like how scikit-learn works. model.fit(X, y) to train, model.predict(X) to predict. That's not a coincidence -- scikit-learn's LinearRegression does essentially the same thing under the hood (though it uses a closed-form solution in stead of gradient descent by default, which we'll discuss in the next episode). When we start using scikit-learn later in this series, you'll already know what .fit() does internally. No black box. No magic. Just the math we built today.
Evaluating predictions: a closer look
Let's see how our model does on individual predictions. Sometimes aggregate numbers like MAE hide important information about where the model struggles:
# Train on 80%, test on 20%
raw_train = raw_features[indices[:split]]
raw_test = raw_features[indices[split:]]
y_tr = y[indices[:split]]
y_te = y[indices[split:]]
model2 = LinearRegression(learning_rate=0.01, n_epochs=1000)
model2.fit(raw_train, y_tr)
test_predictions = model2.predict(raw_test)
test_errors = y_te - test_predictions
print("Individual test predictions:\n")
print(f"{'Sqm':>5s} {'Rooms':>5s} {'Age':>5s} {'Actual':>10s} "
f"{'Predicted':>10s} {'Error':>10s}")
print("-" * 55)
for i in range(min(10, len(y_te))):
s, r, a = raw_test[i]
print(f"{s:>5.0f} {r:>5.0f} {a:>5.0f} EUR {y_te[i]:>9,.0f} "
f"EUR {test_predictions[i]:>9,.0f} {test_errors[i]:>+10,.0f}")
# Error distribution
print(f"\nError statistics:")
print(f" Mean error: EUR {test_errors.mean():>+,.0f} (should be near 0)")
print(f" Std of errors: EUR {test_errors.std():>,.0f}")
print(f" Max overpredict: EUR {test_errors.min():>,.0f}")
print(f" Max underpredict: EUR {test_errors.max():>,.0f}")
The mean error should be close to zero -- meaning the model doesn't systematically over-predict or under-predict. The standard deviation of errors gives you the typical spread. Individual errors can be large (remember, our data has EUR 15,000 standard deviation of noise built in), but if they scatter randomly around zero, the model is doing its job. It captured the signal and left behind the noise, which is exactly what we want.
The complete model in perspective
Let me put what we built today into perspective. The core of linear regression is remarkably simple:
- Prediction:
y_hat = X @ w(matrix multiply from episode #8) - Cost:
J = mean((y - y_hat)^2)(MSE from episode #6) - Gradient:
dJ/dw = -(2/m) * X.T @ (y - y_hat)(calculus from episode #9) - Update:
w = w - lr * gradient(gradient descent from episodes #6 and #7)
Four lines in a loop. That's it. The same four-step skeleton from episode #7, now generalized with linear algebra to handle any number of features. The model itself is just a weight vector -- nothing else. Training finds good weights. Prediction is a single matrix multiply. A part from feature scaling (which is standard practice), there's really nothing else to it.
And yet this "simple" model is used EVERYWHERE. Predicting house prices (exactly what we just did). Estimating ad click-through rates. Forecasting sales revenue. Analyzing drug dosage effects. Measuring the impact of policy changes. Anywhere you have a roughly linear relationship between features and a continuous target, linear regression is the tool of first choice. Not because it's the most powerful model -- it isn't -- but because it's fast, interpretable, hard to screw up, and often good enough.
Having said that, "good enough" is relative. Linear regression assumes the relationship is actually linear. If the true relationship is curved, periodic, or has complex interactions between features, a straight line (or hyperplane in higher dimensions) won't capture it. We'll encounter this limitation soon and learn models that handle nonlinearity -- decision trees, random forests, and eventually neural networks. But we'll always come back to linear regression as the baseline. If a fancy model can't beat linear regression on your data, you need to seriously question whether the fancy model is worth its complexity.
Let's recap
We built a complete linear regression model from scratch today. No shortcuts, no libraries hiding the work. Let me summarize every key concept:
- Linear regression predicts
y_hat = Xw-- a weighted sum of features (the bias trick absorbs the intercept into the weight vector); - The cost function (MSE) measures average squared prediction error -- a single number that captures overall model quality;
- The gradient
dJ/dw = -(2/m) * X.T @ (y - Xw)tells us exactly how to adjust each weight to reduce the cost -- derived from calculus, verified numerically; - Always verify gradients numerically when building from scratch -- if analytical and numerical gradients disagree, your math is wrong and training will silently produce garbage;
- Feature scaling (standardization) puts all features on equal footing, allowing much larger learning rates and dramatically faster convergence;
- Train/test split is mandatory -- evaluating only on training data is meaningless because the model could be memorizing in stead of learning;
- Linear regression weights are directly interpretable: each weight quantifies a feature's effect on the prediction, in units you can explain to anyone;
- The
.fit()/.predict()pattern we built here is the universal interface for ML models -- every library follows it.
Next episode, we'll take this further. There are things we glossed over today -- the normal equation (an analytical solution that skips gradient descent entirely), R-squared (a more intuitive measure than MSE), and what happens when you apply linear regression to real-world messy data where the assumptions don't hold perfectly. The model we built today is the engine. Next time, we learn to drive it properly.
Bedankt en tot de volgende keer! Vragen? Laat ze gerust achter in de comments ;-)
Report Misclassification
Why is this post incorrectly classified?
Comments