Filters

how many times do we need to study the w

@olluchi

ai competition or challenge

@pavelsku

HiveArte

Mar 27 08:51 AM

land leasing and crop rotation

why i just said no to a paid kol protect

shadows in the room my

holy week is it biblical mcgi topic revi

Beyond the Circuit Breaker: How I Solve

morning by the water 25

@kubak

photography

Mar 27 08:38 AM

3rfudr musik tipp zum wochenende

topcomment weekly curation report week 2

selection of the best articles b732df667

hive open mic weeek 311

actifit edycu007 20260327t082019061z

my birthday and my hive

carne de cerdo con frigoles y arroz pork

splinterlands unlocked the gold league

friday already updates bem

the miracle of the cherimoyas

i feel like moping today gute laune ist

@maxinpower

Photography

writing, homesteading

Mar 27 08:10 AM

traveling alone from chiang mai

qurator s photo quest sunset

escalopines o marinera con arroz con que

zp2jn is praying enough to get closer to

beautiful orange paper flowers

beyond the edge trails into the jungle a

actifit erikklok 20260327t075016388z

weekend engagement topics week 302

creating shadow shadowhunters contest 41

common leopard butterfly

problema matemtico del da 24032026 lsl

rate limiting rpc node management hive a

cuando saber demasiado nos impide comuni

publicaciones valoradas por el proyecto

did trump reach out because he s running

@yann03

politics

Mar 27 07:41 AM

a cold friday morning in holland

pizza a slice of madness 8736733b6ef34

in his own space sketch

my actifit report card march 27th back t

graffiti of a girl in a hoodie and glass

archive xkcdhatguy vol1 1774596468963

are we financially responsible to other

@ritachimdi

MCGI Cares Hive

social-issues, spirituality

Mar 27 07:25 AM

actifit filosof103 20260327t072526686z

archive cunjur vol1 1774596268092

@archiveauto1

hive

Mar 27 07:24 AM

little nightmares lets play 07 finale in

weekly contest 281 working from

my friends little sister has it in for m

community compilation from 25 03 2026 to

lontong with sayur tauco compressed

israeli biolab update 13 us bases all bu

@tdvtv

politics

Mar 27 07:11 AM

reducing generator dependence in nigeria

bounce or breakdown s and

obedience to gods will part 2 of 3 oror

xolo needs a home holozing fanart eng es

Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch

What will I learn

You will learn how linear regression works -- mathematically, not just intuitively;
the cost function (MSE) formalized with proper notation;
gradient descent with analytically derived gradients;
how to implement the entire thing in pure NumPy -- no libraries hiding the work;
how to watch the model converge by tracking loss over epochs;
feature scaling and why it makes gradient descent dramatically faster;
train/test splits -- the difference between memorization and real learning;
what the learned parameters actually mean.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch

Nine episodes. Nine episodes of building intuition, learning NumPy, turning data into numbers, making predictions from gut feeling, finding patterns, formalizing loss functions, watching training loops converge, and covering the math -- linear algebra in #8, calculus and probability in #9. That was the preparation. This is the payoff.

Today we build our first real machine learning model from absolute scratch. No scikit-learn. No PyTorch. No library doing the heavy lifting behind a .fit() call. Just you, me, and NumPy. Every formula derived, every gradient understood, every line of code doing exactly what we tell it to do. By the end of this episode you'll have a working linear regression implementation that you built yourself and that you can explain to anyone who asks how it works.

And honestly? This is my favourite episode to write so far. Because this is where everything connects. The dot product from episode #8 computes our predictions. The MSE from episode #6 measures our error. The chain rule from episode #9 gives us the gradients. And the training loop from episode #7 ties it all together into a learning machine. All nine episodes, converging into one working model ;-)

Let's go.

The model, formalized

Back in episodes #5 through #7, we fit lines to apartment data. A line has a slope and an intercept: prediction = slope * sqm + intercept. That's linear regression with one feature. But real data has multiple features -- square meters, number of rooms, age of the building, floor level, whether there's a balcony. We need to handle ALL of them at once.

Linear regression predicts a continuous output as a weighted sum of inputs:

prediction = w1*x1 + w2*x2 + ... + wn*xn + b

Or in matrix notation (remember episode #8?):

prediction = Xw + b

Where:

X is the data matrix, shape (m, n) -- m samples, n features
w is the weight vector, shape (n,) -- one weight per feature
b is the bias (scalar) -- the intercept
prediction is the predictions vector, shape (m,)

Now here's a trick we mentioned briefly in episode #8 that becomes essential here: the bias trick. We append a column of ones to X, which lets us fold the bias into the weight vector. In stead of having Xw + b with separate w and b, we get Xw where the last element of w is the bias (because it multiplies the column of ones).

Let's set this up with some synthetic apartment data where we KNOW the true relationship -- just like in episode #7:

import numpy as np

np.random.seed(42)

# Generate data: price = 2500*sqm + 800*rooms - 500*age + 15000 + noise
# We KNOW these true values. The model has to discover them.
m = 100  # samples
sqm = np.random.uniform(30, 150, m)
rooms = np.random.randint(1, 6, m).astype(float)
age = np.random.uniform(0, 40, m)
noise = np.random.randn(m) * 15000

price = 2500 * sqm + 800 * rooms - 500 * age + 15000 + noise

# Feature matrix with bias column
X = np.column_stack([sqm, rooms, age, np.ones(m)])  # (100, 4)
y = price  # (100,)

print(f"X shape: {X.shape}")  # (100, 4)
print(f"y shape: {y.shape}")  # (100,)
print(f"Features: sqm, rooms, age, bias")
print(f"\nFirst 3 samples:")
for i in range(3):
    print(f"  {sqm[i]:.0f} sqm, {rooms[i]:.0f} rooms, {age[i]:.0f} yrs old "
          f"-> EUR {price[i]:,.0f}")

That np.column_stack with np.ones(m) is the bias trick in action. Our model becomes prediction = Xw where w has 4 elements: three feature weights plus the bias. This simplification makes both the math and the code cleaner -- one matrix multiply handles everything, no special cases for the intercept.

The cost function: Mean Squared Error

We need a way to measure "how wrong is the model, overall?" One single number that captures the total quality of all predictions across all samples. We already know the answer from episode #6 -- Mean Squared Error:

J(w) = (1/m) * sum((yi - prediction_i)^2)

Square the errors (so negatives don't cancel positives), average across all samples, done. Lower is better. The notation J(w) emphasizes that the cost depends on the weights -- change the weights, the predictions change, and the cost changes with them.

def compute_cost(X, y, w):
    m = len(y)
    predictions = X @ w
    cost = (1/m) * np.sum((y - predictions) ** 2)
    return cost

# Initialize weights to zeros -- terrible starting point, but standard
w = np.zeros(X.shape[1])
initial_cost = compute_cost(X, y, w)
print(f"Initial cost (all weights zero): {initial_cost:,.0f}")
print(f"(This is enormous because we're predicting EUR 0 for everything)")

With all weights at zero, every prediction is EUR 0. Every apartment priced at zero. The cost is astronomical. That's fine -- the whole point of training is to start terrible and systematically get better. We saw the same pattern in episode #7 when we initialized slope and intercept to zero and watched them crawl toward the true values.

Deriving the gradient -- for real this time

Here's where episode #9's calculus pays off. In episodes #6 and #7 I handed you the gradient formulas and said "trust me, we'll derive them later." Later is now.

We need the partial derivative of J(w) with respect to the weight vector w -- how does the cost change when we nudge each weight? Starting from:

J(w) = (1/m) * sum((yi - Xi*w)^2)

Applying the chain rule (the outer function is squaring, the inner function is the error):

dJ/dw = -(2/m) * X_transpose * (y - Xw)

Let me verify this makes sense piece by piece:

(y - Xw) is the error vector -- how wrong each prediction is, for all m samples
X_transpose is X flipped so its shape goes from (m, n) to (n, m) -- remember the transpose from episode #8
X_transpose @ (y - Xw) correlates each feature column with the errors -- it tells us which features are most "responsible" for the mistakes
The result is a vector with one element per weight -- each element says "nudge this weight in this direction to reduce the cost"

This is the exact same X.T @ errors expression we saw at the end of episode #8. Back then I told you it computes gradients for all weights simultaneously. Now you can see why -- the matrix multiplication does a dot product between each feature column and the error vector, and those dot products ARE the partial derivatives.

Let's implement it:

def compute_gradient(X, y, w):
    m = len(y)
    predictions = X @ w
    errors = y - predictions
    gradient = -(2/m) * (X.T @ errors)
    return gradient

# Check gradient at initial weights
grad = compute_gradient(X, y, w)
print(f"Gradient shape: {grad.shape}")  # (4,)
print(f"Gradient values: {grad}")

But how do we know our analytical gradient formula is correct? We verify it numerically. Remember the central difference method from episode #9? Nudge one weight slightly in both directions and measure how the cost changes:

# Numerical gradient check for weight 0 (sqm weight)
epsilon = 1e-5
w_plus = w.copy()
w_plus[0] += epsilon
w_minus = w.copy()
w_minus[0] -= epsilon

numerical_grad = (compute_cost(X, y, w_plus) - compute_cost(X, y, w_minus)) / (2 * epsilon)
analytical_grad = compute_gradient(X, y, w)[0]

print(f"Gradient check for w[0] (sqm weight):")
print(f"  Numerical:   {numerical_grad:,.2f}")
print(f"  Analytical:  {analytical_grad:,.2f}")
print(f"  Difference:  {abs(numerical_grad - analytical_grad):,.6f}")

# Check ALL weights
numerical_grads = np.zeros_like(w)
for i in range(len(w)):
    wp = w.copy()
    wp[i] += epsilon
    wm = w.copy()
    wm[i] -= epsilon
    numerical_grads[i] = (compute_cost(X, y, wp) - compute_cost(X, y, wm)) / (2 * epsilon)

analytical_grads = compute_gradient(X, y, w)

print(f"\nFull gradient check:")
for i, name in enumerate(["sqm", "rooms", "age", "bias"]):
    diff = abs(numerical_grads[i] - analytical_grads[i])
    print(f"  {name:>5s}: numerical={numerical_grads[i]:>12.2f}  "
          f"analytical={analytical_grads[i]:>12.2f}  diff={diff:.6f}")

If the numerical and analytical gradients match (difference less than 0.01 or so), our math is correct. This gradient checking technique is essential -- always verify your gradients when building models from scratch. I cannot stress this enough. I've had bugs in gradient code that produced plausible-looking training curves (loss going down, parameters moving) but converged to wrong values. The numerical check catches those bugs. It's slow (one forward pass per weight), so you'd never use it during actual training, but as a verification tool it's invaluable ;-)

Gradient descent: the full implementation

Now we combine everything into the complete training loop. If you've been following since episode #7, this will feel familiar -- same four-step structure, just with more features and proper matrix notation:

def gradient_descent(X, y, learning_rate=1e-9, n_epochs=1000):
    m, n = X.shape
    w = np.zeros(n)
    history = []

    for epoch in range(n_epochs):
        # STEP 1: Forward pass -- predict with current weights
        predictions = X @ w

        # STEP 2: Compute cost (MSE)
        cost = (1/m) * np.sum((y - predictions) ** 2)
        history.append(cost)

        # STEP 3: Compute gradient (calculus + linear algebra)
        gradient = -(2/m) * (X.T @ (y - predictions))

        # STEP 4: Update weights -- step downhill
        w = w - learning_rate * gradient

        if epoch % 200 == 0:
            print(f"Epoch {epoch:>5d}  cost={cost:>14,.0f}  "
                  f"w={np.array2string(w, precision=1, separator=', ')}")

    return w, history

# Train!
print("Training linear regression from scratch:\n")
w_learned, history = gradient_descent(X, y, learning_rate=1e-9, n_epochs=1000)

print(f"\nLearned weights:")
print(f"  sqm:       {w_learned[0]:>8.1f}  (true: 2500)")
print(f"  rooms:     {w_learned[1]:>8.1f}  (true: 800)")
print(f"  age:       {w_learned[2]:>8.1f}  (true: -500)")
print(f"  bias:      {w_learned[3]:>8.1f}  (true: 15000)")

Run it and watch the cost column. That same characteristic curve from episode #7 -- steep drop followed by a long flat tail. The weights will move toward the true values but probably won't hit them exactly. Two reasons: (a) we only have 100 noisy samples, so the best the model can do still has residual error from the noise, and (b) the learning rate is tiny at 1e-9, so convergence is slow.

Which brings us to a practical problem you've probably already noticed.

Feature scaling: why it matters (a lot)

That learning rate of 1e-9 is ridiculously small. Why can't we use something bigger like 0.01? Because our features have wildly different scales. Square meters ranges from 30 to 150. Rooms ranges from 1 to 5. Age ranges from 0 to 40. The sqm feature dominates the gradient because its values are the largest -- so the gradient component for sqm is huge compared to rooms. Using a learning rate large enough for rooms would cause sqm to overshoot and diverge. Using a rate small enough for sqm makes rooms and age learn painfully slowly.

The fix: standardize each feature to zero mean and unit variance. After scaling, all features live on roughly the same range (centered around 0 with standard deviation 1), and a single learning rate works equally well for all of them.

# Standardize features (NOT the bias column -- that stays as ones)
means = X[:, :3].mean(axis=0)
stds = X[:, :3].std(axis=0)

X_scaled = X.copy()
X_scaled[:, :3] = (X[:, :3] - means) / stds

print(f"Before scaling - feature ranges:")
print(f"  sqm:   {X[:, 0].min():.0f} to {X[:, 0].max():.0f}")
print(f"  rooms: {X[:, 1].min():.0f} to {X[:, 1].max():.0f}")
print(f"  age:   {X[:, 2].min():.0f} to {X[:, 2].max():.0f}")

print(f"\nAfter scaling - feature ranges:")
print(f"  sqm:   {X_scaled[:, 0].min():.2f} to {X_scaled[:, 0].max():.2f}")
print(f"  rooms: {X_scaled[:, 1].min():.2f} to {X_scaled[:, 1].max():.2f}")
print(f"  age:   {X_scaled[:, 2].min():.2f} to {X_scaled[:, 2].max():.2f}")

# NOW we can use a much larger learning rate
print(f"\nTraining with scaled features:\n")
w_scaled, hist_scaled = gradient_descent(X_scaled, y, learning_rate=0.01, n_epochs=1000)

Look at the difference. With scaled features, we jumped from learning_rate=1e-9 to learning_rate=0.01 -- seven orders of magnitude bigger! And the model converges WAY faster because all features contribute equally to the gradient. No single feature is drowning out the others.

This is not optional knowledge. Feature scaling is something you'll do in essentially every ML project. When someone's model "isn't learning," bad scaling is one of the first things to check. The mechanics are simple (subtract mean, divide by standard deviation), but forgetting to do it will waste hours of debugging time. I've been there. More than once ;-)

The learned weights will be in "scaled space" -- they correspond to the standardized features, not the original ones. To interpret them in original units, you need to convert back. We'll do that in a moment.

Convergence analysis: are we there yet?

Let's look at how the cost evolves during training. This is the same convergence analysis we did in episode #7, but now with a proper multi-feature model:

history = np.array(hist_scaled)

print("Convergence analysis:\n")
checkpoints = [0, 10, 50, 100, 200, 500, 999]
for i in checkpoints:
    reduction = (1 - history[i] / history[0]) * 100
    print(f"  Epoch {i:>4d}: cost={history[i]:>14,.0f}  "
          f"({reduction:>5.1f}% reduction from start)")

# How many epochs to reach 99% of total improvement?
total_improvement = history[0] - history[-1]
for i, cost in enumerate(history):
    if history[0] - cost >= 0.99 * total_improvement:
        print(f"\n99% of improvement reached at epoch {i}")
        break

You'll see the same pattern from episode #7: 90% of the improvement happens in the first 10-20% of epochs. The rest is fine-tuning. The loss drops sharply early on when the parameters are far from optimal (big gradients, big steps), then flattens out as we approach the minimum (small gradients, tiny steps). This is why early stopping works -- you almost never need to run to full convergence. The practical improvement from epoch 200 to epoch 1000 is often negligable compared to what happened in the first 100.

Train/test split: the real test

So far we've been training and evaluating on the same data. But as we discussed way back in episodes #4 and #7 -- that's cheating. A model could memorize every data point perfectly and still be useless on new data it hasn't seen. The true measure of a model is generalization: does it work on fresh data?

# Shuffle and split: 80% train, 20% test
indices = np.random.permutation(m)
split = int(0.8 * m)

X_train = X_scaled[indices[:split]]
y_train = y[indices[:split]]
X_test = X_scaled[indices[split:]]
y_test = y[indices[split:]]

print(f"Train: {X_train.shape[0]} samples")
print(f"Test:  {X_test.shape[0]} samples")

# Train only on training data
w_final, _ = gradient_descent(X_train, y_train, learning_rate=0.01, n_epochs=1000)

# Evaluate on BOTH sets
train_preds = X_train @ w_final
test_preds = X_test @ w_final

train_mae = np.abs(y_train - train_preds).mean()
test_mae = np.abs(y_test - test_preds).mean()
train_mse = ((y_train - train_preds) ** 2).mean()
test_mse = ((y_test - test_preds) ** 2).mean()

print(f"\nTrain MAE: EUR {train_mae:,.0f}")
print(f"Test MAE:  EUR {test_mae:,.0f}")
print(f"Train MSE: {train_mse:,.0f}")
print(f"Test MSE:  {test_mse:,.0f}")

# The ratio tells you about overfitting
ratio = test_mae / train_mae
print(f"\nTest/Train MAE ratio: {ratio:.3f}")
print(f"(Close to 1.0 = good generalization)")

If the test MAE is close to the train MAE, the model generalized well -- it learned the underlying pattern (price depends on sqm, rooms, age) rather than memorizing the specific noise in the training data. For a simple linear model with 100 samples and a linear ground truth, this should work nicely. The test/train MAE ratio should be close to 1.0.

If the ratio were significantly above 1 -- say, test MAE is 2x or 3x the train MAE -- that would be a red flag for overfitting. With linear regression on well-scaled data, overfitting is rare because the model is so simple (only 4 parameters for 100 samples). But when we move to more complex models -- decision trees, neural networks, models with thousands or millions of parameters -- overfitting becomes the central challenge. We're planting a seed here that we'll come back to again and again.

What the weights actually mean

One of the biggest advantages of linear regression over more complex models is interpretability. The weights have a direct, concrete meaning: each weight quantifies the effect of one feature on the prediction.

But our weights are currently in "scaled space" because we standardized the features. To interpret them in original units, we need to convert back:

# Convert scaled weights back to original feature space
# For each feature: w_original = w_scaled / std
# For the bias: it absorbs the mean-shifting
w_original = np.zeros(4)
w_original[:3] = w_scaled[:3] / stds
w_original[3] = w_scaled[3] - np.sum(w_scaled[:3] * means / stds)

print("Interpreted weights (original scale):\n")
print(f"  Each additional sqm adds:    EUR {w_original[0]:>+,.0f}")
print(f"  Each additional room adds:   EUR {w_original[1]:>+,.0f}")
print(f"  Each year of age subtracts:  EUR {w_original[2]:>+,.0f}")
print(f"  Base price:                  EUR {w_original[3]:>+,.0f}")

print(f"\nTrue values for comparison:")
print(f"  sqm coefficient:  2500")
print(f"  rooms coefficient: 800")
print(f"  age coefficient:  -500")
print(f"  base price:       15000")

This is interpretability. You can explain to a non-technical person: "each extra square meter adds roughly EUR 2,500 to the predicted price, each extra room adds about EUR 800, and each year of age reduces the price by about EUR 500." Try doing that with a neural network that has 10 million parameters. You can't. (Well, you can try -- there are interpretability techniques -- but it's orders of magnitude harder.)

This interpretability is why linear regression is still heavily used in fields like economics, medicine, and social science, even though "fancier" models exist. When you need to understand why a model makes a prediction, not just what it predicts, linear regression delivers that transparency for free.

Packaging it: a clean reusable class

Let's take everything we've built and package it into a clean class. This is not just about code organization -- it's about building the habit of creating reusable models with a .fit() and .predict() interface. Every ML library you'll ever use follows this pattern:

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_epochs=1000):
        self.lr = learning_rate
        self.n_epochs = n_epochs
        self.weights = None
        self.means = None
        self.stds = None
        self.history = []

    def _scale(self, X):
        """Standardize features to zero mean, unit variance."""
        X_s = X.copy()
        X_s = (X_s - self.means) / self.stds
        return X_s

    def fit(self, X, y):
        # Compute scaling parameters from training data
        self.means = X.mean(axis=0)
        self.stds = X.std(axis=0)
        self.stds[self.stds == 0] = 1.0  # avoid division by zero

        # Scale and add bias column
        X_s = self._scale(X)
        X_b = np.column_stack([X_s, np.ones(len(X_s))])
        m, n = X_b.shape
        self.weights = np.zeros(n)
        self.history = []

        for epoch in range(self.n_epochs):
            preds = X_b @ self.weights
            cost = np.mean((y - preds) ** 2)
            self.history.append(cost)
            gradient = -(2/m) * (X_b.T @ (y - preds))
            self.weights -= self.lr * gradient

        return self

    def predict(self, X):
        X_s = self._scale(X)
        X_b = np.column_stack([X_s, np.ones(len(X_s))])
        return X_b @ self.weights

# Usage -- raw features, no manual scaling needed
raw_features = np.column_stack([sqm, rooms, age])
model = LinearRegression(learning_rate=0.01, n_epochs=500)
model.fit(raw_features, y)

preds = model.predict(raw_features)
mae = np.abs(y - preds).mean()
print(f"Model MAE: EUR {mae:,.0f}")
print(f"Final cost: {model.history[-1]:,.0f}")
print(f"Weights: {model.weights.round(2)}")

Notice something? This looks a LOT like how scikit-learn works. model.fit(X, y) to train, model.predict(X) to predict. That's not a coincidence -- scikit-learn's LinearRegression does essentially the same thing under the hood (though it uses a closed-form solution in stead of gradient descent by default, which we'll discuss in the next episode). When we start using scikit-learn later in this series, you'll already know what .fit() does internally. No black box. No magic. Just the math we built today.

Evaluating predictions: a closer look

Let's see how our model does on individual predictions. Sometimes aggregate numbers like MAE hide important information about where the model struggles:

# Train on 80%, test on 20%
raw_train = raw_features[indices[:split]]
raw_test = raw_features[indices[split:]]
y_tr = y[indices[:split]]
y_te = y[indices[split:]]

model2 = LinearRegression(learning_rate=0.01, n_epochs=1000)
model2.fit(raw_train, y_tr)

test_predictions = model2.predict(raw_test)
test_errors = y_te - test_predictions

print("Individual test predictions:\n")
print(f"{'Sqm':>5s} {'Rooms':>5s} {'Age':>5s}  {'Actual':>10s}  "
      f"{'Predicted':>10s}  {'Error':>10s}")
print("-" * 55)
for i in range(min(10, len(y_te))):
    s, r, a = raw_test[i]
    print(f"{s:>5.0f} {r:>5.0f} {a:>5.0f}  EUR {y_te[i]:>9,.0f}  "
          f"EUR {test_predictions[i]:>9,.0f}  {test_errors[i]:>+10,.0f}")

# Error distribution
print(f"\nError statistics:")
print(f"  Mean error:     EUR {test_errors.mean():>+,.0f} (should be near 0)")
print(f"  Std of errors:  EUR {test_errors.std():>,.0f}")
print(f"  Max overpredict: EUR {test_errors.min():>,.0f}")
print(f"  Max underpredict: EUR {test_errors.max():>,.0f}")

The mean error should be close to zero -- meaning the model doesn't systematically over-predict or under-predict. The standard deviation of errors gives you the typical spread. Individual errors can be large (remember, our data has EUR 15,000 standard deviation of noise built in), but if they scatter randomly around zero, the model is doing its job. It captured the signal and left behind the noise, which is exactly what we want.

The complete model in perspective

Let me put what we built today into perspective. The core of linear regression is remarkably simple:

Prediction: y_hat = X @ w (matrix multiply from episode #8)
Cost: J = mean((y - y_hat)^2) (MSE from episode #6)
Gradient: dJ/dw = -(2/m) * X.T @ (y - y_hat) (calculus from episode #9)
Update: w = w - lr * gradient (gradient descent from episodes #6 and #7)

Four lines in a loop. That's it. The same four-step skeleton from episode #7, now generalized with linear algebra to handle any number of features. The model itself is just a weight vector -- nothing else. Training finds good weights. Prediction is a single matrix multiply. A part from feature scaling (which is standard practice), there's really nothing else to it.

And yet this "simple" model is used EVERYWHERE. Predicting house prices (exactly what we just did). Estimating ad click-through rates. Forecasting sales revenue. Analyzing drug dosage effects. Measuring the impact of policy changes. Anywhere you have a roughly linear relationship between features and a continuous target, linear regression is the tool of first choice. Not because it's the most powerful model -- it isn't -- but because it's fast, interpretable, hard to screw up, and often good enough.

Having said that, "good enough" is relative. Linear regression assumes the relationship is actually linear. If the true relationship is curved, periodic, or has complex interactions between features, a straight line (or hyperplane in higher dimensions) won't capture it. We'll encounter this limitation soon and learn models that handle nonlinearity -- decision trees, random forests, and eventually neural networks. But we'll always come back to linear regression as the baseline. If a fancy model can't beat linear regression on your data, you need to seriously question whether the fancy model is worth its complexity.

Let's recap

We built a complete linear regression model from scratch today. No shortcuts, no libraries hiding the work. Let me summarize every key concept:

Linear regression predicts y_hat = Xw -- a weighted sum of features (the bias trick absorbs the intercept into the weight vector);
The cost function (MSE) measures average squared prediction error -- a single number that captures overall model quality;
The gradient dJ/dw = -(2/m) * X.T @ (y - Xw) tells us exactly how to adjust each weight to reduce the cost -- derived from calculus, verified numerically;
Always verify gradients numerically when building from scratch -- if analytical and numerical gradients disagree, your math is wrong and training will silently produce garbage;
Feature scaling (standardization) puts all features on equal footing, allowing much larger learning rates and dramatically faster convergence;
Train/test split is mandatory -- evaluating only on training data is meaningless because the model could be memorizing in stead of learning;
Linear regression weights are directly interpretable: each weight quantifies a feature's effect on the prediction, in units you can explain to anyone;
The .fit() / .predict() pattern we built here is the universal interface for ML models -- every library follows it.

Next episode, we'll take this further. There are things we glossed over today -- the normal equation (an analytical solution that skips gradient descent entirely), R-squared (a more intuitive measure than MSE), and what happens when you apply linear regression to real-world messy data where the assumptions don't hold perfectly. The model we built today is the engine. Next time, we learn to drive it properly.

Bedankt en tot de volgende keer! Vragen? Laat ze gerust achter in de comments ;-)

@scipio

Comments

Could not load comments.

`bold`	bold
`italic`	italic
`~~strikethrough~~`	~~strikethrough~~
`# Heading 1`	Largest heading
`## Heading 2`	Medium heading
`### Heading 3`	Small heading

`[text](url)`	Hyperlink
`![alt](url)`	Image
`@username`	Mention a Hive user

`- item`	Bulleted list
`1. item`	Numbered list

`> quote`	Blockquote
`code`	Inline code
```	Code block (on its own line)
`---`	Horizontal rule
`<center>...</center>`	Center text/images

`@username`	Links to Hive profile
YouTube / 3Speak links	Auto-embedded as video players
Bare image URLs	Auto-rendered as images
`<center>` blocks	Commonly used for layout on Hive

No posts found

No posts found

No posts found

Report Misclassification