It learns by falling.

A network starts knowing nothing — random weights, a meaningless wash across the plane. Then it learns: for every weight, backprop asks which way nudges the error down, and gradient descent takes one small step that way. Do it a few thousand times and a decision boundary carves itself out of the dots — intelligence as nothing but data and the slope of a loss. The whole story is the learning rate η: get it right and the loss glides down; push it too hard and the descent overshoots and explodes.

learning — carving a boundary out of scattered dots—accuracy

loss —epoch —net —■ class +1■ class −1— boundary

set η just right and the loss glides down · push η to the top + Reset to watch it overshoot · try Spiral, then add hidden units

The maths Cauchy 1847 · Rosenblatt 1958 · Rumelhart–Hinton–Williams 1986

The neuronforward

a = act(W x + b)

Each neuron takes a weighted sum of the layer below plus a bias, then bends it through a non-linearity. Stack two such layers and the network can carve curved, disconnected regions — not just a single straight cut.

The lossL

L = ½ ⟨(ŷ − y)²⟩

How wrong the network is, averaged over every point. Squared so that big misses dominate and the surface is smooth — a landscape in weight space the optimiser can roll down.

Gradient descentthe rule

θ ← θ − η ∇_θL

The whole of learning. The gradient ∇L points uphill, so step the opposite way, scaled by the learning rate η. Repeat. There is no cleverer secret underneath modern AI than this line.

Backpropagation∇L

δ_l = (W_l+1^ᵀ δ_l+1) ⊙ act’(z_l)

How to get ∇L cheaply: the chain rule, run backwards. The error at the output is propagated layer by layer, and each weight’s gradient is its incoming activation times the error flowing back through it.

Universal approximationCybenko 1989

any continuous f, to any ε

Why one hidden layer is, in principle, enough: a wide enough network of these neurons can approximate any continuous function. Capacity is the catch — Spiral needs more units to bend the boundary far enough.

When η is too largeη ↑ → ∞

overshoot → oscillate → diverge

The same descent, pushed too hard, is the route to chaos. Raise η and the step overshoots the minimum, then overshoots back — a period-2 oscillation that doubles into divergence, exactly the bifurcation The Cascade draws.

This is the rack’s learning instrument — the question of where intelligence comes from, reduced to its mechanism. There is no understanding inside the network, only weights and the slope of an error: yet from data and gradient steps alone a structure emerges that generalises. It is the twin of The Lens (INST·25): both pull signal from data, but the Lens infers a hidden state with a known model while the Descent learns the model itself. Its failure mode belongs to The Cascade (INST·02) — too large a step turns smooth convergence into period-doubling chaos — and its honest, noisier cousin is The Walk(INST·19): stochastic gradient descent is this same downhill roll with the loss estimated from a random handful of points each step. The Perception Engine’s “explain the present” with the model fitted, not given.

Learning rate η 0.500good

Hidden units 8 × 2 layers

Dataset

Activation

01 The title card

booting INST·27…

The Rack Next The Contagion

It learns by falling.

It learns by falling.

TheIceJi

It learns by falling.