It learns the way by walking it.

An agent is dropped into a grid it has never seen, with one rule: reach the reward. It has no map and no model — only trial, error, and a single number per move, the action-value Q, nudged toward the reward plus the best it now sees ahead. Do that enough and the value floods backward from the goal, the policy crystallizes into arrows, and a route appears out of nothing. This is learning to decide — the last verb of the engine, after analyse, simulate and predict.

exploring — no path to the reward found yet0%success

episodes 0greedy path —■ value / agent■ pit / cliff— learned path

watch the value flood back from the goal · drag ε to 0 to stop exploring · try Cliff and see ε gamble into the drop

The maths Bellman 1957 · Watkins 1989 · DeepMind 2015

Bellman optimalitythe fixed point

V*(s) = max_a [ r + γ Σ_s′ P(s′|s,a) V*(s′) ]

The value of a state is the best action’s immediate reward plus the discounted value of wherever it lands you. Optimal play is exactly the solution of this self-referential equation — the present worth of the whole future, folded into one line.

The action-value Qexpected return

Q(s,a) = E[ Σ_t γ^t r_t | s,a ]

How good it is to take action a in state s and then play greedily forever after — the expected discounted reward. Learn this table and the policy is free: in every state, pick the action with the highest Q.

Q-learning updatetemporal difference

Q(s,a) ← Q(s,a) + α[ r + γ max_a′ Q(s′,a′) − Q(s,a) ]

The one rule. Nudge the estimate toward the reward you just got plus the best value you now see ahead; the bracket is the TD error — the surprise. Learn the future from a better guess of the future, with no model of the world at all. This single line drives The Descent’s deep-RL descendants.

Explore vs. exploitε-greedy

a = argmax_aQ(s,a) w.p. 1−ε, else random

Most of the time take the best action you know; a fraction ε of the time gamble on a random one — because the only way to find a better path is to risk a worse one. Too little ε and the agent locks onto the first route it finds; too much and it never commits.

This is the rack’s decideinstrument — the verb that closes AxionCore’s loop after analyse, simulate and predict. Where The Descent (INST·27) learns a function from labelled answers, the agent here is told nothing but a sparse reward and must discover the answer itself, the value of every move estimated from its own later estimates. Its α is The Descent’s η by another name, and pushed too hard it thrashes the same way; its value field is a The Rank-style stationary structure poured through a graph; its ε-exploration is the random kick of The Walk, and annealing ε from high to low is the cooling of The Anneal. Intelligence as nothing but reward and iteration.

Exploration ε 0.20balanced

Learning rate α 0.30

Discount γ 0.95

Gridworld

01 The title card

booting INST·42…

The Rack Next The Bloom

It learns the way by walking it.

It learns the way by walking it.

TheIceJi

It learns the way by walking it.