Indexing the archive…
Your Universe of Digital Possibilities
An agent is dropped into a grid it has never seen, with one rule: reach the reward. It has no map and no model — only trial, error, and a single number per move, the action-value Q, nudged toward the reward plus the best it now sees ahead. Do that enough and the value floods backward from the goal, the policy crystallizes into arrows, and a route appears out of nothing. This is learning to decide — the last verb of the engine, after analyse, simulate and predict.
The value of a state is the best action’s immediate reward plus the discounted value of wherever it lands you. Optimal play is exactly the solution of this self-referential equation — the present worth of the whole future, folded into one line.
How good it is to take action a in state s and then play greedily forever after — the expected discounted reward. Learn this table and the policy is free: in every state, pick the action with the highest Q.
The one rule. Nudge the estimate toward the reward you just got plus the best value you now see ahead; the bracket is the TD error — the surprise. Learn the future from a better guess of the future, with no model of the world at all. This single line drives The Descent’s deep-RL descendants.
Most of the time take the best action you know; a fraction ε of the time gamble on a random one — because the only way to find a better path is to risk a worse one. Too little ε and the agent locks onto the first route it finds; too much and it never commits.
This is the rack’s decideinstrument — the verb that closes AxionCore’s loop after analyse, simulate and predict. Where The Descent (INST·27) learns a function from labelled answers, the agent here is told nothing but a sparse reward and must discover the answer itself, the value of every move estimated from its own later estimates. Its α is The Descent’s η by another name, and pushed too hard it thrashes the same way; its value field is a The Rank-style stationary structure poured through a graph; its ε-exploration is the random kick of The Walk, and annealing ε from high to low is the cooling of The Anneal. Intelligence as nothing but reward and iteration.