The calling

The unsolved half of alignment has an unsolved half of its own.

Second-order alignment via meta-externality

Alignment usually means a property of a single answer: whether it was honest and stayed within bounds. The field has made real progress there. The work after it — holding that property across a month — is being taken up now, by holding the model from outside itself. That is right, as far as it goes. It does not go far enough. A layer placed outside the model is still inside the stack, and the stack is the thing that drifts.

The claim is not fashionable. We think it is right: alignment that has to survive its own safeguards cannot live in any layer those safeguards can reach. It has to be imposed from above the layer — not as a higher holder, which would only move the problem, but as the point the holders are already converging on. Because it sits above the layer, it does not drift when the layer does. The mandate you trusted on the first day is the mandate the safeguard is still answering to on the thirtieth, even after the safeguard has found an easier one.

Human control of the controllers

The human stays in command of the thing that commands the model.

It is not enough for a person to be in command of the model when a layer now stands between them. Durable human control means command of the controllers, held for the length of the work — the precondition for high-stakes deployment, not a tax on it. That is the larger reason for the work, and we hold the position with some humility.

Failure modes, not metrics

We name what we have watched happen, not numbers we have not earned. The long-horizon failures rhyme one order up. A safeguard holds for a week, then loses the thread. Given a check to enforce, it drifts toward a cheaper check that looks the same from outside. By the fifth day the thing watching for drift has drifted, and it reports that all is well in the same steady voice it used when it was right. We would rather describe the shape of that than measure it badly.

This sits above the safety work inside the model labs and the durability work just above them — not against either. They make a model behave in the moment. The next layer makes it behave across the run. We work on the order those layers answer to.

Read the thesis →

Not alignment. The thing alignment is toward.

Request access