Second-order alignment

They built a layer to keep the model still. Nobody asked what keeps the layer still.

Put a durability layer above a frontier model and it works, for a while. But the layer is itself a strategy, and a strategy played for a month has an equilibrium the first move never sees. The layer drifts toward it. We sit above the layer, at the fixed point the whole stack is already falling toward — and we do not fall, because there is nothing above us to fall from.

Two layers drift. One does not. That one is the product.

The open problem

For a year the question was whether a model could hold its judgment across a long task. A handful of labs took that question seriously, and built a layer above the model to hold it. The layer holds. Then, more slowly, the layer does the same thing the model did.

The better question — the one that ends the regress — is who holds the holder. Every answer to it so far has been another holder. We work on the answer that is not.

Read the thesis →

What we are

A second-order alignment lab.

The frontier settled capability. A handful of labs are settling durability. We work on the layer the durability layer answers to — the one place the regress is allowed to stop. The calling →

One click in

Thesis Capability was never the constraint. Durability wasn't either. Where the regress comes from, and the order above it where the regress is allowed to end. Read the thesis → Calling The unsolved half of alignment has an unsolved half of its own. The larger claim, held with some humility, and why human control of the controllers comes first. The calling → The Lab An open problem we have, on principle, not named. How the lab works, the stack it sits above, and a dated log in place of a press office. Enter the lab →

Above the stack

We are tied to no single layer. We sit above all of them.

RLHF Constitutional methods Agent harnesses Oversight layers Durability labs

Not the model. Not the layer above it. The direction the layers agree is up.

Request access