Understanding Search in Transformers

“[T]here is no technique that would allow us to lay out in any satisfactory way what kinds of knowledge, reasoning, or goals a model is using when it produces some output.” – Sam Bowman

Most recent ML papers start with a long description of how Transformers have been incredibly successful in a huge variety of tasks. Capabilities are advancing rapidly, but our understanding of how Transformers do what they do is limited. Recognizing this gap, our project boldly contributes to the field of mechanistic interpretability. In particular, we focus on the importance of search and goal representations in transformers.

Our Research

Transformers Use Causal World Models in Maze-Solving Tasks

Using sparse autoencoders and attention analysis, we discover and intervene on world models in maze-solving transformers

Blog Post

Structured World Representations in Maze-Solving Transformers

We train transformers on mazes and use linear probing to show that they form internal representations of the entire maze, and find evidence for Adjacency Heads which attend to valid "next moves"

External Link

A Configurable Library for Generating and Manipulating Maze Datasets

The paper accompanying our maze-dataset library, which was used in our first two papers and is publically available.

External Link

Understanding Mesa-optimization Using Toy Models

LessWrong post detailing and motivating our Research Agenda.

External Link