Understanding Search in Transformers

“[T]here is no technique that would allow us to lay out in any satisfactory way what kinds of knowledge, reasoning, or goals a model is using when it produces some output.” – Sam Bowman

Most recent ML papers start with a long description of how Transformers have been incredibly successful in a huge variety of tasks. Capabilities are advancing rapidly, but our understanding of how Transformers do what they do is limited. Recognizing this gap, our project boldly contributes to the field of mechanistic interpretability. In particular, we focus on the importance of search and goal representations in transformers.

Our Research

Structured World Representations in Maze-Solving Transformers

Transformers trained to solve mazes form linear representations of the maze, and we find evidence for Adjacency Heads which attend to valid "next moves"

Understanding Mesa-optimization Using Toy Models

LessWrong post detailing and motivating our Research Agenda.