We want to understand what models are actually doing when they produce outputs. This is early work, and most questions do not have clear answers yet.
Tracing how information flows through a model by swapping activations at specific sites.
Studying how individual neurons simultaneously respond to many different, unrelated concepts.
Finding the small subnetworks inside large models that are responsible for specific capabilities.