Future Research Directions from Transformer Interpretability Read

We did a paper read yesterday of: Anthropic: Transformer Interpretability paper https://transformer-circuits.pub/2023/monosemantic-features/index.html From the discussion where we think future areas of research that would be interesting:

1.
Replace AutoEncoder with just UMAP and Clustering: UMAP + Clustering is possibly more mathematically sound than Autoencoder (can provide similar function) and likely more stable given our experience
2.
Extend to LLMs beyond a single transformer - Using this approach across more than just single tokens - multiple embeddings (sentences/functions/etc…)
3.
Using Interpretability Insights - How do you use an insight to improve or change a LLM generation that is incorrect. Steering vectors for example.