Future Research Directions from Transformer Interpretability Read
We did a paper read yesterday of: Anthropic: Transformer Interpretability paper https://transformer-circuits.pub/2023/monosemantic-features/index.html From the discussion where we think future areas of research that would be interesting:
- 1.
Replace AutoEncoder with just UMAP and Clustering: UMAP + Clustering is possibly more mathematically sound than Autoencoder (can provide similar function) and likely more stable given our experience
- 2.
Extend to LLMs beyond a single transformer - Using this approach across more than just single tokens - multiple embeddings (sentences/functions/etc…)
- 3.
Using Interpretability Insights - How do you use an insight to improve or change a LLM generation that is incorrect. Steering vectors for example.
