Ilya B. to Mikyo's point, we're doing some exploration into fine-tuned smaller models vs SLMs for evaluating responses, but still using the LLM-as-a-Judge approach. Hoping to add some fun augmentation via human-labeled data to further improve performance, but that's the general direction we're moving