Jane W. curious where you landed on this. At google we’d lean more on encoder style models, rather than generative decoders to get much more stable scores that we could tune with human ratings