In case you are wondering about GPT-3.5-Instruct:
We tested it against our Task Based Evals with very mixed results. It is highly task dependent. Included are two widely varying results against a golden test dataset.
Details of how this works is in our (soon to be officially released Eval library):
https://docs.arize.com/phoenix/concepts/llm-evals