As you think about evaluations you should consider whether LLM as a judge or just running code can do the correct performance check. In your case above it’s a bit unclear whether your Eval can just be some code check.
llm_classify is a lower level LLM as a judge operator that can be used in the evaluator class. It’s a nice place to start if you have your data in a dataframe, write an eval prompt.
Experiments are the over time tests you want to run. Currently we show the results in tabular form but the team is adding graphs to experiments section