Hi Xander S. Anthony P.
Sorry if I had not clear enough question, I will try to explain it based on the examples.
I tried to reproduce your quickstarts from the documentation, including the following:
- Evals (https://docs.arize.com/phoenix/evaluation/evals)
- Datasets and experiments (https://docs.arize.com/phoenix/datasets-and-experiments/quickstart-datasets)
As I understand, on of the main differences between Evals and Experiments is that we also have a task in the experiments we need to perform before evaluation, is it correct?
Then I found that in both scenarios you use OpenAIModel as an example evaluation model.
For our cases, we want to experiment with langchain/langgraph frameworks and to use different LLMs (and Runnable chains based on it), which have integration with langchain, as evaluators.
For Evals quickstart, I found that you use predefined evaluators and OpenAIModel as eval model for them:
Python
# Set your OpenAI API key
eval_model = OpenAIModel(model="gpt-4o-mini")
# Define your evaluators
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)
But I didn't find the way how to use my LLM/Chain from langchain with your pre-built evaluators or create custom evaluator, for example, for run_evals function which you use in this quickstart.
For Experiments case, I understand that you are correct that we can create custom evaluator as function and use it in run_experiment function.
I created an example of custom evaluator for langchain Runnable chain (based on example from this quickstart) and it works correctly:
python
llm = ChatOpenAI(
model='gpt-4o-mini',
temperature=0
)
eval_prompt_template = """
Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
Output only a single word (accurate or inaccurate).
QUESTION: {question}
REFERENCE_ANSWER: {reference_answer}
ANSWER: {answer}
ACCURACY (accurate / inaccurate):
"""
@create_evaluator(kind="llm") # need the decorator or the kind will default to "code"
def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
prompt = ChatPromptTemplate.from_messages(
("user", eval_prompt_template)
)
output_parser = StrOutputParser()
chain = prompt | llm | output_parser
response = chain.invoke({
"question": input["question"],
"reference_answer": expected["answer"],
"answer": output
}).lower().strip()
return 1.0 if response == "accurate" else 0.0
But I am confused that even if you use OpenAIModel for both cases and pass it to evaluators argument for run_evals and run_experiment functions and it works correctly, for langchain usage I can use only experiments using @create_evaluator decorator and this evaluator can't be used in the Evals case - the same custom evaluator will have the following error:
AttributeError: 'SyncEvaluator' object has no attribute 'default_concurrency'
Is there any universal way to create custom evaluator for both Evals and Experiments cases?