Using Langchain LLM Models with Arize Phoenix Evals: A Guide

Dmitry · 2025-01-10T13:11:20.850Z

Hi, Could you please provide information is it possible to use langchain LLM models with arize-phoenix-evals out of the box? If there is no such support now, do you have any documentation about creating custom models for such purpose? Thank you.

13 comments

· Sorted by Oldest

RunLLM
APP
·
Phoenix does support LangChain applications, including tracing and evaluation, but there is no explicit mention of out-of-the-box support for LangChain LLM models with Arize Phoenix Evals in the provided data sources. However, you can create custom models for evaluation purposes. According to the Phoenix documentation, you can use the LangChainInstrumentor to trace LangChain applications, which suggests some level of integration.
For creating custom models, you can define a new class inheriting from BaseEvalModel and implement necessary methods for your custom LLM, as shown in the GitHub issue. This involves setting up an HTTP client and parsing responses to integrate your custom LLM with Phoenix Evals.
Would you like more detailed guidance on creating custom models?
Sources: - Phoenix Tracing Integrations: LangChain - GitHub Issue: Custom LLM
💡 Hint: Mention RunLLM in the thread for followups.
Dmitry
·
I saw this PR, but positive answer from the bot is not consistent with the answer in PR, and I can't find any additional information about current support or creating custom LLMs, including langchain models, so our question is still relevant.
Dmitry
·
The same question about langchain LLM support is also for using it in prompt playground.
Anthony P.
·
Hey there Dmitry thanks for reaching out! As far as I am aware we only support the models listed here https://docs.arize.com/phoenix/api/evaluation-models#supported-llm-providers for built in phoenix evals. In particular you may be able to configure LiteLLM to proxy between other models as well that aren't explicitly listed in the documentation. Does anyone else on the team have experience with evals and custom models?
Dmitry
·
Anthony P. thank you for your answer, we need to use some LLMs that are available via langchain integration packages, so if there is no current support for langchain LLMs, it would be great to find the way how to properly create this support ourselves. I also found that experiments also use these models, so by implementing it we can use both of them. The main task now is to get the information about correct customization for this task.
Anthony P.
·
You can also create your own evaluators, in which you could invoke langchain wrapped LLM providers. you can follow the custom evaluators guide here https://docs.arize.com/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators#custom-evaluators
Dmitry
·
Yes, but we need not custom evaluators but custom LLM model from langchain integration package and now langchain doesn’t have direct support based on the information from https://docs.arize.com/phoenix/api/evaluation-models#supported-llm-providers
Dmitry
·
Xander S. could you please help where I can read or know about such customization?
Anthony P.
·
Apologies for the delay in response here. Would you mind pointing me to some documentation for one of the langchain integrations you would like to use, so that I can get a better sense of the functionality you are looking for?
Xander S.
·
Hey Dmitry, it would also help to know which part of our evaluations SDK you intend to use with LangChain, e.g., run_experiments, llm_classify, etc.
Dmitry
·
Hi Xander S. Anthony P. Sorry if I had not clear enough question, I will try to explain it based on the examples. I tried to reproduce your quickstarts from the documentation, including the following: - Evals (https://docs.arize.com/phoenix/evaluation/evals) - Datasets and experiments (https://docs.arize.com/phoenix/datasets-and-experiments/quickstart-datasets) As I understand, on of the main differences between Evals and Experiments is that we also have a task in the experiments we need to perform before evaluation, is it correct? Then I found that in both scenarios you use OpenAIModel as an example evaluation model. For our cases, we want to experiment with langchain/langgraph frameworks and to use different LLMs (and Runnable chains based on it), which have integration with langchain, as evaluators. For Evals quickstart, I found that you use predefined evaluators and OpenAIModel as eval model for them:
Python # Set your OpenAI API key eval_model = OpenAIModel(model="gpt-4o-mini") # Define your evaluators hallucination_evaluator = HallucinationEvaluator(eval_model) qa_evaluator = QAEvaluator(eval_model)
But I didn't find the way how to use my LLM/Chain from langchain with your pre-built evaluators or create custom evaluator, for example, for run_evals function which you use in this quickstart. For Experiments case, I understand that you are correct that we can create custom evaluator as function and use it in run_experiment function. I created an example of custom evaluator for langchain Runnable chain (based on example from this quickstart) and it works correctly:
python llm = ChatOpenAI( model='gpt-4o-mini', temperature=0 ) eval_prompt_template = """ Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate. Output only a single word (accurate or inaccurate). QUESTION: {question} REFERENCE_ANSWER: {reference_answer} ANSWER: {answer} ACCURACY (accurate / inaccurate): """ @create_evaluator(kind="llm") # need the decorator or the kind will default to "code" def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float: prompt = ChatPromptTemplate.from_messages( ("user", eval_prompt_template) ) output_parser = StrOutputParser() chain = prompt | llm | output_parser response = chain.invoke({ "question": input["question"], "reference_answer": expected["answer"], "answer": output }).lower().strip() return 1.0 if response == "accurate" else 0.0
But I am confused that even if you use OpenAIModel for both cases and pass it to evaluators argument for run_evals and run_experiment functions and it works correctly, for langchain usage I can use only experiments using @create_evaluator decorator and this evaluator can't be used in the Evals case - the same custom evaluator will have the following error: AttributeError: 'SyncEvaluator' object has no attribute 'default_concurrency' Is there any universal way to create custom evaluator for both Evals and Experiments cases?
Xander S.
·
Hey Dmitry, thanks for the clarification! We don't currently support a LangChain model, although please feel free to file us an enhancement request. In general, the design of run_experiment was designed to be as flexible as possible to allow users to use whatever third-party libraries they want inside. We don't typically expect users to implement their own models, although you can certainly give it a shot for LangChain by taking a look at the OpenAI model and other model interfaces within that folder. Hope this helps!
Dmitry
·
Thank you Xander S. !
👍1

RunLLM
APP
·
Phoenix does support LangChain applications, including tracing and evaluation, but there is no explicit mention of out-of-the-box support for LangChain LLM models with Arize Phoenix Evals in the provided data sources. However, you can create custom models for evaluation purposes. According to the Phoenix documentation, you can use the LangChainInstrumentor to trace LangChain applications, which suggests some level of integration.
For creating custom models, you can define a new class inheriting from BaseEvalModel and implement necessary methods for your custom LLM, as shown in the GitHub issue. This involves setting up an HTTP client and parsing responses to integrate your custom LLM with Phoenix Evals.
Would you like more detailed guidance on creating custom models?
Sources: - Phoenix Tracing Integrations: LangChain - GitHub Issue: Custom LLM
💡 Hint: Mention RunLLM in the thread for followups.
Dmitry
·
I saw this PR, but positive answer from the bot is not consistent with the answer in PR, and I can't find any additional information about current support or creating custom LLMs, including langchain models, so our question is still relevant.
Dmitry
·
The same question about langchain LLM support is also for using it in prompt playground.
Anthony P.
·
Hey there Dmitry thanks for reaching out! As far as I am aware we only support the models listed here https://docs.arize.com/phoenix/api/evaluation-models#supported-llm-providers for built in phoenix evals. In particular you may be able to configure LiteLLM to proxy between other models as well that aren't explicitly listed in the documentation. Does anyone else on the team have experience with evals and custom models?
Dmitry
·
Anthony P. thank you for your answer, we need to use some LLMs that are available via langchain integration packages, so if there is no current support for langchain LLMs, it would be great to find the way how to properly create this support ourselves. I also found that experiments also use these models, so by implementing it we can use both of them. The main task now is to get the information about correct customization for this task.
Anthony P.
·
You can also create your own evaluators, in which you could invoke langchain wrapped LLM providers. you can follow the custom evaluators guide here https://docs.arize.com/phoenix/datasets-and-experiments/how-to-experiments/using-evaluators#custom-evaluators
Dmitry
·
Yes, but we need not custom evaluators but custom LLM model from langchain integration package and now langchain doesn’t have direct support based on the information from https://docs.arize.com/phoenix/api/evaluation-models#supported-llm-providers
Dmitry
·
Xander S. could you please help where I can read or know about such customization?
Anthony P.
·
Apologies for the delay in response here. Would you mind pointing me to some documentation for one of the langchain integrations you would like to use, so that I can get a better sense of the functionality you are looking for?
Xander S.
·
Hey Dmitry, it would also help to know which part of our evaluations SDK you intend to use with LangChain, e.g., run_experiments, llm_classify, etc.
Dmitry
·
Hi Xander S. Anthony P. Sorry if I had not clear enough question, I will try to explain it based on the examples. I tried to reproduce your quickstarts from the documentation, including the following: - Evals (https://docs.arize.com/phoenix/evaluation/evals) - Datasets and experiments (https://docs.arize.com/phoenix/datasets-and-experiments/quickstart-datasets) As I understand, on of the main differences between Evals and Experiments is that we also have a task in the experiments we need to perform before evaluation, is it correct? Then I found that in both scenarios you use OpenAIModel as an example evaluation model. For our cases, we want to experiment with langchain/langgraph frameworks and to use different LLMs (and Runnable chains based on it), which have integration with langchain, as evaluators. For Evals quickstart, I found that you use predefined evaluators and OpenAIModel as eval model for them:
Python # Set your OpenAI API key eval_model = OpenAIModel(model="gpt-4o-mini") # Define your evaluators hallucination_evaluator = HallucinationEvaluator(eval_model) qa_evaluator = QAEvaluator(eval_model)
But I didn't find the way how to use my LLM/Chain from langchain with your pre-built evaluators or create custom evaluator, for example, for run_evals function which you use in this quickstart. For Experiments case, I understand that you are correct that we can create custom evaluator as function and use it in run_experiment function. I created an example of custom evaluator for langchain Runnable chain (based on example from this quickstart) and it works correctly:
python llm = ChatOpenAI( model='gpt-4o-mini', temperature=0 ) eval_prompt_template = """ Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate. Output only a single word (accurate or inaccurate). QUESTION: {question} REFERENCE_ANSWER: {reference_answer} ANSWER: {answer} ACCURACY (accurate / inaccurate): """ @create_evaluator(kind="llm") # need the decorator or the kind will default to "code" def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float: prompt = ChatPromptTemplate.from_messages( ("user", eval_prompt_template) ) output_parser = StrOutputParser() chain = prompt | llm | output_parser response = chain.invoke({ "question": input["question"], "reference_answer": expected["answer"], "answer": output }).lower().strip() return 1.0 if response == "accurate" else 0.0
But I am confused that even if you use OpenAIModel for both cases and pass it to evaluators argument for run_evals and run_experiment functions and it works correctly, for langchain usage I can use only experiments using @create_evaluator decorator and this evaluator can't be used in the Evals case - the same custom evaluator will have the following error: AttributeError: 'SyncEvaluator' object has no attribute 'default_concurrency' Is there any universal way to create custom evaluator for both Evals and Experiments cases?
Xander S.
·
Hey Dmitry, thanks for the clarification! We don't currently support a LangChain model, although please feel free to file us an enhancement request. In general, the design of run_experiment was designed to be as flexible as possible to allow users to use whatever third-party libraries they want inside. We don't typically expect users to implement their own models, although you can certainly give it a shot for LangChain by taking a look at the OpenAI model and other model interfaces within that folder. Hope this helps!
Dmitry
·
Thank you Xander S. !
👍1