Hi, I am new to Phoenix, and overall think it's a pretty splendid project, but I get the feeling it's rapidly evolving to the point where the documentation is leaving me quite confused as a newcomer. An updated evaluation framework (v2) was just released, but is this the sole source of documentation? https://arize-phoenix.readthedocs.io/projects/evals/en/latest/index.html. It seems that the main docs at https://arize.com/docs/phoenix/evaluation/llm-evals which use the legacy framework. It leaves me with some questions:
How do I use evals together with LiteLLM with a custom model or endpoint? The LLM class seems very restricted.
how does the new evals framework interact with the dataset/experiments framework?
To use LiteLLM with a custom model or endpoint in Phoenix Evals v2, instantiate LiteLLMModel and pass model-specific parameters via the model_kwargs argument. For custom endpoints (e.g., Anthropic, Ollama, or your own proxy), set the required environment variables or include endpoint details in model_kwargs as needed by LiteLLM. Example: LBKSimport os from phoenix.evals import LiteLLMModelLBKSos.environ["OLLAMA_API_BASE"] = "http://localhost:11434" # for Ollama model = LiteLLMModel(model="ollama/llama3", model_kwargs={"api_base": "http://your-custom-endpoint"})LBKSYou can find more details in the official Phoenix Evals v2 documentation and LiteLLMModel usage guide. The main docs at arize.com still reference the legacy framework, so prefer the readthedocs site for v2.Would you like a step-by-step example for a specific provider or endpoint?Sources:
- LiteLLMModel Example
💡 Hint: Mention RunLLM in the thread for followups.
You are correct: LiteLLMModel is part of the legacy Phoenix evaluation framework and does not appear in the API listing for the new v2 evaluation framework (see v2 docs). The v2 framework uses a new abstraction, typically via the LLM class and adapters, and does not expose LiteLLMModel as a public API.Would you like more detail on how to use custom models or endpoints with the v2 evaluation framework?Sources:
- see v2 docs
💡 Hint: Mention RunLLM in the thread for followups.
To use a custom model or endpoint with the Phoenix Evals v2 framework, you should use the new LLM abstraction, which allows you to specify the provider, model, and connection details directly. For example, to use a custom endpoint (such as a LiteLLM proxy or OpenAI-compatible API), you can do: LBKSfrom phoenix.evals.preview.llm import LLMLBKSllm = LLM( provider="openai", # or "litellm" if supported model="your-model-name", api_key="your-api-key", base_url="http://your-custom-endpoint", )LBKSThis approach is documented in the phoenix-evals preview README. The v2 framework is designed to be adapter-based, so you can use any OpenAI-compatible endpoint by specifying the correct provider and base URL.Would you like a full example of running an evaluation with this setup?Sources:
- phoenix-evals preview README
💡 Hint: Mention RunLLM in the thread for followups.
Yes, you can use LiteLLM with Ollama by setting the appropriate environment variable (OLLAMA_API_BASE) and initializing the LLM with the correct model name. The v2 framework's LLM abstraction does not officially document support for arbitrary kwargs, but for LiteLLM/Ollama, the standard approach is: LBKSimport os from phoenix.evals.preview.llm import LLMLBKSos.environ["OLLAMA_API_BASE"] = "http://localhost:11434" llm = LLM(provider="litellm", model="ollama/llama3")LBKSThis pattern is shown in the Phoenix documentation and code examples. Passing additional kwargs directly to LLM is not documented for v2, so rely on environment variables for endpoint configuration.Would you like a complete evaluation example using this setup?Sources:
- Phoenix evaluation-models.md
💡 Hint: Mention RunLLM in the thread for followups.
RunLLM litellm does not appear to be a valid provider value at all, it can only be used as a client? Could you also touch on my second question "how does the new evals framework interact with the dataset/experiments framework?"? Namely, can I do evaluation as part of executing an experiment, or will I have to execute it with run-evals or similar, and then upload manually?
You are correct: in the Phoenix Evals v2 framework, litellm is not a valid provider—it is used as a client argument, while the provider should be the actual model provider (e.g., "openai", "anthropic", etc.). For example: LBKSfrom phoenix.evals.preview.llm import LLM llm = LLM(provider="openai", model="gpt-3.5-turbo", client="litellm")LBKSThis is confirmed in the LLM wrapper documentation.Regarding integration with the dataset/experiments framework: as of Evals v2, evaluation is decoupled from the experiments framework. You typically run evaluations separately (e.g., with run-evals or programmatically), then upload or log results to Phoenix. Direct integration—where evaluation is performed as part of an experiment execution—is not currently documented for v2; you must run evals and experiments as separate steps and manage results accordingly (see v2 docs).Would you like more detail or a workflow example for this process?Sources:
- LLM wrapper documentation
💡 Hint: Mention RunLLM in the thread for followups.
Apologies for the out of date documentation. We’re working on a full revamp. Let us know there’s anything we can do in the meantime
Mikyo sounds awesome 🙂 we are currently using Github Models for prototyping, and it generally plays nice with Langchain/Langgraph if I either treat it as an OpenAI model with a custom api_base/api_key, OR if I use LiteLLM with the github/ prefix. is there anything I can do to run v2 evals with either of those modes? Something like wrapping a langchain chat interface or passing api_base/api_key?
would also be interested to hear a human perspective on how evals v2 is intended to work with datasets/experiments 🙂
would also be interested to hear a human perspective on how evals v2 is intended to work with datasets/experiments 🙂
This is one of the core motivations actually. They should "just work" on most cases. We should have much better documentation on this in the coming week.
Mikyo sounds awesome 🙂 we are currently using Github Models for prototyping, and it generally plays nice with Langchain/Langgraph if I either treat it as an OpenAI model with a custom api_base/api_key, OR if I use LiteLLM with the github/ prefix. is there anything I can do to run v2 evals with either of those modes? Something like wrapping a langchain chat interface or passing api_base/api_key?
Let me check with the team. If it works with LangChain or LiteLLM it should work but we may have not known about the github prefix. (https://github.com/Arize-ai/phoenix/issues/9685)
If you tell me the right answer to this question, I'll make sure I don't make the same mistake twice!
Feedback
Type the right answer here...
↵ Please 'enter' to submit
