I have a little POC in progress on phoenix. I am trying to tool up around a model of f(x) = y where, f is the prompt, x is the input y is the output. from there, you can say _metric(f(x1), f(x2)) ,etc.. My concern is that phoenix doesn’t seem to have facility for tracking f data alongside the x and y data. So my question is: what is everyone doing to examine the values of f e.g. the prompt template, its subdependencies, rag stuff, the LM model version used … everything that goes into configuring a LM prompt.
To proceed with your proof of concept in Phoenix, you can leverage its capabilities for evaluating and experimenting with models. Here's a general approach: 1. Define Your Model and Data: Ensure that your model f(x) = y is well-defined, where f is the prompt, x is the input, and y is the output. You can use Phoenix to log and trace these inputs and outputs.2. Evaluate Metrics: To evaluate metrics like _metric(f(x1), f(x2)), you can use Phoenix's evaluation framework. You may need to create a custom evaluator if the metric is not predefined. This involves preparing your data, defining the evaluation logic, and logging the results back to Phoenix.Would you like more detail on setting up custom evaluators or logging evaluations in Phoenix?Sources:
- Bring Your Own Evaluator
- Phoenix User Guide
💡 Hint: Mention RunLLM in the thread for followups.
y will be stored as experiment run outputs, or reference outputs if you upload them with the original dataset
an evaluation can be run over all of the above
I will give that a shot
what are people doing for dashboards? I am thinking like I have a list of LM functions and their current SOTA accuracy, and other metadata about that SOTA experiment
like a leaderboard
but that’s just one possible dashboard
if you run multiple experiments against the same dataset, you can run the same metric across all of them and see how it evolves
right
is there a way to customize the dataset index to show the best experiment metrics?
you mean how each experiment is ordered?
not currently, as far as I know
are people using arize the pay platform with phoenix to do dashboards?
I think so, I'd refer you to 🔒[private user] to talk about how people are using the main platform
