Tracking Prompt Configurations in Phoenix for Effective Model Evaluation

·Nov 26, 2024 06:16 PM

I have a little POC in progress on phoenix. I am trying to tool up around a model of f(x) = y where, f is the prompt, x is the input y is the output. from there, you can say _metric(f(x1), f(x2)) ,etc.. My concern is that phoenix doesn’t seem to have facility for tracking f data alongside the x and y data. So my question is: what is everyone doing to examine the values of f e.g. the prompt template, its subdependencies, rag stuff, the LM model version used … everything that goes into configuring a LM prompt.

16 comments

· Sorted by Oldest

RunLLM
APP
·
To proceed with your proof of concept in Phoenix, you can leverage its capabilities for evaluating and experimenting with models. Here's a general approach:
1. Define Your Model and Data: Ensure that your model f(x) = y is well-defined, where f is the prompt, x is the input, and y is the output. You can use Phoenix to log and trace these inputs and outputs.
2. Evaluate Metrics: To evaluate metrics like _metric(f(x1), f(x2)), you can use Phoenix's evaluation framework. You may need to create a custom evaluator if the metric is not predefined. This involves preparing your data, defining the evaluation logic, and logging the results back to Phoenix.
Would you like more detail on setting up custom evaluators or logging evaluations in Phoenix?
Sources: - Bring Your Own Evaluator - Phoenix User Guide
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi aaron s. I think you can consider using Phoenix experiments for this, the inputs you store in the dataset can contain both f and x as keys of a dictionary or other nested structure, and the task you define can be calling f(x)
Dustin N.
·
y will be stored as experiment run outputs, or reference outputs if you upload them with the original dataset
Dustin N.
·
an evaluation can be run over all of the above
aaron s.
·
I will give that a shot
aaron s.
·
what are people doing for dashboards? I am thinking like I have a list of LM functions and their current SOTA accuracy, and other metadata about that SOTA experiment
aaron s.
·
like a leaderboard
aaron s.
·
but that’s just one possible dashboard
Dustin N.
·
if you run multiple experiments against the same dataset, you can run the same metric across all of them and see how it evolves
aaron s.
·
right
aaron s.
·
is there a way to customize the dataset index to show the best experiment metrics?
Dustin N.
·
you mean how each experiment is ordered?
Dustin N.
·
not currently, as far as I know
aaron s.
·
are people using arize the pay platform with phoenix to do dashboards?
Dustin N.
·
I think so, I'd refer you to 🔒[private user] to talk about how people are using the main platform

RunLLM
APP
·
To proceed with your proof of concept in Phoenix, you can leverage its capabilities for evaluating and experimenting with models. Here's a general approach:
1. Define Your Model and Data: Ensure that your model f(x) = y is well-defined, where f is the prompt, x is the input, and y is the output. You can use Phoenix to log and trace these inputs and outputs.
2. Evaluate Metrics: To evaluate metrics like _metric(f(x1), f(x2)), you can use Phoenix's evaluation framework. You may need to create a custom evaluator if the metric is not predefined. This involves preparing your data, defining the evaluation logic, and logging the results back to Phoenix.
Would you like more detail on setting up custom evaluators or logging evaluations in Phoenix?
Sources: - Bring Your Own Evaluator - Phoenix User Guide
💡 Hint: Mention RunLLM in the thread for followups.
Dustin N.
·
hi aaron s. I think you can consider using Phoenix experiments for this, the inputs you store in the dataset can contain both f and x as keys of a dictionary or other nested structure, and the task you define can be calling f(x)
Dustin N.
·
y will be stored as experiment run outputs, or reference outputs if you upload them with the original dataset
Dustin N.
·
an evaluation can be run over all of the above
aaron s.
·
I will give that a shot
aaron s.
·
what are people doing for dashboards? I am thinking like I have a list of LM functions and their current SOTA accuracy, and other metadata about that SOTA experiment
aaron s.
·
like a leaderboard
aaron s.
·
but that’s just one possible dashboard
Dustin N.
·
if you run multiple experiments against the same dataset, you can run the same metric across all of them and see how it evolves
aaron s.
·
right
aaron s.
·
is there a way to customize the dataset index to show the best experiment metrics?
Dustin N.
·
you mean how each experiment is ordered?
Dustin N.
·
not currently, as far as I know
aaron s.
·
are people using arize the pay platform with phoenix to do dashboards?
Dustin N.
·
I think so, I'd refer you to 🔒[private user] to talk about how people are using the main platform