No error messages, just the above logs which i shared
yes, I wonder why the evaluations are not getting logged🤔
but my traces ares being sent via this endpoint but why the evaluations are not being sent to the same?
If we do not provide any endpoint inside px.client(), then by default it takes from environment variable, right?
Actually I have set the endpoint like this in the kubernetes deployment :
- name: PHOENIX_COLLECTOR_ENDPOINT
value: http://arize-ui.default.svc.cluster.local:80/v1/traces
The traces are being sent to the ui, but the evaluations are not being sent via px.client() we have used it like this :
logger.info(f"Getting input, output and reference from traces ...")
test_spans = px.Client().get_spans_dataframe()
logger.info(f"############# {test_spans}")
input_output_df = get_qa_with_reference(px.Client())
logger.info(f"####### { input_output_df}")This is the log which i am getting now :
The evaluation based on dataset:100001. 5 questions to run
2024-03-05T18:59:57.761503+0000 INFO [job_builder] - Getting input, output and reference from traces ...
2024-03-05T18:59:58.119939+0000 INFO [job_builder] - ############# name ... attributes.status
context.span_id ...
b7ff27afb7de0567 AzureChatOpenAI ... None
12edcc7d11ab52bb LLMChain ... None
ad29ee6a9b54714d StuffDocumentsChain ... None
163cd80dbe74b0db ConversationalRetrievalWithScoreChain ... None
ac51b20dccd0fb28 AzureChatOpenAI ... None
... ... ... ...
4025ad8114ed6e9c get_answer ... OK
e2704ad13a15b2a6 get_answer ... OK
345b2087aa39b26b get_answer ... OK
23fdd79e69456cf2 get_answer ... OK
2154118ee030ff6a get_answer ... OK
[527 rows x 29 columns]
2024-03-05T18:59:58.218531+0000 INFO [job_builder] - ####### Empty DataFrame
Columns: [input, output]
Index: []
2024-03-05T18:59:58.305156+0000 INFO [job_builder] - Running evaluations: dict_values(['Correctness', 'Hallucination', 'Toxicity', 'Groundtruth'])
2024-03-05T18:59:58+0000 WARNING [executor] - 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
run_evals | | 0/0 (0.0%) | ⏳ 00:00<? | ?it/s
run_evals | | 0/0 (0.0%) | ⏳ 00:00<? | ?it/s
2024-03-05T18:59:58.622718+0000 INFO [job_builder] - Log evaluation results to UI for: dict_values(['Correctness', 'Hallucination', 'Toxicity', 'Groundtruth'])this is my main code where i have implemented tracing :
def submit(self):
init_logger()
# to do: change to support local server and remote server based on the server set
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://arize.zscaler.site"
session = px.launch_app()
LangChainInstrumentor().instrument()
ai_dresult = {}
question_num = 0
for question, human_answer in self.question_answer_pool.items():
ai_answer = self.chat_app_run(question)
ai_dresult[question] = ai_answer
question_num += 1
logger.info(f"The evaluation based on dataset:{self.dataset_id}. {question_num} questions to run")
# log the traces
logger.info(f"Getting input, output and reference from traces ...")
input_output_df = get_qa_with_reference(px.Client())
input_output_df["correct_answer"] = input_output_df["input"].apply(
lambda x: self.question_answer_pool[x])
input_output_df["ai_answer"] = input_output_df["input"].apply(
lambda x: ai_dresult[x])
retrieved_documents_df = get_retrieved_documents(px.Client())
if self.evaluators_qa_with_reference:
evaluations_list = self.evaluators_qa_with_reference.values()
logger.info(f"Running evaluations: {evaluations_list}")
df_results = []
names = []
for name in self.evaluators_qa_with_reference.values():
df_results.append(f"result_of_{name}")
names.append(name)
df_results = run_evals(
dataframe=input_output_df,
evaluators=self.evaluators_qa_with_reference.keys(),
provide_explanation=True,
)
logger.info(f"Log evaluation results to UI for: {evaluations_list}")
for index, df_result in enumerate(df_results):
px.Client().log_evaluations(
SpanEvaluations(eval_name=names[index], dataframe=df_result)
)
self.dashboard_data[names[index]] = df_result.mean(numeric_only=True)["score"]
if self.evaluators_retrieved_documents:
evaluations_retrived = self.evaluators_retrieved_documents.values()
logger.info(f"Running evaluations: {evaluations_retrived}")
relevance_eval_df = run_evals(
dataframe=retrieved_documents_df,
evaluators=self.evaluators_retrieved_documents.keys(),
provide_explanation=True,
)[0]
logger.info(f"Log evaluation results to UI: {evaluations_retrived}")
px.Client().log_evaluations(
DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)
self.dashboard_data["Relevance"] = relevance_eval_df.mean(numeric_only=True)["score"]
if self.run_outputtone:
logger.info("Run evaluation: outputtone")
output_tone_df = run_output_tone_evaluation(input_output_df)
logger.debug(f"output_tone_df:{output_tone_df}")
logger.info(f"Log evaluation results to UI: outputtone")
px.Client().log_evaluations(SpanEvaluations(eval_name="output_tone", dataframe=output_tone_df))
# to do: implement the function
self.dashboard_data["timestamp"] = datetime.now().timestamp()
if self.dataset_id > BENCHMARK_DATASET_ID_MIN and self.project_name != PROJECT_NAMES['RANDOM']:
logger.info("Log evaluations scores to Bigquery")
log_scores_to_bigQuery(self.dashboard_data)
# to do: check condition, only do this for running on local trace server
import time
time.sleep(self.job_save_seconds_for_local_run)
this is my usage_example.py:
from llamaas.qa.chatbot_qa.job import JobBuilder
from llamaas.qa.chatbot_qa.utils import load_from_json
from llamaas.model_connector.azure_resources import DevGPT4_32K
from llamaas.qa.chatbot_qa.constants import *
from llamaas.qa.eval_utils.eval_models import get_azure_openai_model
from llamaas.orchestrator.doc_qa.scoped_doc_qa_app import ScopedDocQaApp
from llamaas.qa.chatbot_qa.server import TraceServer
# [required] the dataset that used for evaluation, format follow ./questions_answers_template.json
json_questions_answers = load_from_json("./questions_answers_template.json")
# [required] the instance of app under evaluation
chat_app = ScopedDocQaApp(
vector_store_path="/Users/priya/datasci/All")
# [required] the method name to get answer, the evaluation job will pass question as parameter
run_method = "get_answer"
# [option] the Azure model that used to do evaluation, if not set, default use DevGPT4_32K, it will be setted in set_tasks
eval_model = get_azure_openai_model(DevGPT4_32K)
job = (JobBuilder()
# [required] defined question_answers data set, format same to questions_answers_template.json
.set_question_answer_pool(json_questions_answers)
# [option] trace server ip, if local run, set "127.0.0.1", if not set, default is "127.0.0.1"
.set_server(TraceServer("127.0.0.1"))
# [required] set model to be used to do evaluation and evaluation tasks(True: run the evaluation, False(default): not run)
.set_tasks(
eval_model=eval_model,
correctness=True,
hallucination=True,
toxicity=True,
groundtruth=True,
relevance=True,
outputtone=True)
# [required]set the chat_app to be evaluationed, and how to run it. The system will pass in "question" as parameter
.set_chat_app_run(chat_app, run_method)
# [required] set the model that used in chat_app, str, used for log Metric_store to do model comparision.
.set_chat_app_model("gpt-4")
# [required] projectname, used for Metric_store table. The name should lised in constants.PROJECT_NAMES. For new project, please add name into constants.PROJECT_NAMES first
.set_project_name(PROJECT_NAMES["DOC_GPT"])
# [option]when local run, how long(senconds) to save job for checking results, default: 0
.set_job_save_seconds_for_local_run(1000)
.build()
)
job.submit()this is the error i am facing: i tried adding and removing port.
no
if i set the remote url, in none of the places the traces are being sent. but it works fine with localhost.
