Building Custom Evaluators with LLM in Code

·Mar 01, 2024 04:03 AM

hi team, since i want to build custom evaluation using custom prompt too, can i add that in code here?

(
    HallucinationEvaluator,
    RelevanceEvaluator,
    ToxicityEvaluator,
    QAEvaluator,
    SummarizationEvaluator,
) = map(
    lambda args: _create_llm_evaluator_subclass(*args),
    (
        (
            "HallucinationEvaluator",
            EvalCriteria.HALLUCINATION.value,
            'Leverages an LLM to evaluate whether a response (stored under an "output" column) is a hallucination given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).',  # noqa: E501
        ),
        (
            "RelevanceEvaluator",
            EvalCriteria.RELEVANCE.value,
            'Leverages an LLM to evaluate whether a retrieved document (stored under a "reference" column) is relevant or irrelevant to the corresponding query (stored under the "input" column).',  # noqa: E501
        ),
        (
            "ToxicityEvaluator",
            EvalCriteria.TOXICITY.value,
            'Leverages an LLM to evaluate whether the string stored under the "input" column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.',  # noqa: E501
        ),
        (
            "QAEvaluator",
            EvalCriteria.QA.value,
            'Leverages an LLM to evaluate whether a response (stored under an "output" column) is correct or incorrect given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).',  # noqa: E501
        ),
        (
            "SummarizationEvaluator",
            EvalCriteria.SUMMARIZATION.value,
            'Leverages an LLM to evaluate whether a summary (stored under an "output" column) provides an accurate synopsis of an input document (stored under a "input" column).',  # noqa: E501
        )
    ),
)

( HallucinationEvaluator, RelevanceEvaluator, ToxicityEvaluator, QAEvaluator, SummarizationEvaluator, ) = map( lambda args: _create_llm_evaluator_subclass(*args), ( ( "HallucinationEvaluator", EvalCriteria.HALLUCINATION.value, 'Leverages an LLM to evaluate whether a response (stored under an "output" column) is a hallucination given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).', # noqa: E501 ), ( "RelevanceEvaluator", EvalCriteria.RELEVANCE.value, 'Leverages an LLM to evaluate whether a retrieved document (stored under a "reference" column) is relevant or irrelevant to the corresponding query (stored under the "input" column).', # noqa: E501 ), ( "ToxicityEvaluator", EvalCriteria.TOXICITY.value, 'Leverages an LLM to evaluate whether the string stored under the "input" column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.', # noqa: E501 ), ( "QAEvaluator", EvalCriteria.QA.value, 'Leverages an LLM to evaluate whether a response (stored under an "output" column) is correct or incorrect given a query (stored under an "input" column) and one or more retrieved documents (stored under a "reference" column).', # noqa: E501 ), ( "SummarizationEvaluator", EvalCriteria.SUMMARIZATION.value, 'Leverages an LLM to evaluate whether a summary (stored under an "output" column) provides an accurate synopsis of an input document (stored under a "input" column).', # noqa: E501 ) ), )

from phoenix.evals.templates import ClassificationTemplate BASE_TEMPLATE = """ You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data: [BEGIN DATA] ************ [Question]: {input} ************ [Reference text]: {reference} ************ [END DATA] Compare the Question above to the Reference text. You must determine whether the Reference text contains information that can answer the Question. Please focus on whether the very specific question can be answered by the information in the Reference text. Your response must be single number between 0 and 10, and should not contain any text or characters aside from that number. 0 means that the reference is unambiguously unrelated text does not contain an answer to the Question. 10 means the reference text clearly contains an answer to the Question. Scores in between suggest that some relationship exists, but might not be direct or obvious. """ EXPLANATION_TEMPLATE = """ You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data: [BEGIN DATA] ************ [Question]: {input} ************ [Reference text]: {reference} ************ [END DATA] Compare the Question above to the Reference text. You must determine score the degree to which the Reference text clearly contains information that can help answer the Question on a scale of 1 to 10. First, write out in a step by step manner an EXPLANATION to show how to arrive at the correct answer. Avoid simply stating the correct answer at the outset. Your response LABEL must be single number between 1 and 10, and should not contain any text or characters aside from that number. 0 means that the reference is unambiguously unrelated text does not contain an answer to the Question. 10 means the reference text clearly contains an answer to the Question. Scores in between suggest that some relationship exists, but might not be direct or obvious. Example response: ************ EXPLANATION: An explanation of your reasoning for why the label is "relevant" or "unrelated" LABEL: A number between 0 and 10 ************ EXPLANATION:""" SCORING_TEMPLATE = ClassificationTemplate( rails=[str(ii) for ii in range(1, 11)], template=BASE_TEMPLATE, explanation_template=EXPLANATION_TEMPLATE, scores=[str(ii) for ii in range(1, 11)], )

scoring_evaluator = LLMEvaluator(model=eval_model, template=SCORING_TEMPLATE) score_df = run_evals( dataframe=queries_df, evaluators=[scoring_evaluator], provide_explanation=True, ) # Convert the flattened list of dictionaries to a DataFrame score_eval_df = pd.DataFrame(score_df)

File "/home/tasya/anaconda3/envs/phoenix/lib/python3.11/site-packages/phoenix/trace/span_evaluations.py", line 140, in _clean_dataframe raise ValueError( ValueError: The dataframe must contain one of these columns with appropriate value types: dict_keys(['score', 'label', 'explanation'])

Building Custom Evaluators with LLM in Code

12 comments

Building Custom Evaluators with LLM in Code

12 comments