Evaluating GPT-4 Answers for Document Chunking Strategies

·Sep 09, 2024 10:07 PM

Hello, I'm working on a Q&A system where a document is split into chunks using multiple text-splitting techniques, and I generate answers using GPT-4 based on these splits. My goal is to evaluate whether the generated answer for each splitter comes from the correct chunk (where the ground truth answer is located). First I divide the document into multiple chunks using different splitting strategies, then fotr each question, I have a chunk (splitter_chunk_ground_truth) that contains the correct answer from the document. And finally i get the GPT-4 generates answers based on each splitting strategy. As part of evaluating the effectiveness of each splitter, I want to check if the answer generated by GPT-4 for each splitter comes from the correct chunk or not. Does anyone know an approach to help solve my problem?

4 comments

· Sorted by Oldest

Jason
·
Mahi If I understand you correctly you have a retrieval, and a specific chunk that is considered correct, say it was the 3rd position: [chunk1 chunk2 chunk3] and the third was the correct position [0 0 1] in terms of search relevance. You are doing a chat.completion generation on Gen(Cunk1 Chunk2 Chunk3) in the context window generating an answer: Gen(Cunk1 Chunk2 Chunk3)-> Agen You are trying to make sure Agen comes from Chunk 3?

Jason

You could consider an Eval template for an LLM Eval:

You are given a list of text chunks, and the task was to generate an answer based on one of them. Ensure that the answer comes from the correct chunk and evaluate whether the answer corresponds to the expected chunk.

The list of chunks is provided as follows:

Chunks:
{chunks}

Your task is to check if the following generated answer comes from the correct chunk.

Generated Answer:
{generated_answer}

Correct Chunk Index:
{correct_chunk_index}

Now, determine if the generated answer comes from the correct chunk. You should compare the generated answer to the text of the chunk at the specified index and provide the evaluation. 

Your response must be a single word, either "relevant" or "irrelevant", and should not contain any text or characters aside from that. The string "relevant" means that the Generated Answer is derived from the Correct Chunk. The string "irrelevant" means that the Generated Answer is not derived from the Correct Chunk.

Jason
·
Also depending on whether you are testing pre-production with ground truth you also might be able to use traditional search MRR type metrics
Mahi
·
Thank you for your reply, Jason. For now, I will (manually) determine which chunk is the correct chunk. And this is also for research purposes, so it'sokay, but what if I just don't do this step anymore. Is there a way to figure it out without specifying the correct chunk? And what if I have multiple chunks that contain part of the answer