Measuring Performance Improvement in Fine-Tuning: Key Insights
My challenge on this to you, is how are you measuring improved performance of Fine-tune? Is it on questions that are the same as the ones in the fine-tune data set? Or are you expecting now to answer new questions on your data that is not in the fine tuning set - how does it get that information? Fine tuning is typically not the same as original training runs by most methods, they are freezing of weights and many use LORA (low rank matrix adaptation, much more limited parameters being tuned) in many cases - I wouldn't assume the same abilities to pick up information as the original training runs. It might, but I would be very skeptical depending on how much information you are trying to impart - really test it. If you are having issues with retrieval, you might want to see if the chunks retrieved have a chance at even answering the question. You can use RAG evals per chunk - we will have an example notebook on this next week. If all chunks are irrelevant there are a lot of options to fix the retrieval step, so the LLM has a good chance of answering. If the chunks do not contain the answer your LLM will have not only have no chance, is more likely to hallucinate.
