If we make a modification, where we repeat the prompt twice in the same context window, it changes results drastically. Very strange.
Claude now predicts a lot more of the "relevant" class
๐งช Test results with 2x prompt modification 0.74 F1 much better but 0.59 Precision