There is an interesting extension to this experiment that may be fun to explore:
What if you let deepseek continue on with its output after generating reasoning, and simultaneously generate output with a cheaper model using that reasoning, and then Eval them by comparing final outputs