got it, so sounds like for the task itself it's best to have groundtruth dataset and run experiments over that to start with, rather than to work backwards by trying to create a custom eval first.
would the custom eval for the task be more for monitoring then? and to craft an llm as a judge eval itself we would leverage the experiment/dataset workflow like i described?