I am currently trying to calculate evals of recall, precision and f1 score for a binary classification of our LLM outputs on a dataset but am not sure what makes the most sense using Phoenix. For example, for recall I attempted to build a custom eval metric comparing the output and expected, but the issue is that I can't output a None value for those examples with a ground truth labels of negative (since we only care about positive values in recall) without causing an error. Let me know what y'all think, ty!