Hello everyone, I’m Devvrat Bhardwaj, a backend engineer at AIMon Labs, where we build low-latency, highly accurate evaluation models that provide deterministic assessments for LLMs—focused on hallucination detection, instruction following evaluation, context relevance, and other core quality dimensions. I’ve been exploring Arize Phoenix and am interested in the possibility of integrating AIMon’s evaluation models as custom evaluators. I’m here to learn how others are using Phoenix for evaluation and see where I might be able to contribute to the codebase. Here’s a quick overview of a couple of the models we think could bring value to the Phoenix ecosystem:
HDM-2 (Hallucination Detection Model): Outperforms competitive models and even leading LLMs on multiple hallucination detection benchmarks. It's optimized for speed and reliability. Our paper describing the model.
AIMon’s Instruction Following Evaluation Model: A robust alternative to heuristic-based evaluation, this model performs at a level comparable to large models like GPT-4o and O3-mini—while being far more efficient in terms of cost and latency.
Thank you :)
