Hi! We're self-hosting a multimodal LLM on vllm and are looking for an open source LLM observability tool. Does Phoenix support mulitmodal image + text inputs, and can you hook it up to self hosted LLM's?
Hi Tom, thanks your for interest in Phoenix! We currently do not support either evals or tracing for non text-mode inputs. While we do support self-hosted LLMs for running evals, we do not have autoinstrumentors for tracing them unless you are using them with either Langchain or LlamaIndex as an orchestration framework.
If you鈥檙e using vllm as an openai compatible server, you can try our openai instrumentor on the openai client and see whether the instrumentor captures the http payloads in the way you wanted. If not, we can enhance it for you.
Thanks Roger, i'll try this out 馃憤! Phoenix looks amazing for our use case and would be my first choice but its really important for us to be able to view the image inputs as well - is there anything i can do to bump up adding support for this on your roadmap? How much work would this be?
Hey Tom M. - Good to see ya! I think we have to dogfood some of these multi-modal use cases and see what the payload structure is. If it's base64 encodes small images we maybe can have a way to opt in to capturing this data. Tom, which multi-modal LLM are you using? Are you self-hosting? If so are you using a python client? I ask because our observability works on top of existing clients so we'd probably have to scope it with regards to the tech stack you are using. Feel free to file a ticket so you can follow along as we scope it!
Hey again Mikyo ! It鈥檚 another team in my company working on this I鈥檓 supporting but from what I can see: So yep it鈥檚 base64 encoded small images. We鈥檙e using a customised Llava model that we鈥檝e self hosted with VLLM on Kubernetes. We鈥檙e using a Python client, just plain httpx to send the requests we aren鈥檛 using any frameworks at the moment (although I think we鈥檙e open to choosing a framework). I鈥檒l check in with the team to confirm and put this into a ticket, thanks!
Great! So the rest of our tech stack, other than python:
Database: Postgres
Monitoring and Logging: Datadog, Grafana, Prometheus
Containerization and Orchestration: Docker and Kubernetes (EKS)
Infrastructure: Hosted on AWS (use s3 for storage too), SQS for message queues
There's some other components but nothing else that immediately come to mind as relevant, let me know if there's anything else you're wondering about that i might have missed!
So with Phoenix we'd look too make use of the latest persistence features in version 4.0+, probably with Postgres. And we'd look to slot the tracing into our production API for classifying images for different types of harmful content that we send to our visual LLM
I think the main areas we'd need to understand is what inference looks like to this visual LLM. E.g. what is the IO payload and how the visual image elements are sent / referenced (base_64 encoded images, urls?) Capturing the labeling would probably not be too difficult in some regard on our end. Sounds like you would be using custom instrumentation then since it sounds like you are not using any wrappers to call your LLM. Do you have an evaluation strategy in place for this visual classifier?
We've self hosted our model on vllm, it's API is openai compatible. Here's our client for reference - we send images as base64 encoded, we're not using any wrappers but we can easily use one if it makes things easier:
class VLMClient:
def __init__(self, vlm_model: str = VLM_MODEL, vllm_url: str = VLLM_URL):
self._vlm_model = vlm_model
self._vllm_client = httpx.AsyncClient(base_url=vllm_url)
if VLLM_HEALTHCHECK:
wait_for_ready(
server_url=vllm_url,
wait_seconds=VLLM_READY_TIMEOUT,
health_endpoint="health",
)
@property
def vlm_model(self) -> str:
return self._vlm_model
async def __call__(
self,
prompt: str,
image_bytes: bytes | None = None,
image_filetype: filetype.Type | None = None,
max_tokens: int = 10,
) -> str:
# Assemble the message content
message_content: list[dict[str, str | dict]] = [
{
"type": "text",
"text": prompt,
}
]
if image_bytes is not None:
if image_filetype is None:
image_filetype = filetype.guess(image_bytes)
if image_filetype is None:
raise ValueError("Could not determine image filetype")
if image_filetype not in ALLOWED_IMAGE_TYPES:
raise ValueError(
f"Image type {image_filetype} is not supported. Allowed types: {ALLOWED_IMAGE_TYPES}"
)
image_b64 = base64.b64encode(image_bytes).decode("utf-8")
message_content.append(
{
"type": "image_url",
"image_url": {
"url": f"data:{image_filetype.mime};base64,{image_b64}",
},
}
)
# Put together the request payload
payload = {
"model": self.vlm_model,
"messages": [{"role": "user", "content": message_content}],
"max_tokens": max_tokens,
# "logprobs": True,
# "top_logprobs": 1,
}
response = await self._vllm_client.post("/v1/chat/completions", json=payload)
response = response.json()
response_text: str = (
response.get("choices")[0].get("message", {}).get("content", "").strip()
)
return response_textRe. evaluation strategy, i just asked our team:
Measuring accuracy on known eval sets with wide coverage of policy areas and including edge cases, reviewing mistakes to look for patterns
Also interested in using LLMs as judges, for example using GPT to say which model output is best
That鈥檚 exciting Tom M. , it means out openAI tracing will most likely work for you if you use it. Here鈥檚 the ticket for the image message parsing https://github.com/Arize-ai/openinference/issues/495
Great, thanks Mikyo!!
