Hi guys, let me give some context. We use multimodal models to generate product tags from both texts and images. Here is one of the blog posts about this project, it can give some more context what and why we do. Screenshot shows an image belonging to a given product class and how we extract tags like color, style or theme.
As input we pass image and text prompt, as output we return list of tag values. So the output is same format we discussed with you before.
We are now developing automated evaluation pipelines. For now It is assumed to run them offline to test different prompts before shipping them to prod. I was thinking whether i can use arize-phoenix for gemini-vision models 馃檪