OpenAI's Multimodal Model: Evaluation and Key Takeaways
Here's a write-up from OpenAI on the evaluation and red-teaming of their new multimodal model. Some takeaways:
Multimodality opens up the model to a new set of jailbreaks and attack vectors, including inserting text into the image or combining the text prompt with images in ways that circumvent the model's ability to refuse harmful requests. OpenAI has mitigation strategies at both the system and model level (e.g., they mention using OCR to extract text from input images and evaluating the extracted text for harmful intent).
The system as a whole is optimized for recall for detecting and refusing problematic or harmful requests. One example is what OpenAI calls ungrounded inferences, the model's propensity for making (often biased) assumptions about people in input images. The system as a whole is impressively capable of refusing such prompts 100% of the time according to OpenAI's testing, even if the model in isolation cannot. That being said, the system as a whole can still occasionally be induced into producing harmful or even hateful content.
Based on a pilot program, GPT-4V has impressive potential for helping visually impaired users understand the content of images, although with the caveat that is not reliable enough for sensitive tasks such as reading prescription drug information.
It sounds like GPT-4V was trained at the same time as the language-only GPT-4 model, and is only being released now after a period of evaluation and improvement of its ability to identify and refuse harmful requests and attacks.
