Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes? 文章

Hugging Face Blog2024-03-05BLOGen