Multi-modal video data-pipelines for machine learning with minimal human supervision 文章

ArXiv CS.CV2026-05-26NEWSen作者: Mihai-Cristian P\^irvu, Marius Leordeanu

摘要

arXiv:2510.14862v2 Announce Type: replace Abstract: The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source.

Multi-modal video data-pipelines for machine learning with minimal human supervision 文章

摘要

相关事件查看全部 (2)

相关公司查看全部 (4)

相关人物

相关产品查看全部 (9)

相关技术查看全部 (19)