Continual Visual and Verbal Learning Through a Child's Egocentric Input 文章

ArXiv CS.CV2026-06-04NEWSen作者: Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren

摘要

arXiv:2606.05115v1 Announce Type: new Abstract: Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone.

Continual Visual and Verbal Learning Through a Child's Egocentric Input 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (4)

相关技术