Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese 文章

ArXiv CS.CL2026-06-25PAPERen作者: Antonio de Sousa Leit\~ao Filho, Allan Kardec Duailibe Barros Filho, Fabr\'icio Saul Lima. Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa

详细信息

来源站点
ArXiv CS.CL
作者
Antonio de Sousa Leit\~ao Filho, Allan Kardec Duailibe Barros Filho, Fabr\'icio Saul Lima. Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa
文章类型
PAPER
语言
en
发布日期
2026-06-25

别名

Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese

摘要

arXiv:2606.19626v2 Announce Type: replace-cross Abstract: AI pipelines that reason quantitatively over technical text depend on input where physical quantities, numbers, units, and symbolic expressions arrive intact; when these entities fragment at tokenization, errors propagate downstream. Byte-Pair Encoding, optimized for vocabulary compression, is blind to such entities and fragments them into arbitrary subwords -- a problem aggravated in technical Brazilian Portuguese. We present TOTEN, a knowledge-based system whose input representation preserves each technical entity as a whole, typed unit: vocabulary is not derived statistically but classified declaratively under a formal ontology of engineering entities (OEE). The core is the triple : types, principles, and invariants; a classifier mapping raw text into typed regions; and instantiators yielding a self-descriptive representation.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据