Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese 文章
详细信息
- 来源站点
- ArXiv CS.CL
- 作者
- Antonio de Sousa Leit\~ao Filho, Allan Kardec Duailibe Barros Filho, Fabr\'icio Saul Lima. Selby Mykael Lima dos Santos, Rejani Bandeira Vieira Sousa
- 文章类型
- PAPER
- 语言
- en
- 发布日期
- 2026-06-25
别名
摘要
arXiv:2606.19626v2 Announce Type: replace-cross Abstract: AI pipelines that reason quantitatively over technical text depend on input where physical quantities, numbers, units, and symbolic expressions arrive intact; when these entities fragment at tokenization, errors propagate downstream. Byte-Pair Encoding, optimized for vocabulary compression, is blind to such entities and fragments them into arbitrary subwords -- a problem aggravated in technical Brazilian Portuguese. We present TOTEN, a knowledge-based system whose input representation preserves each technical entity as a whole, typed unit: vocabulary is not derived statistically but classified declaratively under a formal ontology of engineering entities (OEE). The core is the triple : types, principles, and invariants; a classifier mapping raw text into typed regions; and instantiators yielding a self-descriptive representation.
相关事件
暂无数据
相关公司
暂无数据
相关人物
暂无数据