Rethinking Molecular Text Representations for LLMs: An Empirical Study 文章

ArXiv CS.AI2026-06-03NEWSen作者: Arun Raja, Garrett M. Morris, Kian Ming A. Chai

摘要

arXiv:2606.03057v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations and eight chemical tasks. We benchmark 16 LLMs across five model families, including reasoning and non-reasoning variants, chemistry-specialized LLMs, and closed frontier models. Performance is strongly representation-dependent and no single representation wins across tasks, though CML is the best, followed by MolJSON, InChI, and then canonical SMILES. Explicit structured text representations (CML and MolJSON) dominate structural tasks; IUPAC dominates semantic tasks, winning molecule retrieval for all 16 LLMs; and SMILES variants are rarely optimal despite their prevalence in pretraining.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据