Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease 文章

ArXiv CS.AI2026-06-06NEWSen作者: Hamed Khosravi, Milad Khanchi, Mobina Noori, Srinjoy Das, Abdullah Al-Mamun, Imtiaz Ahmed

摘要

arXiv:2403.00965v2 Announce Type: replace-cross Abstract: Only a small fraction of patients with chronic kidney disease (CKD) progress to dialysis, creating severe class imbalance that limits the performance of machine learning models for early dialysis prediction. This challenge is compounded by the binary structure of electronic health record (EHR) data, for which most existing augmentation methods were not designed. We propose Binary Gaussian Copula Synthesis (BGCS), a two-stage data augmentation method tailored to binary clinical data. BGCS first generates synthetic minority-class samples using a Gaussian copula framework that explicitly models pairwise dependencies among binary features, then applies a fine-tuned GPT-2 classifier to filter out clinically implausible samples before training.