Using CART to generate partially synthetic public use microdata 论文
2005引用 219
Data Mining Algorithms and ApplicationsPrivacy-Preserving Technologies in DataData Quality and Management
摘要
To limit disclosure risks, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. This article presents and evaluates the use of classification and regression trees to generate partially synthetic data. Two potential applications of CART are studied via simulation: (i) generate synthetic data for sensitive variables; and, (ii) generate synthetic data for variables that are key identifiers. 1