Manual DUOpop: A Comprehensive Guide to Usage

This vignette explains how to run a synthesis within DUO step by step. A package called DUOpop was created for this matter. So the start of the process is to load the package DUOpop.

The DUOpop package was built on top of the well known synthpop package. We need to install this package and several other packages. It’s up to you which packages you need. To load the packages we use the built in function load_packages() from the DUOpop package.

The next step is to load data. For this vignette we used a built in dataset from the synthpop package. When using data from an in-house database, it might be useful to create a module for loading your data.

Gender	Age	Marital Status	Depression Score	Smoker?	Income	Year Married
FEMALE	57	MARRIED	6	NO	800	1,979
MALE	20	SINGLE	0	NO	350
FEMALE	18	SINGLE	0	NO
FEMALE	78	WIDOWED	16	NO	900	1,958
FEMALE	54	MARRIED	4	YES	1,500	1,980
MALE	20	SINGLE	5	NO	-8

After loading the data, it’s useful to gain some insight in your dataframe. The built in function dataframe_comparer() from the DUOpop package can be used for doing this. The function tells you the data type of the variable, the minimum and the maximum value and much more. If data preparation is needed, it is recommended to create a seperate module for doing so.

Variable Names	Type	Unique Values (Count)	Unique Values	Min/Median/Mean/Max	Count (NA)	Count (Filled)	Count (Empty)	Percentage (NA)
sex	factor	2	FEMALE, MALE	Not Applicable	0	5,000	0	0.00 %
age	numeric	79	Not Applicable	Min = 16; Median = 49; Mean = 47.7; Max = 97	0	5,000	0	0.00 %
marital	factor	6	MARRIED, SINGLE, WIDOWED, DIVORCED, DE FACTO SEPARATED, LEGALLY SEPARATED	Not Applicable	9	4,991	0	0.18 %
depress	numeric	22	Not Applicable	Min = 0; Median = 4; Mean = 4.5; Max = 21	89	4,911	0	1.78 %
smoke	factor	2	NO, YES	Not Applicable	10	4,990	0	0.20 %
income	numeric	406	Not Applicable	Min = -8; Median = 1200; Mean = 1411.1; Max = 16000	683	4,317	0	13.66 %
ymarr	numeric	74	Not Applicable	Min = 1937; Median = 1981; Mean = 1981.3; Max = 2011	1,320	3,680	0	26.40 %
Summary of observed data with variable insights.

After inspecting and preparing the observed data, the data can be synthesized. The built in function syn() from the synthpop package is used for synthesizing. Several parameters can be given to the function. Type ?syn in your console to see which parameters are accepted.

Gender	Age	Marital Status	Depression Score	Smoker?	Income	Year Married
MALE	19	SINGLE	1	YES	-8
FEMALE	69	WIDOWED	9	NO	978	1,961
MALE	44	DIVORCED	0	NO	1,000	1,994
MALE	16	MARRIED	0	NO		2,007
FEMALE	31	SINGLE	5	NO	-8
MALE	54	MARRIED	0	NO	400	1,976

Now the synthetic data can be compared to the observed data. The dataframe_comparer() function can be used again for doing so. Now it takes two parameters: the observed data and the synthetic data.

Variable Name	Observed Data Type	Synthetic Data Type	Level of Agreement	Observed Unique Values (Count)	Synthetic Unique Values (Count)	Observed Unique Values	Synthetic Unique Values	Observed Min/Median/Mean/Max	Synthetic Min/Median/Mean/Max	Observed Count (NA)	Synthetic Count (NA)	Observed Count (Filled)	Synthetic Count (Filled)	Observed Count (Empty)	Synthetic Count (Empty)	Observed Percentage (NA)	Synthetic Percentage (NA)
sex	factor	factor	Exact Match	2	2	FEMALE, MALE	MALE, FEMALE	Not Applicable	Not Applicable	0	0	5,000	5,000	0	0	0.00 %	0.00 %
age	numeric	numeric	Non-Factor	79	79	Not Applicable	Not Applicable	Min = 16; Median = 49; Mean = 47.7; Max = 97	Min = 16; Median = 49; Mean = 47.6; Max = 97	0	0	5,000	5,000	0	0	0.00 %	0.00 %
marital	factor	factor	Exact Match	6	6	MARRIED, SINGLE, WIDOWED, DIVORCED, DE FACTO SEPARATED, LEGALLY SEPARATED	SINGLE, WIDOWED, DIVORCED, MARRIED, LEGALLY SEPARATED, DE FACTO SEPARATED	Not Applicable	Not Applicable	9	14	4,991	4,986	0	0	0.18 %	0.28 %
depress	numeric	numeric	Non-Factor	22	22	Not Applicable	Not Applicable	Min = 0; Median = 4; Mean = 4.5; Max = 21	Min = 0; Median = 4; Mean = 4.5; Max = 21	89	84	4,911	4,916	0	0	1.78 %	1.68 %
smoke	factor	factor	Exact Match	2	2	NO, YES	YES, NO	Not Applicable	Not Applicable	10	13	4,990	4,987	0	0	0.20 %	0.26 %
income	numeric	numeric	Non-Factor	406	314	Not Applicable	Not Applicable	Min = -8; Median = 1200; Mean = 1411.1; Max = 16000	Min = -8; Median = 1200; Mean = 1391.9; Max = 16000	683	692	4,317	4,308	0	0	13.66 %	13.84 %
ymarr	numeric	numeric	Non-Factor	74	74	Not Applicable	Not Applicable	Min = 1937; Median = 1981; Mean = 1981.3; Max = 2011	Min = 1937; Median = 1982; Mean = 1981.7; Max = 2011	1,320	1,371	3,680	3,629	0	0	26.40 %	27.42 %
Comparison between observed and synthetic data variables.

Next to comparing the dataframes, it’s useful to gain some insight in the utility of the synthetic data. The built in function utility_evaluation() in the DUOpop package can be used for this.

Utility is measured by a metric called Standardized Mean Squared Propensity Error (SpMSE). The SpMSE evaluates how similar a synthetic dataset is to the real dataset by training a classifier to distinguish between them. It computes the mean squared error of the predicted probabilities (propensity scores) against the expected value of 0.5 (random guess). A lower SpMSE indicates the synthetic data is more indistinguishable from the real data, suggesting better utility. In the plot below, a low SpMSE score is indicated by a green color and a high SpMSE score is indicated by red. In between scores are indicated by yellow and orange. Links to websites that explain more about the SpMSE can be found on the website of Synthpop.

And, the privacy can be measured using the built in privacy_evaluation() function in the DUOpop package. Within this function the Distance to the Closest Record (DCR) is calculated. The DCR metric evaluates the privacy of a synthetic dataset by measuring the similarity between each real record and its closest synthetic record. A high average minimal distance indicates stronger privacy, as it implies synthetic records are not direct copies of real ones. Lower distances may suggest a risk of re-identification. More information can be found on the website of Frontiers in Big Data.

Make sure the synthetic data and the workspact image are saved after running your synthesis. It can save you a lot of time ;). Good luck!