This vignette explains how to run a synthesis within DUO step by step. A package called DUOpop was created for this matter. So the start of the process is to load the package DUOpop.
The DUOpop package was built on top of the well known synthpop package. We need to install this package and several other packages. It’s up to you which packages you need. To load the packages we use the built in function load_packages() from the DUOpop package.
# Load other packages
load_packages("tidyverse", "odbc", "readr", "glue", "writexl", "synthpop", "flextable")The next step is to load data. For this vignette we used a built in dataset from the synthpop package. When using data from an in-house database, it might be useful to create a module for loading your data.
# Load data to synthesize
data_observed <- SD2011 %>%
select(sex, age, marital, depress, smoke, income, ymarr)
head(data_observed)Gender | Age | Marital Status | Depression Score | Smoker? | Income | Year Married |
|---|---|---|---|---|---|---|
FEMALE | 57 | MARRIED | 6 | NO | 800 | 1,979 |
MALE | 20 | SINGLE | 0 | NO | 350 | |
FEMALE | 18 | SINGLE | 0 | NO | ||
FEMALE | 78 | WIDOWED | 16 | NO | 900 | 1,958 |
FEMALE | 54 | MARRIED | 4 | YES | 1,500 | 1,980 |
MALE | 20 | SINGLE | 5 | NO | -8 |
After loading the data, it’s useful to gain some insight in your dataframe. The built in function dataframe_comparer() from the DUOpop package can be used for doing this. The function tells you the data type of the variable, the minimum and the maximum value and much more. If data preparation is needed, it is recommended to create a seperate module for doing so.
# Analyze observed data
observed_data_check <- dataframe_comparer(data_observed)
print(observed_data_check)Variable Names | Type | Unique Values (Count) | Unique Values | Min/Median/Mean/Max | Count (NA) | Count (Filled) | Count (Empty) | Percentage (NA) |
|---|---|---|---|---|---|---|---|---|
sex | factor | 2 | FEMALE, MALE | Not Applicable | 0 | 5,000 | 0 | 0.00 % |
age | numeric | 79 | Not Applicable | Min = 16; Median = 49; Mean = 47.7; Max = 97 | 0 | 5,000 | 0 | 0.00 % |
marital | factor | 6 | MARRIED, SINGLE, WIDOWED, DIVORCED, DE FACTO SEPARATED, LEGALLY SEPARATED | Not Applicable | 9 | 4,991 | 0 | 0.18 % |
depress | numeric | 22 | Not Applicable | Min = 0; Median = 4; Mean = 4.5; Max = 21 | 89 | 4,911 | 0 | 1.78 % |
smoke | factor | 2 | NO, YES | Not Applicable | 10 | 4,990 | 0 | 0.20 % |
income | numeric | 406 | Not Applicable | Min = -8; Median = 1200; Mean = 1411.1; Max = 16000 | 683 | 4,317 | 0 | 13.66 % |
ymarr | numeric | 74 | Not Applicable | Min = 1937; Median = 1981; Mean = 1981.3; Max = 2011 | 1,320 | 3,680 | 0 | 26.40 % |
Summary of observed data with variable insights. | ||||||||
After inspecting and preparing the observed data, the data can be synthesized. The built in function syn() from the synthpop package is used for synthesizing. Several parameters can be given to the function. Type ?syn in your console to see which parameters are accepted.
Gender | Age | Marital Status | Depression Score | Smoker? | Income | Year Married |
|---|---|---|---|---|---|---|
MALE | 19 | SINGLE | 1 | YES | -8 | |
FEMALE | 69 | WIDOWED | 9 | NO | 978 | 1,961 |
MALE | 44 | DIVORCED | 0 | NO | 1,000 | 1,994 |
MALE | 16 | MARRIED | 0 | NO | 2,007 | |
FEMALE | 31 | SINGLE | 5 | NO | -8 | |
MALE | 54 | MARRIED | 0 | NO | 400 | 1,976 |
Now the synthetic data can be compared to the observed data. The dataframe_comparer() function can be used again for doing so. Now it takes two parameters: the observed data and the synthetic data.
# Compare the synthetic data to the observed data
comparison_observed_syn <- dataframe_comparer(data_observed, sds$syn)
comparison_observed_synVariable Name | Observed Data Type | Synthetic Data Type | Level of Agreement | Observed Unique Values (Count) | Synthetic Unique Values (Count) | Observed Unique Values | Synthetic Unique Values | Observed Min/Median/Mean/Max | Synthetic Min/Median/Mean/Max | Observed Count (NA) | Synthetic Count (NA) | Observed Count (Filled) | Synthetic Count (Filled) | Observed Count (Empty) | Synthetic Count (Empty) | Observed Percentage (NA) | Synthetic Percentage (NA) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sex | factor | factor | Exact Match | 2 | 2 | FEMALE, MALE | MALE, FEMALE | Not Applicable | Not Applicable | 0 | 0 | 5,000 | 5,000 | 0 | 0 | 0.00 % | 0.00 % |
age | numeric | numeric | Non-Factor | 79 | 79 | Not Applicable | Not Applicable | Min = 16; Median = 49; Mean = 47.7; Max = 97 | Min = 16; Median = 49; Mean = 47.6; Max = 97 | 0 | 0 | 5,000 | 5,000 | 0 | 0 | 0.00 % | 0.00 % |
marital | factor | factor | Exact Match | 6 | 6 | MARRIED, SINGLE, WIDOWED, DIVORCED, DE FACTO SEPARATED, LEGALLY SEPARATED | SINGLE, WIDOWED, DIVORCED, MARRIED, LEGALLY SEPARATED, DE FACTO SEPARATED | Not Applicable | Not Applicable | 9 | 14 | 4,991 | 4,986 | 0 | 0 | 0.18 % | 0.28 % |
depress | numeric | numeric | Non-Factor | 22 | 22 | Not Applicable | Not Applicable | Min = 0; Median = 4; Mean = 4.5; Max = 21 | Min = 0; Median = 4; Mean = 4.5; Max = 21 | 89 | 84 | 4,911 | 4,916 | 0 | 0 | 1.78 % | 1.68 % |
smoke | factor | factor | Exact Match | 2 | 2 | NO, YES | YES, NO | Not Applicable | Not Applicable | 10 | 13 | 4,990 | 4,987 | 0 | 0 | 0.20 % | 0.26 % |
income | numeric | numeric | Non-Factor | 406 | 314 | Not Applicable | Not Applicable | Min = -8; Median = 1200; Mean = 1411.1; Max = 16000 | Min = -8; Median = 1200; Mean = 1391.9; Max = 16000 | 683 | 692 | 4,317 | 4,308 | 0 | 0 | 13.66 % | 13.84 % |
ymarr | numeric | numeric | Non-Factor | 74 | 74 | Not Applicable | Not Applicable | Min = 1937; Median = 1981; Mean = 1981.3; Max = 2011 | Min = 1937; Median = 1982; Mean = 1981.7; Max = 2011 | 1,320 | 1,371 | 3,680 | 3,629 | 0 | 0 | 26.40 % | 27.42 % |
Comparison between observed and synthetic data variables. | |||||||||||||||||
Next to comparing the dataframes, it’s useful to gain some insight in the utility of the synthetic data. The built in function utility_evaluation() in the DUOpop package can be used for this.
Utility is measured by a metric called Standardized Mean Squared Propensity Error (SpMSE). The SpMSE evaluates how similar a synthetic dataset is to the real dataset by training a classifier to distinguish between them. It computes the mean squared error of the predicted probabilities (propensity scores) against the expected value of 0.5 (random guess). A lower SpMSE indicates the synthetic data is more indistinguishable from the real data, suggesting better utility. In the plot below, a low SpMSE score is indicated by a green color and a high SpMSE score is indicated by red. In between scores are indicated by yellow and orange. Links to websites that explain more about the SpMSE can be found on the website of Synthpop.
# # Test the post hoc utility
utility_result <- utility_evaluation(sds, data_observed)
utility_result$plot
utility_result$`1-dim`$tab.utility
#> S_pMSE
#> sex 0.3940239
#> age 0.6017068
#> marital 1.6663994
#> depress 69.7620607
#> smoke 0.4198061
#> income 551.4557728
#> ymarr 964.7075616And, the privacy can be measured using the built in privacy_evaluation() function in the DUOpop package. Within this function the Distance to the Closest Record (DCR) is calculated. The DCR metric evaluates the privacy of a synthetic dataset by measuring the similarity between each real record and its closest synthetic record. A high average minimal distance indicates stronger privacy, as it implies synthetic records are not direct copies of real ones. Lower distances may suggest a risk of re-identification. More information can be found on the website of Frontiers in Big Data.
# Test the privacy evaluation
privacy_evaluation <- privacy_evaluation(data_observed)
#>
#> Synthesis
#> -----------
#> sex age marital depress smoke income ymarr
privacy_evaluation
#> share_training_data n_syn n_closer
#> 0.6352 2500.0000 906.0000
#> n_farther n_equal mean_distance_train
#> 230.0000 1364.0000 1.4692
#> mean_distance_holdout
#> 1.7792Make sure the synthetic data and the workspact image are saved after running your synthesis. It can save you a lot of time ;). Good luck!