Manual DUOpop: A Comprehensive Guide to Usage

This vignette explains how to run a synthesis within DUO step by step. A package called DUOpop was created for this matter. So the start of the process is to load the package DUOpop.

# Load the DUOpop package 
library(DUOpop)

The DUOpop package was built on top of the well known synthpop package. We need to install this package and several other packages. It’s up to you which packages you need. To load the packages we use the built in function load_packages() from the DUOpop package.

# Load other packages 
load_packages("tidyverse", "odbc", "readr", "glue", "writexl", "synthpop", "flextable")

The next step is to load data. For this vignette we used a built in dataset from the synthpop package. When using data from an in-house database, it might be useful to create a module for loading your data.


# Load data to synthesize 
data_observed <- SD2011 %>%
  select(sex, age, marital, depress, smoke, income, ymarr)

head(data_observed)

Gender

Age

Marital Status

Depression Score

Smoker?

Income

Year Married

FEMALE

57

MARRIED

6

NO

800

1,979

MALE

20

SINGLE

0

NO

350

FEMALE

18

SINGLE

0

NO

FEMALE

78

WIDOWED

16

NO

900

1,958

FEMALE

54

MARRIED

4

YES

1,500

1,980

MALE

20

SINGLE

5

NO

-8

After loading the data, it’s useful to gain some insight in your dataframe. The built in function dataframe_comparer() from the DUOpop package can be used for doing this. The function tells you the data type of the variable, the minimum and the maximum value and much more. If data preparation is needed, it is recommended to create a seperate module for doing so.

# Analyze observed data   
observed_data_check <- dataframe_comparer(data_observed)

print(observed_data_check)

Variable Names

Type

Unique Values (Count)

Unique Values

Min/Median/Mean/Max

Count (NA)

Count (Filled)

Count (Empty)

Percentage (NA)

sex

factor

2

FEMALE, MALE

Not Applicable

0

5,000

0

0.00 %

age

numeric

79

Not Applicable

Min = 16; Median = 49; Mean = 47.7; Max = 97

0

5,000

0

0.00 %

marital

factor

6

MARRIED, SINGLE, WIDOWED, DIVORCED, DE FACTO SEPARATED, LEGALLY SEPARATED

Not Applicable

9

4,991

0

0.18 %

depress

numeric

22

Not Applicable

Min = 0; Median = 4; Mean = 4.5; Max = 21

89

4,911

0

1.78 %

smoke

factor

2

NO, YES

Not Applicable

10

4,990

0

0.20 %

income

numeric

406

Not Applicable

Min = -8; Median = 1200; Mean = 1411.1; Max = 16000

683

4,317

0

13.66 %

ymarr

numeric

74

Not Applicable

Min = 1937; Median = 1981; Mean = 1981.3; Max = 2011

1,320

3,680

0

26.40 %

Summary of observed data with variable insights.

After inspecting and preparing the observed data, the data can be synthesized. The built in function syn() from the synthpop package is used for synthesizing. Several parameters can be given to the function. Type ?syn in your console to see which parameters are accepted.

# Synthesize data 
sds <- syn(data_observed)
head(sds$syn)

Gender

Age

Marital Status

Depression Score

Smoker?

Income

Year Married

MALE

19

SINGLE

1

YES

-8

FEMALE

69

WIDOWED

9

NO

978

1,961

MALE

44

DIVORCED

0

NO

1,000

1,994

MALE

16

MARRIED

0

NO

2,007

FEMALE

31

SINGLE

5

NO

-8

MALE

54

MARRIED

0

NO

400

1,976

Now the synthetic data can be compared to the observed data. The dataframe_comparer() function can be used again for doing so. Now it takes two parameters: the observed data and the synthetic data.

# Compare the synthetic data to the observed data 
comparison_observed_syn <- dataframe_comparer(data_observed, sds$syn)

comparison_observed_syn

Variable Name

Observed Data Type

Synthetic Data Type

Level of Agreement

Observed Unique Values (Count)

Synthetic Unique Values (Count)

Observed Unique Values

Synthetic Unique Values

Observed Min/Median/Mean/Max

Synthetic Min/Median/Mean/Max

Observed Count (NA)

Synthetic Count (NA)

Observed Count (Filled)

Synthetic Count (Filled)

Observed Count (Empty)

Synthetic Count (Empty)

Observed Percentage (NA)

Synthetic Percentage (NA)

sex

factor

factor

Exact Match

2

2

FEMALE, MALE

MALE, FEMALE

Not Applicable

Not Applicable

0

0

5,000

5,000

0

0

0.00 %

0.00 %

age

numeric

numeric

Non-Factor

79

79

Not Applicable

Not Applicable

Min = 16; Median = 49; Mean = 47.7; Max = 97

Min = 16; Median = 49; Mean = 47.6; Max = 97

0

0

5,000

5,000

0

0

0.00 %

0.00 %

marital

factor

factor

Exact Match

6

6

MARRIED, SINGLE, WIDOWED, DIVORCED, DE FACTO SEPARATED, LEGALLY SEPARATED

SINGLE, WIDOWED, DIVORCED, MARRIED, LEGALLY SEPARATED, DE FACTO SEPARATED

Not Applicable

Not Applicable

9

14

4,991

4,986

0

0

0.18 %

0.28 %

depress

numeric

numeric

Non-Factor

22

22

Not Applicable

Not Applicable

Min = 0; Median = 4; Mean = 4.5; Max = 21

Min = 0; Median = 4; Mean = 4.5; Max = 21

89

84

4,911

4,916

0

0

1.78 %

1.68 %

smoke

factor

factor

Exact Match

2

2

NO, YES

YES, NO

Not Applicable

Not Applicable

10

13

4,990

4,987

0

0

0.20 %

0.26 %

income

numeric

numeric

Non-Factor

406

314

Not Applicable

Not Applicable

Min = -8; Median = 1200; Mean = 1411.1; Max = 16000

Min = -8; Median = 1200; Mean = 1391.9; Max = 16000

683

692

4,317

4,308

0

0

13.66 %

13.84 %

ymarr

numeric

numeric

Non-Factor

74

74

Not Applicable

Not Applicable

Min = 1937; Median = 1981; Mean = 1981.3; Max = 2011

Min = 1937; Median = 1982; Mean = 1981.7; Max = 2011

1,320

1,371

3,680

3,629

0

0

26.40 %

27.42 %

Comparison between observed and synthetic data variables.

Next to comparing the dataframes, it’s useful to gain some insight in the utility of the synthetic data. The built in function utility_evaluation() in the DUOpop package can be used for this.

Utility is measured by a metric called Standardized Mean Squared Propensity Error (SpMSE). The SpMSE evaluates how similar a synthetic dataset is to the real dataset by training a classifier to distinguish between them. It computes the mean squared error of the predicted probabilities (propensity scores) against the expected value of 0.5 (random guess). A lower SpMSE indicates the synthetic data is more indistinguishable from the real data, suggesting better utility. In the plot below, a low SpMSE score is indicated by a green color and a high SpMSE score is indicated by red. In between scores are indicated by yellow and orange. Links to websites that explain more about the SpMSE can be found on the website of Synthpop.

# # Test the post hoc utility
utility_result <- utility_evaluation(sds, data_observed)

utility_result$plot

utility_result$`1-dim`$tab.utility
#>              S_pMSE
#> sex       0.3940239
#> age       0.6017068
#> marital   1.6663994
#> depress  69.7620607
#> smoke     0.4198061
#> income  551.4557728
#> ymarr   964.7075616

And, the privacy can be measured using the built in privacy_evaluation() function in the DUOpop package. Within this function the Distance to the Closest Record (DCR) is calculated. The DCR metric evaluates the privacy of a synthetic dataset by measuring the similarity between each real record and its closest synthetic record. A high average minimal distance indicates stronger privacy, as it implies synthetic records are not direct copies of real ones. Lower distances may suggest a risk of re-identification. More information can be found on the website of Frontiers in Big Data.

# Test the privacy evaluation
privacy_evaluation <- privacy_evaluation(data_observed)
#> 
#> Synthesis
#> -----------
#>  sex age marital depress smoke income ymarr

privacy_evaluation
#>   share_training_data                 n_syn              n_closer 
#>                0.6352             2500.0000              906.0000 
#>             n_farther               n_equal   mean_distance_train 
#>              230.0000             1364.0000                1.4692 
#> mean_distance_holdout 
#>                1.7792

Make sure the synthetic data and the workspact image are saved after running your synthesis. It can save you a lot of time ;). Good luck!