Author: Cameron Fraser

Using CTGAN to synthesise fake patient data

Being the member of the Computational Oncology lab with no more than my A levels (which I never sat!) and an unconditional offer to study Computer Science at Imperial College London in October 2021 has been a great opportunity which I am extremely grateful for, albeit a bit daunting, sitting alongside everyone with their variety of PhDs.

A large issue in the medical world is that patient data is highly confidential and private, making getting our hands on this limited resource difficult. One potential solution to this problem is using a GAN (Generative Adversarial Network) called  CTGAN to generate realistic synthetic patient data, based off real, private patient data. The goal of a project that I have been working on  is to take in real tabular patient data, train a CTGAN model on the real data and have the model output synthetic data that preserves the correlations of the various columns in the real data. This model can generate as many synthetic patients as one desires, can undergo the same analysis techniques that researchers would use on real patient data, and the synthetic data can be made publicly available as no private data is accessible through the synthetic data.

There are many constraints that can be placed onto the GAN to make the synthetic data more realistic, as sometimes data needs to be constrained. As of right now, there are 4 constraints that can be placed on the model. First, there is the ‘Custom Formula’ constraint, which could be used to preserve the formula of ‘years taking prescription = age – age when prescription started’. Next, the ‘Greater Than’ constraint which would ensure that ‘age’ would always be greater than ‘age when prescription started’ and finally, the ‘Unique Combinations’ constraint which could be used to restrict ‘City’ to only be synthesised when the  appropriate Country column is generated alongside it. There is also a constraint called APII (Anonymising Personally Identifiable Information) which ensures that no private information is copied from the real data to synthetic data. APII works with many different confidential fields such as Name, Address, Country or Telephone Numbers by replacing these fields with pre-set, non-existent data entries from a large database called Faker Providers.

I have been working on adding a new constraint called ‘Custom ID’. This constraint applies when 2 columns are just encodings of each other, which occurs most frequently with ID columns. Without this constraint, the 100% correlation between a discrete column and its’ respective ID column would not be preserved. This is done by, first, comparing each of the columns against every other column in the data, if any encodings are found then the discrete column is preserved and the numeric ID column is removed. This is done because the ID column is an encoding of a discrete column, but will be identified by CTGAN as continuous, and will therefore model the column incorrectly. Once the ID column is removed, a lookup table will be created which links each value in the discrete column to their respective IDs. The CTGAN model is then trained on the data that does not contain the ID column. Finally, when sampling synthetic data, the ID is added backing into the synthetic data using the lookup table.

This solution has the advantage of running quickly, as the time complexity is not based on the number of rows in the real data. It is also easy to use, as it can be turned on and off with one input. Finally, my solution will identify all of the ID columns in the data and create lookup tables for each of them. One limitation of my solution would be that my Custom ID constraint would not detect 3 columns that are correlated (eg. ‘Gender’ ‘Gender Abbreviation’ and ‘Gender ID’) although this situation occurs rarely.

As I have the long-term goal of synthesising patient data, the synthetic data must be secure in the sense that it must not be possible to reverse engineer the real data from the synthetic. One test, available in SDGym, is the LogisticDetection metric which compiles both the real and synthetic data randomly and passes them to a discriminator which attempts to flag the incoming data with real and synthetic flags. This test showed that the data can be correctly identified as real or synthetic just over 50% of the time. However, when we are dealing with analysis on medical data, I feel there are still steps to be taken, to make the synthetic data more accurate before serious analysis on this synthetic data can begin.

Dipping my toes into the world of machine learning has been extremely fascinating and I have learned many new things, about both machine learning and coding more generally. I realise how important the complexity of my code is, because, if the code has a bad time complexity, the program could take days to run, which is not practical. Another lesson I learned the hard way is to not be afraid to restart. I completely rewrote my code after I finished a previous, working solution to the Custom ID constraint because the code was too complex and was taking hours to run. This allowed me to learn from my mistakes and reach a much better solution.

I now hope to train a CTGAN model on the GlioCova dataset which contains medical records of over 50000 cancer patients, and measure how well CTGAN performs on a large, relational database.