Inference attacks
Inference attacks are the majority of attacks. The simplest inference attacks are derived from early code breaking, where codes were created by swapping around individual letters, or by substituting glyphs for the alphabetic characters. Once the relative frequency of the various letters in a character substitution language is found, messages using the code can be easily cracked.
In a similar fashion, a pseudonymised database may have the City field disguised. Cities have varying populations, and if the population in the database conforms to a known population, those cities can be identified, and this tentative identification can be firmed up by looking at the distribution of addresses within the city, or by other auxiliary information (also called background information) on individual cities. A seemingly somewhat harmless example. But not when combined with other information.
Methods
The easiest way of revealing individual data is to combine two or more databases. Once the individual is identified, an inference attack can then use the GPS location and movement data of a user, possibly with some auxiliary information, to deduce other personal data such as their home and place of work, interests and social network, and even home in on religion, health condition or business confidential data coming from the user’s employer.
Data about people and their activities is passed around for research purposes. Advances in medicine can be made by finding patterns in existing patient and biomedical data. When such data is pseudonymised and made public or shared, it can unintentionally reveal sensitive data about an individual, such as their medical condition, or how much they were charged for treatment.
Every inference attack is slightly different because it depends on the data and relies on drilling down to a unique combination of characteristics. Available information such as friendships or group relationships in social media datasets can be used to infer sensitive properties that reveal hidden values or behaviours. Some techniques use algorithms to infer values or behaviours from customers’ transactions using auxiliary information with temporal changes of recommender systems. Bayesian inference can be used to gain knowledge about communication patterns and profile information of users.
Mitigation
Naive suppression such as pseudonymising datasets does not prevent privacy breaches. The more useful a record is for scientific or marketing research, the more vulnerable it is to inference attack.