Assets¶

Welcome to the greenhouse’s hidden treasures: the juicy, leafy data crops adversaries are after. Whether tucked safely behind fences or left out for the taking, every dataset tells a story.

Data releases¶

Data comes with different levels of greenhouse access. Sometimes it is just the head gardener poking about with good intentions. Other times, the gates are flung open and tourists can trample through at will.

Internal secondary research¶

This is the “we are just looking around the shed” kind. Internal teams reuse data they already had permission for, usually with some vague notion of anonymisation. The risk of external intruders is low, in theory, since access is tightly controlled. But an overconfident intern can still leave the door ajar.

External secondary research¶

Here, outsiders are allowed into the garden, but only after signing a blood pact (or a data-sharing agreement). With anonymisation, contracts, and a few watchdogs, it feels secure. That is, until someone gets clever with auxiliary information and finds a way to pick the lock. Trust, but verify, and then lock the compost bin.

Public release¶

The wild meadow: data tossed out into the open, ripe for the picking. Anyone with a basket and bad intentions can gather what they like. It is impossible to know who is harvesting or what tools they have brought. Here, assumptions about adversary motivation and technical skill do not just backfire. They explode like overripe fruit.

Auxiliary information¶

It is tempting to think anonymised data, stripped of names, is safe. But adversaries are tenacious compost sniffers, and they will dig through the mulch to find what they need.

They collect crumbs: bits of information from petitions, forums, even a colleague’s old review of a garden centre.
They buy old maps: open datasets, neighbourhood statistics, and that quaint survey from 2009.
They inherit leftovers: mergers, acquisitions, and shadowy black market data swaps.
And sometimes they find hints inside the target dataset itself, like a cryptic note left in a flowerpot.

In short: they will mix and match with whatever is lying around to figure out what is really growing in the greenhouse.

Target dataset¶

The main crop: the big, labelled “anonymised” but still suspiciously personal collection of records. The names and phone numbers are gone, replaced with techniques like k-anonymity. But with enough auxiliary information and determination, the roots are traceable, whatever the mitigations promise.

Biometric data¶

Faces, voices, fingerprints, gait patterns, and retina scans. Biometric data is uniquely dangerous because it cannot be changed. A breached password can be reset. A leaked face cannot be replaced.

Facial recognition databases built from scraped social media images, voice prints extracted from customer service calls, and gait signatures lifted from CCTV footage all create permanent linking keys that persist across any number of future datasets. See also: metadata and browser fingerprinting.

Genetic data¶

DNA profiles are the most durable linking key of all, and the only one that implicates people who never provided a sample. A profile shared by a relative can identify a person through the family tree, as the genetic-genealogy case describes. Consumer ancestry databases turn a private trait into a searchable index of families.

Location and mobility data¶

GPS trails, mobile cell tower logs, transport card swipes, and geofenced advertising data. Location data is among the most re-identifying categories that exist: a handful of approximate points in time and space will pick one person out of millions, because the shape of a week is close to unique to whoever lived it.

Continuous location data reveals home address, workplace, religious attendance, medical appointments, political activity, and intimate relationships, all without a single explicit label. The commercial market in this data is the surveillance model’s databroker case: the same location traces, bought rather than intercepted.

Behavioural and temporal patterns¶

The rhythm of behaviour over time: when someone wakes, how long they spend on particular pages, the order in which they complete tasks, their response latency to notifications. Temporal patterns are highly distinctive and extremely difficult to anonymise, because the structure of someone’s day is often more identifying than any single data point within it.

Behavioural fingerprints extracted from mouse movement, keystroke dynamics, and scroll behaviour can re-identify individuals across sessions even after all explicit identifiers have been removed.

Graph and network data¶

A social graph: who someone is connected to, how often they communicate, the structure of their relationships. Graph data is particularly resistant to anonymisation because the re-identifying information is structural rather than attribute-based. Removing names leaves the shape of the network intact, and that shape is often sufficient.

This includes contact lists, follower networks, communication metadata, and co-authorship or co-participation graphs.

Device fingerprints¶

Browser configuration, screen resolution, installed fonts, GPU rendering characteristics, time zone, language settings, and hardware identifiers combine to form a fingerprint that is often unique to a single device and therefore a single person. These fingerprints persist across incognito sessions, VPNs, and cleared cookies. The fingerprint-to-name case shows brokers turning that persistence into a name.

Advertising identifiers¶

Every phone ships with a resettable advertising identifier (Apple’s IDFA, Android’s GAID) that ad networks use to tie a person’s activity together across apps without a login. Individually anonymous, the identifier is the spine of the tracking economy: it links app usage, feeds real-time bidding, and, joined to a data broker’s records, resolves to a name and address. It is an asset precisely because it looks like an anonymous number and behaves like an identity.

Synthetic data¶

Artificially generated records designed to mimic a real dataset without containing any real individuals. Increasingly used in healthcare and finance as a privacy-preserving alternative to sharing raw data.

Synthetic data is an asset that carries concealed risk. It is not automatically safe: generative models trained on small or unusual populations may reproduce identifying patterns; membership inference attacks can sometimes establish whether a particular individual’s data contributed to the training process. “It is synthetic” is not a guarantee of safety.

Last reviewed: 2026-07-17.