Being hired or paid by data analysis agencies, politicians and people in other industries, data scientists and analysts are considered to be among the most likely potential adversaries that have the motivation to attempt a re-identification (identity disclosure, link disclosure and content disclosure), and have the necessary knowledge and tools.
Data scientists are data wranglers. They take data points (unstructured and structured) and use math, statistics and programming to clean, manage and organise data. Then they apply industry knowledge, contextual understanding, a critical attitude towards existing assumptions – to uncover hidden solutions. There are a few data scientists working for universities and other institutes, answering some very interesting questions and/or working to solve some societal or medical problems (The road to hell may paved with good intentions), but most work for businesses trying to find solutions to business challenges. One might say, these find opportunities for businesses to make more money. A whole new industry.
Data science and data mining are dissimilar terms, but when it comes to data they often go hand in hand:
Data mining is about finding trends in data sets, and using these trends to identify future patterns. It often involves analysing vast amounts of structured data. It is a technique mostly used by businesses to make use of data to find new trends. If you know how to navigate data and have a bit of statistical knowledge, you can do it yourself.
Data science studies everything from big (unstructured) data analytics, data mining, predictive modelling, data visualisation, mathematics, and statistics. It is a field of scientific study that aims to build data-centric products for organisations. It finds application in social analysis, and in building predictive models and unearthing new facts in various domains. To do it you need extensive knowledge of machine learning, programming, the domain it is applied to, and often also includes data mining.
Programmatic markets are considered to be among the most likely potential adversaries to broker re-identified data for its customers to place ads on people’s screens.
Main motivation is financial gain. Predators seek to minimise spent energy in their hunt for prey. The human hunt for money also tends to take the easiest, opportunistic path. Click-fraud created a full-blown ecosystem of adTech that aggressively monitored users and shoved ads into their face, and fraud-bots that figured out how to get past the countermeasures. Note that it was considered fraud when a bot tricked an advertiser into thinking an ad had been seen, but it was considered totally legitimate for an advertiser to trick users into seeing an ad.
Advertisers/marketers want to advertise a product or service to sell it.
Publishers own one or several websites/social media and display advertisements for payment.
Data brokers collect information from users and sell user impressions (profiles) to advertisers.
The rise of data-driven ad techniques has accelerated the evolution of the advertising eco-systems. Brands in-sourced ad operations and agencies became data brokers with a different, more hands-on and data-rich partnership approach than was seen in the past.
Programmatic refers to the automation of buying and trafficking processes for audience targeted ads. It excludes direct publisher and ad network contracts specified by standard insertion orders, but it includes various hybrid models in which direct buys are supported by automated processes, going under the general heading of “programmatic direct”, referring to the application of software to automate and optimise the placement of ads sold directly to advertisers by publishers at a human-negotiated price.
The programmatic industry grows incredibly fast, and is a complex ecosystem of complementary and competitive product category niches:
A Demand-Side Platform (DSP) is a technology that advertisers use to buy ad impressions in an automated way. DSPs allow advertisers to bid on inventory from a variety of media owners/publishers.
Supply-Side Platforms (SSP’s) use similar technology to that of a DSP except that it is used by media owners to manage their ad inventory and sell impressions programmatically. SSPs allow media owners to connect their inventory to multiple ad exchanges.
Service providers: Creative and media agencies and trading desks who specialize in creating advertising and buying and managing the media placement.
Media and data providers: Data providers and the software providers for the ecosystem that manage the ad sales on behalf of publishers (SSP), programmatic ad buying (DSP), and data from multiple sources that is made available to marketers to build segments and targets (DMPs).
Marketplaces: Ad exchanges for buying and selling digital media, ad networks that aggregate inventory, and data brokers/exchanges that buy and sell consumer data that can be associated with ad media through an identifier such as a cookie ID.
Once a media owner/publisher sells an ad impression, an ad server traffics the ad itself, putting it in front of the target consumer whose impression the advertiser purchased.
Consumers are the people being grazed on via impressions. Every time we go to an e-commerce site or a retail store, it is considered track-able commercial behaviour. Even when there are no ads, the tracking is there.
Data brokers collect, analyse, combine, and package some of our most sensitive personal information and sell it as a commodity to each other, to advertisers, even to those same public authorities, often without our direct knowledge, let alone our consent. Brokers are considered to be among the most likely potential adversaries that have the motivation to attempt a re-identification (identity disclosure, link disclosure and content disclosure), and have the necessary tools.
Main motivation is financial gain. Where there is money being made, there is a market, and there are middlemen brokering data, of which many do not even consider themselves a data broker. While this industry has been around for decades, most people have never heard of data brokers. Thanks to advances in data science and its role in enabling the current internet marketing and advertising and advertising eco-systems, it has grown into a multibillion dollar global industry that operates in the shadows with virtually no oversight.
Some data broker products are beneficial or harmless, others are a threat to privacy.
Credit bureaus have played and still play a critical data brokerage role in mediating access to financial data. They begun building databases in the mid 20th century, to catalogue us and our habits for marketing, fraud detection or credit scoring purposes. They have adapted to be able to ingest and process the streams of information we make available about ourselves today.
Police in both the United States and Europe purchase information and assistance to profile people based on personal data.
Political parties are targeting their digital outreach based on details of individual behaviour.
Employers routinely turn to data brokers to purchase reports regarding job candidates.
In the US, one data broker disclosed in a government filing that “they buy our health information, electronic health records, prescriptions, claims data, and they also put in information about our health from social media.” These “longitudinal” health profiles are then sold to thousand of clients, including the federal government.
In the European Union, the GDPR was established, but public authorities and civil society struggle to apply its rules in concrete ways. Regulatory guidance seems not entirely complete.
Information about millions of people is sold to corporate and governmental actors in both the US and Europe. Data brokers, and the profiling techniques used, are giving large institutions more visibility than ever before into people’s private lives.
Black markets are considered to be among the most likely adversaries to broker re-identified data and the knowledge and tools on how to re-identify data, in order to make money.
Motivation is financial gain. Personally identifiable information gotten illegally from de-anonymisation techniques is sold in underground marketplaces, which are also a form of anonymisation platforms.
Information that falls into the wrong hands can be used for further hacking, coercion, extortion, and intimidation leading to privacy concerns and enormous costs for individuals and organisations who fall victim.
Marketers and advertisers
Marketers are considered to be among the most likely potential adversaries that have the motivation to attempt a re-identification (identity disclosure, link disclosure and content disclosure), buy data from brokers directly, and are considered to have the necessary tools, or are indirect adversaries in hiring data analytics agencies that buy data from data brokers. Advertisers are no direct threat, but are important enabling players.
Traditional marketing research often involves assessing the overall market for a good or service, surveying consumers about their likes and dislikes, and conducting focus groups to gauge consumer responses to a new product.
The growth of information technology has transformed market research, with a growing number of analysts learning about consumer preferences and buying habits by mining massive sets of quantitative data and employing complex algorithms to uncover patterns and correlations that enable more effective marketing and advertising.
Most used are correlations between different factors and variables in large data sets, often measured in terabytes. Data mining often gives businesses enormous amounts of information about their customers’ behaviours and buying habits, enabling them to more effectively market and advertise their goods and services.
Amazon’s feature matching algorithm tells a potential customer that people who like one particular product also like certain other items, is an example. The “Frequently Bought Together” of Amazon and The “Genius Recommendations” feature of iTunes make recommendations which are similar to what we already like.
Credit card issuers generating lists of products and services that consumers are likely to buy based on characteristics of customer credit card accounts for customer service representatives is another.
Motivation is financial gain. Often mentioned three major benefits of using a recommendation engine:
Increase in revenue.
Increase in customer satisfaction leading to customer retention.
Elimination of the need for market research.
Insurance companies are considered to be among the most likely potential adversaries to have the motivation to attempt a re-identification, and are considered to have the necessary tools.
Main motivation is financial gain. Insurance companies collect data from many sources. They merge them in many ways to calculate risk (expected lifespan, weather patterns for farming insurance, car accidents etc.). They then calculate premiums based on that risk. The better/more data they have the better they can price the risk and the more they can manage their profits.
One of the most important uses of data analysis for insurance companies is determining policy premiums. For example, automobile, home and health insurance companies use data from telematics (in-vehicle telecommunication devices) IoT devices and wearables (Fitbit, Apple Watch etc.) to track their customers in order to predict and calculate risks.
Insurers also use data to improve fraud detection and criminal activity through data management and predictive modelling by matching the variables in every claim against the profiles of past claims which were fraudulent so that when there is a match, the claim is pinned for further investigation.
Information gained from call centre data, customer e-mails, social media, user forums and user behaviour while logged into the insurers’ sites, enable insurers to build unique customer profiles with customer behaviours, habits and needs to anticipate future behaviours for up-selling/cross-selling products.
Since the insurance industry is founded on estimating future events and measuring the risk/value of these events, the volume, velocity, veracity and variety of massive datasets has become an essential tool for insurers. With new data sources such as telematics, sensors, government, customer interactions and social media, the opportunity to use big data to determine risk, claims and enhance customer experience (higher predictive accuracy) is very appealing to this industry.
Employers are considered to be among the most likely potential adversaries to hire data analytics agencies that buy data from brokers. Background checks have always been done by employers.
Web scraping and data mining software is available to any individual or organization and can be used by staff as part of a suitability check for hiring a candidate, or for a promotion.
Scraping can cover anything and everything associated with someone’s digital presence online: Social media such as Facebook, Twitter and location-based services, data from association sites, school repositories and community events. It enables an employer to create a time map of an individual’s life over the specified search period, without the individual’s explicit permission.
It can be used to ascertain information that cannot be legally asked in interviews. Political views, age, sexual orientation, marital status, ethnicity, offspring and life stage are common examples.
Things that may have been socially acceptable in the past but are no longer welcomed can come back to haunt an individual. Old posts, tweets and comments can damage careers and because the subject is unable to explain things in or out of context, anything can be used against someone.
National law enforcement and intelligence agencies are considered to be among the most likely adversaries to have the motivation to attempt a re-identification, and are considered to have the necessary tools.
The “many eyes” (5 and more eyes) and law enforcers use information from the advertising ecosystem as a cheap and easy way of tracking individual users across multiple devices, locations and accounts. The adoption of these new tools does not come as a surprise given their low cost compared to other forms of surveillance. And seems to also be a response to public and political pressure to investigate crimes online, especially when after a high profile crime, journalists unearth social media profiles full of warning signs.
Law enforcement uses social media in many ways, including:
Browsing social media
Creating accounts for sharing information with law enforcement
Obtaining information from social media companies
Creating fake profiles and personas
Commonly used data mining techniques by law enforcement are:
Entity extraction to automatically identify people, organisations, vehicles and personal details in unstructured data such as police reports. Even if entity extraction provides only basic information, it can be used as auxiliary information to accelerate the investigation by rapidly providing precise details from large amounts of unstructured data (from social media such as Instagram, Facebook, and Twitter, a structural de-anonymisation attack leading to identity disclosure and link disclosures. This also means activity of particular people can be monitored, a content disclosure threat, see the last point in this list).
Clustering techniques that are used to group similar characteristics together in classes in order to gain intelligence by maximizing or minimizing similarities; for example, to identify suspects or criminal groups conducting crimes in similar ways. Clustering techniques can be applied to discover criminal relations by cross-referencing entities in criminal records.
Association rules are used to discover recurring items in databases in order to create pattern rules and detect potential future events. For example, sequential pattern mining as an association rule is useful to identify sequences or recurring item in order to define patterns and prevent attacks, in network security.
Classification for analysing unstructured data to discover common properties among criminal entities. It has been used together with inferential statistics techniques to predict crime trends. This technique can dramatically narrow down different criminal entities and organise them into predefined classes.
String comparison is used to reveal deceptive information in criminal records by comparing structured text fields. This requires highly intensive computational capabilities.
Text mining techniques were considered to be the next step in the evolution of data mining and criminal intelligence technologies in 2016.
Law enforcement agencies (and intelligence agencies) claim that this is an inexpensive strategy with little impact on people’s privacy because it relies only on so-called publicly available information. A tweet is considered not private because, by its nature, you cannot control its audience. Does that automatically make it public, or within the space of the police? Both Evanna Hu and Millie Graham Wood make a case by saying that social media do not easily fit into either the category of public or private and argue that it is instead a pseudo-private space, where there is an expectation of privacy from the state.
This category includes an ecosystem of lawyers, journalists, (privacy) activists and gray hat hackers.Gray hats are likely to focus on a target that is the easiest to re-identify (presenting unique or unlikely characteristics, test results, etc.) and have the necessary tools.
To be able to maintain the data industry, privacy needs to be protected, and anonymisation techniques appear, which are then (at least theoretically) attacked again. This is yet another business around data, with new adversaries, of which some have genuine ethical concerns about the data industry.
Most, if not all attacks by these adversaries are done for demonstration purposes and have no direct relation to the target(s). In these attacks, the adversary is attempting to re-identify any person in the data to prove insufficient anonymisation (something that may work towards solving the problem) and possibly to shame or embarrass the data custodian (something that has never worked to solve a problem).
Not a direct adversary but opportunistically, politicians are considered to be among the most likely potential adversaries to hire data analytics professionals and agencies that use programmatic markets and/or buy data from brokers.
Political preferences and tendency of the population can be assessed using classification and opinion mining (read sentiment analysis) techniques. Both patterns affected by socio-economic variables and voter sentiment analysis can be used. Some known strategies and patterns that have been studied are:
Organisational affiliation strategies
Electoral affiliation strategies
Patterns of party systems transformation
Organisational affiliation patterns of politicians
Electoral affiliation patterns of voters
Those are the usual suspects, but other things in relation to elections have been studied as well.
The ability to find and target specific groups online has already been developed by data scientists for marketers and advertisers. The search technology of knowing who you are and constantly playing that ad can be applied to politics.
It is very cheap to target specific individuals with advertising, which in a political context could be fake news, misinformation, or propaganda. Sending persistent, targeted messaging to key voters via social media platforms is extremely cheap, while the cost of detecting those messages automatically is significantly higher. And there you have it. The cost per click is around 25 cents, and the cost per view less than 10 euro per thousand. Influencing state elections costs in the order of a few hundred euro, and European elections in the order of a few thousand euro. And while the greater the difference in voters, the greater the cost to target them, it would be a bit more but still be in those orders.
The Cambridge Analytica story was about how a company was able to use and abuse our personal information to target us in ways we can’t even see, let alone understand. Global Science Research created quizzes and surveys inside the social network designed to engage users. The surveys used by the company used inflammatory language designed to stir up the kind of emotion that would prompt an interaction. Cambridge Analytica then used artificial intelligence systems to build “psychographic profiles” (behavioural profiles) which combined Facebook data with information gathered from other “top commercial data providers” that included specific information about voter demographics, geographics, purchase history, and personal interests.
The scandal that followed revealed that Facebook is not just bigger than any nation state on Earth, it also plays a pivotal role in their elections. And what many overlook is that Cambridge Analytica was part of a much bigger company, SCL, which had worked as a defence contractor for governments and militaries around the world, then branched into elections in developing countries, and, only in its final iteration, entered western politics.
Until now this has been mostly a social media thing. With the unchecked opportunistic nature of humankind in mind, I expect it to grow to include every other platform on the internet, assistance apps, top 100 lists of websites, …
Election security is getting a lot of attention. Most of the discussions seem to have focused on voting machines, technological attacks, or the election process as a whole, not on influence campaigns.
Not a direct adversary but an important player with a possible conflict of interest.
Regulators that are to oversee the data industry are working for governments who depend on mass-scale ad-tech based surveillance to make law enforcements and government spying cost-effective, and on taxing this new, and very rich, industry. Financial gain again.
In the European Union and the EFTA member countries most Data Protection Authorities (DPA’s) were created following the implementation of EU Directive 95/46/EC and primarily deal in specific cases on the basis of inquiries from public authorities or private individuals or cases taken up by the agency on its own initiative.
The old EU Data Protection Directive of 1995 was outdated. It failed to cover for example, social networking sites, cloud computing, location-based services, smart cards and biometric data.
In 2012 the European Commission proposed a comprehensive reform of the EU’s data protection rules to strengthen privacy rights and boost Europe’s digital economy. Unlike directives, the new GDPR that came into force in 2018 does not require national governments to pass any enabling legislation. It is directly binding and applicable.
From the perspective of users, the GDPR can also be seen as an arrogation, a law tool for the normalisation of appropriation of their data.