Research vs Data Privacy: What Does Ethical Data Research Look Like in the GDPR Era?

Guest blog post by Leitha Matz, Speaker, Strategist, Advisor, FinTech Leader

New Battlegrounds In Our Databases

Imagine you’re a data analyst (or maybe you’re just data-curious) and you want to do some work with interesting data that focus on human behavior. And why wouldn’t you? Humans are so weird and fascinating!

As researchers, students and curious people, the world’s wealth of available data looks like an all-you-can-eat buffet. Entire classes of personal data that were once logged (with inevitable human error) through self-reporting or overworked grad students are now available electronically. Like disinterested stalkers, devices ranging from our phones to our watches and cars log the minutiae of our movements, our heartbeats, our click trails, our sleep habits, our purchases.

Of course, we’d like the data we’re working with to be gathered transparently, and with the consent of users; unfortunately the current data landscape isn’t that black and white. Data ethics is increasingly occupying public consciousness; and with this shift, the way we conduct data research needs to change too.

Data Managers Are Now Data Ethicists

I worked in the eCommerce grocery business in the early 2000s, a time when data collection and analysis in the commercial sector was a much quieter affair. We were initially excited to find patterns (“Look! They increase their spend on organic products at the same time they start buying baby products!” “They buy iced coffee when the daytime high hits 73F!”), and we were even more excited to create algorithms that beat the recommendation predictions of our merchandising staff.

Back then, there was a certain innocence across the industry. Those were the days when data mining produced recommendations, like, “Hey, we noticed you bought peanut butter. You might also like some bananas, sandwich bread or paper towels.”

But for those of us reading science fiction, there were already accurate predictions of today’s data abuses. From startups to leviathans like Apple, Google, Netflix and Facebook, private firms have been pushing the boundaries of acceptable data usage, and they have no incentives to initiate the sort of ethical reviews that academia routinely uses to prevent malfeasance.

Our new outlets now issue daily reports of data-driven emotional manipulation (1), election hacking (2), theft (3), harassment (4) and stalking (5). This inevitably triggered legislation like the European Data Protection Directive (DPD) and its sequel, GDPR (General Data Protection Regulation).

These early, untested pieces of legislation will need to see some time in court before we understand all the implications, and that means we’re in for a bumpy ride as we navigate territory in the uncharted waters of the world’s newest data guardians.

So data management in 2019 is no longer just the work of data cleansing, analysis and visualization. Even if philosophy was never part of your university curriculum, legal and cultural standards call for us to concern ourselves with the ethical questions of how our data was gathered, and whether the process respected the consent of our subjects.

When you take a closer look at our all-you-can-eat data buffet, it starts to reveal itself as a multi-course feast of ethical and legal problems. So how do we move forward?

DPD? GDPR? Where Do We Go From Here?

A clarification and consolidation needs to be made in legal codes. In an insightful talk (6) between University of Zurich Vice President Christian Schwarzenegger and epidemiologist Milo Puhanat, we learn that for them, data protection is governed at the canton level, “which presents enormous challenges when it comes to national projects across cantonal borders or to procurement agreements, because the provisions aren’t the same.” Wading through that kind of complexity is a waste of time and human effort.

When it comes to individual privacy, key ideals for research data are usage transparency and informed consent, but actually achieving these goals is trickier than it sounds.

GDPR calls for “data minimization” (that only the personal data required for a particular purpose is processed) and “storage limitation” (that data may only be stored for as long as necessary for that stated purpose). Even the “principle of purpose limitation” (that personal data may be gathered and used only for specified, clear and legitimate purposes) restricts the kinds of discovery and serendipity that has historically produced critical scientific, industrial and medical insights.

For example, I’m happy to consent to 23&Me using my genomic data for drug interaction research (7). I’m not so happy if that consent leads to drug companies using my genomic data for targeted pricing. Similarly, companies like Strava are now selling commuter data to cities, who will presumably use it to improve infrastructure. (8) That seems positive, unless individuals can be identified through that data, as inadvertently happened when the NYC taxi authority released awkwardly encrypted taxi ride information (9), or Netflix released aggregated film viewing data (10) that could be de-aggregated and used to identify individuals.

Private Data for Public Data Research: A Promising Model?

An industry example that I believe shows promise for both consumer consent and research benefit is the Apple Research Kit, which focuses on gathering subjects and consent for access to their health data.

In the best-case scenario, this product helps researchers to use Apple’s consistent user interfaces and its broad audience to gain informed consent from qualified subjects. According to a CBInsights report (11), the Apple Heart Study recruited more than 400,000 people in a year, and a mobile Parkinson’s study known as mPower gained consent from more than 10,000 enrollees. (11). These are great numbers in a field where gathering qualified, consenting subjects is a challenge.

Clearly, there are downsides of this kind of consent process as well. For one, Apple users tend to be wealthier, so they may be a poor representation of the general population. Access to the data for study reproducibility is an open question, and the cost of access may also be prohibitive.

But I think the Research Kit at least provides a model for data researchers to access commercially-harvested data that might otherwise be locked in inaccessible silos or gathered under questionable conditions.

TL;DR- What Is the Future of Ethical Data Research?

Naturally, research works best in an environment of freedom. Researchers need to explore and experiment during the discovery phase, and the broader community needs freedom to access experimental data for examination and reproducibility in the publishing and distribution phase.

On the other hand, the requirements of individual privacy include informed consent, transparency about intended data usage, minimisation of data collection and storage, revocability of the data, and the right to be forgotten.

We can meet around a common table and work together on these concerns, but private companies who now collect so much of the world’s data need to be both represented and held responsible. Projects which use private data “donations” for public research are certainly promising. But ultimately, we need more planning, more conversation, and more cooperation between legislators and corporations, data managers, and data subjects.


Up Next....


What is blockchain technology used for?

Since its First launch, blockchain technology has been associated with the world of cryptocurrencies and because of this, it has raised certain doubts. The lack of transparency of some virtual