In the mix…Facebook “breach” of public data, data-mining for everyone, thinking through the Panton Principles, and BEST PRACTICES Act in Congress

July 30th, 2010 by Grace Meng

1) Facebook’s in privacy trouble again. Ron Bowes created a downloadable file containing information on 100 million searchable Facebook profiles, including the URL, name, and unique ID.  What’s interesting is that it’s not exactly a breach.  As Facebook pointed out, the information was already public.  What Facebook will likely never admit, though, is that there is a qualitative difference between information that is publicly available, and information that is organized into an easily searchable database.  This is what we as a society are struggling to define — if “public” means more public than ever before, how do we balance our societal interests in both privacy and disclosure?

2) Can data mining go mainstream? The article doesn’t actually say much, but it does at least raise an important question.  The value of data and data-mining is immense, as corporations and large government agencies know well.  Will those tools every be available to individuals?  Smaller businesses and organizations?  And what would that mean for them?  It’s a big motivator for us at the Common Data Project — if data doesn’t belong to anyone, and it’s been collected from us, shouldn’t we all be benefiting from data?

3) In the same vein is a new blog by Peter Murray-Rust discussing open knowledge/open data issues, focusing on the Panton Principles for open science data.

4) A new data privacy bill has been introduced in Congress called “Building Effective Strategies to Promote Responsibility Accountability Choice Transparency Innovation Consumer Expectations and Safeguards” Act, aka “BEST PRACTICES Act.”  The Information Law Group has posted Part One of FAQs on this proposed bill.

Although the bill is still being debated and rewritten, some of its provisions indicate that the author of the bill knows a bit more about data and privacy issues than many other Congressional representatives.

  • The information regulated by the Act goes beyond the traditional, American definition of personally identifiable information.  “The definition of “covered information” in the Act does not require such a combination – each data element stands on its own and may not need to be tied to or identify a specific person. If I, as an individual, had an email address that was wildwolf432@hotmail.com, that would would appear to satisfy the definition of covered information even if my name was not associated with it.”
  • Notice is required when information will be merged or combined with other data.
  • There’s some limited push to making more information accessible to users: “covered entities, upon request, must provide individuals with access to their personal files.” However, they only have to if “the entity stores such file in a manner that makes it accessible in the normal course of business,” which I’m guessing would apply to much of the data collected by internet companies.

Google buys Metaweb: Can corporations acquire the halo effect of the underdog?

July 26th, 2010 by Grace Meng

Google recently bought Metaweb, a major semantic web company.  The value of Metaweb to Google is obvious — as ReadWriteWeb notes, “For the most part,…Google merely serves up links to Web pages; knowing more about what is behind those links could allow the search giant to provide better, more contextual results.” But what does the purchase mean for Metaweb?

Big companies buy small companies all the time.  Some entrepreneurs create their start-ups with that goal in mind — get something going and then make a killing when Google buys it.  But what do you think of a company when it seems to be doing something different and then is bought by Google?

Metaweb was never a nonprofit, but like Wikipedia, it has had a similar, community-driven vibe.  Freebase, its database of entities, is crowd-sourced, open, and free.  Google promises that Freebase will remain free, but will the community of people who contribute to Freebase feel the same contributing free labor to a mega-corporation?  Is there anything keeping Google from changing its mind in the future about keeping Freebase free?  How will the culture of Metaweb change as its technologies evolve within Google?

This isn’t to say that Metaweb’s goals have necessarily been compromised by its purchase by Google.   Many people feel like this is the best thing that could have happened to the semantic web.

(Though a few feel, “They didn’t make it big. In fact, this means they failed at their mission of better organizing the world’s information so that rich apps could be built around it. They never got to the APPS part. FAIL!”, and at least one person is concerned Google bought Freebase to kill it.)

But what did you think when Google bought the nonprofit Gapminder, of Hans Rosling’s famous TED talk?

Or when eBay bought a 25% stake in Craigslist?

Or outside the tech world, when Unilever bought Ben & Jerry’s?

Can a company or organization maintain any high-minded mission to be driven by principles other than profit when they’re bought by a major publicly held corporation?

This isn’t just an abstract question for us.  One of the biggest reasons why we chose to be a 501(c)(3) nonprofit organization is that we wanted to make sure no one running the Common Data Project would be tempted to sell its greatest asset, the data it hopes to bring together, for short-term profit.  As a nonprofit, CDP is not barred from making profits, but no profits can inure to the benefit of any private individual or shareholder.  Also as a nonprofit, should CDP dissolve, it cannot merely sell its assets to the highest bidder but must transfer them to another nonprofit organization with a similar mission.

We’re still working on understanding the legal distinctions between IRS-recognized tax-exempt organizations and for-profit businesses.  We were surprised when we first found out that Gapminder, a Swedish nonprofit, had been bought by Google.  Swedish nonprofit law may differ from U.S. nonprofit law.  But it appears Hans Rosling did not stand to make a dime.  Google only bought the software and the website, and whatever that was worth went to the organization itself.  So in a way, the experience of Gapminder supports the idea that being a nonprofit does make a difference in restricting the profit motives of individuals.  Alex Selkirk, as the founder and President of CDP, will never make a windfall through CDP.

The fact that CDP is not profit-driven, and will never be profit-driven makes a difference to us.  Does it make a difference to you?

In the mix..government surveillance, HIPAA updates, and user control over online data

July 19th, 2010 by Grace Meng

1) The U.S. government comes up with some, um, interesting names for its surveillance programs.  “Perfect Citizen” sounds like it’s right out of Orwell. As the article points out, there are some major unanswered questions.  How do they collect this data?  Where do they get it?  Do they use it just to look for interesting patterns that then lead them to identify specific individuals, or are all the individuals apparent and visible from the get-go?  And what are the regulations around re-use of this data?

2) Health and Human Services has issued proposed updated regulations to HIPAA, the law regulating how personal health information is shared. CDT has made some comments about how these regulations will affect patient privacy, data security, and enforcement.  HIPAA, to some extent, lays out some useful standards on things like how electronic health information should transmitted.  But it also has been controversial for suppressing information-sharing, even when it is legal and warranted.

So what if instead of talking about what we can’t do, what if we started talking about what we can do with electronic health data?  I’m not imagining a list of uses where anything outside of the list is barred, but rather an outline of the kinds of uses that are useful.  The whole point of electronic health records is to make information more easily shareable so care is more continuous and comprehensive and research more efficient and effective.

I love this bit from an interview with a neuroscientist who studies dog brains because, “dogs aren’t covered by Hipaa! Their records aren’t confidential!”

3) A start-up called Bynamite is trying to give users control over the information they share with advertisers online. It’s another take on something we’ve seen from Google and BlueKai, where users get to see what interests have been associated with them.  Like those services, Bynamite allows you to remove interests that don’t pertain to you or that you don’t want to share.  Bynamite then goes further by opting you out of networks that won’t let you make these choices.  That definitely sounds easier to managing P3P, and easier than reading through the policies of all the companies that participate in the National Advertising Initiative.

I agree with Professor Acquisti that all of us, when we use Google or any other free online service, are paying for our use of the service with our personal information, and that Bynamite is trying to make that transaction more explicit.  But I wonder if the value of the data companies have gained is explicit.  Is the price of the transaction fair?  Does 1 hour of free Google search equal x amounts of personal data bits?  Can you even put a dollar value on that transaction, given that the true value of all this data is in aggregate?

The accompanying blog post to this article cites a study demonstrating how hard it is to assign a dollar value to privacy.  The study subjects clearly did value “privacy,” but the price they put on it depended on how much they felt they had any privacy to begin with!

In the mix…new organizational structures, giant list of data brokers, governments sharing citizens’ financial data, and what IT security has to do with Lady Gaga

July 9th, 2010 by Grace Meng

1) More on new kinds of organizational structures for entities that want to form for philanthropic purposes but not fit into the IRS definition of a nonprofit.

2) CDT shone a spotlight on Spokeo, a data broker last week.  Who are other data brokers? Don’t be shocked, there are A LOT of them.  What they do, they mainly do out of the spotlight shone on companies like Facebook, but with very real effects.  In 2005, ChoicePoint sold data to identity thieves posing as a legitimate business.

3) The U.S. has come to an agreement with Europe on sharing finance data, which the U.S. argues is an essential tool of counterterrorism.  The article doesn’t say exactly how these investigations work, whether specific suspects are targeted or whether large amounts of financial data are combed for suspicious activity.  It does make me wonder, given how data crosses borders more easily than any other resource, how will Fourth Amendment protections in the U.S. (and similar protections in other countries) apply to these international data exchanges?  There is also this pithy quote:

Giving passengers a way to challenge the sharing of their personal data in United States courts is a key demand of privacy advocates in Europe — though it is not clear under what circumstances passengers would learn that their records were being misused or were inaccurate.

4) Don’t mean to focus so much on scary data stuff, but 41% of IT professionals admit to abusing privileges.  In a related vein, it turns out a disgruntled soldier accused of illegally downloading classified data managed to do it by disguising his CDs as Lady Gaga CDs.  Even better,

He was able to avoid detection not because he kept a poker face, they said, but apparently because he hummed and lip-synched to Lady Gaga songs to make it appear that he was using the classified computer’s CD player to listen to music.

The New York Times is definitely getting cheekier.

In the mix…philanthropic entities, who’s online doing what, data brokers, and data portability

July 5th, 2010 by Grace Meng

1) Mimi and I are constantly discussing what it means to be a nonprofit organization, whether it’s a legal definition or a philosophical one.  We both agree, though, that our current system is pretty narrow, which is why it’s interesting to see states considering new kinds of entities, like the low-profit LLC.

2) This graphic of who’s online and what they’re doing isn’t going to tell you anything you don’t already know, but I like the way it breaks down the different ways to be online.  (via FlowingData) At CDP, as we work on creating a community for the datatrust, we want to create avenues for different levels of participation.  I’d be curious to see this updated for 2010, and to see if and how people transition from being passive userd to more active userd of the internet.

3) CDT has filed a complaint against Spokeo, a data broker, alleging, “Consumers have no access to the data underlying Spokeo’s conclusions, are not informed of adverse determinations based on that data, and have no opportunity to learn who has accessed their profiles.” We’ve been wondering when people would start to look at data businesses, which have even less reason to care about individuals’ privacy than businesses with customers like Google and Facebook.  We’re interested to see what happens.

4) The Data Portability Project is advocating for every site to have a Portability Policy that states clearly what data visitors can take in and take out. The organization believes “a lot more economic value could be created if sites realized the opportunity of an Internet whose sites do not put borders around people’s data.” (via Techcrunch)  It definitely makes sense to create standards, though I do wonder how standards and icons like the ones they propose would be useful to the average internet user.

Common Data Project looking for a partner organization to open up access to sensitive data.

June 30th, 2010 by The Common Data Project

Looking for a partner...

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

  1. Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;
  2. Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

  • A data exchange to share sensitive information between members.
  • An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.
  • A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

A big update for the Common Data Project

June 29th, 2010 by The Common Data Project

There’s been a lot going on at the Common Data Project, and it can be hard to keep track.  Here’s a quick recap.

Our Mission

The Common Data Project’s mission is to encourage and enable the disclosure of personal data for public use and research.

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open.  But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

We’ve refined what that means, what the datatrust is and what the datatrust is not.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust.  The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released.  Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective.  The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways.  Right now, sensitive data is aggregated and released as statistics.  A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We have a demo of PINQ

An illustration of how you can safely query a sensitive data set through differential privacy: a relatively new, quantitative approach to protecting privacy.

We’ve also developed an accompanying privacy risk calculator.

To help us visualize the consequences of tweaking different levers in differential privacy.

For CDP, improved privacy technology is only one part of the datatrust concept.

We’ve also been working on a number of organizational and policy issues:

A Quantifiable Privacy Guarantee: We are working through how differential privacy can actually yield a “measurable privacy guarantee” that is meaningful to the layman. (Thus far, it has been only a theoretical possibility. A specific “quantity” for the so-called “measurable privacy guarantee” has yet to be agreed upon by the research community.)

Building Community and Self-Governance: We’re wrapping up a blog series looking at online information-sharing communities and self-governance structures and how lessons learned from the past few years of experimentation in user-generated and user-monitored content can apply to a data-sharing community built around a datatrust.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission.  We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

Licensing Personal Information: We proposed a “Creative Commons” style license for sharing personal data and we’re following the work of others developing licenses for data. In particular, what does it mean to “give up” personal information to a third-party?

Privacy Policies: We published a guide to reading online privacy policies for the curious layman: An analysis of their pitfalls and ambiguities which was re-published up by the IAPP and picked up by the popular technology blog, Read Write Web.

We’ve also started researching the issues we need to address to develop our own privacy policy.  In particular, we’ve been working on figuring out how we will deal with government requests for information.  We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers.  We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Education: We regularly publish in-depth essays and news commentary on our blog: myplaceinthecrowd.org covering topics such as: the risk of re-identification with current methods of anonymization and the value of open datasets that are available for creative reuse.

We have a lot to work on, but we’re excited to move forward!

Who has your data and how can the government get it?

June 28th, 2010 by Grace Meng

Who has your data? And how can the government get it?

The questions are more complicated than they might seem.

In the last month, we’ve seen Facebook criticized and scrutinized at every turn for the way they collect and share their users’ data.  Much of that criticism was deserved, but what was missing in that discussion were the companies that have your data without even your knowledge, let alone your consent.

The relationship between a user and Facebook is at least relatively straightforward.  The user knows his or her data has been placed in Facebook, and legislation could be updated relatively easily to protect his or her expectation of privacy in that data.

But what about the data consumer service companies share with third parties?

Pharmacies sell prescription data that includes you; cellphone-related businesses sell data that includes you.

So much of the data economy involves companies and businesses that don’t necessarily have you as a customer, and thus even less incentive to protect your interests.

What about data that’s supposedly de-identified or anonymized?  We know that such data can be combined with another dataset to re-identify people.  Could the government seek that kind of data and avoid getting even a subpoena?  Increasingly, the companies that have data about you aren’t even the companies you initially transacted with.  How will existing privacy laws, even proposed reforms by the Digital Due Process coalition, deal with this reality?

These are all questions that consume us at the Common Data Project for good reason.  As an organization dedicated to enabling the safe disclosure of personal information, we are committed to talking about privacy and anonymity in measurable ways, rather than with vague promises.

If you read a typical privacy policy, you’ll see language that goes something like this,

Google only shares personal information with other companies or individuals outside of Google in the following limited circumstances:…

We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request

We think the datatrust needs to be do better than that. We want to know exactly what “enforceable government request” means.  We want to think creatively about what individual privacy rights mean when organizations are sharing information with each other. We’ve written up the aspects that seem most directly relevant to our project here, including 1) a quick overview of federal privacy law; 2) implications for data collectors today; and 3) implications for the datatrust.

We ultimately have more questions than answers.  But we definitely can’t assume we know everything there is to know.  Even at the Supreme Court, where the Justices seem to have some trouble understanding how pagers and text messages work, they understand that the world is changing quickly.  (See City of Ontario v. Quon.)  We all need to be asking questions together.

So take a look.  Let us know if there are issues we’re missing. What are some other questions we should be asking?

The Datatrust Product: What it is. What it’s not.

June 21st, 2010 by Mimi Yin

The datatrust has always been a big-tent project, but over the last few months, we’ve done a lot of paring down.  We’re getting closer to something that feels like a product and less like a vague hope for a better future!

The following is an attempt to describe the datatrust “technology product” by way of comparison with existing websites and services. The “Governance and Policies” aspect of the datatrust was covered in a separate post.

Sensitive information about us. They have it, we don’t.

Today, most of the sensitive data about us (e.g. medical records, personal finance data, online search history) is inaccessible to us and to those who represent the public: elected officials, government agencies, advocacy groups, researchers.

Our Mission: Democratizing Access to Sensitive Data

While a significant movement has grown up around opening up government data, there are few efforts to gain public access to sensitive “personal information” data, most of it held in the private sector.

Our goal for the datatrust is to create an open marketplace for information to democratize access to some of the most sensitive and valuable data there is, to help us answer difficult policy and societal questions. .

Which brings us to the question: What is a datatrust?

A datatrust will be an online service that allows organizations to make sensitive data available to the public and provides researchers, policymakers and application developers with a way to directly query that data.

We believe the datatrust is only possible with 1) technical innovations that will allow us to provide a quantifiable and enforceable privacy guarantee; and 2) governance and policy innovations that will inspire public confidence.

The datatrust will include a data catalog, a registry of queries and their privacy risks, and a collaboration network for both data donors and data users.

We realize that as a new breed of service, the datatrust is difficult to conceptualize. So, we thought it might be helpful to compare it to some existing websites and services.

A Data Catalog

Like data.gov, the datatrust will provide ways to browse and search a “catalog” of available data.

A Query-able Database of “Raw Data”

Unlike data.gov, datatrust data will be released in “raw” form, not in pre-digested aggregate reports.

Unlike data.gov, datatrust data will not be viewable or downloadable.

Instead, the datatrust will provide a way to directly query raw data.

An “Automated” Privacy Filter

Unlike most open government data releases, the datatrust will not rely on labor-intensive and subjective anonymization methods. Existing methods like scrubbing, swapping or synthesizing data limit the accuracy and usefulness of the data.

By contrast, the datatrust will makes use of new privacy technologies to provide a measurable and enforceable privacy guarantee that treats individual privacy as a value-able asset with a quantifiable limit on re-use.

As a result, the datatrust will keep track of the amount of privacy risk incurred by each query

Privacy protection will happen on-the-fly, thereby automating the “anonymization” aspect of releasing data.

An Open Collaboration Network

Because the datatrust will maintain an open history of all queries and data users, it will also become an important open registry of how data is being used and analyzed. This in turn can become the foundation for a community of data donors and data users, who will collaborate on collecting and analyzing data for research and data-driven software applications.

Like Amazon, the datatrust will do a better job of describing and browsing data sets as well as eliciting user feedback and data-mining actual usage (as opposed to self-reported usage) to help users find relevant data sets.


Like Wikipedia, the datatrust will depend on an invested and active community to curate and manage the data.

Unlike Wikipedia and Yelp (but like Facebook and LinkedIn), the datatrust will require its users to maintain real and active identities in order to build a quality rating system for evaluating data and data use, based on actual usage and individual reputations (as opposed to explicit user ratings).

Not A Generic Set of Tools for Working With Data

Unlike Swivel, the datatrust is not a generic tool set for working with and visualizing data.


Unlike Ning (a consumer platform for creating your own social network), the datatrust is not a consumer platform for creating your own data-sharing networks. It is also not a developer toolkit for building data-driven services.


Unlike Freebase, the datatrust is not Wikipedia for structured data.

Not A Data-Driven Service for Consumers

You should not expect to come to the datatrust to find out if people like you are also experiencing worse than average allergies this year.

Unlike Mint or Patients Like Me, the datatrust is not a personal data-sharing service focused on offering a consumer service (personal finance manager in the case of Mint) or sharing a specific kind of data (tracking chronic diseases in the case of Patiens Like Me)..

But application builders like Mint.com as well as researchers may find the datatrust useful in allowing them to provide services and collect data in new ways from larger groups of people, due to the measurable privacy guarantee provided by the datatrust.

The datatrust is just about data.

The datatrust is a sensitive data release engine and we will build tools insofar as it helps our Data Donors get more data to Data Users. However, it stops short of directly serving consumers. We think that is better left to those with a passion for a specific cause and the domain expertise to serve their constituents well.

In the mix…democratizing access to data, data literacy, and predictable responses to proposed privacy bill

June 18th, 2010 by Grace Meng

1) Infochimps launched their API. People often ask, are you guys doing something similar?  Yes, in that we are also interested in democratizing access to data, but we’re focusing on a narrower area — information that’s too sensitive and too personal to release in the usual channels. In any case, we’re excited to see more movement in this direction.

2) Wikipedia began a trial of a new tool called “Pending Changes.” To deal with glaring inaccuracies and vandalism, Wikipedia made certain entries off-limits for off-the-cuff editing.  The trade-off, however, was that first-time editors to these articles couldn’t get that immediate thrill of seeing their edits.  Wikipedia’s trying out a compromise, a tab in which these edits are visible as “pending changes.”  It’s always fascinating to see all the different spaces in which people in a community can interact online — this is a new one.

3) The Info Law Group posted various groups’ reactions to the privacy bill proposed by Representative Rick Boucher. Here’s Part I, here’s Part II. Fairly predictable, but it still never ceases to amuse me how far apart industry groups are from consumer advocates.

4) Great discussion continues on the concept of “data literacy.” I love this guest post from David Eaves on the Open Knowledge Foundation blog, with the awesome line:

It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes