http://blogs.wsj.com/digits/2012/12/07/sites-sharing-personal-details-the-journal-methodology/

December 7, 2012

Sites Sharing Personal Details: The Journal's Methodology

By WSJ Staff

The Wall Street Journal tested 50 of the most popular sites in the U.S. -- plus WSJ.com and 20 additional websites in sensitive categories -- to identify data about their registered users they passed to other companies.

In choosing sites to test, the Journal drew primarily from Web measurement firm ComScore's list of the 1,000 most popular sites, as of June 2012. The Journal tested the top 50 sites that allowed users to register, excluding sites that required a real-world account, such as banking sites.

In addition, the Journal selected popular sites in categories deemed to be sensitive -- children's sites, political sites, medical sites and sites for dating and relationships. These sites were selected from the ComScore list as well as from a list by measurement firm Quantcast.

The Journal also tested its own site, WSJ.com. The Journal's methodology was based in part on techniques used by Balachander Krishnamurthy of AT&T Labs and Konstantin Naryshkin and Craig Wills of Worcester Polytechnic Institute in a 2011 study. [1] For each site on its list, the Journal created an account, entering name, username, email address, birth date, location, password and other information.

The Journal followed each site's suggested registration procedure, including email confirmation when necessary. In addition to registering, the Journal logged out of each account, logged back in, and browsed all known types of pages on the site -- for instance, article pages, profile pages and setting pages. The Journal cleared its test computer of tracking files, known as cookies, between each browsing session.

During each browsing session, the Journal used mitmproxy, an open-source software program, to inspect the data being transmitted to and from the sites. This method reveals all data being passed via the Web browser. This serves as a "lower bound" for data sharing; companies can also pass data behind the scenes. Transfers of information to the sites themselves -- or to domains owned by these "first-party" sites -- were not counted as data leakage unless the domain served a significantly different purpose from that of the original site.

Also excluded was information that, while related to identity, might not actually divulge information about the specific user. For example, a social-networking site might structure its public Web pages in the format http://site.com/username; although this site contains a person's username, it is not clear that the person viewing that site is, in fact, the user.

In rare cases, the Journal conducted closer inspections of data being sent, based on information seen during the registration process. Extra data observed in this manner is recorded in notes for each site. The Journal also sought confirmation and comment from each of the sites that transmitted or received data.

-- Jeremy Singer-Vine and Jennifer Valentino-DeVries

[1] http://web.cs.wpi.edu/~cew/papers/w2sp11.pdf