http://online.wsj.com/article/SB10001424127887323783704578247842499724794.html

January 17, 2013

A Little Digging Unmasks DNA Donor Names

Experts Identify People by Matching Y-Chromosome Markers to Genealogy Sites, Obits; Researchers' Privacy Promises 'Empty'

By AMY DOCKSER MARCUS

Genetic information stored anonymously in databases doesn't always stay that way, a new study revealed, raising concern about how much privacy participants in research projects can expect in the Internet era.

Tension has long existed between the need to share data to drive medical discoveries and the fact many people don't want personal health information disclosed. The growing use of genetic sequencing makes this even more challenging because genetic data reveals information not only about an individual, but also about his or her relatives.

In a paper published Thursday in the journal Science, researchers were able to determine the identities of nearly 50 people who had submitted genetic information as part of scientific studies. The people were told that no identifying information would be included in the studies but were warned of the remote possibility that at some point in the future, their identities might become known.

"We have been pretending that by removing enough information from databases that we can make people anonymous. We have been promising privacy, and this paper demonstrates that for a certain percent of a population, those promises are empty,'' said John Wilbanks, chief commons officer at Sage Bionetworks, a nonprofit organization that promotes data sharing, who wasn't involved in the study.

The public and scientific community are concerned about DNA privacy since they worry that genetic information--which can show susceptibility to certain diseases and other ailments--might be used by insurers, employers or others to discriminate against people.

In the new study, the researchers, led by the Whitehead Institute for Biomedical Research in Cambridge, Mass., used the genetic information of people whose genomes had been anonymously published as part of the 1000 Genomes Project, an international collaboration to create a public catalog of data from at least 1,000 people of different ethnic and population groups.

Using a computer algorithm, the researchers focused on identifying unique genetic markers on the Y chromosome of men in the project. They searched publicly accessible genealogy databases that contain both Y chromosome information and men's surnames.

Such genealogy sites, which people join in hopes of compiling their family tree, sometimes include Y chromosome data because it is passed from father to son and can be traced back generations. Some genealogy sites group such genetic information with surnames.

When they got a match to a surname, the researchers ran numerous Internet searches to collect data on each individual's family tree, including obituaries, which often list the names of a deceased's family members. They also searched for demographic data on the public website of the Coriell Institute for Medical Research, a nonprofit in Camden, N.J., that houses collections of genetic material.

With the family-tree data, they were able to identify nearly 50 men and women who participated in genetic studies. "It only takes one male,'' said Yaniv Erlich, a Whitehead fellow, who led the research team. "With one male, we can find even distant relatives.''

Dr. Erlich said the technique works best for people who have the highest participation in genetic genealogy services, upper- and middle-class Caucasian Americans. They estimated their technique would have a success rate in identifying the last names of 12% of U.S. Caucasian males in similar DNA studies.

The researchers didn't disclose the names of the DNA donors they discovered.

Hank Greely, director of the Center for Law and the Biosciences at Stanford University, said the study raises important questions about expectations of privacy. In an age when genetic information is being collected as part of medical care and can be correlated with personal information people freely post online, Mr. Greely said the medical and scientific communities need to be clear that "we cannot promise people confidentiality.''

The issue of protecting privacy in genetic studies isn't new. In 2008, geneticist David Craig showed that he was able to trace pooled genetic data that was available online back to an individual who had participated. As a result, the National Institutes of Health and the Wellcome Trust tightened access to these collections of DNA. The scientific community protested, but officials felt they had no choice because participants had been explicitly promised anonymity, said Eric Green, director of the National Human Genome Research Institute at NIH.

Dr. Green, who co-wrote a perspective piece that accompanied Thursday's paper, said steps already have been taken to make it more difficult for others to duplicate the results of Dr. Erlich's team. For instance, the ages of the people in the study no longer are publicly available on the Coriell website, said Courtney Kronenthal, director of communications and development at Coriell.

David Altshuler, co-chair of the steering committee for the 1000 Genomes Project, said the people who enrolled in the study were told that all steps would be taken to keep them anonymous, but that technological advances one day might make it possible to identify them.

Dr. Altshuler said he favors offering research participants a variety of options about how much to share.

"If they choose to share that's a very admirable thing because by sharing freely, progress for everyone is accelerated, and if someone is not comfortable we should respect that too and find ways for them to still participate in research,'' he said.

Write to Amy Dockser Marcus at amy.marcus@wsj.com