- SCIENTIFIC INTEGRITY AND TRANSPARENCY

[House Hearing, 113 Congress]
[From the U.S. Government Publishing Office]

SCIENTIFIC INTEGRITY AND TRANSPARENCY

=======================================================================

HEARING

BEFORE THE

SUBCOMMITTEE ON RESEARCH

COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY
HOUSE OF REPRESENTATIVES

ONE HUNDRED THIRTEENTH CONGRESS

FIRST SESSION

__________

TUESDAY, MARCH 5, 2013

__________

Serial No. 113-10

__________

Printed for the use of the Committee on Science, Space, and Technology

Available via the World Wide Web: http://science.house.gov

U.S. GOVERNMENT PRINTING OFFICE
79-929 WASHINGTON : 2013
-----------------------------------------------------------------------
For sale by the Superintendent of Documents, U.S. Government Printing Office,
http://bookstore.gpo.gov. For more information, contact the GPO Customer Contact Center, U.S. Government Printing Office. Phone 202�09512�091800, or 866�09512�091800 (toll-free). E-mail, [email protected].

COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY

HON. LAMAR S. SMITH, Texas, Chair
DANA ROHRABACHER, California EDDIE BERNICE JOHNSON, Texas
RALPH M. HALL, Texas ZOE LOFGREN, California
F. JAMES SENSENBRENNER, JR., DANIEL LIPINSKI, Illinois
Wisconsin DONNA F. EDWARDS, Maryland
FRANK D. LUCAS, Oklahoma FREDERICA S. WILSON, Florida
RANDY NEUGEBAUER, Texas SUZANNE BONAMICI, Oregon
MICHAEL T. McCAUL, Texas ERIC SWALWELL, California
PAUL C. BROUN, Georgia DAN MAFFEI, New York
STEVEN M. PALAZZO, Mississippi ALAN GRAYSON, Florida
MO BROOKS, Alabama JOSEPH KENNEDY III, Massachusetts
RANDY HULTGREN, Illinois SCOTT PETERS, California
LARRY BUCSHON, Indiana DEREK KILMER, Washington
STEVE STOCKMAN, Texas AMI BERA, California
BILL POSEY, Florida ELIZABETH ESTY, Connecticut
CYNTHIA LUMMIS, Wyoming MARC VEASEY, Texas
DAVID SCHWEIKERT, Arizona JULIA BROWNLEY, California
THOMAS MASSIE, Kentucky MARK TAKANO, California
KEVIN CRAMER, North Dakota VACANCY
JIM BRIDENSTINE, Oklahoma
RANDY WEBER, Texas
CHRIS STEWART, Utah
VACANCY
------

Subcommittee on Research

HON. LARRY BUCSHON, Indiana, Chair
STEVEN M. PALAZZO, Mississippi DANIEL LIPINSKI, Illinois
MO BROOKS, Alabama ZOE LOFGREN, California
STEVE STOCKMAN, Texas AMI BERA, California
CYNTHIA LUMMIS, Wyoming ELIZABETH ESTY, Connecticut
JIM BRIDENSTINE, Oklahoma EDDIE BERNICE JOHNSON, Texas
LAMAR S. SMITH, Texas

C O N T E N T S

Tuesday, March 5, 2013

Page
Witness List..................................................... 2

Hearing Charter.................................................. 3

Opening Statements

Statement by Representative Larry Bucshon, Chairman, Subcommittee
on Research, Committee on Science, Space, and Technology, U.S.
House of Representatives....................................... 5
Written Statement............................................ 6

Statement by Representative Daniel Lipinski, Ranking Minority
Member, Subcommittee on Research, Committee on Science, Space,
and Technology, U.S. House of Representatives.................. 7
Written Statement............................................ 8

Witnesses:

Dr. Bruce Alberts, Editor-in-Chief, Science Magazine and
Professor Emeritus of Biochemistry and Biophysics, University
of California - San Francisco
Oral Statement............................................... 9
Written Statement............................................ 12

Dr. Victoria Stodden, Assistant Professor of Statistics, Columbia
University
Oral Statement............................................... 20
Written Statement............................................ 22

Dr. Stanley Young, Assistant Director for Bioinformatics,
National Institutes of Statistical Sciences
Oral Statement............................................... 48
Written Statement............................................ 51

Mr. Sayeed Choudhury, Associate Dean for Research Data Management
at Johns Hopkins University and Hodson Director of the Digital
Research and Curation Center
Oral Statement............................................... 54
Written Statement............................................ 56

Discussion....................................................... 63

Appendix I: Answers to Post-Hearing Questions

Dr. Bruce Alberts, Editor-in-Chief, Science Magazine and
Professor Emeritus of Biochemistry and Biophysics, University
of California - San Francisco.................................. 74

Dr. Victoria Stodden, Assistant Professor of Statistics, Columbia
University..................................................... 80

Dr. Stanley Young, Assistant Director for Bioinformatics,
National Institutes of Statistical Sciences.................... 86

Mr. Sayeed Choudhury, Associate Dean for Research Data Management
at Johns Hopkins University and Hodson Director of the Digital
Research and Curation Center................................... 92

SCIENTIFIC INTEGRITY AND TRANSPARENCY

----------

TUESDAY, MARCH 5, 2013

House of Representatives,
Subcommittee on Research
Committee on Science, Space, and Technology,
Washington, D.C.

The Subcommittee met, pursuant to call, at 10:01 a.m., in
Room 2318 of the Rayburn House Office Building, Hon. Larry
Bucshon [Chairman of the Subcommittee] presiding.
[GRAPHIC] [TIFF OMITTED] T9929.001

U.S. HOUSE OF REPRESENTATIVES

COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY

SUBCOMMITTEE ON RESEARCH

hearing charter

Scientific Integrity and Transparency

tuesday, march 5, 2013
10:00 a.m. to 12:00 p.m.
2318 rayburn house office building

Purpose

At 10 AM on Tuesday, March 5, 2013, the Subcommittee on Research
will hold a hearing titled Scientific Integrity and Transparency. This
hearing will provide Members an opportunity to understand the problem
of access to underlying data from published research funded by the
federal government, and why access to this underlying data is vital to
scientific integrity and transparency for peer reviewed research. On
March 29th, 2012 the Investigation and Oversight Subcommittee held a
hearing entitled, ``Federally Funded Research: Examining Public Access
and Scholarly Publication Interests.'' \1\ The focus of this past
hearing was on open access to publications, whereas the focus of this
hearing is on open access to data used in federal research.
---------------------------------------------------------------------------
\1\ http://science.house.gov/hearing/subcommittee-investigations-
and-oversight-hearing-examining-public-access-and-scholarly

---------------------------------------------------------------------------
Witnesses

Prof. Bruce Alberts, Professor of Biochemistry,
University of California San Francisco

Prof. Victoria Stodden, Assistant Professor of
Statistics, Columbia University

Dr. Stanley Young, Assistant Director for Bioinformatics,
National Institute of Statistical Sciences

Mr. Sayeed Choudhury, Associate Dean for Research Data
Management at Johns Hopkins University and Hodson Director of the
Digital Research and Curation Center

Overview

The bedrock of the scientific process is the ability to replicate
the experimental claims made by researchers. These claims include both
the generation of data and the analysis of data by computer software
and code. Scientists rarely reproduce the work of others since they
neither have the time nor the resources to reliably replicate the work
of their colleagues; instead, they often trust these claims and rely on
the peer review process and their colleagues to share their data and
analysis methods when needed. This exchange allows for scientists and
companies to exploit the latest insights to develop new directions in
their research, and allows them to maximize the impact of federal
research investment. Thus, scientific progress cannot occur unless
there is a strong culture of integrity and transparency.
Unfortunately, the current system has demonstrated several flaws.
The current incentive system rewards researchers who publish in
journals, but preparation of data for others' use is not an important
part of this reward structure. The process of peer review, which the
scientific community views as its primary means to check scientific
integrity in journal publications, oftentimes does not try to replicate
the results of submitted papers. Fellow researchers conducting the peer
review for publication rarely ask for the original data of the
submitted paper they are reviewing, and focus instead on whether the
claims made in the paper are plausible. They simply assume the
underlying data is valid. In a recent study by Young and Karr, upwards
of 90% of clinical trial claims for new medicines cannot be replicated.
\2\ The inability to replicate published results is not unique to
clinical trials and occurs across scientific disciplines. \3\
---------------------------------------------------------------------------
\2\ http://science.house.gov/sites/republicans.science.house.gov/
files/documents/hearings/HHRG-112-SY20-WState-SYoung-20120203.pdf
\3\ ``Again, and again, and again.'' p1225 Science Vol 334 2
December 2011
---------------------------------------------------------------------------
This hearing will attempt to understand the scope of the problem
with scientific integrity, especially how thorough researchers deal
with underlying data. This issue of scientific integrity should be
differentiated from cases of scientists knowingly and intentionally
committing scientific fraud, fabricating data, or plagiarism though
these might be inter-related depending on individual circumstances.
This hearing will focus primarily on how data is collected, shared, and
analyzed by the scientific community and policies for what, how, and
when federally funded research data should be shared, as well as the
cost of making this data available to the scientific community and
public. Current federal laws governing the sharing of data include the
Data Access Act (DAA) of 1999 and the Information Quality Act (IQA) of
2001. \4\ Introduced by Senator Richard Shelby, the DAA (sometimes
known as ``the Shelby Amendment'' within the science community)
requires that data from federally funded research be made available
under the Freedom of Information Act procedures. The IQA requires the
OMB to issue regulations for ensuring the quality and integrity of all
information disseminated by federal agencies. However, the Government
Accountability Office reported in September 2007 that federal agencies
rarely monitor whether researchers make data available. \5\
---------------------------------------------------------------------------
\4\ National Research Council, Ensuring the Integrity,
Accessibility, and Stewardship of Research Data in the Digital Age
(Washington, DC: National Academy Press), 2009.
\5\ http://www.gao.gov/products/GAO-07-1172
---------------------------------------------------------------------------
In response to these aforementioned issues, the Office of Science
and Technology Policy (OSTP) released guidance to federal agencies on
February 22nd about increasing access to the results of federally
funded scientific research which includes a discussion about access to
non-classified digital data. In this memo, OSTP outlines the following
principles for federal funding agencies to follow when a issuing a data
access plan \6\:
---------------------------------------------------------------------------
\6\ http://www.whitehouse.gov/sites/default/files/microsites/ostp/
ostp-public-access-memo-2013
.pdf

Maximize access to scientific data created with federal
---------------------------------------------------------------------------
funds;

Ensure that researchers develop data management plans,
and allow inclusion for costs in proposals along with proper
evaluations of these proposals;

Include mechanisms to ensure compliance with data
management plans and policies;

Promote the deposit of data in publicly accessibly
databases;

Encourage cooperation with the private sector to improve
data access and compatibility;

Develop approaches for identifying/providing appropriate
attribution to data sets;

Support the training, education and workforce development
related to data management; and

Provide assessment of long-term needs for the
preservation of scientific data.

This hearing will address how such principles might best be
implemented by federal research agencies and members of the scientific
community conducting such research.
Chairman Bucshon. The Subcommittee on Research will now
come to order.
Good morning. Welcome to today's hearing entitled
``Scientific Integrity and Transparency.'' In front of you are
packets containing the written testimonies, biographies and
Truth-in-Testimony disclosures for today's witness panel. I
recognize myself for five minutes for an opening statement.
I want to welcome everyone to today's Research Subcommittee
hearing on the issue of scientific integrity and transparency.
An editorial in the March 29, 2012, edition of Nature
magazine entitled: ``Must try harder: too many sloppy mistakes
are creeping into scientific papers. Lab heads must look more
rigorously at the data and at themselves.'' I found this
editorial particularly interesting because of my background as
a cardiothoracic surgeon and my professional interest in
medicine. The editorial goes on to cite a recent study
contained in this specific issue by Glenn Begley and Lee Ellis,
which analyzes the low number of cancer research studies that
have been converted into clinical success, and concludes that a
major factor is the overall poor quality of published
preclinical data. This is one of the many similar studies that
I have read.
The growing lack of scientific integrity and transparency
has many causes but one thing is very clear: without open
access to data, there can be neither integrity nor transparency
from the conclusions reached by the scientific community.
Furthermore, when there is no reliable access to data, the
progress of science is impeded and leads to inefficiencies in
the scientific discovery process. Important results cannot be
verified, and confidence in scientific claims dwindles.
The Federal Government is the main sponsor of basic
scientific research, with over $140 billion spent in fiscal
year 2013. The American scientific community has made enormous
contributions in many scientific fields from federally
sponsored research. I believe our Nation's scientists will
continue to develop the breakthrough discoveries and
innovations of tomorrow. However, scientists receiving federal
funding need to be accountable and responsible stewards of
taxpayers' resources. Hardworking Americans trust our
scientists to be genuine and authentic in the way they conduct
and share federally funded research.
The focus of this hearing will be on scientific research
data funded by the Federal Government. There are key issues
that data-sharing policies should address including what is
data, how it should be shared, when it should be shared, and
what potential costs might result in making this data available
to the research community. We want to maximize access to data
while protecting personal privacy, avoid any negative impact on
intellectual property rights and innovation, and preserve data
without ridiculous cost or administrative burdens.
In an attempt to begin addressing this issue, the Office of
Science and Technology Policy released guidelines on February
22nd of this year that recognized the problem of data access.
These guidelines, intended for federal science agencies, are to
be followed when determining a policy for public access to
scientific data in digital formats. As part of this hearing, I
look forward to hearing the witnesses' opinions on these
federal guidelines.
Our witnesses today offer input from a variety of
scientific fields, as this problem is not exclusive to one
scientific field, community or discipline. I would like to
thank them for coming and taking the time to offer their
expertise. I would also like to thank Ranking Member Lipinski
and everyone else participating in today's hearing.
[The prepared statement of Mr. Bucshon follows:]

Prepared Statement of Chairman Larry Bucshon

I want to welcome everyone to today's Research subcommittee hearing
on the issue of scientific integrity and transparency.
An editorial in the March 29, 2012 edition of Nature magazine was
entitled: ``Must try harder: too many sloppy mistakes are creeping into
scientific papers. Lab heads must look more rigorously at the data--and
at themselves.'' I found this editorial particularly interesting
because of my background as a cardiothoracic surgeon and my
professional interest in medicine. The editorial goes on to cite a
recent study (contained in this specific issue) by Glenn Begley and Lee
Ellis which analyzes the low number of cancer-research studies that
have been converted into clinical success, and concludes that ``a major
factor is the overall poor quality of published pre-clinical data.''
This is one of many similar studies that I have read.
The growing lack of scientific integrity and transparency has many
causes but one thing is very clear: without open access to data, there
can be neither integrity nor transparency from the conclusions reached
by the scientific community. Furthermore, when there is no reliable
access to data, the process of science is impeded and leads to
inefficiencies in the scientific discovery process. Important results
cannot be verified, and confidence in scientific claims dwindles.
The federal government is the main sponsor of basic science
research, with over $140 billion spent in fiscal year 2013. The
American scientific community has made enormous contributions in many
scientific fields from federally sponsored research. I believe our
nation's scientists will continue to develop the breakthrough
discoveries and innovations of tomorrow. However, scientists receiving
federal funding need to be accountable and responsible stewards of tax-
payer resources. Hard-working Americans trust our scientists to be
genuine and authentic in the way they conduct and share federally
funded research.
The focus of this hearing will be on scientific research data
funded by the federal government. There are key issues that data-
sharing policies should address including: what is data, how it should
be shared, when it should be shared, and what potential costs might
result in making this data available to the research community. We want
to maximize access to data while protecting personal privacy, avoid any
negative impact on intellectual property rights and innovation, and
preserve data without ridiculous cost or administrative burdens. In an
attempt to begin addressing this issue, the Office of Science and
Technology Policy released guidelines on February 22nd of this year
that recognized the problem of data access. These guidelines, intended
for federal science agencies, are to be followed when determining a
policy for public access to scientific data in digital formats. As part
of this hearing, I look forward to hearing the witness's opinions on
these federal guidelines.
Our witnesses today offer input from a variety of scientific
fields, as this problem is not exclusive to one scientific field,
community, or discipline. I'd like to thank them for coming and taking
time to offer their expertise. I'd also like to thank Ranking Member
Lipinski and everyone else participating in today's hearing.

Chairman Bucshon. With that, I now recognize the Ranking
Member, the gentleman from Illinois, Mr. Lipinski, for an
opening statement.
Mr. Lipinski. Thank you, Chairman Bucshon. I think this is
our third hearing in three weeks, and we have another one next
week that I will now label you the hardest-working Chairman in
Washington, D.C. So it is good to be at work here and I want to
thank all the witnesses for being here.
The United States has for decades represented the world's
gold standard for scientific integrity. But no one should
mistake this observation as an argument for complacency. In the
COMPETES Act of 2007, which we worked on in this Subcommittee,
then-Subcommittee Chairman Brian Baird included a provision on
Responsible Conduct of Research that required every institution
receiving NSF grant funding to provide training on the ethical
conduct of science to all students and postdocs covered under
those grants. Today, all U.S. research universities have
implemented research ethics training for their STEM students
and trainees, which we all can agree is a good thing.
The bigger challenge to the progress of science is not
misconduct, but rather poor methodology and bad statistical
analysis that take a long time to uncover. Or for that matter,
discoveries in one field that have broad multidisciplinary
relevance but take time to be known in other fields. To that
end, the open sharing of scientific data is good for science
and it is good for society. We must, of course, respect issues
of privacy and intellectual property. But the more data are
open, the faster we will validate new theories and overturn old
ones, and the more efficiently we will transform new
discoveries into innovations that will create jobs and make us
healthier and more prosperous. The movement toward open data is
not primarily about scientific integrity; it is mostly about
speeding up the process of scientific discovery and innovation.
However, there are some big challenges to the widespread
implementation of open data. Someone must define what exactly
data sharing is going to mean and how it is going to be done,
beginning with a standard. The February 22nd OSTP memo, which
the Chairman mentioned, on increasing access to the results of
federally funded scientific research, which by the way was also
a direct response to requirements in the COMPETES Act, takes on
many of these issues in detail. But specifically, here are some
questions that we have to consider, and some of these questions
were questions raised by the Chairman. First, what does it
entail and how much does it cost for researchers to develop a
data management plan and to prepare their own data for sharing?
Do they have adequate assistance from professional information
managers? Are funding agencies sufficiently aware of the costs
and skills required for good data management plans, and how
should they evaluate and budget for data management proposals?
What are the IT infrastructure needs for data sharing,
including technical standards, and what, if any, scientific or
technical barriers exist to developing that infrastructure?
What are the most important factors to consider in the
economics of digital data access and preservation? What should
be the respective roles of science agencies, universities, and
the private sector in supporting and preserving public
databases? How can these groups work together to minimize costs
and maximize benefit to the scientific community? And finally,
are there any policy or legal barriers for sustainable digital
access and preservation?
In light of the majority's suggestion of a possible
legislative outcome for this hearing, I hope that today's
dialogue will include a thoughtful discussion of some of these
practical issues of implementation. I know that all four expert
witnesses before us have a lot to contribute to this discussion
and I look forward to learning from them because this is
certainly something that is important for us to pursue but we
need to make sure that we are covering all our bases here and
do this in the right manner.
With that, I yield back.
[The prepared statement of Mr. Lipinski follows:]

Prepared Statement of Ranking Minority Member Daniel Lipinski

Thank you Chairman Bucshon and thanks to all of the witnesses for
being here.
The U.S. has for decades represented the world's gold standard for
scientific integrity. But no one should mistake this observation as an
argument for complacency. In the COMPETES Act of 2007, which we worked
on in this subcommittee, then Subcommittee Chairman Brian Baird
included a provision on Responsible Conduct of Research that required
every institution receiving NSF grant funding to provide training on
the ethical conduct of science to all students and postdocs covered
under those grants. Today, all U.S. research universities have
implemented research ethics training for their STEM students and
trainees.
The bigger challenge to the progress of science is not misconduct,
but rather poor methodology and bad statistical analysis that take a
long time to uncover. Or for that matter, discoveries in one field that
have broad multidisciplinary relevance but take time to be known in
other fields. To that end, the open sharing of scientific data is good
for science and it's good for society. We must, of course, respect
issues of privacy and intellectual property. But the more data are
open, the faster we will validate new theories and overturn old ones,
and the more efficiently we will transform new discoveries into
innovations that will create jobs and make us healthier and more
prosperous. The movement toward open data is not primarily about
scientific integrity, it's mostly about speeding up the process of
scientific discovery and innovation.
However, there are some big challenges to the widespread
implementation of open data. Someone must define what exactly data
sharing is going to mean and how it is going to be done, beginning with
a standard. The February 22nd OSTP memo on increasing access to the
results of federally funded scientific research, which by the way was
also a direct response to a requirement in COMPETES, takes on many of
these issues in detail.

Specifically, we must consider such questions as:

What does it entail and how much does it cost for
researchers to develop a data management plan and to prepare their own
data for sharing? Do they have adequate assistance from professional
information managers?

Are funding agencies sufficiently aware of the costs and
skills required for good data management plans, and how should they
evaluate and budget for data management proposals?

What are the IT infrastructure needs for data-sharing,
including technical standards, and what, if any, scientific or
technical barriers exist to developing that infrastructure?

What are the most important factors to consider in the
economics of digital data access and preservation?

What should be the respective roles of science agencies,
universities, and the private sector in supporting and preserving
public databases? How can these groups work together to minimize costs
and maximize benefit to the scientific community?

And finally, are there any policy or legal barriers for
sustainable digital access and preservation?

In light of the Majority's suggestion of a possible legislative
outcome for this hearing, I hope that today's dialogue will include a
thoughtful discussion of some of these practical issues of
implementation. I know that all four expert witnesses before us have a
lot to contribute to this discussion and I look forward to learning
from them.

With that I yield back.

Chairman Bucshon. Thank you, Mr. Lipinski.
If there are Members who wish to submit additional opening
statements, your statements will be added to the record at this
point.
At this time I would like to introduce our witnesses. Our
first witness is Dr. Bruce Alberts, Editor-in-Chief of Science
Magazine and Professor Emeritus of Biochemistry and Biophysics
at the University of California-San Francisco. Welcome. Our
next witness is Dr. Victoria Stodden, Assistant Professor of
Statistics at Columbia University. Our third witness is Dr.
Stanley Young, the Assistant Director of Bioinformatics at the
National Institutes of Statistical Sciences. That was hard to
say. Our fourth and final witness today is Mr. Sayeed
Choudhury, Associate Dean for Research Data Management at Johns
Hopkins University and Hodson Director of the digital Research
and Curation Center.
As our witnesses should know, spoken testimony is limited
to five minutes each after which Members of the Committee will
have five minutes each to ask questions.
I now recognize Dr. Alberts to present his oral testimony.

TESTIMONY OF DR. BRUCE ALBERTS,

EDITOR-IN-CHIEF, SCIENCE MAGAZINE AND

PROFESSOR EMERITUS OF BIOCHEMISTRY AND BIOPHYSICS,

UNIVERSITY OF CALIFORNIA - SAN FRANCISCO

Dr. Alberts. It is a pleasure to be here today. I would
just like to start by emphasizing something that Science
Magazine covers repeatedly, which is the fact that our strength
in science and technology in the United States underlies both
our economic success and our military dominance in the world.
As you all know, many other nations are increasingly making
investments in this area, and I find it distressing that
although this Committee has long supported fundamental, long-
term scientific research, the investment in the United States
has been stagnant for many years. The investment in this kind
of research was 1.25 percent of GDP in 1985, has dropped to .87
percent of GDP in 2013, a big drop, and of course, the current
sequester will now make our situation even worse. I believe
that this is dangerous for America's future, for my
grandchildren's future.
But this hearing, of course, is to focus on the quality and
not the quantity of U.S. research. I would like to address
first the data availability issue, which of course is crucial
for science. Science builds by one scientist testing and
building on and maybe refuting the data of other scientists,
very much a community endeavor. And the privilege of publishing
in a journal like ours demands data sharing. Otherwise science
doesn't work.
So our journal has been working on this. This is a special
issue we published, 14 long articles about all these issues,
February 2011, and we are publishing more and more about this.
It is accompanied by a survey, a useful survey of scientists,
how they use data and whether they have enough access. And we
have stressed over and over again that our policy is ``that all
data necessary to understand, assess and extend the conclusions
of the manuscript must be available to any reader of science.''
In this issue, we announced a new policy. This includes
computer codes involved in the creation or analysis of data,
and I am pleased to say that we are getting good compliance
with those policies.
Of course, there are problems that remain. You will hear
about them from the rest of the group here. But one I would
like to emphasize is guaranteeing funding for the public
databases, the critical ones, funding long term so that the
community and journals like ours can rely on them. This is
really a major issue. In my field, the protein database, for
example, is absolutely crucial. It has got 100,000 different
protein coordinates in it. You know, if funding lapses, then we
lose all this, and these places play major roles in setting
standards as well.
And secondly, I would like to emphasize that we need tools
for interacting with the largest data sets that are now
increasingly provided as supplemental online information and
journal publications like ours, so when we demand the data, we
put the data not in the written paper but most of it in a big
electronic supplement, and other journals are doing that as
well, but we need ways to help people analyze that data who are
not the original authors. And of course, every journal needs to
stress clear and complete presentation of all the materials and
methods that were used in the research.
So the other issue is data reproducibility. Mr. Chairman,
you quoted from that paper. My conclusion, and talking to
people at Genentech who would agree with that paper that you
cited from Bayer Health Care is that the scientific standards
are lower in some fields of science and others that we need to
work on setting higher standards.
In addition, human cells are incredibly complex and it is
easy to get a result that looks right when it is really wrong,
and one can easily be fooled. Every scientist must be trained
to be highly suspicious about his or her own results, and this
again is a major issue. And finally, I believe we are
overemphasizing research directly aimed at finding drugs at the
expense of the high-quality discovery-driven basic research
that is urgently needed to improve the search for disease
treatments. We are just mostly stabbing in the dark.
So my suggestions for improving this situation would demand
a community effort from scientific journals like ours. We have
new policies in the last three years that every senior author
for each part of the results being published must confirm that
he or she has personally reviewed the original data generated
by that unit, specifying where exactly those results appear in
the paper. It used to be that we wanted one author to take
responsibility. That is totally unreasonable now. Half of our
papers have authors in different countries. We would have to
have a set of senior authors. We are developing checklists in
various fields of science to help journals and scientists.
There is a biosketch issue. People should not be listing huge
lists of publications to impress other people who are giving
them grant funds. They need to focus on their five or ten most
important contributions, and quality is critical, not quantity.
And funding agencies have a role to play here as well.
I just want to emphasize my own role at universities. I am
still teaching. I am going to be teaching a 2-week minicourse
on ethics and research standards this May, so I am very much
involved in these issues. Thank you.
[The prepared statement of Dr. Alberts follows:]

[GRAPHIC] [TIFF OMITTED] T9929.005

[GRAPHIC] [TIFF OMITTED] T9929.006

[GRAPHIC] [TIFF OMITTED] T9929.007

[GRAPHIC] [TIFF OMITTED] T9929.008

[GRAPHIC] [TIFF OMITTED] T9929.009

[GRAPHIC] [TIFF OMITTED] T9929.010

[GRAPHIC] [TIFF OMITTED] T9929.011

Chairman Bucshon. Thank you.
I now recognize Dr. Stodden for five minutes to present her
testimony.

TESTIMONY OF DR. VICTORIA STODDEN,

ASSISTANT PROFESSOR OF STATISTICS,

COLUMBIA UNIVERSITY

Dr. Stodden. Thank you for the privilege of addressing you,
and thank you for your very lucid comments. I agree with just
about everything both of you have said, and I also agree on how
important this issue is. So I would like to spend my remaining
time on two aspects. One is, I would like to scope the problem
for you, and the second is, I would like to scope the action I
think that is available to you here.
So the first thing I want to say is that there is not a
crisis of integrity in terms of scientists and scientists'
behavior. What has happened in science is that like all sectors
of the economy, and all across America, we are taking advantage
of technological revolutions. What we are doing is using far
more computers, far more data-oriented and data-driven
research, far more high-powered investigation in all the
research all across the sciences. This isn't just in the life
sciences. This is in engineering, this is in English
departments who are doing word counts in Shakespeare. This is
something that is really pervasive in the scientific enterprise
as a whole, and this is something that is having ramifications
in the way that we disseminate and communicate science. It is
not a question of personal integrity.
So what this means is, to scope the issue, I think that we
need to think about this issue in terms of reproducibility, so
as Dr. Alberts outlined, open data itself is a very broad
notion. I think this needs to be scoped to data and software
required to reproduce published results, and what that means to
a scientist is clear. There are details, of course, but that is
something that a scientist can understand. This is something
that institutions in the scientific enterprise can understand.
And I reiterate that it is not just about data, it must include
the codes and the software that take that data to the published
results so that those results can be validated and verified.
You mentioned in your opening remarks about statistical
errors, about other issues. I would like to scope the problem
to this computational issue, which I believe is reflected in
the language around this, digital data, and the reason for that
is clarity. I agree with you that as a statistician, there are
lots of statistical errors that are in the literature that are
being worked out. This is in part because doing computational
work is new to many fields, and I believe the core issue is
sharing data, sharing code and things like sort of biological
materials or the mathematics and the statistics, those will
work out as corollary issues. Right now the issue needs to be
scoped on data and code that allow those results to be
understood, validated and reproduced by other members in the
community.
So secondly, I would like to talk about the scope of action
that I think is available and important for you to think about.
The first thing is, as Dr. Alberts outlined, scientists are
very interested in these issues of reproducibility. As we know,
it is a cornerstone. We don't accept scientific findings until
there is replication, until there is validation by other
people--at least that is the theory. And in my testimony, I
included two articles that are in some sense manifestos from
computational scientists calling for greater reproducibility.
The reason computational scientists are banding together and
creating these manifestos is because there is a collective
action problem. It does take time to make your data available
and to make your software available. It is easier to hack
things up on your machine and produce a paper and never really
look at the code or the data in the sense of sharing it. That
does take extra time. So what this means is that scientists who
want to do reproducible research and sharing the code and data
that replicates their results are at a disadvantage because
they don't receive credit for this right now. They generally
receive credit for the publications. So steps like what Science
Magazine has taken with data-sharing requirements and code-
sharing requirements are extraordinary and laudable and very
important. This is Science, though, our highest-impact journal,
and it is much harder for lower-impact journals to demand that
of the authors. But this is where the federal funding agencies
come in as another lever that exerts pressure on scientists and
what they are required to do.
So in these manifestos that I included in my testimony, you
will see computational scientist after computational scientist
calling for help in a broad sense because people who stick
their nose out get it cut off and we need the federal funding
agencies to work in an integrated way to help overcome this
collective action problem.
Now, how does this happen? This happens through the
creation of and financial support for repositories that can
house code and can house data, and this is something that can't
just happen, I don't believe, from added money on grants, on
NIH grants and so on, that are supposed to fund these things in
an ethereal way. I think this is more serious and this is
something that needs to be directly confronted, more similar to
a mandate when you take federal funds for your research.
Now, standards, as Dr. Alberts mentioned, the protein data
bank and these institutional repositories, other institutional
repositories are very important for setting standards. They
come from the community level. I don't believe they come from
the federal level down. But this needs to be addressed and
recognized. There is no point in saying we need to have
reproducibility, we need to share data, we need to share code
when they don't know where to put it and there aren't ways for
people to share it and access it and curate it.
So I will move to questions here.
[The prepared statement of Dr. Stodden follows:]

[GRAPHIC] [TIFF OMITTED] T9929.012

[GRAPHIC] [TIFF OMITTED] T9929.013

[GRAPHIC] [TIFF OMITTED] T9929.014

[GRAPHIC] [TIFF OMITTED] T9929.015

[GRAPHIC] [TIFF OMITTED] T9929.016

[GRAPHIC] [TIFF OMITTED] T9929.017

[GRAPHIC] [TIFF OMITTED] T9929.018

[GRAPHIC] [TIFF OMITTED] T9929.019

[GRAPHIC] [TIFF OMITTED] T9929.020

[GRAPHIC] [TIFF OMITTED] T9929.021

[GRAPHIC] [TIFF OMITTED] T9929.022

[GRAPHIC] [TIFF OMITTED] T9929.023

[GRAPHIC] [TIFF OMITTED] T9929.024

[GRAPHIC] [TIFF OMITTED] T9929.025

[GRAPHIC] [TIFF OMITTED] T9929.026

[GRAPHIC] [TIFF OMITTED] T9929.027

[GRAPHIC] [TIFF OMITTED] T9929.028

[GRAPHIC] [TIFF OMITTED] T9929.029

[GRAPHIC] [TIFF OMITTED] T9929.030

[GRAPHIC] [TIFF OMITTED] T9929.031

[GRAPHIC] [TIFF OMITTED] T9929.032

[GRAPHIC] [TIFF OMITTED] T9929.033

[GRAPHIC] [TIFF OMITTED] T9929.034

[GRAPHIC] [TIFF OMITTED] T9929.035

[GRAPHIC] [TIFF OMITTED] T9929.036

[GRAPHIC] [TIFF OMITTED] T9929.037

Chairman Bucshon. Thank you very much.
I recognize Dr. Young for five minutes to present his
testimony.

TESTIMONY OF DR. STANLEY YOUNG,

ASSISTANT DIRECTOR FOR BIOINFORMATICS,

NATIONAL INSTITUTES OF STATISTICAL SCIENCES

Dr. Young. Thank you for the opportunity of testifying.
As an abstract principle, the sharing of research data is a
noble goal and meets with little opposition. However, when data
sharing is attempted in a particular circumstance, the
conflicting interests of the parties can thwart the exchange.
So said Joe Cecil of the Justice Department in 1985.
What is the current status of science in general and data
availability in particular? First, where are we with science
claims? In 2005, John Ioannidis published two papers of
interest. In one, he asserted that 90 percent of the claims
made in science papers are wrong in the sense that they are not
expected to replicate. In another, he noted that five out of
six papers based on observational studies failed to replicate.
I published a paper in 2011 and showed that of 52 hypotheses
suggested from observational studies, none replicated in the
expected direction and five were statistically significant, but
in the opposite direction. Begley and Ellis reported that 47
out of 53 claims made in major science journals failed to
usefully replicate.
Where are we on data sharing? John Ioannidis selected 10
papers from each of 50 of the highest-impact journals--New
England Journal of Medicine, Nature, Science, et cetera--and
asked, is the data used in these papers publicly available?
Overall, only 47 of 500 papers deposited full primary raw data
online. None of the 149 papers not subjected to data
availability policies made their full primary data publicly
available.
I report on two personal experiences. Dr. Beate Ritz of
UCLA made a claim in Environmental Health Perspectives that air
pollution in L.A. county leads to low birth weights. Dr.
Frederica Perera of Columbia University asserted in the journal
Pediatrics that air pollution decreased IQ in children. NIEHS
provided funding for both studies. In both cases, I asked for
the data sets from the authors. I also asked for help from
NIEHS. I resorted to FOI. I received neither data set.
Recently, I was informed that NIEHS does not have the legal
authority to compel and an author to proved data that was
funded by them. Operationally, NIH funding, the Shelby
amendment, etc. mean very little with respect to data
availability. Mostly, authors do not provide data sets used in
their publications. It is technically easy to share data used
in publications. Others will discuss reproducible Research, so
I will leave that aside.
Just why are we in this situation, where most claims do not
replicate and authors will not make data sets available? In a
long and illustrious career, Edwards Deming made the point that
if a system is failing, it is not the workers' fault--that is
the scientist--it is the fault with management, in this case
funding agencies and journal editors. For over 30 years,
workers have been admonished to do their work better and to
make their data sets available. It was reported in Science in
1988 that there were serious problems with observational
studies. Nothing has changed in 25 years.
Congress, funding agencies and journal editors need to step
up and manage the scientific process. They should require
authors to deposit data sets on publication of their papers.
Funding of data set construction and analysis should be
separate. They should require data analysis strategies that
demonstrate reproducibility. For example, any claim should be
replicated in a separate data set before publication. Remember,
the reliability of current scientific claims is only 10 to 20
percent. John Holdren's thing on the Office of Science and
Technology Policy I think is a welcomed thing in this area.
It is not enough to agree with sharing data. It is almost
30 years since Joe Cecil stated the problem. Management should
make the depositing of data sets on publication mandatory. This
is a management problem; it is not a science worker problem.
Thank you very much.
[The prepared statement of Dr. Young follows:]

[GRAPHIC] [TIFF OMITTED] T9929.038

[GRAPHIC] [TIFF OMITTED] T9929.039

Chairman Bucshon. Thank you.
I now recognize Mr. Choudhury to present his testimony,
five minutes.

TESTIMONY OF MR. SAYEED CHOUDHURY,

ASSOCIATE DEAN FOR RESEARCH DATA MANAGEMENT

AT JOHNS HOPKINS UNIVERSITY AND HODSON DIRECTOR

OF THE DIGITAL RESEARCH AND CURATION CENTER

Mr. Choudhury. Chairman Bucshon, Ranking Member Lipinski,
Members of the Subcommittee, thank you for the opportunity to
be here today.
I have been asked to address questions related to data
sharing, access and preservation. I would like to do so from
the perspective of infrastructure development. The other
witnesses have already addressed the importance of persistent
scientific data archives for reproducibility. I believe that
strategic investments in data infrastructure also have
important implications for our overall competitiveness.
There are important lessons from our historical
infrastructure development that are relevant as we consider
data sharing, access and preservation. The development of
railroads initially led to systems that served regional
networks but eventually merged into a national network through
a standard track gauge. With the development of automobiles, we
adapted from early mistakes to adjust drivers' behavior through
education, driving rules and seat belts. The development of the
Internet reflects a layered approach of different technologies
connected through a key component in the form of two protocols
known as TCP and IP.
Broadly speaking, successful infrastructure development has
relied on a flexible balance of community and national
approaches, social aspects relating to human behavior, and key
components. In each case, as infrastructure evolved through
community efforts, we reached the point where national
coordination moved us to a more cohesive situation. In previous
cases, the more cohesive infrastructure led to greater societal
benefits from both the private and public sector. I believe we
have reached a similar point with certain aspects of data
infrastructure.
From a policy perspective, the recent Executive Memorandum
from the Office of Science and Technology Policy provides a
useful framework for federal policies that would maximize data
sharing, access and preservation. The memorandum acknowledges
the need for flexibility by federal agencies for the
communities they support balanced with the need for uniform
guidelines when appropriate. There is one specific example that
I will mention in my oral remarks. The memorandum outlines the
need for appropriate data attribution and citation. The method
for meeting this need is the persistent identifier, which is a
long-lasting reference to data. You can think of persistent
identifiers as an improved version of Web site addresses such
as Congress.gov. It is a rough analogy, but the persistent
identifier may be compared to having the same role as track
gauge in the development of railroads.
From an economics perspective, there is a greater need for
understanding of costs. For example, some cost studies focus
only on storage, ignoring related costs such as data center
operations or longer-term costs related to preservation.
Preservation of data ensures that we can extract value for the
long term, noting that with data, preservation issues can arise
in as little as five years. The development of data
preservation infrastructure represents a case where effective
partnerships could be formed between the public sector, private
sector and university sector, in which I include libraries and
national laboratories. It is possible that the private sector
will not focus on data preservation because there are
unresolved research problems, it is unlikely to be profitable,
and it benefits from large-scale coordination. Federal agencies
could provide the funding for research, prototypes and initial
deployment of data preservation infrastructure. The university
sector could then set up production systems that the scientific
community and private sector could exploit for discovery and
profit.
From a technology perspective, it is important to remember
that there are different types of data and different stages of
scientific projects. Consequently, there is a need for a
layered approach to diverse systems spanning individual
researchers to large-scale national projects. Even with this in
mind, it is possible to identify gaps that are common across
this landscape. For example, today's storage systems work well
for many purposes but they do not currently meet some
preservation requirements. It is worth mentioning that some
storage companies view this situation as an opportunity for
code development with the university sector.
From a non-technical perspective, scientists do their best
to manage their data but they do not always have a full
understanding. Raising awareness and reinforcing the importance
of data sharing, access and preservation will be important.
This type of awareness building and education is similar to the
adjustment of automobile drivers' behaviors over time.
In conclusion, I believe that we have an important
opportunity to advance our data networks into more cohesive,
large-scale infrastructure that will advance the scientific
process and generate benefits for the public sector, industry
and the scientific community.
I thank you again for the opportunity to be here, and I
look forward to answering your questions.
[The prepared statement of Mr. Choudhury follows:]

[GRAPHIC] [TIFF OMITTED] T9929.040

[GRAPHIC] [TIFF OMITTED] T9929.041

[GRAPHIC] [TIFF OMITTED] T9929.042

[GRAPHIC] [TIFF OMITTED] T9929.043

[GRAPHIC] [TIFF OMITTED] T9929.044

[GRAPHIC] [TIFF OMITTED] T9929.045

Chairman Bucshon. Thank you very much, and I thank all the
witnesses for their testimony, reminding Members that the
Committee rules limit questioning to five minutes. The Chair
will at this point open the round of questions. The Chair
recognizes himself for five minutes.
As a cardiothoracic surgeon, I am very interested in this
issue because I have to translate what is written into clinical
practice, and so this type of issue really does affect real
people. I can tell you the difficulty that people like me have
in figuring out when to change your clinical practice, when you
are doing something that turns out wasn't the right thing to
do, it is a very difficult process that is ongoing, so I am
very interested in this particular subject.
I will start with Dr. Young. Could you give me some
examples of where State and federal regulations were made
without public release of data used to make those regulations?
Dr. Young. Yes. I have taken an interest pro bono in air
pollution questions, and an expert in the area worked with me
and we developed 100 papers that are key papers in that area.
Then being a statistician, I selected 50 of those papers at
random and asked the authors for the data sets. I received no
data sets at all. Many of these data sets were funded by the
Federal Government and there are many regulations that are
based on these data sets. They are key data sets. For the most
part, these data sets are not available.
Chairman Bucshon. Just so you know, I had the same problem
getting the data out of the Federal Government. It can be an
issue.
Mr. Choudhury, could you give me what specific
infrastructure technology requirements are required for the
storage of scientific data research?
Mr. Choudhury. There are several layers that are necessary
to actually preserve scientific data. It begins with storage,
which is basically just the bits residing on a hard disc or a
tape or even in the cloud, but eventually we also need to do
things to ensure data protection. We also need to have to then
do things to ensure that we can migrate the data over time, so
as we start to use new storage systems or if we have new file
formats, we have to be able to move those data into those new
environments. As Dr. Stodden mentioned, we also need to have
access to the software or the tools that process the data
because in many cases, it is not sufficient just to get access
to the data alone. So the actual preservation of the data is
this complex set of layers that go beyond storage. Storage is
necessary but it isn't sufficient. So we have to do all these
other things to understand the context and the reusability of
the data as well.
Chairman Bucshon. Do you think currently that university
libraries or national laboratories are equipped for this type
of infrastructure?
Mr. Choudhury. At Johns Hopkins, we have taken an approach
of looking at two stages. The first is prior to investigators
submitting proposals--they need some sort of consultation and
support to develop their data management plans. In this
respect, I do believe that the university sector, and
particularly university libraries, have stepped up very well. I
think most research university libraries are providing that
kind of consultation to their investigators.
The second stage is that once an award is made, then we
actually have to handle the data and we actually have to start
preserving it for the long term. In this respect, there is a
subset of that library community that has come forward to help
provide that kind of support, and then there is the long-term
preservation need, and even there, it is a smaller subset
again. It is in the preservation of the data where I think
there remains some research questions which ultimately when
they are addressed they can migrate the support into the
university library sector.
Chairman Bucshon. Great. Dr. Alberts, on February 11, 2011,
in a Science magazine editorial, you write, ``We will ask
authors to provide a specific statement regarding the
availability and curation of data as part of their
acknowledgments requesting that reviewers consider this as a
responsibility of the authors.'' Do you think this self-
policing policy works in practice?
Dr. Alberts. We find that it has been working for Science
magazine. Our senior author, deputy editor, Brooks Hanson, has
been deeply involved in this. On rare occasions we have had to
make authors do things that they should have done themselves
but I guess we are fortunate we have the threat, which is, we
are not going to publish any more papers from you, and they
want to publish in Science magazine, and as Victoria said, not
every journal can make that threat. So I think this is a very
important issue to emphasize. We haven't talked about the
fact--I am a biochemist, and I had lots of data from my
laboratory when I was an active scientist. Not all of it should
be preserved. I mean, if I tried to preserve everything, I
couldn't find anything. So we also need different fields to
decide what it is that we really need to preserve and make
available. There is so much material being collected now that
it is really important to get standards for different fields of
what needs to be preserved and what needs to be put in your
publication.
Chairman Bucshon. Great. Thank you all. I now yield to Mr.
Lipinski from Illinois.
Mr. Lipinski. Thank you. I wanted to start out by saying I
am sort of going back to my days as a social scientist and
thinking about not just the research I did and the data that I
had but also thinking about behavior, and it is--there are not
rewards generally for having--someone had mentioned, I think
Dr. Stodden, that you are rewarded for a result in a
publication but you are not rewarded--the rewards aren't there
to spend the time and the effort to have the data in a format
even that is accessible to others, and if you are talking about
going further than that, how exactly you went through and you
analyzed the data. I can't tell you how much paper I had
printed out of different ways, all these different models that
I ran and trying to keep track of all that. So it is not simple
to do and there has to be incentives. So somehow the culture
has to be changed. And the question is, how do we change that
culture? Now, the National Science Foundation requires that you
have a data management plan when you are applying for a grant,
so the NSF puts that in there.
My question is, in a short period of time if you can do it,
how do we change this, and should this be a situation where it
is data available upon request or should it all be available?
Should it be put out there published somewhere or put on a site
that everyone can access? And how far do we go with the data?
Is it, okay, this is how I analyze it, this is the statistical
package I used, this is how exactly I did it. So let me start
with Dr. Stodden. I mean, what is your quick sort of suggestion
on it for your 30,000 foot? What would you do if you could?
Dr. Stodden. So I think the efforts that have been taken so
far are really this on request and so on, and there are a
number of experiments and studies, and Dr. Young mentioned a
couple, where that doesn't seem to work as well. You don't
simply get the response. So I think it is time to move forward
to this being a standard. Now, having said, as Dr. Alberts
said, there are data sets and problems of different importance,
and you can imagine investing a lot more time curating a data
set that has broad use and applicability and might underlie 50
or 100 studies and so on versus one one-off. But the changes
really something that I believe scientists are willing to do
and are working on standards. For example, in economics this is
a very forward-thinking community and many of the journals have
standards and they do engage in data sharing and code sharing
but not even as much as they would like. And so I think the
complexity of the problem means that it really is not a one-
size-fits-all solution. As you mentioned, it is something that
comes from the field.
But I would suggest that this is a standard that it should
be understood that this code and the data go open for
reproducibility and changing the culture is something
scientists are talking about. There is a special issue I can
point you to in Computing and Science in Engineering that is
called Changing the Culture, and it is about giving these
rewards. So as Dr. Choudhury mentioned, having these persistent
identifiers allows citation for data and for code NSF steps
towards allowing scholarly objects like data and code listed on
the biosketch and not just publication is a real step in this
direction, and I think the scientific community will sort out
how it values data contribution and code contribution and
publication contribution. They may not be all valued equally
but we have a long history of doing this. Not all publications
are valued equally. But I think that bringing this through
citation and having citation standards is a way to really
change the culture and reward people.
And I will add one last point, which is there is a
generational difference here because these changes in
technology, young people and young scientists and people who
want to go into research, it is very natural for them to share
data and to share code, and it is discouraging for them to
enter a situation where suddenly this is not the norm. So this
is something where I think there is also this opportunity that
the culture is changing naturally on its own just with time as
younger people come in and have these expectations for sharing
what they are doing digitally. And so that is also something to
capitalize on. And again, I go back to the testimony in that
there is this collective action problem because, as you
mentioned, it takes time, and so something particularly from
federal agencies that can help push through that is really very
important.
Mr. Lipinski. I thank you. My time is up. I yield back.
Chairman Bucshon. I now yield to Mr. Stockman for five
minutes.
Mr. Stockman. I have a question for Dr. Alberts. My wife is
a NASA privacy officer, and I want to follow up on something
the Chairman related. In February in your editorial, you wrote,
``We recognize that exceptions may be needed to these general
requirements for sharing data, for example, preserve the
privacy of individuals or in some cases when data materials are
obtained from third parties and for security reasons but we
accept those rare exceptions.'' Is this your view today?
Dr. Alberts. For example, we had an experience with a
Department of Energy lab where they weren't allowed to give us
the code because presumably it had some security implications.
So we do encounter those one-off occasions. But they have been
rare. So we have to live with the law, and we try our best to
do what we can.
Mr. Stockman. Do you see other exceptions?
Dr. Alberts. Not that--I don't know of any exceptions since
that policy was made.
Mr. Stockman. Okay. The other question I have is for all
the witnesses. Many of you today also practice science. You are
also members of the United States scientific community. You
have been a world leader in producing first-class research. How
do you envision the mechanism of enforcing the sharing of data
without hindering the process of scientific discovery and
simultaneously minimizing the administrative burden of a
scientist? Because I know a lot of professors and everything a
lot of time fill our more paperwork than they do research. If
you could each just go quickly through the----
Dr. Alberts. Well, I think Victoria said it right. We need
to mobilize our communities. I mean, I am a cell biologist and
the American Society of Cell Biology used to help us. What does
it mean for our community, and we have to take responsibility
for it, and it is going to be different for statisticians.
Different people will have different requirements and it has to
make sense, and I agree with you that it has gone way overboard
now at universities. Every time I want to do anything, I have
to fill out a form. So I think we should try to avoid
legislating more flat requirements. You know, if I want to
interview students, graduate students at UCSF about their
career options, I have to fill out a 50-page human youth form.
It drives me nuts. So this Committee might work on pushing back
on some of the meaningless paper and get some requirements that
are more meaningful.
Dr. Stodden. That is a great question, and I think it goes
back to these issues of reproducibility. If you are publishing
a paper where you claim that data and code are out there and
available for it to be reproducible, then that is in a sense
the starting point of standards in a community. Now, as Dr.
Alberts mentioned, this will change for different communities
and different research problems and they can be quite
different, but there needs to be this expectation that the
results, the computational results will be reproducible and
then when you go and get your hands dirty and you try and do
the reproducibility, then if it doesn't work or it does work,
then that is value too in the community, and I think that
scaffolding and that framework is really there. It is a
question of moving towards this default of openness rather than
the default of being closed and then you request and so on, and
as I was mentioning to Ranking Member Lipinski, the default
needs to be open, and then as you mentioned, we have exceptions
for confidentiality and so on but those are the exceptions, and
then the standard is really about reproducibility.
Dr. Young. The first thing to keep in mind is that many
estimates say that 80 to 90 percent of the claims that appear
in scientific papers are wrong in the sense that they will not
replicate. So I would focus on cost per valid result.
Additional costs can be put into reproducible research and
things like that. The total number of claims that are checked
will go down but the number of valid claims can easily go up if
we do our research better. Thank you.
Mr. Choudhury. I think one thing that is becoming clear is
data management is a complex and demanding set of activities on
its own. It may not be reasonable to expect scientists to
conduct their own data management but rather work with a set of
professionals who sit somewhere between the domain sciences,
say, library information science. So I think there is a
workforce development issue here. We don't expect scientists to
be experts in IT systems or other kinds of systems. We provide
support for them, and I think data management may be in that
category.
Mr. Stockman. Thank you. I yield back.
Chairman Bucshon. I now recognize Mr. Bera from California.
Mr. Bera. Thank you, Mr. Chairman.
Now, to start off with, I would want to make sure we don't
give the impression that our scientific community and our
research institutions are producing faulty data. We maintain a
competitive advantage. As a scientist myself, as someone who
spent countless hours in the lab as a medical student and has
spent time as a faculty member and associate dean at the
University of California-Davis, working with our medical
students and our resident physicians, we maintain a competitive
superiority in our research institutions, and I think Dr.
Alberts touched on the importance of the federal investment in
our research institutions. We also need to recognize our
journals and particularly our leading peer review journals.
There is a rigorous process having again submitted articles and
worked with countless students that you go through as you are
submitting articles. Replicability is an important component
but also putting the information out there so others can look
at it and provide feedback is very important. So we want to be
conscious of that as well.
As we set up our research institutions, we often are doing
it and our trials are in a very transparent way, you know,
funding multi-center trials. When we look at major projects
like the Human Genome Project, as we talk about brain mapping,
we will set that up in as transparent a way as possible using
multiple of our institutions. And it isn't always just about
replicability. It is about sharing that data and working
together, but at the same time--and my question is this--as we
move into this era of wanting to share data, we also have to
maintain our competitive advantage. We do have competitor
nations that every day are trying to get to our data and get to
the research institutions. We talk about cybersecurity on this
Committee. We need to be very conscious of what we are putting
out there as well.
I would direct a question to Dr. Alberts. You talked about
the importance of research funding as well as the threats to
research funding in our academic institutions. Why don't you
touch on that, and then if the rest of the panel wants to talk
about how we move forward in kind of an open, transparent way
but maintaining our competitive advantage and protecting those
discoveries that we are making.
Dr. Alberts. As I wrote in my written testimony, I referred
to this major project from the National Academy of Sciences
when I was president to explain to Congress and the public how
fundamental knowledge produces breakthroughs. The first
pamphlet we produced was on the global positioning system.
Somewhere started with the fact that physicists invented atomic
clocks. They won a Nobel Prize but everybody thought it was
useless because it enabled us to keep time to a billionth of a
second, and why should we want to do that. Well, you follow
this progression, and I recommend that whole series. It is
still up on the Web. That combined with many other findings of
knowledge about the world enabled us to put up these 24
satellites that produce this wonderful device that we all use
and the military uses, and we did that over and over.
And what has been true in the United States, remarkably,
and I don't think people recognize this, we have been a magnet
for the most talented people from all around the world coming
here, and you just look at Silicon Valley and places like that.
So if we don't keep our leading position as scientific
research, a place to come to, our universities, then those
people won't come here and they won't subsequently contribute
their genius to the American economy and the American strength
of our Nation. So I am quite worried right now because many
other countries, China, for one, they see this very clearly.
This is where we have our competitive advantage and they are
trying to gain it, and if we don't pay attention to that, I
think we are going to lose this game. We are taking it for
granted that all these great people are going to come to this
country but they are not going to do that anymore if we are not
the best place to do research.
Dr. Stodden. So I couldn't agree with your comments more,
and also with Dr. Alberts that American science is absolutely
superb, and as evidence of this, I believe our discussion today
actually reflects the high integrity and the honesty of that
community in trying to grapple with these problems. I mean,
these manifestos and so on I put in the testimony here, these
are scientists who are concerned about the quality of the
science and trying to fix it. This is not anything other than
the highest-integrity profession.
I also want to make one quick comment about corollary
benefits of open data, going back to your earlier point, which
is, you probably gathered by now that I think reproducibility
is important but there are also issues in terms of access to
the technology. So if you have the ability, the software tools
and the data to replicate those results and those findings, not
only can you therefore build on them more easily as well as
validating them but it also opens them to industry and to
others who can then capitalize on this for commercial use. I
mean, whatever they see as appropriate. So it opens all of
these avenues towards economic growth that can't be overlooked
that are extremely important.
And to your point about, well, what if open data helps our
competitors, I think that there is a long history in the United
States of being able to capitalize on this and move ahead, and
I don't think that maintaining a closure around our scientific
enterprise does anything but restrict American enterprise and
competitiveness internationally and also threaten the integrity
of our results. I mean, science moves forward, as Dr. Alberts
mentioned, through skepticism and through questioning and
through transparency and openness, and being able to share
those methods and giving others the tools to replicate and also
build on, commercialize, capitalize on all of this, I think is
an avenue towards economic growth and an avenue towards STEM
understanding too. When it is open, you can imagine smart high
school kids getting their hands on this stuff and figuring
things out and playing with it, and that is very real.
Chairman Bucshon. Thank you. I now yield to Ms. Lummis five
minutes.
Ms. Lummis. Thank you, Mr. Chairman.
Now, my first question is for any of you who cares to
answer. It is about OSTP guidance. My question is, do you think
that the guidances provides appropriate flexibility to agencies
in developing plans to improve access to federally funded
research?
Dr. Young. Stan Young. I read the guidelines very
carefully. I think they are a major advance forward. The
history is that if scientists are not compelled to make their
data sets available, they generally don't make it available.
The American Psychological Association, for example, just
started a huge effort on reproducibility. Their journals, there
are 50 of them, have the author sign a paper saying I will make
my data set available. Studies have shown that two-thirds of
the authors that have signed those statements do not make their
data sets available, so I think there is--some scientists are
great. In general, there is no data sharing.
Mr. Choudhury. I do think the memorandum provides a good
deal of flexibility for federal agencies and the communities
they support. I do think it is also important to think about
those opportunities where something may be uniform across
different agencies. Another example that I would give is the
memo talks very clearly about enforcing data management plans.
Well, most reviewers in these early days don't even know what
constitutes a good data management plan, so I think providing
guidelines to reviewers about what constitutes a rigorous data
management plan would be a very important thing that any
federal agency could do, and it would, of course, be customized
to their communities.
Ms. Lummis. Well, I had an experience like you have
mentioned with the greater Yellowstone interagency brucellosis
committee where we trying to get data on elk and the
transmission of brucellosis from elk to bison, bison to
domestic livestock, and it was tremendously important because
we finally have that disease pretty well isolated to the
greater Yellowstone area after trying for, what, almost 100
years now to isolate it because it does--it used to be
prevalent in milk cows, but after years of destroying entire
herds of dairy cattle, we finally have that disease isolated to
the greater Yellowstone area. But it is raising havoc, and
there was a woman who was an employee of Yellowstone National
Park who gave her entire career paid by the taxpayers to
studying elk and she would not share her data with us. I mean,
she was taxpayer funded. So I have had personal experience with
your frustrations here.
Another question. Could you comment on the difference
between what has been written in statute versus what is
happening in practice regarding obtaining data in federally
funded research, you know, any of you in your experience?
Dr. Young. I have a lot of experience asking for data sets,
and I will call out the country of Finland. Every time I ask a
scientist in Finland to send me a data set, I get it in return
email. Given the electronic age that we are in, it is
reasonably easy to pass data sets around. My experience in the
United States is not nearly so good. I mentioned requests for
50 data sets in the area of air pollution, and I got none. The
psychologists know very well that data sharing, even though it
is compelled by their journals, it is not done there. There is
a huge difference between what beautiful-thinking people say
about sharing data, and then Joe Cecil is right. In practice,
quite often it is to the advantage of the person that holds the
data not to share it, and so there is a real problem and a
difference. NIEHS or NIH, for example, has a wonderful data-
sharing policy. However, they have no legal authority to compel
anyone to share data, and so many times I have gone all the way
up through very high levels of the NIH asking for data sets and
have not gotten them. So the practice is very different from
the publicity.
Dr. Stodden. I would like to just reiterate Stan's point
there. Both NIH and NSF grant guidelines require data sharing,
and even encourage software sharing, and these have been around
for at least a decade, and it seems to be unenforceable. And so
when the Executive Memorandum talked about mechanisms for
enforceability, I found that very exciting because, like Stan
says, things can be on paper and then without that enforcement,
then things don't proceed, and that, I think, is a real bridge
to breaking the collective action problem and providing those
incentives for sharing and rewarding scientists to do this.
Ms. Lummis. Thank you, panel. My time is up, so I will
yield back to the Chairman.
Chairman Bucshon. Thank you. I now yield to Mr. Palazzo for
five minutes.
Mr. Palazzo. Thank you, Mr. Chairman.
Dr. Stodden, allowing open access to federally funded
scientific data may also create new business opportunities.
What are your thoughts on this issue?
Dr. Stodden. I think the evidence is clear, and one of the
reasons that scientific research is funded by the Federal
Government is because we can discover scientific facts and
inventions and so on that then can, among other things,
undergird economic growth through these creations of
opportunity for industry. So something like economic open data
and open methods that allow reproduction of these discoveries,
I don't think it can help but fuel economic growth in the sense
that you can take these discoveries--scientists don't develop
things for market. They don't do commercialization or full
development, particularly not of software and so on. And then
it is perfectly plausible that these can be taken out and
developed into products and taken to market if that is viable,
and I think that that is something that is a very compelling
reason behind open data and open code.
Mr. Palazzo. Do you have any examples of products and
services that companies may be able to offer?
Dr. Stodden. So, for example, some of my background is in
image processing and working on standards like the JPEG 2000
standard. So this came out of academic research on how to do
image compression and then that is released openly with open
code, and that is something that can be implemented and become
standard in the Web for faster loading of Facebook or whatever
it is or Flickr or whatnot, and it is these types of things
that are done in the scientific labs and then sometimes, as Dr.
Alberts said, you don't even see the end application. You are
making these discoveries and then it takes ingenuity and
industry to then turn it into different other applications, but
this happens absolutely all the time.
Mr. Palazzo. And I think you mentioned this in your
testimony, that it is definitely a potential economic growth
area for our country?
Dr. Stodden. Absolutely.
Mr. Palazzo. Now, on the flip side, allowing open access to
federally funded scientific research and the impact, or what
would be the impact on the intellectual property rights, which
innovation and U.S. competitiveness and things of that nature?
Dr. Stodden. That is a great question, and it has,
unfortunately, a complex answer that I tried to touch on in my
testimony. The intellectual property structure that affects
scientists was not designed for science, and there is two
principal ways that it touches scientific output, and one is
copyright and the other is patents, and copyright is something
that works against--in the scientific context that works
against openness in the sense that a scientist who produces
code or produces other copyrighted outputs like a paper, I
actually would need to give you explicit permission to do this.
The default is not openness. So this is something I mentioned
in my testimony, that maybe this is something that we need to
rethink how the intellectual property system interacts with
scientists who have completely different normative structure to
say, for example, a poet or someone creating a movie or
something like this, it is a very different model.
The other way that it interacts is through patents, and
this is largely around inventions, not touching so much the
computational work that we have been discussing today but
software is patentable, and I can imagine--and this is actually
increasing now, that patentable code is something that is
coming out of the academic institution. So I think this is
something that we need to think about very carefully. If you
think back to 1980 and Bayh-Dole, this was something that was
put into place to encourage transparency, the idea being that
giving these intellectual property rights to institutions would
then allow them to patent and give them this incentive, a
financial incentive, to be open. Now if we have standards of
reproducibility where code is open and data is open, it doesn't
make sense to have that same incentive to patent because it
actually becomes more of a barrier because in 1980, no one
imagined you would just go to a repository or get hub or
whatnot and click and get the code. It had to be this whole
thing through a tech transfer and so on, which is completely
different and now that is the barrier. So I think there is some
careful thinking that needs to happen in terms of IP and also
around how we collaborate with industry too. Industry has very
fruitful collaborations with academia, and those need to be
worked out in terms of what intellectual property remains over
the scientific output so that industry has--essentially they
can sort of get some return on their investment.
Mr. Palazzo. I yield back, Mr. Chairman.
Chairman Bucshon. Thank you very much. I would like to
thank all the witnesses for their valuable very interesting
testimony and the Members for their questions. The Members of
the Committee may have additional questions for you, and they
we will ask you to respond to those in writing. The record will
remain open for two weeks for additional comments and written
questions from Members.
The witnesses are excused and the hearing is adjourned.
Thank you, everyone.
[Whereupon, at 11:06 a.m., the Subcommittee was adjourned.]

Appendix I

----------

Answers to Post-Hearing Questions

Responses by Dr. Bruce Alberts

[GRAPHIC] [TIFF OMITTED] T9929.047

[GRAPHIC] [TIFF OMITTED] T9929.048

[GRAPHIC] [TIFF OMITTED] T9929.049

[GRAPHIC] [TIFF OMITTED] T9929.050

[GRAPHIC] [TIFF OMITTED] T9929.051

Responses by Dr. Victoria Stodden

[GRAPHIC] [TIFF OMITTED] T9929.052

[GRAPHIC] [TIFF OMITTED] T9929.053

[GRAPHIC] [TIFF OMITTED] T9929.054

[GRAPHIC] [TIFF OMITTED] T9929.055

[GRAPHIC] [TIFF OMITTED] T9929.056

[GRAPHIC] [TIFF OMITTED] T9929.057

Responses by Dr. Stanley Young

[GRAPHIC] [TIFF OMITTED] T9929.058

[GRAPHIC] [TIFF OMITTED] T9929.059

[GRAPHIC] [TIFF OMITTED] T9929.060

[GRAPHIC] [TIFF OMITTED] T9929.061

[GRAPHIC] [TIFF OMITTED] T9929.062

[GRAPHIC] [TIFF OMITTED] T9929.063

Responses by Mr. Sayeed Choudhury

[GRAPHIC] [TIFF OMITTED] T9929.064

[GRAPHIC] [TIFF OMITTED] T9929.065

[GRAPHIC] [TIFF OMITTED] T9929.066

[GRAPHIC] [TIFF OMITTED] T9929.067

[GRAPHIC] [TIFF OMITTED] T9929.068

[GRAPHIC] [TIFF OMITTED] T9929.069

[GRAPHIC] [TIFF OMITTED] T9929.070

[GRAPHIC] [TIFF OMITTED] T9929.071

[GRAPHIC] [TIFF OMITTED] T9929.072

[GRAPHIC] [TIFF OMITTED] T9929.073