[House Hearing, 113 Congress]
[From the U.S. Government Publishing Office]



 
                 SCIENTIFIC INTEGRITY AND TRANSPARENCY

=======================================================================

                                HEARING

                               BEFORE THE

                        SUBCOMMITTEE ON RESEARCH

              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY
                        HOUSE OF REPRESENTATIVES

                    ONE HUNDRED THIRTEENTH CONGRESS

                             FIRST SESSION

                               __________

                         TUESDAY, MARCH 5, 2013

                               __________

                           Serial No. 113-10

                               __________

 Printed for the use of the Committee on Science, Space, and Technology


       Available via the World Wide Web: http://science.house.gov



                  U.S. GOVERNMENT PRINTING OFFICE
79-929                    WASHINGTON : 2013
-----------------------------------------------------------------------
For sale by the Superintendent of Documents, U.S. Government Printing Office, 
http://bookstore.gpo.gov. For more information, contact the GPO Customer Contact Center, U.S. Government Printing Office. Phone 202�09512�091800, or 866�09512�091800 (toll-free). E-mail, [email protected].  


              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY

                   HON. LAMAR S. SMITH, Texas, Chair
DANA ROHRABACHER, California         EDDIE BERNICE JOHNSON, Texas
RALPH M. HALL, Texas                 ZOE LOFGREN, California
F. JAMES SENSENBRENNER, JR.,         DANIEL LIPINSKI, Illinois
    Wisconsin                        DONNA F. EDWARDS, Maryland
FRANK D. LUCAS, Oklahoma             FREDERICA S. WILSON, Florida
RANDY NEUGEBAUER, Texas              SUZANNE BONAMICI, Oregon
MICHAEL T. McCAUL, Texas             ERIC SWALWELL, California
PAUL C. BROUN, Georgia               DAN MAFFEI, New York
STEVEN M. PALAZZO, Mississippi       ALAN GRAYSON, Florida
MO BROOKS, Alabama                   JOSEPH KENNEDY III, Massachusetts
RANDY HULTGREN, Illinois             SCOTT PETERS, California
LARRY BUCSHON, Indiana               DEREK KILMER, Washington
STEVE STOCKMAN, Texas                AMI BERA, California
BILL POSEY, Florida                  ELIZABETH ESTY, Connecticut
CYNTHIA LUMMIS, Wyoming              MARC VEASEY, Texas
DAVID SCHWEIKERT, Arizona            JULIA BROWNLEY, California
THOMAS MASSIE, Kentucky              MARK TAKANO, California
KEVIN CRAMER, North Dakota           VACANCY
JIM BRIDENSTINE, Oklahoma
RANDY WEBER, Texas
CHRIS STEWART, Utah
VACANCY
                                 ------                                

                        Subcommittee on Research

                   HON. LARRY BUCSHON, Indiana, Chair
STEVEN M. PALAZZO, Mississippi       DANIEL LIPINSKI, Illinois
MO BROOKS, Alabama                   ZOE LOFGREN, California
STEVE STOCKMAN, Texas                AMI BERA, California
CYNTHIA LUMMIS, Wyoming              ELIZABETH ESTY, Connecticut
JIM BRIDENSTINE, Oklahoma            EDDIE BERNICE JOHNSON, Texas
LAMAR S. SMITH, Texas


                            C O N T E N T S

                         Tuesday, March 5, 2013

                                                                   Page
Witness List.....................................................     2

Hearing Charter..................................................     3

                           Opening Statements

Statement by Representative Larry Bucshon, Chairman, Subcommittee 
  on Research, Committee on Science, Space, and Technology, U.S. 
  House of Representatives.......................................     5
    Written Statement............................................     6

Statement by Representative Daniel Lipinski, Ranking Minority 
  Member, Subcommittee on Research, Committee on Science, Space, 
  and Technology, U.S. House of Representatives..................     7
    Written Statement............................................     8

                               Witnesses:

Dr. Bruce Alberts, Editor-in-Chief, Science Magazine and 
  Professor Emeritus of Biochemistry and Biophysics, University 
  of California - San Francisco
    Oral Statement...............................................     9
    Written Statement............................................    12

Dr. Victoria Stodden, Assistant Professor of Statistics, Columbia 
  University
    Oral Statement...............................................    20
    Written Statement............................................    22

Dr. Stanley Young, Assistant Director for Bioinformatics, 
  National Institutes of Statistical Sciences
    Oral Statement...............................................    48
    Written Statement............................................    51

Mr. Sayeed Choudhury, Associate Dean for Research Data Management 
  at Johns Hopkins University and Hodson Director of the Digital 
  Research and Curation Center
    Oral Statement...............................................    54
    Written Statement............................................    56

Discussion.......................................................    63

             Appendix I: Answers to Post-Hearing Questions

Dr. Bruce Alberts, Editor-in-Chief, Science Magazine and 
  Professor Emeritus of Biochemistry and Biophysics, University 
  of California - San Francisco..................................    74

Dr. Victoria Stodden, Assistant Professor of Statistics, Columbia 
  University.....................................................    80

Dr. Stanley Young, Assistant Director for Bioinformatics, 
  National Institutes of Statistical Sciences....................    86

Mr. Sayeed Choudhury, Associate Dean for Research Data Management 
  at Johns Hopkins University and Hodson Director of the Digital 
  Research and Curation Center...................................    92


                 SCIENTIFIC INTEGRITY AND TRANSPARENCY

                              ----------                              


                         TUESDAY, MARCH 5, 2013

                  House of Representatives,
                                   Subcommittee on Research
               Committee on Science, Space, and Technology,
                                                   Washington, D.C.

    The Subcommittee met, pursuant to call, at 10:01 a.m., in 
Room 2318 of the Rayburn House Office Building, Hon. Larry 
Bucshon [Chairman of the Subcommittee] presiding.
[GRAPHIC] [TIFF OMITTED] T9929.001

                     U.S. HOUSE OF REPRESENTATIVES

              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY

                        SUBCOMMITTEE ON RESEARCH

                            hearing charter

                 Scientific Integrity and Transparency

                         tuesday, march 5, 2013
                        10:00 a.m. to 12:00 p.m.
                   2318 rayburn house office building

Purpose

    At 10 AM on Tuesday, March 5, 2013, the Subcommittee on Research 
will hold a hearing titled Scientific Integrity and Transparency. This 
hearing will provide Members an opportunity to understand the problem 
of access to underlying data from published research funded by the 
federal government, and why access to this underlying data is vital to 
scientific integrity and transparency for peer reviewed research. On 
March 29th, 2012 the Investigation and Oversight Subcommittee held a 
hearing entitled, ``Federally Funded Research: Examining Public Access 
and Scholarly Publication Interests.'' \1\ The focus of this past 
hearing was on open access to publications, whereas the focus of this 
hearing is on open access to data used in federal research.
---------------------------------------------------------------------------
    \1\ http://science.house.gov/hearing/subcommittee-investigations-
and-oversight-hearing-examining-public-access-and-scholarly

---------------------------------------------------------------------------
Witnesses

      Prof. Bruce Alberts, Professor of Biochemistry, 
University of California San Francisco

      Prof. Victoria Stodden, Assistant Professor of 
Statistics, Columbia University

      Dr. Stanley Young, Assistant Director for Bioinformatics, 
National Institute of Statistical Sciences

      Mr. Sayeed Choudhury, Associate Dean for Research Data 
Management at Johns Hopkins University and Hodson Director of the 
Digital Research and Curation Center

Overview

    The bedrock of the scientific process is the ability to replicate 
the experimental claims made by researchers. These claims include both 
the generation of data and the analysis of data by computer software 
and code. Scientists rarely reproduce the work of others since they 
neither have the time nor the resources to reliably replicate the work 
of their colleagues; instead, they often trust these claims and rely on 
the peer review process and their colleagues to share their data and 
analysis methods when needed. This exchange allows for scientists and 
companies to exploit the latest insights to develop new directions in 
their research, and allows them to maximize the impact of federal 
research investment. Thus, scientific progress cannot occur unless 
there is a strong culture of integrity and transparency.
    Unfortunately, the current system has demonstrated several flaws. 
The current incentive system rewards researchers who publish in 
journals, but preparation of data for others' use is not an important 
part of this reward structure. The process of peer review, which the 
scientific community views as its primary means to check scientific 
integrity in journal publications, oftentimes does not try to replicate 
the results of submitted papers. Fellow researchers conducting the peer 
review for publication rarely ask for the original data of the 
submitted paper they are reviewing, and focus instead on whether the 
claims made in the paper are plausible. They simply assume the 
underlying data is valid. In a recent study by Young and Karr, upwards 
of 90% of clinical trial claims for new medicines cannot be replicated. 
\2\ The inability to replicate published results is not unique to 
clinical trials and occurs across scientific disciplines. \3\
---------------------------------------------------------------------------
    \2\ http://science.house.gov/sites/republicans.science.house.gov/
files/documents/hearings/HHRG-112-SY20-WState-SYoung-20120203.pdf
    \3\ ``Again, and again, and again.'' p1225 Science Vol 334 2 
December 2011
---------------------------------------------------------------------------
    This hearing will attempt to understand the scope of the problem 
with scientific integrity, especially how thorough researchers deal 
with underlying data. This issue of scientific integrity should be 
differentiated from cases of scientists knowingly and intentionally 
committing scientific fraud, fabricating data, or plagiarism though 
these might be inter-related depending on individual circumstances. 
This hearing will focus primarily on how data is collected, shared, and 
analyzed by the scientific community and policies for what, how, and 
when federally funded research data should be shared, as well as the 
cost of making this data available to the scientific community and 
public. Current federal laws governing the sharing of data include the 
Data Access Act (DAA) of 1999 and the Information Quality Act (IQA) of 
2001. \4\ Introduced by Senator Richard Shelby, the DAA (sometimes 
known as ``the Shelby Amendment'' within the science community) 
requires that data from federally funded research be made available 
under the Freedom of Information Act procedures. The IQA requires the 
OMB to issue regulations for ensuring the quality and integrity of all 
information disseminated by federal agencies. However, the Government 
Accountability Office reported in September 2007 that federal agencies 
rarely monitor whether researchers make data available. \5\
---------------------------------------------------------------------------
    \4\ National Research Council, Ensuring the Integrity, 
Accessibility, and Stewardship of Research Data in the Digital Age 
(Washington, DC: National Academy Press), 2009.
    \5\ http://www.gao.gov/products/GAO-07-1172
---------------------------------------------------------------------------
    In response to these aforementioned issues, the Office of Science 
and Technology Policy (OSTP) released guidance to federal agencies on 
February 22nd about increasing access to the results of federally 
funded scientific research which includes a discussion about access to 
non-classified digital data. In this memo, OSTP outlines the following 
principles for federal funding agencies to follow when a issuing a data 
access plan \6\:
---------------------------------------------------------------------------
    \6\ http://www.whitehouse.gov/sites/default/files/microsites/ostp/
ostp-public-access-memo-2013
.pdf

      Maximize access to scientific data created with federal 
---------------------------------------------------------------------------
funds;

      Ensure that researchers develop data management plans, 
and allow inclusion for costs in proposals along with proper 
evaluations of these proposals;

      Include mechanisms to ensure compliance with data 
management plans and policies;

      Promote the deposit of data in publicly accessibly 
databases;

      Encourage cooperation with the private sector to improve 
data access and compatibility;

      Develop approaches for identifying/providing appropriate 
attribution to data sets;

      Support the training, education and workforce development 
related to data management; and

      Provide assessment of long-term needs for the 
preservation of scientific data.

    This hearing will address how such principles might best be 
implemented by federal research agencies and members of the scientific 
community conducting such research.
    Chairman Bucshon. The Subcommittee on Research will now 
come to order.
    Good morning. Welcome to today's hearing entitled 
``Scientific Integrity and Transparency.'' In front of you are 
packets containing the written testimonies, biographies and 
Truth-in-Testimony disclosures for today's witness panel. I 
recognize myself for five minutes for an opening statement.
    I want to welcome everyone to today's Research Subcommittee 
hearing on the issue of scientific integrity and transparency.
    An editorial in the March 29, 2012, edition of Nature 
magazine entitled: ``Must try harder: too many sloppy mistakes 
are creeping into scientific papers. Lab heads must look more 
rigorously at the data and at themselves.'' I found this 
editorial particularly interesting because of my background as 
a cardiothoracic surgeon and my professional interest in 
medicine. The editorial goes on to cite a recent study 
contained in this specific issue by Glenn Begley and Lee Ellis, 
which analyzes the low number of cancer research studies that 
have been converted into clinical success, and concludes that a 
major factor is the overall poor quality of published 
preclinical data. This is one of the many similar studies that 
I have read.
    The growing lack of scientific integrity and transparency 
has many causes but one thing is very clear: without open 
access to data, there can be neither integrity nor transparency 
from the conclusions reached by the scientific community. 
Furthermore, when there is no reliable access to data, the 
progress of science is impeded and leads to inefficiencies in 
the scientific discovery process. Important results cannot be 
verified, and confidence in scientific claims dwindles.
    The Federal Government is the main sponsor of basic 
scientific research, with over $140 billion spent in fiscal 
year 2013. The American scientific community has made enormous 
contributions in many scientific fields from federally 
sponsored research. I believe our Nation's scientists will 
continue to develop the breakthrough discoveries and 
innovations of tomorrow. However, scientists receiving federal 
funding need to be accountable and responsible stewards of 
taxpayers' resources. Hardworking Americans trust our 
scientists to be genuine and authentic in the way they conduct 
and share federally funded research.
    The focus of this hearing will be on scientific research 
data funded by the Federal Government. There are key issues 
that data-sharing policies should address including what is 
data, how it should be shared, when it should be shared, and 
what potential costs might result in making this data available 
to the research community. We want to maximize access to data 
while protecting personal privacy, avoid any negative impact on 
intellectual property rights and innovation, and preserve data 
without ridiculous cost or administrative burdens.
    In an attempt to begin addressing this issue, the Office of 
Science and Technology Policy released guidelines on February 
22nd of this year that recognized the problem of data access. 
These guidelines, intended for federal science agencies, are to 
be followed when determining a policy for public access to 
scientific data in digital formats. As part of this hearing, I 
look forward to hearing the witnesses' opinions on these 
federal guidelines.
    Our witnesses today offer input from a variety of 
scientific fields, as this problem is not exclusive to one 
scientific field, community or discipline. I would like to 
thank them for coming and taking the time to offer their 
expertise. I would also like to thank Ranking Member Lipinski 
and everyone else participating in today's hearing.
    [The prepared statement of Mr. Bucshon follows:]

              Prepared Statement of Chairman Larry Bucshon

    I want to welcome everyone to today's Research subcommittee hearing 
on the issue of scientific integrity and transparency.
    An editorial in the March 29, 2012 edition of Nature magazine was 
entitled: ``Must try harder: too many sloppy mistakes are creeping into 
scientific papers. Lab heads must look more rigorously at the data--and 
at themselves.'' I found this editorial particularly interesting 
because of my background as a cardiothoracic surgeon and my 
professional interest in medicine. The editorial goes on to cite a 
recent study (contained in this specific issue) by Glenn Begley and Lee 
Ellis which analyzes the low number of cancer-research studies that 
have been converted into clinical success, and concludes that ``a major 
factor is the overall poor quality of published pre-clinical data.'' 
This is one of many similar studies that I have read.
    The growing lack of scientific integrity and transparency has many 
causes but one thing is very clear: without open access to data, there 
can be neither integrity nor transparency from the conclusions reached 
by the scientific community. Furthermore, when there is no reliable 
access to data, the process of science is impeded and leads to 
inefficiencies in the scientific discovery process. Important results 
cannot be verified, and confidence in scientific claims dwindles.
    The federal government is the main sponsor of basic science 
research, with over $140 billion spent in fiscal year 2013. The 
American scientific community has made enormous contributions in many 
scientific fields from federally sponsored research. I believe our 
nation's scientists will continue to develop the breakthrough 
discoveries and innovations of tomorrow. However, scientists receiving 
federal funding need to be accountable and responsible stewards of tax-
payer resources. Hard-working Americans trust our scientists to be 
genuine and authentic in the way they conduct and share federally 
funded research.
    The focus of this hearing will be on scientific research data 
funded by the federal government. There are key issues that data-
sharing policies should address including: what is data, how it should 
be shared, when it should be shared, and what potential costs might 
result in making this data available to the research community. We want 
to maximize access to data while protecting personal privacy, avoid any 
negative impact on intellectual property rights and innovation, and 
preserve data without ridiculous cost or administrative burdens. In an 
attempt to begin addressing this issue, the Office of Science and 
Technology Policy released guidelines on February 22nd of this year 
that recognized the problem of data access. These guidelines, intended 
for federal science agencies, are to be followed when determining a 
policy for public access to scientific data in digital formats. As part 
of this hearing, I look forward to hearing the witness's opinions on 
these federal guidelines.
    Our witnesses today offer input from a variety of scientific 
fields, as this problem is not exclusive to one scientific field, 
community, or discipline. I'd like to thank them for coming and taking 
time to offer their expertise. I'd also like to thank Ranking Member 
Lipinski and everyone else participating in today's hearing.

    Chairman Bucshon. With that, I now recognize the Ranking 
Member, the gentleman from Illinois, Mr. Lipinski, for an 
opening statement.
    Mr. Lipinski. Thank you, Chairman Bucshon. I think this is 
our third hearing in three weeks, and we have another one next 
week that I will now label you the hardest-working Chairman in 
Washington, D.C. So it is good to be at work here and I want to 
thank all the witnesses for being here.
    The United States has for decades represented the world's 
gold standard for scientific integrity. But no one should 
mistake this observation as an argument for complacency. In the 
COMPETES Act of 2007, which we worked on in this Subcommittee, 
then-Subcommittee Chairman Brian Baird included a provision on 
Responsible Conduct of Research that required every institution 
receiving NSF grant funding to provide training on the ethical 
conduct of science to all students and postdocs covered under 
those grants. Today, all U.S. research universities have 
implemented research ethics training for their STEM students 
and trainees, which we all can agree is a good thing.
    The bigger challenge to the progress of science is not 
misconduct, but rather poor methodology and bad statistical 
analysis that take a long time to uncover. Or for that matter, 
discoveries in one field that have broad multidisciplinary 
relevance but take time to be known in other fields. To that 
end, the open sharing of scientific data is good for science 
and it is good for society. We must, of course, respect issues 
of privacy and intellectual property. But the more data are 
open, the faster we will validate new theories and overturn old 
ones, and the more efficiently we will transform new 
discoveries into innovations that will create jobs and make us 
healthier and more prosperous. The movement toward open data is 
not primarily about scientific integrity; it is mostly about 
speeding up the process of scientific discovery and innovation.
    However, there are some big challenges to the widespread 
implementation of open data. Someone must define what exactly 
data sharing is going to mean and how it is going to be done, 
beginning with a standard. The February 22nd OSTP memo, which 
the Chairman mentioned, on increasing access to the results of 
federally funded scientific research, which by the way was also 
a direct response to requirements in the COMPETES Act, takes on 
many of these issues in detail. But specifically, here are some 
questions that we have to consider, and some of these questions 
were questions raised by the Chairman. First, what does it 
entail and how much does it cost for researchers to develop a 
data management plan and to prepare their own data for sharing? 
Do they have adequate assistance from professional information 
managers? Are funding agencies sufficiently aware of the costs 
and skills required for good data management plans, and how 
should they evaluate and budget for data management proposals? 
What are the IT infrastructure needs for data sharing, 
including technical standards, and what, if any, scientific or 
technical barriers exist to developing that infrastructure? 
What are the most important factors to consider in the 
economics of digital data access and preservation? What should 
be the respective roles of science agencies, universities, and 
the private sector in supporting and preserving public 
databases? How can these groups work together to minimize costs 
and maximize benefit to the scientific community? And finally, 
are there any policy or legal barriers for sustainable digital 
access and preservation?
    In light of the majority's suggestion of a possible 
legislative outcome for this hearing, I hope that today's 
dialogue will include a thoughtful discussion of some of these 
practical issues of implementation. I know that all four expert 
witnesses before us have a lot to contribute to this discussion 
and I look forward to learning from them because this is 
certainly something that is important for us to pursue but we 
need to make sure that we are covering all our bases here and 
do this in the right manner.
    With that, I yield back.
    [The prepared statement of Mr. Lipinski follows:]

     Prepared Statement of Ranking Minority Member Daniel Lipinski

    Thank you Chairman Bucshon and thanks to all of the witnesses for 
being here.
    The U.S. has for decades represented the world's gold standard for 
scientific integrity. But no one should mistake this observation as an 
argument for complacency. In the COMPETES Act of 2007, which we worked 
on in this subcommittee, then Subcommittee Chairman Brian Baird 
included a provision on Responsible Conduct of Research that required 
every institution receiving NSF grant funding to provide training on 
the ethical conduct of science to all students and postdocs covered 
under those grants. Today, all U.S. research universities have 
implemented research ethics training for their STEM students and 
trainees.
    The bigger challenge to the progress of science is not misconduct, 
but rather poor methodology and bad statistical analysis that take a 
long time to uncover. Or for that matter, discoveries in one field that 
have broad multidisciplinary relevance but take time to be known in 
other fields. To that end, the open sharing of scientific data is good 
for science and it's good for society. We must, of course, respect 
issues of privacy and intellectual property. But the more data are 
open, the faster we will validate new theories and overturn old ones, 
and the more efficiently we will transform new discoveries into 
innovations that will create jobs and make us healthier and more 
prosperous. The movement toward open data is not primarily about 
scientific integrity, it's mostly about speeding up the process of 
scientific discovery and innovation.
    However, there are some big challenges to the widespread 
implementation of open data. Someone must define what exactly data 
sharing is going to mean and how it is going to be done, beginning with 
a standard. The February 22nd OSTP memo on increasing access to the 
results of federally funded scientific research, which by the way was 
also a direct response to a requirement in COMPETES, takes on many of 
these issues in detail.

Specifically, we must consider such questions as:

      What does it entail and how much does it cost for 
researchers to develop a data management plan and to prepare their own 
data for sharing? Do they have adequate assistance from professional 
information managers?

      Are funding agencies sufficiently aware of the costs and 
skills required for good data management plans, and how should they 
evaluate and budget for data management proposals?

      What are the IT infrastructure needs for data-sharing, 
including technical standards, and what, if any, scientific or 
technical barriers exist to developing that infrastructure?

      What are the most important factors to consider in the 
economics of digital data access and preservation?

      What should be the respective roles of science agencies, 
universities, and the private sector in supporting and preserving 
public databases? How can these groups work together to minimize costs 
and maximize benefit to the scientific community?

      And finally, are there any policy or legal barriers for 
sustainable digital access and preservation?

    In light of the Majority's suggestion of a possible legislative 
outcome for this hearing, I hope that today's dialogue will include a 
thoughtful discussion of some of these practical issues of 
implementation. I know that all four expert witnesses before us have a 
lot to contribute to this discussion and I look forward to learning 
from them.

    With that I yield back.

    Chairman Bucshon. Thank you, Mr. Lipinski.
    If there are Members who wish to submit additional opening 
statements, your statements will be added to the record at this 
point.
    At this time I would like to introduce our witnesses. Our 
first witness is Dr. Bruce Alberts, Editor-in-Chief of Science 
Magazine and Professor Emeritus of Biochemistry and Biophysics 
at the University of California-San Francisco. Welcome. Our 
next witness is Dr. Victoria Stodden, Assistant Professor of 
Statistics at Columbia University. Our third witness is Dr. 
Stanley Young, the Assistant Director of Bioinformatics at the 
National Institutes of Statistical Sciences. That was hard to 
say. Our fourth and final witness today is Mr. Sayeed 
Choudhury, Associate Dean for Research Data Management at Johns 
Hopkins University and Hodson Director of the digital Research 
and Curation Center.
    As our witnesses should know, spoken testimony is limited 
to five minutes each after which Members of the Committee will 
have five minutes each to ask questions.
    I now recognize Dr. Alberts to present his oral testimony.

                TESTIMONY OF DR. BRUCE ALBERTS,

             EDITOR-IN-CHIEF, SCIENCE MAGAZINE AND

       PROFESSOR EMERITUS OF BIOCHEMISTRY AND BIOPHYSICS,

            UNIVERSITY OF CALIFORNIA - SAN FRANCISCO

    Dr. Alberts. It is a pleasure to be here today. I would 
just like to start by emphasizing something that Science 
Magazine covers repeatedly, which is the fact that our strength 
in science and technology in the United States underlies both 
our economic success and our military dominance in the world. 
As you all know, many other nations are increasingly making 
investments in this area, and I find it distressing that 
although this Committee has long supported fundamental, long-
term scientific research, the investment in the United States 
has been stagnant for many years. The investment in this kind 
of research was 1.25 percent of GDP in 1985, has dropped to .87 
percent of GDP in 2013, a big drop, and of course, the current 
sequester will now make our situation even worse. I believe 
that this is dangerous for America's future, for my 
grandchildren's future.
    But this hearing, of course, is to focus on the quality and 
not the quantity of U.S. research. I would like to address 
first the data availability issue, which of course is crucial 
for science. Science builds by one scientist testing and 
building on and maybe refuting the data of other scientists, 
very much a community endeavor. And the privilege of publishing 
in a journal like ours demands data sharing. Otherwise science 
doesn't work.
    So our journal has been working on this. This is a special 
issue we published, 14 long articles about all these issues, 
February 2011, and we are publishing more and more about this. 
It is accompanied by a survey, a useful survey of scientists, 
how they use data and whether they have enough access. And we 
have stressed over and over again that our policy is ``that all 
data necessary to understand, assess and extend the conclusions 
of the manuscript must be available to any reader of science.'' 
In this issue, we announced a new policy. This includes 
computer codes involved in the creation or analysis of data, 
and I am pleased to say that we are getting good compliance 
with those policies.
    Of course, there are problems that remain. You will hear 
about them from the rest of the group here. But one I would 
like to emphasize is guaranteeing funding for the public 
databases, the critical ones, funding long term so that the 
community and journals like ours can rely on them. This is 
really a major issue. In my field, the protein database, for 
example, is absolutely crucial. It has got 100,000 different 
protein coordinates in it. You know, if funding lapses, then we 
lose all this, and these places play major roles in setting 
standards as well.
    And secondly, I would like to emphasize that we need tools 
for interacting with the largest data sets that are now 
increasingly provided as supplemental online information and 
journal publications like ours, so when we demand the data, we 
put the data not in the written paper but most of it in a big 
electronic supplement, and other journals are doing that as 
well, but we need ways to help people analyze that data who are 
not the original authors. And of course, every journal needs to 
stress clear and complete presentation of all the materials and 
methods that were used in the research.
    So the other issue is data reproducibility. Mr. Chairman, 
you quoted from that paper. My conclusion, and talking to 
people at Genentech who would agree with that paper that you 
cited from Bayer Health Care is that the scientific standards 
are lower in some fields of science and others that we need to 
work on setting higher standards.
    In addition, human cells are incredibly complex and it is 
easy to get a result that looks right when it is really wrong, 
and one can easily be fooled. Every scientist must be trained 
to be highly suspicious about his or her own results, and this 
again is a major issue. And finally, I believe we are 
overemphasizing research directly aimed at finding drugs at the 
expense of the high-quality discovery-driven basic research 
that is urgently needed to improve the search for disease 
treatments. We are just mostly stabbing in the dark.
    So my suggestions for improving this situation would demand 
a community effort from scientific journals like ours. We have 
new policies in the last three years that every senior author 
for each part of the results being published must confirm that 
he or she has personally reviewed the original data generated 
by that unit, specifying where exactly those results appear in 
the paper. It used to be that we wanted one author to take 
responsibility. That is totally unreasonable now. Half of our 
papers have authors in different countries. We would have to 
have a set of senior authors. We are developing checklists in 
various fields of science to help journals and scientists. 
There is a biosketch issue. People should not be listing huge 
lists of publications to impress other people who are giving 
them grant funds. They need to focus on their five or ten most 
important contributions, and quality is critical, not quantity. 
And funding agencies have a role to play here as well.
    I just want to emphasize my own role at universities. I am 
still teaching. I am going to be teaching a 2-week minicourse 
on ethics and research standards this May, so I am very much 
involved in these issues. Thank you.
    [The prepared statement of Dr. Alberts follows:]

    [GRAPHIC] [TIFF OMITTED] T9929.005
    
    [GRAPHIC] [TIFF OMITTED] T9929.006
    
    [GRAPHIC] [TIFF OMITTED] T9929.007
    
    [GRAPHIC] [TIFF OMITTED] T9929.008
    
    [GRAPHIC] [TIFF OMITTED] T9929.009
    
    [GRAPHIC] [TIFF OMITTED] T9929.010
    
    [GRAPHIC] [TIFF OMITTED] T9929.011
    
    
    
    Chairman Bucshon. Thank you.
    I now recognize Dr. Stodden for five minutes to present her 
testimony.

               TESTIMONY OF DR. VICTORIA STODDEN,

               ASSISTANT PROFESSOR OF STATISTICS,

                      COLUMBIA UNIVERSITY

    Dr. Stodden. Thank you for the privilege of addressing you, 
and thank you for your very lucid comments. I agree with just 
about everything both of you have said, and I also agree on how 
important this issue is. So I would like to spend my remaining 
time on two aspects. One is, I would like to scope the problem 
for you, and the second is, I would like to scope the action I 
think that is available to you here.
    So the first thing I want to say is that there is not a 
crisis of integrity in terms of scientists and scientists' 
behavior. What has happened in science is that like all sectors 
of the economy, and all across America, we are taking advantage 
of technological revolutions. What we are doing is using far 
more computers, far more data-oriented and data-driven 
research, far more high-powered investigation in all the 
research all across the sciences. This isn't just in the life 
sciences. This is in engineering, this is in English 
departments who are doing word counts in Shakespeare. This is 
something that is really pervasive in the scientific enterprise 
as a whole, and this is something that is having ramifications 
in the way that we disseminate and communicate science. It is 
not a question of personal integrity.
    So what this means is, to scope the issue, I think that we 
need to think about this issue in terms of reproducibility, so 
as Dr. Alberts outlined, open data itself is a very broad 
notion. I think this needs to be scoped to data and software 
required to reproduce published results, and what that means to 
a scientist is clear. There are details, of course, but that is 
something that a scientist can understand. This is something 
that institutions in the scientific enterprise can understand. 
And I reiterate that it is not just about data, it must include 
the codes and the software that take that data to the published 
results so that those results can be validated and verified.
    You mentioned in your opening remarks about statistical 
errors, about other issues. I would like to scope the problem 
to this computational issue, which I believe is reflected in 
the language around this, digital data, and the reason for that 
is clarity. I agree with you that as a statistician, there are 
lots of statistical errors that are in the literature that are 
being worked out. This is in part because doing computational 
work is new to many fields, and I believe the core issue is 
sharing data, sharing code and things like sort of biological 
materials or the mathematics and the statistics, those will 
work out as corollary issues. Right now the issue needs to be 
scoped on data and code that allow those results to be 
understood, validated and reproduced by other members in the 
community.
    So secondly, I would like to talk about the scope of action 
that I think is available and important for you to think about. 
The first thing is, as Dr. Alberts outlined, scientists are 
very interested in these issues of reproducibility. As we know, 
it is a cornerstone. We don't accept scientific findings until 
there is replication, until there is validation by other 
people--at least that is the theory. And in my testimony, I 
included two articles that are in some sense manifestos from 
computational scientists calling for greater reproducibility. 
The reason computational scientists are banding together and 
creating these manifestos is because there is a collective 
action problem. It does take time to make your data available 
and to make your software available. It is easier to hack 
things up on your machine and produce a paper and never really 
look at the code or the data in the sense of sharing it. That 
does take extra time. So what this means is that scientists who 
want to do reproducible research and sharing the code and data 
that replicates their results are at a disadvantage because 
they don't receive credit for this right now. They generally 
receive credit for the publications. So steps like what Science 
Magazine has taken with data-sharing requirements and code-
sharing requirements are extraordinary and laudable and very 
important. This is Science, though, our highest-impact journal, 
and it is much harder for lower-impact journals to demand that 
of the authors. But this is where the federal funding agencies 
come in as another lever that exerts pressure on scientists and 
what they are required to do.
    So in these manifestos that I included in my testimony, you 
will see computational scientist after computational scientist 
calling for help in a broad sense because people who stick 
their nose out get it cut off and we need the federal funding 
agencies to work in an integrated way to help overcome this 
collective action problem.
    Now, how does this happen? This happens through the 
creation of and financial support for repositories that can 
house code and can house data, and this is something that can't 
just happen, I don't believe, from added money on grants, on 
NIH grants and so on, that are supposed to fund these things in 
an ethereal way. I think this is more serious and this is 
something that needs to be directly confronted, more similar to 
a mandate when you take federal funds for your research.
    Now, standards, as Dr. Alberts mentioned, the protein data 
bank and these institutional repositories, other institutional 
repositories are very important for setting standards. They 
come from the community level. I don't believe they come from 
the federal level down. But this needs to be addressed and 
recognized. There is no point in saying we need to have 
reproducibility, we need to share data, we need to share code 
when they don't know where to put it and there aren't ways for 
people to share it and access it and curate it.
    So I will move to questions here.
    [The prepared statement of Dr. Stodden follows:]

    [GRAPHIC] [TIFF OMITTED] T9929.012
    
    [GRAPHIC] [TIFF OMITTED] T9929.013
    
    [GRAPHIC] [TIFF OMITTED] T9929.014
    
    [GRAPHIC] [TIFF OMITTED] T9929.015
    
    [GRAPHIC] [TIFF OMITTED] T9929.016
    
    [GRAPHIC] [TIFF OMITTED] T9929.017
    
    [GRAPHIC] [TIFF OMITTED] T9929.018
    
    [GRAPHIC] [TIFF OMITTED] T9929.019
    
    [GRAPHIC] [TIFF OMITTED] T9929.020
    
    [GRAPHIC] [TIFF OMITTED] T9929.021
    
    [GRAPHIC] [TIFF OMITTED] T9929.022
    
    [GRAPHIC] [TIFF OMITTED] T9929.023
    
    [GRAPHIC] [TIFF OMITTED] T9929.024
    
    [GRAPHIC] [TIFF OMITTED] T9929.025
    
    [GRAPHIC] [TIFF OMITTED] T9929.026
    
    [GRAPHIC] [TIFF OMITTED] T9929.027
    
    [GRAPHIC] [TIFF OMITTED] T9929.028
    
    [GRAPHIC] [TIFF OMITTED] T9929.029
    
    [GRAPHIC] [TIFF OMITTED] T9929.030
    
    [GRAPHIC] [TIFF OMITTED] T9929.031
    
    [GRAPHIC] [TIFF OMITTED] T9929.032
    
    [GRAPHIC] [TIFF OMITTED] T9929.033
    
    [GRAPHIC] [TIFF OMITTED] T9929.034
    
    [GRAPHIC] [TIFF OMITTED] T9929.035
    
    [GRAPHIC] [TIFF OMITTED] T9929.036
    
    [GRAPHIC] [TIFF OMITTED] T9929.037
    
    
    
    Chairman Bucshon. Thank you very much.
    I recognize Dr. Young for five minutes to present his 
testimony.

                TESTIMONY OF DR. STANLEY YOUNG,

             ASSISTANT DIRECTOR FOR BIOINFORMATICS,

          NATIONAL INSTITUTES OF STATISTICAL SCIENCES

    Dr. Young. Thank you for the opportunity of testifying.
    As an abstract principle, the sharing of research data is a 
noble goal and meets with little opposition. However, when data 
sharing is attempted in a particular circumstance, the 
conflicting interests of the parties can thwart the exchange. 
So said Joe Cecil of the Justice Department in 1985.
    What is the current status of science in general and data 
availability in particular? First, where are we with science 
claims? In 2005, John Ioannidis published two papers of 
interest. In one, he asserted that 90 percent of the claims 
made in science papers are wrong in the sense that they are not 
expected to replicate. In another, he noted that five out of 
six papers based on observational studies failed to replicate. 
I published a paper in 2011 and showed that of 52 hypotheses 
suggested from observational studies, none replicated in the 
expected direction and five were statistically significant, but 
in the opposite direction. Begley and Ellis reported that 47 
out of 53 claims made in major science journals failed to 
usefully replicate.
    Where are we on data sharing? John Ioannidis selected 10 
papers from each of 50 of the highest-impact journals--New 
England Journal of Medicine, Nature, Science, et cetera--and 
asked, is the data used in these papers publicly available? 
Overall, only 47 of 500 papers deposited full primary raw data 
online. None of the 149 papers not subjected to data 
availability policies made their full primary data publicly 
available.
    I report on two personal experiences. Dr. Beate Ritz of 
UCLA made a claim in Environmental Health Perspectives that air 
pollution in L.A. county leads to low birth weights. Dr. 
Frederica Perera of Columbia University asserted in the journal 
Pediatrics that air pollution decreased IQ in children. NIEHS 
provided funding for both studies. In both cases, I asked for 
the data sets from the authors. I also asked for help from 
NIEHS. I resorted to FOI. I received neither data set. 
Recently, I was informed that NIEHS does not have the legal 
authority to compel and an author to proved data that was 
funded by them. Operationally, NIH funding, the Shelby 
amendment, etc. mean very little with respect to data 
availability. Mostly, authors do not provide data sets used in 
their publications. It is technically easy to share data used 
in publications. Others will discuss reproducible Research, so 
I will leave that aside.
    Just why are we in this situation, where most claims do not 
replicate and authors will not make data sets available? In a 
long and illustrious career, Edwards Deming made the point that 
if a system is failing, it is not the workers' fault--that is 
the scientist--it is the fault with management, in this case 
funding agencies and journal editors. For over 30 years, 
workers have been admonished to do their work better and to 
make their data sets available. It was reported in Science in 
1988 that there were serious problems with observational 
studies. Nothing has changed in 25 years.
    Congress, funding agencies and journal editors need to step 
up and manage the scientific process. They should require 
authors to deposit data sets on publication of their papers. 
Funding of data set construction and analysis should be 
separate. They should require data analysis strategies that 
demonstrate reproducibility. For example, any claim should be 
replicated in a separate data set before publication. Remember, 
the reliability of current scientific claims is only 10 to 20 
percent. John Holdren's thing on the Office of Science and 
Technology Policy I think is a welcomed thing in this area.
    It is not enough to agree with sharing data. It is almost 
30 years since Joe Cecil stated the problem. Management should 
make the depositing of data sets on publication mandatory. This 
is a management problem; it is not a science worker problem.
    Thank you very much.
    [The prepared statement of Dr. Young follows:]

    [GRAPHIC] [TIFF OMITTED] T9929.038
    
    [GRAPHIC] [TIFF OMITTED] T9929.039
    
    
    
    Chairman Bucshon. Thank you.
    I now recognize Mr. Choudhury to present his testimony, 
five minutes.

               TESTIMONY OF MR. SAYEED CHOUDHURY,

          ASSOCIATE DEAN FOR RESEARCH DATA MANAGEMENT

        AT JOHNS HOPKINS UNIVERSITY AND HODSON DIRECTOR

          OF THE DIGITAL RESEARCH AND CURATION CENTER

    Mr. Choudhury. Chairman Bucshon, Ranking Member Lipinski, 
Members of the Subcommittee, thank you for the opportunity to 
be here today.
    I have been asked to address questions related to data 
sharing, access and preservation. I would like to do so from 
the perspective of infrastructure development. The other 
witnesses have already addressed the importance of persistent 
scientific data archives for reproducibility. I believe that 
strategic investments in data infrastructure also have 
important implications for our overall competitiveness.
    There are important lessons from our historical 
infrastructure development that are relevant as we consider 
data sharing, access and preservation. The development of 
railroads initially led to systems that served regional 
networks but eventually merged into a national network through 
a standard track gauge. With the development of automobiles, we 
adapted from early mistakes to adjust drivers' behavior through 
education, driving rules and seat belts. The development of the 
Internet reflects a layered approach of different technologies 
connected through a key component in the form of two protocols 
known as TCP and IP.
    Broadly speaking, successful infrastructure development has 
relied on a flexible balance of community and national 
approaches, social aspects relating to human behavior, and key 
components. In each case, as infrastructure evolved through 
community efforts, we reached the point where national 
coordination moved us to a more cohesive situation. In previous 
cases, the more cohesive infrastructure led to greater societal 
benefits from both the private and public sector. I believe we 
have reached a similar point with certain aspects of data 
infrastructure.
    From a policy perspective, the recent Executive Memorandum 
from the Office of Science and Technology Policy provides a 
useful framework for federal policies that would maximize data 
sharing, access and preservation. The memorandum acknowledges 
the need for flexibility by federal agencies for the 
communities they support balanced with the need for uniform 
guidelines when appropriate. There is one specific example that 
I will mention in my oral remarks. The memorandum outlines the 
need for appropriate data attribution and citation. The method 
for meeting this need is the persistent identifier, which is a 
long-lasting reference to data. You can think of persistent 
identifiers as an improved version of Web site addresses such 
as Congress.gov. It is a rough analogy, but the persistent 
identifier may be compared to having the same role as track 
gauge in the development of railroads.
    From an economics perspective, there is a greater need for 
understanding of costs. For example, some cost studies focus 
only on storage, ignoring related costs such as data center 
operations or longer-term costs related to preservation. 
Preservation of data ensures that we can extract value for the 
long term, noting that with data, preservation issues can arise 
in as little as five years. The development of data 
preservation infrastructure represents a case where effective 
partnerships could be formed between the public sector, private 
sector and university sector, in which I include libraries and 
national laboratories. It is possible that the private sector 
will not focus on data preservation because there are 
unresolved research problems, it is unlikely to be profitable, 
and it benefits from large-scale coordination. Federal agencies 
could provide the funding for research, prototypes and initial 
deployment of data preservation infrastructure. The university 
sector could then set up production systems that the scientific 
community and private sector could exploit for discovery and 
profit.
    From a technology perspective, it is important to remember 
that there are different types of data and different stages of 
scientific projects. Consequently, there is a need for a 
layered approach to diverse systems spanning individual 
researchers to large-scale national projects. Even with this in 
mind, it is possible to identify gaps that are common across 
this landscape. For example, today's storage systems work well 
for many purposes but they do not currently meet some 
preservation requirements. It is worth mentioning that some 
storage companies view this situation as an opportunity for 
code development with the university sector.
    From a non-technical perspective, scientists do their best 
to manage their data but they do not always have a full 
understanding. Raising awareness and reinforcing the importance 
of data sharing, access and preservation will be important. 
This type of awareness building and education is similar to the 
adjustment of automobile drivers' behaviors over time.
    In conclusion, I believe that we have an important 
opportunity to advance our data networks into more cohesive, 
large-scale infrastructure that will advance the scientific 
process and generate benefits for the public sector, industry 
and the scientific community.
    I thank you again for the opportunity to be here, and I 
look forward to answering your questions.
    [The prepared statement of Mr. Choudhury follows:]

    [GRAPHIC] [TIFF OMITTED] T9929.040
    
    [GRAPHIC] [TIFF OMITTED] T9929.041
    
    [GRAPHIC] [TIFF OMITTED] T9929.042
    
    [GRAPHIC] [TIFF OMITTED] T9929.043
    
    [GRAPHIC] [TIFF OMITTED] T9929.044
    
    [GRAPHIC] [TIFF OMITTED] T9929.045
    
    
    
    Chairman Bucshon. Thank you very much, and I thank all the 
witnesses for their testimony, reminding Members that the 
Committee rules limit questioning to five minutes. The Chair 
will at this point open the round of questions. The Chair 
recognizes himself for five minutes.
    As a cardiothoracic surgeon, I am very interested in this 
issue because I have to translate what is written into clinical 
practice, and so this type of issue really does affect real 
people. I can tell you the difficulty that people like me have 
in figuring out when to change your clinical practice, when you 
are doing something that turns out wasn't the right thing to 
do, it is a very difficult process that is ongoing, so I am 
very interested in this particular subject.
    I will start with Dr. Young. Could you give me some 
examples of where State and federal regulations were made 
without public release of data used to make those regulations?
    Dr. Young. Yes. I have taken an interest pro bono in air 
pollution questions, and an expert in the area worked with me 
and we developed 100 papers that are key papers in that area. 
Then being a statistician, I selected 50 of those papers at 
random and asked the authors for the data sets. I received no 
data sets at all. Many of these data sets were funded by the 
Federal Government and there are many regulations that are 
based on these data sets. They are key data sets. For the most 
part, these data sets are not available.
    Chairman Bucshon. Just so you know, I had the same problem 
getting the data out of the Federal Government. It can be an 
issue.
    Mr. Choudhury, could you give me what specific 
infrastructure technology requirements are required for the 
storage of scientific data research?
    Mr. Choudhury. There are several layers that are necessary 
to actually preserve scientific data. It begins with storage, 
which is basically just the bits residing on a hard disc or a 
tape or even in the cloud, but eventually we also need to do 
things to ensure data protection. We also need to have to then 
do things to ensure that we can migrate the data over time, so 
as we start to use new storage systems or if we have new file 
formats, we have to be able to move those data into those new 
environments. As Dr. Stodden mentioned, we also need to have 
access to the software or the tools that process the data 
because in many cases, it is not sufficient just to get access 
to the data alone. So the actual preservation of the data is 
this complex set of layers that go beyond storage. Storage is 
necessary but it isn't sufficient. So we have to do all these 
other things to understand the context and the reusability of 
the data as well.
    Chairman Bucshon. Do you think currently that university 
libraries or national laboratories are equipped for this type 
of infrastructure?
    Mr. Choudhury. At Johns Hopkins, we have taken an approach 
of looking at two stages. The first is prior to investigators 
submitting proposals--they need some sort of consultation and 
support to develop their data management plans. In this 
respect, I do believe that the university sector, and 
particularly university libraries, have stepped up very well. I 
think most research university libraries are providing that 
kind of consultation to their investigators.
    The second stage is that once an award is made, then we 
actually have to handle the data and we actually have to start 
preserving it for the long term. In this respect, there is a 
subset of that library community that has come forward to help 
provide that kind of support, and then there is the long-term 
preservation need, and even there, it is a smaller subset 
again. It is in the preservation of the data where I think 
there remains some research questions which ultimately when 
they are addressed they can migrate the support into the 
university library sector.
    Chairman Bucshon. Great. Dr. Alberts, on February 11, 2011, 
in a Science magazine editorial, you write, ``We will ask 
authors to provide a specific statement regarding the 
availability and curation of data as part of their 
acknowledgments requesting that reviewers consider this as a 
responsibility of the authors.'' Do you think this self-
policing policy works in practice?
    Dr. Alberts. We find that it has been working for Science 
magazine. Our senior author, deputy editor, Brooks Hanson, has 
been deeply involved in this. On rare occasions we have had to 
make authors do things that they should have done themselves 
but I guess we are fortunate we have the threat, which is, we 
are not going to publish any more papers from you, and they 
want to publish in Science magazine, and as Victoria said, not 
every journal can make that threat. So I think this is a very 
important issue to emphasize. We haven't talked about the 
fact--I am a biochemist, and I had lots of data from my 
laboratory when I was an active scientist. Not all of it should 
be preserved. I mean, if I tried to preserve everything, I 
couldn't find anything. So we also need different fields to 
decide what it is that we really need to preserve and make 
available. There is so much material being collected now that 
it is really important to get standards for different fields of 
what needs to be preserved and what needs to be put in your 
publication.
    Chairman Bucshon. Great. Thank you all. I now yield to Mr. 
Lipinski from Illinois.
    Mr. Lipinski. Thank you. I wanted to start out by saying I 
am sort of going back to my days as a social scientist and 
thinking about not just the research I did and the data that I 
had but also thinking about behavior, and it is--there are not 
rewards generally for having--someone had mentioned, I think 
Dr. Stodden, that you are rewarded for a result in a 
publication but you are not rewarded--the rewards aren't there 
to spend the time and the effort to have the data in a format 
even that is accessible to others, and if you are talking about 
going further than that, how exactly you went through and you 
analyzed the data. I can't tell you how much paper I had 
printed out of different ways, all these different models that 
I ran and trying to keep track of all that. So it is not simple 
to do and there has to be incentives. So somehow the culture 
has to be changed. And the question is, how do we change that 
culture? Now, the National Science Foundation requires that you 
have a data management plan when you are applying for a grant, 
so the NSF puts that in there.
    My question is, in a short period of time if you can do it, 
how do we change this, and should this be a situation where it 
is data available upon request or should it all be available? 
Should it be put out there published somewhere or put on a site 
that everyone can access? And how far do we go with the data? 
Is it, okay, this is how I analyze it, this is the statistical 
package I used, this is how exactly I did it. So let me start 
with Dr. Stodden. I mean, what is your quick sort of suggestion 
on it for your 30,000 foot? What would you do if you could?
    Dr. Stodden. So I think the efforts that have been taken so 
far are really this on request and so on, and there are a 
number of experiments and studies, and Dr. Young mentioned a 
couple, where that doesn't seem to work as well. You don't 
simply get the response. So I think it is time to move forward 
to this being a standard. Now, having said, as Dr. Alberts 
said, there are data sets and problems of different importance, 
and you can imagine investing a lot more time curating a data 
set that has broad use and applicability and might underlie 50 
or 100 studies and so on versus one one-off. But the changes 
really something that I believe scientists are willing to do 
and are working on standards. For example, in economics this is 
a very forward-thinking community and many of the journals have 
standards and they do engage in data sharing and code sharing 
but not even as much as they would like. And so I think the 
complexity of the problem means that it really is not a one-
size-fits-all solution. As you mentioned, it is something that 
comes from the field.
    But I would suggest that this is a standard that it should 
be understood that this code and the data go open for 
reproducibility and changing the culture is something 
scientists are talking about. There is a special issue I can 
point you to in Computing and Science in Engineering that is 
called Changing the Culture, and it is about giving these 
rewards. So as Dr. Choudhury mentioned, having these persistent 
identifiers allows citation for data and for code NSF steps 
towards allowing scholarly objects like data and code listed on 
the biosketch and not just publication is a real step in this 
direction, and I think the scientific community will sort out 
how it values data contribution and code contribution and 
publication contribution. They may not be all valued equally 
but we have a long history of doing this. Not all publications 
are valued equally. But I think that bringing this through 
citation and having citation standards is a way to really 
change the culture and reward people.
    And I will add one last point, which is there is a 
generational difference here because these changes in 
technology, young people and young scientists and people who 
want to go into research, it is very natural for them to share 
data and to share code, and it is discouraging for them to 
enter a situation where suddenly this is not the norm. So this 
is something where I think there is also this opportunity that 
the culture is changing naturally on its own just with time as 
younger people come in and have these expectations for sharing 
what they are doing digitally. And so that is also something to 
capitalize on. And again, I go back to the testimony in that 
there is this collective action problem because, as you 
mentioned, it takes time, and so something particularly from 
federal agencies that can help push through that is really very 
important.
    Mr. Lipinski. I thank you. My time is up. I yield back.
    Chairman Bucshon. I now yield to Mr. Stockman for five 
minutes.
    Mr. Stockman. I have a question for Dr. Alberts. My wife is 
a NASA privacy officer, and I want to follow up on something 
the Chairman related. In February in your editorial, you wrote, 
``We recognize that exceptions may be needed to these general 
requirements for sharing data, for example, preserve the 
privacy of individuals or in some cases when data materials are 
obtained from third parties and for security reasons but we 
accept those rare exceptions.'' Is this your view today?
    Dr. Alberts. For example, we had an experience with a 
Department of Energy lab where they weren't allowed to give us 
the code because presumably it had some security implications. 
So we do encounter those one-off occasions. But they have been 
rare. So we have to live with the law, and we try our best to 
do what we can.
    Mr. Stockman. Do you see other exceptions?
    Dr. Alberts. Not that--I don't know of any exceptions since 
that policy was made.
    Mr. Stockman. Okay. The other question I have is for all 
the witnesses. Many of you today also practice science. You are 
also members of the United States scientific community. You 
have been a world leader in producing first-class research. How 
do you envision the mechanism of enforcing the sharing of data 
without hindering the process of scientific discovery and 
simultaneously minimizing the administrative burden of a 
scientist? Because I know a lot of professors and everything a 
lot of time fill our more paperwork than they do research. If 
you could each just go quickly through the----
    Dr. Alberts. Well, I think Victoria said it right. We need 
to mobilize our communities. I mean, I am a cell biologist and 
the American Society of Cell Biology used to help us. What does 
it mean for our community, and we have to take responsibility 
for it, and it is going to be different for statisticians. 
Different people will have different requirements and it has to 
make sense, and I agree with you that it has gone way overboard 
now at universities. Every time I want to do anything, I have 
to fill out a form. So I think we should try to avoid 
legislating more flat requirements. You know, if I want to 
interview students, graduate students at UCSF about their 
career options, I have to fill out a 50-page human youth form. 
It drives me nuts. So this Committee might work on pushing back 
on some of the meaningless paper and get some requirements that 
are more meaningful.
    Dr. Stodden. That is a great question, and I think it goes 
back to these issues of reproducibility. If you are publishing 
a paper where you claim that data and code are out there and 
available for it to be reproducible, then that is in a sense 
the starting point of standards in a community. Now, as Dr. 
Alberts mentioned, this will change for different communities 
and different research problems and they can be quite 
different, but there needs to be this expectation that the 
results, the computational results will be reproducible and 
then when you go and get your hands dirty and you try and do 
the reproducibility, then if it doesn't work or it does work, 
then that is value too in the community, and I think that 
scaffolding and that framework is really there. It is a 
question of moving towards this default of openness rather than 
the default of being closed and then you request and so on, and 
as I was mentioning to Ranking Member Lipinski, the default 
needs to be open, and then as you mentioned, we have exceptions 
for confidentiality and so on but those are the exceptions, and 
then the standard is really about reproducibility.
    Dr. Young. The first thing to keep in mind is that many 
estimates say that 80 to 90 percent of the claims that appear 
in scientific papers are wrong in the sense that they will not 
replicate. So I would focus on cost per valid result. 
Additional costs can be put into reproducible research and 
things like that. The total number of claims that are checked 
will go down but the number of valid claims can easily go up if 
we do our research better. Thank you.
    Mr. Choudhury. I think one thing that is becoming clear is 
data management is a complex and demanding set of activities on 
its own. It may not be reasonable to expect scientists to 
conduct their own data management but rather work with a set of 
professionals who sit somewhere between the domain sciences, 
say, library information science. So I think there is a 
workforce development issue here. We don't expect scientists to 
be experts in IT systems or other kinds of systems. We provide 
support for them, and I think data management may be in that 
category.
    Mr. Stockman. Thank you. I yield back.
    Chairman Bucshon. I now recognize Mr. Bera from California.
    Mr. Bera. Thank you, Mr. Chairman.
    Now, to start off with, I would want to make sure we don't 
give the impression that our scientific community and our 
research institutions are producing faulty data. We maintain a 
competitive advantage. As a scientist myself, as someone who 
spent countless hours in the lab as a medical student and has 
spent time as a faculty member and associate dean at the 
University of California-Davis, working with our medical 
students and our resident physicians, we maintain a competitive 
superiority in our research institutions, and I think Dr. 
Alberts touched on the importance of the federal investment in 
our research institutions. We also need to recognize our 
journals and particularly our leading peer review journals. 
There is a rigorous process having again submitted articles and 
worked with countless students that you go through as you are 
submitting articles. Replicability is an important component 
but also putting the information out there so others can look 
at it and provide feedback is very important. So we want to be 
conscious of that as well.
    As we set up our research institutions, we often are doing 
it and our trials are in a very transparent way, you know, 
funding multi-center trials. When we look at major projects 
like the Human Genome Project, as we talk about brain mapping, 
we will set that up in as transparent a way as possible using 
multiple of our institutions. And it isn't always just about 
replicability. It is about sharing that data and working 
together, but at the same time--and my question is this--as we 
move into this era of wanting to share data, we also have to 
maintain our competitive advantage. We do have competitor 
nations that every day are trying to get to our data and get to 
the research institutions. We talk about cybersecurity on this 
Committee. We need to be very conscious of what we are putting 
out there as well.
    I would direct a question to Dr. Alberts. You talked about 
the importance of research funding as well as the threats to 
research funding in our academic institutions. Why don't you 
touch on that, and then if the rest of the panel wants to talk 
about how we move forward in kind of an open, transparent way 
but maintaining our competitive advantage and protecting those 
discoveries that we are making.
    Dr. Alberts. As I wrote in my written testimony, I referred 
to this major project from the National Academy of Sciences 
when I was president to explain to Congress and the public how 
fundamental knowledge produces breakthroughs. The first 
pamphlet we produced was on the global positioning system. 
Somewhere started with the fact that physicists invented atomic 
clocks. They won a Nobel Prize but everybody thought it was 
useless because it enabled us to keep time to a billionth of a 
second, and why should we want to do that. Well, you follow 
this progression, and I recommend that whole series. It is 
still up on the Web. That combined with many other findings of 
knowledge about the world enabled us to put up these 24 
satellites that produce this wonderful device that we all use 
and the military uses, and we did that over and over.
    And what has been true in the United States, remarkably, 
and I don't think people recognize this, we have been a magnet 
for the most talented people from all around the world coming 
here, and you just look at Silicon Valley and places like that. 
So if we don't keep our leading position as scientific 
research, a place to come to, our universities, then those 
people won't come here and they won't subsequently contribute 
their genius to the American economy and the American strength 
of our Nation. So I am quite worried right now because many 
other countries, China, for one, they see this very clearly. 
This is where we have our competitive advantage and they are 
trying to gain it, and if we don't pay attention to that, I 
think we are going to lose this game. We are taking it for 
granted that all these great people are going to come to this 
country but they are not going to do that anymore if we are not 
the best place to do research.
    Dr. Stodden. So I couldn't agree with your comments more, 
and also with Dr. Alberts that American science is absolutely 
superb, and as evidence of this, I believe our discussion today 
actually reflects the high integrity and the honesty of that 
community in trying to grapple with these problems. I mean, 
these manifestos and so on I put in the testimony here, these 
are scientists who are concerned about the quality of the 
science and trying to fix it. This is not anything other than 
the highest-integrity profession.
    I also want to make one quick comment about corollary 
benefits of open data, going back to your earlier point, which 
is, you probably gathered by now that I think reproducibility 
is important but there are also issues in terms of access to 
the technology. So if you have the ability, the software tools 
and the data to replicate those results and those findings, not 
only can you therefore build on them more easily as well as 
validating them but it also opens them to industry and to 
others who can then capitalize on this for commercial use. I 
mean, whatever they see as appropriate. So it opens all of 
these avenues towards economic growth that can't be overlooked 
that are extremely important.
    And to your point about, well, what if open data helps our 
competitors, I think that there is a long history in the United 
States of being able to capitalize on this and move ahead, and 
I don't think that maintaining a closure around our scientific 
enterprise does anything but restrict American enterprise and 
competitiveness internationally and also threaten the integrity 
of our results. I mean, science moves forward, as Dr. Alberts 
mentioned, through skepticism and through questioning and 
through transparency and openness, and being able to share 
those methods and giving others the tools to replicate and also 
build on, commercialize, capitalize on all of this, I think is 
an avenue towards economic growth and an avenue towards STEM 
understanding too. When it is open, you can imagine smart high 
school kids getting their hands on this stuff and figuring 
things out and playing with it, and that is very real.
    Chairman Bucshon. Thank you. I now yield to Ms. Lummis five 
minutes.
    Ms. Lummis. Thank you, Mr. Chairman.
    Now, my first question is for any of you who cares to 
answer. It is about OSTP guidance. My question is, do you think 
that the guidances provides appropriate flexibility to agencies 
in developing plans to improve access to federally funded 
research?
    Dr. Young. Stan Young. I read the guidelines very 
carefully. I think they are a major advance forward. The 
history is that if scientists are not compelled to make their 
data sets available, they generally don't make it available. 
The American Psychological Association, for example, just 
started a huge effort on reproducibility. Their journals, there 
are 50 of them, have the author sign a paper saying I will make 
my data set available. Studies have shown that two-thirds of 
the authors that have signed those statements do not make their 
data sets available, so I think there is--some scientists are 
great. In general, there is no data sharing.
    Mr. Choudhury. I do think the memorandum provides a good 
deal of flexibility for federal agencies and the communities 
they support. I do think it is also important to think about 
those opportunities where something may be uniform across 
different agencies. Another example that I would give is the 
memo talks very clearly about enforcing data management plans. 
Well, most reviewers in these early days don't even know what 
constitutes a good data management plan, so I think providing 
guidelines to reviewers about what constitutes a rigorous data 
management plan would be a very important thing that any 
federal agency could do, and it would, of course, be customized 
to their communities.
    Ms. Lummis. Well, I had an experience like you have 
mentioned with the greater Yellowstone interagency brucellosis 
committee where we trying to get data on elk and the 
transmission of brucellosis from elk to bison, bison to 
domestic livestock, and it was tremendously important because 
we finally have that disease pretty well isolated to the 
greater Yellowstone area after trying for, what, almost 100 
years now to isolate it because it does--it used to be 
prevalent in milk cows, but after years of destroying entire 
herds of dairy cattle, we finally have that disease isolated to 
the greater Yellowstone area. But it is raising havoc, and 
there was a woman who was an employee of Yellowstone National 
Park who gave her entire career paid by the taxpayers to 
studying elk and she would not share her data with us. I mean, 
she was taxpayer funded. So I have had personal experience with 
your frustrations here.
    Another question. Could you comment on the difference 
between what has been written in statute versus what is 
happening in practice regarding obtaining data in federally 
funded research, you know, any of you in your experience?
    Dr. Young. I have a lot of experience asking for data sets, 
and I will call out the country of Finland. Every time I ask a 
scientist in Finland to send me a data set, I get it in return 
email. Given the electronic age that we are in, it is 
reasonably easy to pass data sets around. My experience in the 
United States is not nearly so good. I mentioned requests for 
50 data sets in the area of air pollution, and I got none. The 
psychologists know very well that data sharing, even though it 
is compelled by their journals, it is not done there. There is 
a huge difference between what beautiful-thinking people say 
about sharing data, and then Joe Cecil is right. In practice, 
quite often it is to the advantage of the person that holds the 
data not to share it, and so there is a real problem and a 
difference. NIEHS or NIH, for example, has a wonderful data-
sharing policy. However, they have no legal authority to compel 
anyone to share data, and so many times I have gone all the way 
up through very high levels of the NIH asking for data sets and 
have not gotten them. So the practice is very different from 
the publicity.
    Dr. Stodden. I would like to just reiterate Stan's point 
there. Both NIH and NSF grant guidelines require data sharing, 
and even encourage software sharing, and these have been around 
for at least a decade, and it seems to be unenforceable. And so 
when the Executive Memorandum talked about mechanisms for 
enforceability, I found that very exciting because, like Stan 
says, things can be on paper and then without that enforcement, 
then things don't proceed, and that, I think, is a real bridge 
to breaking the collective action problem and providing those 
incentives for sharing and rewarding scientists to do this.
    Ms. Lummis. Thank you, panel. My time is up, so I will 
yield back to the Chairman.
    Chairman Bucshon. Thank you. I now yield to Mr. Palazzo for 
five minutes.
    Mr. Palazzo. Thank you, Mr. Chairman.
    Dr. Stodden, allowing open access to federally funded 
scientific data may also create new business opportunities. 
What are your thoughts on this issue?
    Dr. Stodden. I think the evidence is clear, and one of the 
reasons that scientific research is funded by the Federal 
Government is because we can discover scientific facts and 
inventions and so on that then can, among other things, 
undergird economic growth through these creations of 
opportunity for industry. So something like economic open data 
and open methods that allow reproduction of these discoveries, 
I don't think it can help but fuel economic growth in the sense 
that you can take these discoveries--scientists don't develop 
things for market. They don't do commercialization or full 
development, particularly not of software and so on. And then 
it is perfectly plausible that these can be taken out and 
developed into products and taken to market if that is viable, 
and I think that that is something that is a very compelling 
reason behind open data and open code.
    Mr. Palazzo. Do you have any examples of products and 
services that companies may be able to offer?
    Dr. Stodden. So, for example, some of my background is in 
image processing and working on standards like the JPEG 2000 
standard. So this came out of academic research on how to do 
image compression and then that is released openly with open 
code, and that is something that can be implemented and become 
standard in the Web for faster loading of Facebook or whatever 
it is or Flickr or whatnot, and it is these types of things 
that are done in the scientific labs and then sometimes, as Dr. 
Alberts said, you don't even see the end application. You are 
making these discoveries and then it takes ingenuity and 
industry to then turn it into different other applications, but 
this happens absolutely all the time.
    Mr. Palazzo. And I think you mentioned this in your 
testimony, that it is definitely a potential economic growth 
area for our country?
    Dr. Stodden. Absolutely.
    Mr. Palazzo. Now, on the flip side, allowing open access to 
federally funded scientific research and the impact, or what 
would be the impact on the intellectual property rights, which 
innovation and U.S. competitiveness and things of that nature?
    Dr. Stodden. That is a great question, and it has, 
unfortunately, a complex answer that I tried to touch on in my 
testimony. The intellectual property structure that affects 
scientists was not designed for science, and there is two 
principal ways that it touches scientific output, and one is 
copyright and the other is patents, and copyright is something 
that works against--in the scientific context that works 
against openness in the sense that a scientist who produces 
code or produces other copyrighted outputs like a paper, I 
actually would need to give you explicit permission to do this. 
The default is not openness. So this is something I mentioned 
in my testimony, that maybe this is something that we need to 
rethink how the intellectual property system interacts with 
scientists who have completely different normative structure to 
say, for example, a poet or someone creating a movie or 
something like this, it is a very different model.
    The other way that it interacts is through patents, and 
this is largely around inventions, not touching so much the 
computational work that we have been discussing today but 
software is patentable, and I can imagine--and this is actually 
increasing now, that patentable code is something that is 
coming out of the academic institution. So I think this is 
something that we need to think about very carefully. If you 
think back to 1980 and Bayh-Dole, this was something that was 
put into place to encourage transparency, the idea being that 
giving these intellectual property rights to institutions would 
then allow them to patent and give them this incentive, a 
financial incentive, to be open. Now if we have standards of 
reproducibility where code is open and data is open, it doesn't 
make sense to have that same incentive to patent because it 
actually becomes more of a barrier because in 1980, no one 
imagined you would just go to a repository or get hub or 
whatnot and click and get the code. It had to be this whole 
thing through a tech transfer and so on, which is completely 
different and now that is the barrier. So I think there is some 
careful thinking that needs to happen in terms of IP and also 
around how we collaborate with industry too. Industry has very 
fruitful collaborations with academia, and those need to be 
worked out in terms of what intellectual property remains over 
the scientific output so that industry has--essentially they 
can sort of get some return on their investment.
    Mr. Palazzo. I yield back, Mr. Chairman.
    Chairman Bucshon. Thank you very much. I would like to 
thank all the witnesses for their valuable very interesting 
testimony and the Members for their questions. The Members of 
the Committee may have additional questions for you, and they 
we will ask you to respond to those in writing. The record will 
remain open for two weeks for additional comments and written 
questions from Members.
    The witnesses are excused and the hearing is adjourned. 
Thank you, everyone.
    [Whereupon, at 11:06 a.m., the Subcommittee was adjourned.]


                               Appendix I

                              ----------                              


                   Answers to Post-Hearing Questions


Responses by Dr. Bruce Alberts



[GRAPHIC] [TIFF OMITTED] T9929.047

[GRAPHIC] [TIFF OMITTED] T9929.048

[GRAPHIC] [TIFF OMITTED] T9929.049

[GRAPHIC] [TIFF OMITTED] T9929.050

[GRAPHIC] [TIFF OMITTED] T9929.051

Responses by Dr. Victoria Stodden

[GRAPHIC] [TIFF OMITTED] T9929.052

[GRAPHIC] [TIFF OMITTED] T9929.053

[GRAPHIC] [TIFF OMITTED] T9929.054

[GRAPHIC] [TIFF OMITTED] T9929.055

[GRAPHIC] [TIFF OMITTED] T9929.056

[GRAPHIC] [TIFF OMITTED] T9929.057

Responses by Dr. Stanley Young

[GRAPHIC] [TIFF OMITTED] T9929.058

[GRAPHIC] [TIFF OMITTED] T9929.059

[GRAPHIC] [TIFF OMITTED] T9929.060

[GRAPHIC] [TIFF OMITTED] T9929.061

[GRAPHIC] [TIFF OMITTED] T9929.062

[GRAPHIC] [TIFF OMITTED] T9929.063

Responses by Mr. Sayeed Choudhury

[GRAPHIC] [TIFF OMITTED] T9929.064

[GRAPHIC] [TIFF OMITTED] T9929.065

[GRAPHIC] [TIFF OMITTED] T9929.066

[GRAPHIC] [TIFF OMITTED] T9929.067

[GRAPHIC] [TIFF OMITTED] T9929.068

[GRAPHIC] [TIFF OMITTED] T9929.069

[GRAPHIC] [TIFF OMITTED] T9929.070

[GRAPHIC] [TIFF OMITTED] T9929.071

[GRAPHIC] [TIFF OMITTED] T9929.072

[GRAPHIC] [TIFF OMITTED] T9929.073