[House Hearing, 113 Congress]
[From the U.S. Government Publishing Office]



 
                       NEXT GENERATION COMPUTING 
                         AND BIG DATA ANALYTICS 

=======================================================================

                             JOINT HEARING

                               BEFORE THE

                       SUBCOMMITTEE ON RESEARCH &
                       SUBCOMMITTEE ON TECHNOLOGY

              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY
                        HOUSE OF REPRESENTATIVES

                    ONE HUNDRED THIRTEENTH CONGRESS

                             FIRST SESSION

                               __________

                       WEDNESDAY, APRIL 24, 2013

                               __________

                           Serial No. 113-22

                               __________

 Printed for the use of the Committee on Science, Space, and Technology

       Available via the World Wide Web: http://science.house.gov


                               ----------
                         U.S. GOVERNMENT PRINTING OFFICE 

80-561 PDF                       WASHINGTON : 2013 

              COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY

                   HON. LAMAR S. SMITH, Texas, Chair
DANA ROHRABACHER, California         EDDIE BERNICE JOHNSON, Texas
RALPH M. HALL, Texas                 ZOE LOFGREN, California
F. JAMES SENSENBRENNER, JR.,         DANIEL LIPINSKI, Illinois
    Wisconsin                        DONNA F. EDWARDS, Maryland
FRANK D. LUCAS, Oklahoma             FREDERICA S. WILSON, Florida
RANDY NEUGEBAUER, Texas              SUZANNE BONAMICI, Oregon
MICHAEL T. McCAUL, Texas             ERIC SWALWELL, California
PAUL C. BROUN, Georgia               DAN MAFFEI, New York
STEVEN M. PALAZZO, Mississippi       ALAN GRAYSON, Florida
MO BROOKS, Alabama                   JOSEPH KENNEDY III, Massachusetts
RANDY HULTGREN, Illinois             SCOTT PETERS, California
LARRY BUCSHON, Indiana               DEREK KILMER, Washington
STEVE STOCKMAN, Texas                AMI BERA, California
BILL POSEY, Florida                  ELIZABETH ESTY, Connecticut
CYNTHIA LUMMIS, Wyoming              MARC VEASEY, Texas
DAVID SCHWEIKERT, Arizona            JULIA BROWNLEY, California
THOMAS MASSIE, Kentucky              MARK TAKANO, California
KEVIN CRAMER, North Dakota           ROBIN KELLY, Illinois
JIM BRIDENSTINE, Oklahoma
RANDY WEBER, Texas
CHRIS STEWART, Utah
VACANCY
                                 ------                                

                        Subcommittee on Research

                   HON. LARRY BUCSHON, Indiana, Chair
STEVEN M. PALAZZO, Mississippi       DANIEL LIPINSKI, Illinois
MO BROOKS, Alabama                   ZOE LOFGREN, California
STEVE STOCKMAN, Texas                AMI BERA, California
CYNTHIA LUMMIS, Wyoming              ELIZABETH ESTY, Connecticut
JIM BRIDENSTINE, Oklahoma            EDDIE BERNICE JOHNSON, Texas
LAMAR S. SMITH, Texas
                                 ------                                

                       Subcommittee on Technology

                  HON. THOMAS MASSIE, Kentucky, Chair
JIM BRIDENSTINE, Oklahoma            FREDERICA S. WILSON, Florida
RANDY HULTGREN, Illinois             SCOTT PETERS, California
DAVID SCHWEIKERT, Arizona            DEREK KILMER, Washington
                                     EDDIE BERNICE JOHNSON, Texas
LAMAR S. SMITH, Texas



                            C O N T E N T S

                       Wednesday, April 24, 2013

                                                                   Page
Witness List.....................................................     2

Hearing Charter..................................................     3

                           Opening Statements

Statement by Representative Larry Bucshon, Chairman, Subcommittee 
  on Research, Committee on Science, Space, and Technology, U.S. 
  House of Representatives.......................................     8
    Written Statement............................................     9

Statement by Representative Daniel Lipinski, Ranking Minority 
  Member, Subcommittee on Research, Committee on Science, Space, 
  and Technology, U.S. House of Representatives..................    10
    Written Statement............................................    11

Statement by Representative Thomas Massie, Chairman, Subcommittee 
  on Technology, Committee on Science, Space, and Technology, 
  U.S. House of Representatives..................................    12
    Written Statement............................................    13


Statement by Representative Frederica S. Wilson, Ranking Minority 
  Member, Subcommittee on Technology, Committee on Science, 
  Space, and Technology, U.S. House of Representatives...........    13
    Written Statement............................................    14

                               Witnesses:

Dr. David McQueeney, Vice President, Technical Strategy and 
  Worldwide Operations, IBM Research
    Oral Statement...............................................    16
    Written Statement............................................    18

Dr. Michael Rappa, Director, Institute for Advanced Analytics, 
  Distinguished University Professor, North Carolina State 
  University
    Oral Statement...............................................    26
    Written Statement............................................    28

Dr. Farnam Jahanian, Assistant Director for the Computer and 
  Information Science and Engineering (CISE) Directorate, 
  National Science Foundation
    Oral Statement...............................................    36
    Written Statement............................................    38

Discussion.......................................................    55

             Appendix I: Answers to Post-Hearing Questions

Dr. Michael Rappa, Director, Institute for Advanced Analytics, 
  Distinguished University Professor, North Carolina State 
  University.....................................................    76

Dr. Farnam Jahanian, Assistant Director for the Computer and 
  Information Science and Engineering (CISE) Directorate, 
  National Science Foundation....................................    79

            Appendix II: Additional Material for the Record

IDC IVIEW report, The Digital Universe in 2020: Big Data, Bigger 
  Digital Shadows, and Biggest Growth in the Far East, submitted 
  by Representative Derek Kilmer, Subcommittee on Technology, 
  Committee on Science, Space, and Technology, U.S. House of 
  Representatives................................................    86


            NEXT GENERATION COMPUTING AND BIG DATA ANALYTICS

                              ----------                              


                       WEDNESDAY, APRIL 24, 2013

                  House of Representatives,
                                 Subcommittee on Research &
                                    Subcommittee Technology
               Committee on Science, Space, and Technology,
                                                   Washington, D.C.

    The Subcommittees met, pursuant to call, at 10:04 a.m., in 
Room 2318 of the Rayburn House Office Building, Hon. Larry 
Bucshon [Chairman of the Subcommittee on Research] presiding.

[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]

    Chairman Bucshon. All right. This joint hearing of the 
Subcommittee on Research and the Subcommittee on Technology 
will come to order.
    Good morning, and welcome to today's joint hearing entitled 
``Next Generation Computing and Big Data Analytics.'' In front 
of you are packets containing the written testimony, 
biographies and Truth in Testimony disclosures for today's 
witnesses.
    Before I get started, since this is a joint hearing 
involving two Subcommittees, I want to explain how we will 
operate procedurally so all Members understand how the 
question-and-answer period will be handled. As always, we will 
alternate rounds of questioning between majority and minority 
Members. The Chairmen and Ranking Members of the Research and 
Technology Subcommittees will be recognized first. Then we will 
recognize Members present at the gavel in order of seniority on 
the full Committee and those coming in after the gavel will be 
recognized in order of their arrival. I now recognize myself 
for five minutes for an opening statement.
    Again, I would like to welcome everyone to today's hearing 
where we will examine how advancements in information 
technology and data analytics enable private and public sector 
organizations to provide greater value to their customers and 
citizens. Industry, academia, and government are all interested 
in determining how to extract value, gain insights, and make 
better decisions based on the wealth of data that is generated 
today. In recent years, ``big data'' has become the popular 
term used to encompass this phenomenon.
    TechAmerica, an information technology trade association, 
defines big data as ``large volumes of high-velocity, complex 
and variable data that require advanced techniques and 
technologies to enable the capture, storage, distribution, 
management, and analysis of the information.''
    Big data offers a range of opportunities for private 
industry to reduce costs and increase profitability. It can 
enable scientists to make discoveries on a previously 
unreachable scale. And it can allow governments to identify 
ways to serve its citizens more efficiently.
    The McKinsey Global Institute predicts that effective 
information management can provide $300 billion in annual value 
to the U.S. health care sector alone. TechAmerica released a 
report last year highlighting how big data initiatives can 
improve the efficiency and effectiveness of government 
services, and through the use of advanced computing power and 
analytic techniques, universities and Federal laboratories can 
drive new research initiatives that will significantly increase 
our scientific knowledge base.
    There are also various challenges associated with big data 
that the Committee will explore today. McKinsey has estimated 
that the U.S. will face a shortfall of 140,000 to 190,000 
professionals with significant technical depth in data 
analytics, and a further shortfall of an additional 1.5 million 
managers and analysts who can work effectively with big data 
analysis by 2018. Committee Members will be interested to learn 
how industry, academia, and government are addressing this 
shortfall.
    While the term ``big data'' is relatively new, public and 
private organizations have been investing in computing power 
and data analytics for a number of years. In March of last 
year, the Obama Administration announced a Big Data Research 
and Development Initiative, including $200 million in new 
funding across six different government departments and 
agencies. I am interested to learn how effectively these 
programs are being coordinated across the different Federal 
agencies to ensure that taxpayer dollars are being leveraged 
effectively. Finally, privacy and security are major concerns 
when private and public organizations are collecting, 
analyzing, and disseminating massive data sets.
    We have an excellent panel of witnesses ranging across 
industry, academia, and government. I would like to extend my 
appreciation to each of our witnesses for taking the time and 
effort to appear before us today. We look forward to your 
testimony.
    [The prepared statement of Mr. Bucshon follows:]

 Prepared Statement of Subcommittee on Research Chairman Larry Bucshon

    Good morning, I would like to welcome everyone to today's hearing 
where we will examine how advancements in information technology and 
data analytics enable private and public sector organizations to 
provide greater value to their customers and citizens.
    Industry, academia, and government are all interested in 
determining how to extract value, gain insights, and make better 
decisions based on the wealth of data that is generated today. In 
recent years, ``Big Data'' has become the popular term used to 
encompass this phenomenon.
    TechAmerica, an information technology trade association, defines 
Big Data as ``large volumes of high velocity, complex and variable data 
that require advanced techniques and technologies to enable the 
capture, storage, distribution, management, and analysis of the 
information.''
    Big Data offers a range of opportunities for private industry to 
reduce costs and increase profitability. It can enable scientists to 
make discoveries on a previously unreachable scale. And it can allow 
governments to identify ways to serve its citizens more efficiently.
    The McKinsey Global Institute predicts that effective information 
management can provide $300 billion in annual value to the US health 
care sector alone. TechAmerica released a report last year highlighting 
how Big Data initiatives can improve the efficiency and effectiveness 
of government services. And, through the use of advanced computing 
power and analytics techniques, universities and federal laboratories 
can drive new research initiatives that will significantly increase our 
scientific knowledge-base.
    There are also various challenges associated with Big Data that the 
Committee will explore today. McKinsey has estimated that the US will 
face a shortfall of 140,000 to 190,000 professionals with significant 
technical depth in data analytics, and a further shortfall of an 
additional 1.5 million managers and analysts who can work effectively 
with big data analysis by 2018. Committee members will be interested to 
learn how industry, academia, and government are addressing this 
shortfall.
    While the term Big Data is relatively new, public and private 
organizations have been investing in computing power and data analytics 
for a number of years. In March of last year, the Obama Administration 
announced a ``Big Data Research and Development Initiative,'' including 
$200 million in new funding across six different federal departments 
and agencies. I am interested to learn how effectively these programs 
are being coordinated across the different federal agencies to ensure 
that taxpayer dollars are being leveraged effectively.
    Finally, privacy and security are major concerns when private and 
public organizations are collecting, analyzing, and disseminating 
massive data sets. We have an excellent panel of witnesses ranging 
across industry, academia and government. I'd like to extend my 
appreciation to each of our witnesses for taking the time and effort to 
appear before us today. We look forward to your testimony.

    Chairman Bucshon. I will now yield to Mr. Lipinski for his 
opening statement.
    Mr. Lipinski. Thank you. I want to thank you, Chairman 
Bucshon, and I want to thank Chairman Massie for holding this 
hearing. I want to welcome and thank the witnesses for being 
here.
    Today's hearing gives us an opportunity to talk about the 
new tools and analytics that are being developed for big data. 
As Chairman Bucshon stated, big data can be thought of as large 
volumes of complex and diverse types of data that change 
rapidly with time.
    In basic scientific research in national security as well 
as in economic sectors ranging from energy to health care, big 
data challenges are becoming fundamentally important. 
Effectively dealing with big data can impact how we do business 
and how we think about the world.
    As a Member of the Research Subcommittee for several years, 
I have watched as the amount and complexity of data has grown 
by leaps and bounds. The field of astronomy is a great example. 
When the Sloan Digital Sky Survey started work in 2000, its 
telescope in New Mexico collected more data in a few weeks than 
had been collected in the history of astronomy, and that 
telescope will be surpassed when the Large Synoptic Survey 
Telescope begins scientific operations in 2020. LSST will 
photograph the entire sky every few days, producing data at a 
rate almost 100 times greater than the Sloan Survey. But data 
is useless without the means to store and analyze it in an 
efficient manner.
    The types of data are changing as well. Data has gone from 
being mostly numbers entered into Excel spreadsheets to data 
coming from sensors, cell phone cameras and millions of email 
messages. In fact, it is estimated that over 85 percent of data 
generated today are these kinds of unstructured data, data like 
videos and emails. The change in the volume and variety of data 
as well as how fast data is being produced and changed creates 
almost limitless opportunities. For example, since 
cybersecurity data is massive, varied, and changing quickly, 
big data technologies have the potential to detect and prevent 
cyber attacks before they happen. I know that organizations 
like IBM are developing technologies to do just that. 
Additionally, big data could be used to establish new business 
models, create transparency, improve decision-making and reduce 
inefficiencies within businesses and government.
    But along with the opportunities, there are a number of 
challenges. We need new tools and software packages to manage, 
organize, and analyze all these different kinds of data. 
Additionally, we will need an analytic workforce to ensure the 
gains of big data. These challenges necessitate involvement 
from government, academia and the private sector. That is why I 
am happy to see all those sectors represented here today.
    The government has and will continue to play an 
instrumental role in this area. For instance, the Networking 
and Information Technology Research and Development program, or 
NITRD, created an interagency big data group that is 
coordinating Federal efforts in technologies, research, 
competitions, and workforce development for big data. We had a 
hearing on the NITRD program back in February, and I expect 
that we will be able to take a broader look at many of the same 
issues in today's hearing.
    In some cases, agencies have teamed up to issue joint 
solicitations. For example, NSF and NIH have a joint big data 
grant program that awarded nearly $15 million of grants to 
eight teams of researchers last year. These first award grants 
went to projects focused on designing new tools for big data 
and new data analytic approaches. We will be hearing more about 
these and other interagency activities from Dr. Jahanian in his 
testimony. We will also learn more about specific programs at 
NSF, one of the leading agencies in Federal big data efforts on 
both the analytics side and the computational resources side.
    As I mentioned before, one of the areas being coordinated 
through NITRD is workforce development for big data. Several 
agencies, including NSF, have education activities to support a 
new generation of big data researchers. As we will likely hear 
from all of the witnesses, we face a looming shortage of 
workers with the skills needed to analyze and manage large, 
complex and high-velocity data sets. There is some overlap with 
the broader STEM skills we so often speak about in this 
committee, but there are also unique skills required to address 
the big challenges of big data. We need to consider how to 
build those skills into STEM curricula, especially at the 
undergraduate and graduate levels. I look forward to hearing 
from our witnesses about the current educational efforts and 
what additional initiatives may be necessary.
    And finally, since big data involves different types of 
data that can be produced and transferred quickly, there are 
concerns over privacy. We need to ensure that we strike the 
right balance between exploring and implementing all of the 
potential benefits of big data while also protecting 
individuals' personal information.
    I look forward to hearing the witnesses' testimony and our 
discussion today, and I yield back the balance of my time.
    [The prepared statement of Mr. Lipinski follows:]

             Prepared Statement of Subcommittee on Research
                Ranking Minority Member Daniel Lipinski

    Thank you, Chairmen Bucshon and Massie for holding this hearing on 
examining the next generation of computing and big data analytics. I 
want to welcome and thank the witnesses for being here today.
    Today's hearing gives us an opportunity to talk about the new tools 
and analytics that are being developed for big data. Big data can be 
thought of as large volumes of complex and diverse types of data that 
are also high velocity--meaning they change rapidly with time.
    As a member of the Research Subcommittee for several years now, I 
have watched as the amount and complexity of data has grown by leaps 
and bounds. The field of astronomy is a great example. When the Sloan 
Digital Sky Survey started work in 2000, its telescope in New Mexico 
collected more data in a few weeks than had been collected in the 
history of astronomy. And that telescope will be surpassed when the 
Large Synoptic Survey Telescope goes online in about 2020. LSST will 
photograph the entire sky every few days. That's difficult for any of 
us to wrap our heads around.
    The types of data are changing as well. Data has gone from being 
mostly numbers entered in excel spreadsheets to data coming from 
sensors, cellphone cameras, and millions of email messages. In fact, it 
is estimated that over 85 percent of data generated today are these 
kinds of unstructured data--data like videos or emails.
    The change in the volume and variety of data as well as how fast 
data is being produced and changed creates almost limitless 
opportunities. For example, since cybersecurity data is massive, 
varied, and changing quickly, big data technologies have the potential 
to detect and prevent cyber attacks before they even happen. I know 
that organizations like IBM are developing technologies to do just 
that. Additionally, big data could be used to establish new business 
models, create transparency, improve decision-making, and reduce 
inefficiencies within businesses and government.
    But along with the opportunities, there are a number of challenges. 
We need new tools and software packages to manage, organize, and 
analyze all these different kinds of data. Additionally, we will need 
an analytic workforce to ensure the gains of big data. These challenges 
necessitate involvement from government, academia, and the private 
sector. That is why I am happy to see all those sectors represented 
today.
    The government has and will continue to play an instrumental role 
in this area. For instance, the Networking and Information Technology 
Research and Development--or NITRD--program created an interagency big 
data group that is coordinating federal efforts in technologies, 
research, competitions, and workforce development for big data.
    In some cases, agencies have teamed up to issue joint 
solicitations. For example, NSF and NIH have a joint big data grant 
program that awarded nearly $15 million of grants to eight teams of 
researchers last year. These first awarded grants went to projects 
focused on designing new tools for big data and new data analytic 
approaches. We will hear more about these and other interagency 
activities from Dr. Jahanian in his testimony. We will also learn more 
about specific programs at NSF, one of the leading agencies in federal 
big data efforts on both the analytics side and the computational 
resources side.
    As I mentioned before, one of the areas being coordinated through 
NITRD is the workforce development needs for big data. Several 
agencies, including NSF, have education activities to support a new 
generation of big data researchers. As you will likely hear from all of 
the witnesses, we face a looming shortage of workers with the skills 
needed to analyze and manage large, complex, and high-velocity data 
sets. There is some overlap with the broader STEM skills we often speak 
of in this committee. But there are also some unique skills required to 
address the challenges of big data. We need to consider how to build 
those skills into STEM curricula, especially at the undergraduate and 
graduate levels. I look forward to hearing from our witnesses about the 
current educational efforts and what additional initiatives may be 
necessary.
    Finally, since big data involves different types of data that can 
be produced and transferred quickly, there are concerns over privacy. 
We need to ensure that we strike the right balance between exploring 
and implementing all of the potential benefits of big data while also 
protecting individuals' personal information.
    I look forward to hearing the witnesses' testimonies and to our 
discussion today.

    Chairman Bucshon. Thank you, Mr. Lipinski. The Chair now 
recognizes the Chairman of the Subcommittee on Technology, Mr. 
Massie, for five minutes for his opening statement.
    Mr. Massie. Thank you, Chairman.
    Good morning. Today we are examining an issue that we hear 
a lot about. ``Big data'' is a popular new term that can mean a 
lot of different things. The scientific community, though, has 
generated and used big data before there was the term ``big 
data.'' In fact, in 1991 this Committee authored the High 
Performance Computing Act, which organized the Federal agency 
research, development, and training efforts in support of 
advanced computing.
    Individual researchers have always been faced with 
difficult decisions about their data: what to keep, what to 
toss, what to verify with additional experiments. And as our 
computing power has increased, so has the luxury of storing 
more data. Incorporating computer power to process more 
scientific data is transforming laboratories across the 
country.
    At the same time, the ability to analyze large amounts of 
data across multiple networked platforms is also transforming 
the private sector. Through big data applications, businesses 
have not only revealed previously hidden efficiency 
improvements in their internal operations, but, more 
importantly, also uncovered entirely new types of businesses 
built around data that was previously not accessible due to its 
size and complexity.
    Today's hearing will examine the hype around big data. Is 
the United States the most innovative Nation in big data? Is 
our regulatory system creating any burdens on businesses? Could 
public-private partnerships with the Federal agencies be 
improved to allow for more data innovations?
    I thank our witnesses today for their participation today 
and I look forward to hearing their testimony. Thank you. I 
yield back.
    [The prepared statement of Mr. Massie follows:]

            Prepared Statement of Subcommittee on Technology
                         Chairman Thomas Massie

    Good Morning. Today we are examining an issue that we hear a lot 
about. ``Big Data'' is a popular new term that can mean a lot of 
different things.
    The scientific community has generated and used Big Data before 
there was Big Data. In fact, in 1991 this Committee authored the High 
Performance Computing Act, which organized the federal agency research, 
development and training efforts in support of advanced computing.
    Individual researchers have always been faced with difficult 
decisions about their data: what to keep, what to toss, what to verify 
with additional experiments. As our computing power has increased, so 
has the luxury of storing more data. Today, managing this data allows 
for better-informed experiments, more exact metrics, and perhaps 
significantly longer doctoral theses. Incorporating computer power to 
process more scientific data is transforming laboratories across the 
country.
    At the same time, the ability to analyze large amounts of data 
across multiple networked platforms is also transforming the private 
sector. Through Big Data applications, businesses have not only 
revealed previously hidden efficiency improvements in their internal 
operations, but also uncovered entirely new types of business built 
around data that was previously not accessible due to its size and 
complexity.
    Today's hearing will examine the hype around Big Data. Is the 
United States the most innovative nation in Big Data? Is our regulatory 
system creating any burdens on businesses? Could public-private 
partnerships with the federal agencies be improved to allow for more 
data innovations?
    I thank our witnesses for their participation today and look 
forward to hearing their testimony.

    Chairman Bucshon. Thank you, Mr. Massie. The Chair now 
recognizes Ms. Wilson for five minutes for her opening 
statement.
    Ms. Wilson. First of all, I would like to thank both 
Chairman Bucshon and Chairman Massie for holding this joint 
hearing, and thank you all to our witnesses for being here 
today. Welcome.
    This morning's hearing provides us with the opportunity to 
discuss one of the newest buzzwords in Washington, and you know 
we have many buzzwords here. This one: big data. This buzzword 
is not an exaggeration. A computer that used to take up the 
space of this entire room now fits in the palm of your hand. It 
is remarkable.
    Just as computers have gotten immensely smaller, they have 
also gotten immensely more powerful. Instead of talking about 
megabytes, we are now talking about petabytes and zettabytes--
quadrillions and sextillions of units of information. It 
boggles the mind. Collecting and storing this huge volume of 
data would have been impossible just a few years ago.
    I am looking forward to your testimony and learning more 
about the benefits of big data to society. As I understand it, 
big data has the potential to improve nearly all sectors of 
society. The National Cancer Institute is funding a prototype 
in biological big data that could lead to new advances in 
cancer treatment. Companies and agencies are using big data to 
run controlled experiments that improve decision-making. 
Scientists at Florida International University in my district 
are using big data to advance understanding of topics including 
cybersecurity, social networks and cloud computing.
    But there are challenges. In order to reap all the benefits 
of complex and broadly available data, we need new technologies 
and software. We also need a workforce, a workforce with the 
skills necessary to analyze data of such great volume and 
complexity. A recent study estimates that the United States is 
in need of 190,000 additional data scientists.
    In thinking about this hearing on big data, I couldn't help 
but think about the tragic events last week in Boston. The 
marathon bombings may be one of the most photographed attacks 
in history. The Massachusetts State Police asked the public to 
share the photos and videos taken on that awful day. Now all of 
this digital information has been and is being used by the 
Boston Police Department and the FBI in their investigation. It 
appears that this data has been instrumental in helping to 
identify the individuals who were involved.
    Examples like this one demonstrate how important it is that 
we develop and attain the tools and the skills people need to 
analyze tremendous amounts of complex data. Big data can not 
only lead to amazing scientific discoveries; it can also save 
lives.
    As we learn more about these opportunities and challenges 
today, I hope our witnesses will offer recommendations on how 
the Federal Government can help create the new tools, software 
and workforce needed to realize the full potential of big data.
    Chairman Bucshon, Chairman Massie, thank you again for 
holding this hearing, and I yield back the balance of my time.
    [The prepared statement of Ms. Wilson follows:]

            Prepared Statement of Subcommittee on Technology
              Ranking Minority Member Frederica S. Wilson
    I'd like to thank both Chairman Bucshon and Chairman Massie for 
holding this joint hearing. And thank you to all of our witnesses for 
being here today.
    This morning's hearing provides us with the opportunity to discuss 
one of the newest buzz-words in Washington and around the world--``big 
data.''
    This buzz-word is not an exaggeration: A computer that used to take 
up the space of this entire room now fits in the palm of your hand. It 
is remarkable.
    Just as computers have gotten immensely smaller, they have also 
gotten immensely more powerful. Instead of talking about megabytes, we 
are now talking about petabytes and zettabytes--quadrillions and 
sextillions of units of information. It boggles the mind. Collecting 
and storing this huge volume of data would have been impossible just a 
few years ago.
    I'm looking forward to the testimony of today's witnesses and 
learning more about the benefits of ``big data'' to society.
    As I understand it, big data has the potential to improve nearly 
all sectors of society. The National Cancer Institute is funding a 
prototype in biological ``big data'' that could lead to new advances in 
cancer treatment. Companies and agencies are using ``big data'' to run 
controlled experiments that improve decision-making. Scientists at 
Florida International University--in my district--are using ``big 
data'' to advance understanding of topics including cybersecurity, 
social networks, and cloud computing.
    But there are challenges. In order to reap all the benefits of 
complex and broadly available data, we need new technologies and 
software. We also need a workforce with the skills necessary to analyze 
data of such great volume and complexity. A recent study estimates that 
the United States is in need of 190,000 additional data scientists.
    In thinking about this hearing on ``big data,'' I couldn't help but 
think about the tragic events last week in Boston. The marathon 
bombings may be one of the most photographed attacks in history. The 
Massachusetts State Police asked the public to share the photos and 
videos taken on that awful day. Now, all of this digital information 
has been and is being used by the Boston Police Department and the FBI 
in their investigation. It appears that this data has been instrumental 
in helping to identify the individuals who were involved.
    Examples like this one demonstrate how important it is that we 
develop and attain the tools and the skilled people needed to analyze 
tremendous amounts of complex data. Big data can not only lead to 
amazing scientific discoveries--It can also save lives.
    As we learn more about these opportunities and challenges today, I 
hope our witnesses will offer recommendations on how the federal 
government can help create the new tools, software, and workforce 
needed to realize the full potential of ``big data.''

    Chairman Bucshon. Thank you, Ms. Wilson.
    If there are Members who wish to submit additional opening 
statements, your statements will be added to the record at this 
point.
    It is now time to introduce our panel of witnesses. Our 
first witness is Dr. David McQueeney, the Vice President of 
Technical Strategy and Worldwide Operations at IBM Research. In 
this capacity, he is responsible for setting the direction of 
IBM's overall research strategy across 12 worldwide labs and 
leading the global operations and information systems teams. 
Dr. McQueeney's background covers a wide range of disciplines, 
spending about half of his career as a researcher and research 
executive and half in IBM's customer-focused areas. He holds an 
M.S. and Ph.D. in solid-state physics from Cornell University 
and an A.B. in physics from Dartmouth College. Welcome.
    Our second witness is Dr. Michael Rappa, the Executive 
Director of the Institute for Advanced Analytics and Faculty 
Member of the Department of Computer Science at North Carolina 
State University. Dr. Rappa has 25 years of experience as a 
professor working across academic disciplines at the 
intersection of management and computing. He began his teaching 
career at the University of Minnesota where he earned his 
doctorate degree. Welcome.
    And our final witness is Dr. Farnam Jahanian, the Assistant 
Director for the Computer and Information Science and 
Engineering Directorate at the National Science Foundation and 
a frequent visitor to our Subcommittee. He oversees the CISE's 
mission to uphold the Nation's leadership in computer and 
information science and engineering. He also serves as Co-chair 
of the Networking and Information Technology Research and 
Development, or NITRD, Subcommittee of the National Science and 
Technology Council Committee on Technology, providing overall 
coordination for the activities of 14 government agencies. Dr. 
Jahanian holds a master's degree and a Ph.D. in computer 
science from the University of Texas at Austin. Welcome again.
    As our witnesses should know, spoken testimony is limited 
to five minutes each after which Members of the Committee have 
five minutes each to ask questions. Your written testimony will 
be included in the record of the hearing.
    I now recognize our first witness, Dr. McQueeney, for five 
minutes for his testimony.

       TESTIMONY OF DR. DAVID MCQUEENEY, VICE PRESIDENT,

          TECHNICAL STRATEGY AND WORLDWIDE OPERATIONS,

                          IBM RESEARCH

    Dr. McQueeney. Good morning, Mr. Chairman, Ranking Members, 
Members of the Subcommittees. Thank you for the opportunity to 
testify today. My written testimony covers next-generation 
computing, big data and analytics, workforce development and 
the role of government. In my five minutes, I will focus on 
areas where I can offer critical insights from my personal 
experience.
    Computing today is undergoing profound change. We are 
moving from computing based on processors that are programmed 
to follow a predesigned sequence of instructions to cognitive 
computing systems based on massive amounts of data evolving 
into systems that can learn. This new approach will require new 
strategies in hardware and in software and improved skills to 
maintain U.S. leadership. Cognitive systems will digest and 
exploit massive data volumes. Tools such as mobile phones, 
videos and social networks generate as much data in two days in 
2013 as in all of human history prior to 2003.
    Advanced analytics can be thought of as tools for infusing 
all this data to make decisions on facts rather than intuition. 
The challenge is to transform latent data into actionable 
information to decide what to do next. For example, the Memphis 
Police Department is using data analytics to map crime hotspots 
and find patterns. As a result, they have been able to reduce 
crime by 30 percent with no increase in overall police 
manpower.
    To run advanced analytics, it is essential to have the most 
powerful computing systems. However, current supercomputing 
systems are reaching performance levels that will stagnate 
without significant innovation. We must move to the next 
generation of large-scale computing called exascale computing, 
a thousand times faster than today's petascale machines.
    The United States needs to invest now in the research and 
development for exascale systems to maintain strategic and 
economic leadership. Government-funded research on domain 
skills, especially at our national laboratories, should target 
systems for modeling, simulation, and analytics on big data.
    Before 2005, the United States had a clear lead in the 
global supercomputing race. Today, we are still ahead but the 
rest of the world is catching up rapidly. To stay ahead will 
require new skills and knowledge and new types of decision-
making. Nearly two million IT jobs will be created by 2015 in 
the United States to support big data, and the job candidates 
with analytic skills will get these jobs.
    Industry is developing many collaborative skills programs, 
as enumerated in my testimony. I highlight our announcement 
today with Rensselaer Polytechnic Institute to offer a graduate 
degree program in the fall of 2013, the Master of Science in 
Business Analytics.
    Privacy must be considered in the design of big data 
systems. Big data does not require the sacrifice of personal 
privacy. When personal information is used, design-in processes 
such as IBM's Privacy By Design can protect privacy. When 
people understand how information is used, they have the 
ability to set data usage policies and enjoy benefits of the 
analysis, they tend not to have privacy concerns.
    The government's role should focus on research and skills. 
First, Federal research investment in high-performance 
computing is critical to big data. Industry needs university-
based exploratory research into numerous areas including system 
design, flexible software defined environments, and IT 
infrastructure.
    Second, IBM strongly supports the reauthorization of the 
Department of Energy High End Computing Revitalization Act of 
2004 to be offered by Representative Hultgren. This bill will 
improve high-end computing R&D at the DOE and strengthen 
government industry partnerships for exascale platforms. IBM 
has a long history of successful partnerships with DOE. This 
partnership established computational simulation as an 
essential tool for scientific inquiry and led to world 
leadership in the United States in high-performance computing. 
The challenge ahead is to continue this growth. Past Federal 
investments in HP-related research, particularly at DOE's 
national labs, have underpinned mission-critical supercomputers 
at DOD, NASA, NOAA, and in the intelligence agencies.
    Third, the professional science masters program supported 
by NSF is particularly relevant as it provides advanced 
training in science or mathematics and develops workplace 
skills valued by employers. Finally, Congress should 
reauthorize the Carl D. Perkins Act and the Federal work-study 
program and restructure them to align workforce needs and big 
data.
    In conclusion, there exists today a tremendous abundance of 
data about our world. New cognitive computing capabilities will 
help determine which countries and businesses will thrive. The 
United States should support advanced computing and build its 
workforce to seize the future.
    Thank you, and I welcome your questions.
    [The prepared statement of Dr. McQueeney follows:]

    [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
    
    Chairman Bucshon. Thank you, Dr. McQueeney.
    I now recognize Dr. Rappa for five minutes for his 
testimony.

           TESTIMONY OF DR. MICHAEL RAPPA, DIRECTOR,

               INSTITUTE FOR ADVANCED ANALYTICS,

              DISTINGUISHED UNIVERSITY PROFESSOR,

                NORTH CAROLINA STATE UNIVERSITY

    Dr. Rappa. Good morning, Chairman Bucshon, Chairman Massie, 
Ranking Member Lipinski, Ranking Member Wilson and other 
Members of the Subcommittee. I appreciate the opportunity to be 
here this morning to speak with you about data analytics and 
the role institutions of higher learning can play in advancing 
the field.
    I am going to draw this morning's testimony on my own 
behalf as a professor and director of a research institute, 
educational institute for over the past 25 years.
    I think it is important to start with the fact that the 
world is changing around data very rapidly and our ability to 
productively use it becomes a very central part of what we do 
as a society today, as has been heard already. A generation 
ago, data was scarce, expensive, time consuming to collect and 
difficult to analyze. Today, data is everywhere.
    Advances in computer technology and powerful analytic tools 
make it possible not only to collect vast quantities of data 
but also analyze and draw insights from data to solve pressing 
problems from increasing operational efficiency to combating 
fraud, to better health care, to protecting national security. 
Data is everywhere. The question is, how well are we prepared 
to use it? We have the data, the technology, the methods and 
tools, all of which continue to advance. The national 
challenge, in my view, going forward will be our ability to 
educate a data-savvy workforce that has the analytical skills 
to put data into action. Estimates of the talent gap as we have 
heard are large and growing.
    This is a dire but solvable problem. As we have shown at NC 
State, working closely with employers and focusing on their 
needs, we can produce the kind of talent that is so desperately 
needed today. We do it quickly in just 10 months with a 
domestic student population ranging from their early 20s to 
their late 50s, many of whom are returning to school. We have 
done this now for six years economically with consistently high 
student outcomes using a sustainable and scalable business 
model based on self-financed tuition.
    What it comes down to is creative innovation, how we 
organize graduate education, allowing us to engage with 
employers more productively to yield high-quality results in 
the skills and readiness of our graduates.
    I encourage the Committee to focus its attention on 
workforce needs, to encourage the government to seek out 
innovation in higher education and to promote new and novel 
learning models. This is a solvable problem. With the proper 
incentives, focused resources, open collaboration with 
industry, we can produce the analytics professionals needed to 
extract value from big data and to move the economy forward. As 
I said, we have done this ourselves now for 6 straight years to 
great effect. We will graduate a class in a matter of another 
week, 80 students in the Master of Sciences and Analytics 
Program, with already 95 percent of them placed in jobs. They 
are literally the most sought after and highest-paid graduates 
of the university.
    So we can do this. It is a solvable problem. Thank you 
again for your time. I will be glad to answer any questions.
    [The prepared statement of Dr. Rappa follows:]

    [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
    
    Chairman Bucshon. Thank you for your testimony.
    I now recognize our final witness, Dr. Jahanian, for five 
minutes for his testimony.

               TESTIMONY OF DR. FARNAM JAHANIAN,

            ASSISTANT DIRECTOR FOR THE COMPUTER AND

           INFORMATION SCIENCE AND ENGINEERING (CISE)

            DIRECTORATE, NATIONAL SCIENCE FOUNDATION

    Dr. Jahanian. Good morning, Chairman Massie, Chairman 
Bucshon, Ranking Members Wilson and Lipinski, and Members of 
the Subcommittee. It is my pleasure to be back here to discuss 
the next generation of computing and big data analytics.
    Today we live in an era of data and information enabled by 
advanced technologies that surround us. Data is generated by 
modern experimental methods, scientific instruments such as 
telescopes and particle accelerators, large-scale simulators, 
Internet transactions, email, video images, clickstreams, and 
widespread deployment of sensors everywhere. Approximately 90 
percent of the data in the world today were created in the last 
two years alone. However, when we talk about big data, it is 
important to emphasize not only the enormous volume of data 
being generated but also the velocity, heterogeneity and 
complexity of data that now confronts us.
    Why is big data important? Several others have alluded to 
this already. Data represents a transformative new currency. 
Big data is increasingly important to all facets of our 
Nation's discovery and innovation ecosystem. First, insights 
and more accurate predictions from large and complex 
collections of data are creating opportunities in new markets, 
driving the creation of IT products and services and boosting 
the productivity of businesses. Second, advances in our ability 
to store, integrate, and extract meaning and information from 
data are accelerating the pace of discovery in almost every 
science and engineering discipline. Third, big data has the 
potential to solve many of the Nation's most pressing 
challenges from health care and education to cybersecurity and 
public safety, yielding enormous societal benefits and ensuring 
sustained U.S. competitiveness.
    Let me share with you just a few examples of the promise of 
big data. These are all grounded in research that is funded by 
the Federal Government or by the private sector, the work that 
is done in the private sector. By integrating biomedical, 
clinical and scientific data, we can predict the onset of 
diseases and identify unwanted drug interactions. By coupling 
roadway sensors, traffic cameras, individual GPS devices, we 
can reduce traffic congestion and generate significant savings 
in time and fuel. By accurately predicting natural disasters 
such as hurricanes and tornadoes, we can employ lifesaving and 
preventative measures that mitigate their potential impact. By 
correlating disparate data streams through text mining, image 
analysis and face recognition, we can enhance public safety and 
public security. By integrating emerging technologies such as 
MOOCs and inverted classrooms with knowledge from research 
about how people learn, we can transform formal and informal 
education.
    What does this mean for scientific discovery? Data-driven 
discovery, also called the fourth paradigm, is revolutionizing 
scientific exploration and engineering innovations. It enables 
extraction of new knowledge, provides novel approaches to 
driving discovery and decision-making, yields increasingly 
accurate predictions and provides deeper understanding of 
causal relationship based on advanced data analysis.
    What is government doing to ensure we harness this 
potential? As it was mentioned already, in 2011 U.S. Networking 
and Information Technology Research and Development Program, 
also called NITRD, formed a big data senior steering group to 
identify, initiate and coordinate big data research and 
development activities across the government to ensure that 
Federal agencies, the scientific research enterprise, and 
public maximally benefit from data-driven discovery. In March 
2012, the National Big Data R&D Initiative was launched, 
focusing the steering committee group's focus on the tools, 
technologies and human capital needed to move from data to 
knowledge to action. We see exciting new partnership 
opportunities with the private sector, state and local 
governments, academia and nonprofits.
    At NSF, we have identified four major investment areas that 
address current challenges and promise to serve as the 
foundation of comprehensive long-term agenda: first, investment 
in foundational research to advance big data techniques and 
technologies; second, support for building new 
interdisciplinary research communities; third, investment in 
education and workforce development; and finally, development 
and deployment of cyber infrastructure to capture, manage, and 
analyze and share digital data.
    I should add that NSF's investment in cyber infrastructure 
includes advanced computational resources that support data-
enabled science. In particular, the newly dedicated Blue 
Waters, Stampede and Yellowstone supercomputers will expand our 
Nation's computational capabilities significantly.
    In summary, big data represents enormous opportunities for 
our Nation. Investments in big data research and education will 
advance the frontier of knowledge, further fostering 
innovation, creating new economic opportunities, and yielding 
new approaches to addressing national priorities.
    Thank you again for this opportunity. I would be happy to 
answer any questions.
    [The prepared statement of Dr. Jahanian follows:]

    [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
    
    Chairman Bucshon. Thank you for your testimony. I would 
like to thank all the witnesses for their testimony. I am 
reminding Members that Committee rules limit questioning to 
five minutes, and the Chair at this point will recognize 
himself for five minutes to start the questions.
    First, Dr. Jahanian, the Administration announced their Big 
Data Research and Development Initiative in March 2012 
including $200 million in new commitments for big data research 
initiatives. However, the National Science Foundation, 
Department of Defense, Department of Energy, and other agencies 
have had significant research programs and data analytics that 
predated the initiative. How has the Administration's 
initiative changed the ways these agency research programs are 
coordinated and are we effectively managing and leveraging our 
research investments across agencies?
    Dr. Jahanian. Thank you for your question. You are 
absolutely right that it is not that suddenly last March we 
woke up and said boy, data is really important, we need to do 
something about it. There has been significant investment by 
the Federal sector and private sector in areas having to do 
with data. The challenges we face are many--stewardship of 
digital data and software, for example. Many data sets, as was 
mentioned, are too poorly organized or also unstructured. Many 
data sets are heterogeneous. The utility of data is also 
limited by our ability to interpret them. Many data are being 
collected at a scale that we can't even store them, let alone 
analyze them. Also, large and linked data sets may be exploited 
to identify individuals and so there are also the privacy 
issues. So there are enormous challenges that we face.
    As you alluded to, on March 29, 2012, OSTP in concert with 
a number of Federal agencies launched the national Big Data 
Research Initiative. It expands the scope of our activities in 
several directions, for example, state-of-the-art core 
technologies that we need to collect, store, preserve, manage 
and analyze data, harnessing these technologies to accelerate 
pace of discovery, supporting responsible stewardship, for 
example, and sustainable business models for big data.
    There are a number of cross-coordination efforts taking 
place under NITRD. Let me start with NSF. All NSF directorates, 
for example, are participating in this. A multidisciplinary 
panel of experts are making a recommendation on funding of 
this. Furthermore, big data is being coordinated through a 
senior steering group that reports to the assistant directors 
at NSF for all the coordination because it involves every 
science and engineering discipline.
    As far as the Federal Government is concerned, the Big Data 
R&D Initiative is coordinated through the NITRD Subcommittee. 
As you know, I Chair the Subcommittee. There is a senior 
steering group that regularly meets to coordinate the 
activities on many of the fronts that I alluded to. There are 
also enormous opportunities not only in terms of joint 
solicitations but there are a number of workshops that we are 
holding jointly with other agencies including NIH, NIST, DOE, 
DOD to advance the frontiers of knowledge and exploration in 
big data.
    I should also mention that when it comes to this 
initiative, we can't forget that the private sector plays a 
significant role. When we think about innovation and discovery 
ecosystems, not only are we talking about universities, we are 
talking about scientists and engineers, you know, a rich, 
talented labor force, investments in research and education, 
and of course, a vibrant private sector. So there are a number 
of programs that we have at NSF that attempt to connect the 
dots when it comes to transfer of knowledge.
    Chairman Bucshon. Thank you. I am glad to hear there is 
quite a bit of coordination at the Federal level because I 
think all of us are concerned about that, and again, investing 
the taxpayer dollar wisely.
    Dr. Rappa, I also serve on the Education and Workforce 
Committee, and I have got children age 9 through 20, four of 
them, and I have a really strong interest in how we get young 
people interested in different fields of study, and obviously 
we have a tremendous challenge not only with this area but many 
others, and do you think that--what are your ideas on how we 
engage young people in understanding what opportunities there 
are in this area and what the jobs of the future might hold? I 
mean, how do we do that? Because, you know, when you go to a 
high-school class, and I talk to a lot of high-school class, 
people say, you know, not many people come up when you ask them 
what they want to be, you know, they want to analyze big data. 
So how do you do that? What is your recommendation?
    Dr. Rappa. Well, thank you very much for your question, and 
I understand exactly what you are saying, and I think that 
things are changing. You know, I think it is exactly true that 
your average 8-year-old doesn't say they want to grow up, for 
example, to be a statistician. It is not common, unless they 
are really interested in sports. Then you see a sort of nexus 
there because of the relationship. But I think what is changing 
is that it is really about producing education, in my case, at 
the graduate level, reaching further into the pipeline down 
into undergraduate education and even touching upon high school 
where people begin--where students begin to understand how data 
is really used in action. So it is really about creating, not 
just sort of creating knowledge or understanding but also 
applying that knowledge. And when our students--our whole 
education is driven around the application of that knowledge, 
and so students really understand, and increasingly 
undergraduates understand that this kind of graduate education 
is going to lead them to a very interesting, compelling 
professional life.
    Chairman Bucshon. Well, thank you, because I think that we 
do--you know, we do need to have this type of information 
gravitate down, even to middle-school kids to get them 
interested, and there is a program in Indianapolis called 
Project Lead the Way who I know very well that is beginning to 
do that at the high-school level, and it is showing some 
success.
    But my time is expired, so I would love to talk more about 
that but at this point I am going to yield to Ms. Wilson for 
five minutes for her questions.
    Ms. Wilson. Thank you, Mr. Chair.
    Along those lines, can you tell me either one of you what 
skills are necessary for the big data workforce? I heard you 
say something about an analytical something. And also as you 
are speaking, I would like to hear from you what role can 
community colleges play in preparing the next-generation 
workforce for big data.
    Dr. Rappa. Thank you very much for your question. I would 
like to try my hand at that. So what is sort of interesting and 
novel about what we have done around the education, we really 
started from scratch in building an entire new graduate degree 
program, and we really wanted to address this question of what 
skills were needed, and we focused ourselves really looking at 
the employer as the customer in a sense, the person, the 
individuals who buy our product and the students and really 
tried to understand the skills that they need, and really where 
that brings you is that there is these technical skills which 
are important in programming, in math and statistics, but 
employers really want much more than that. They want 
individuals who can work well in teams, who can communicate 
these insights to decision makers, who can actually use the 
tools and apply the knowledge in an organizational context, and 
so we have structured the whole education to build a very 
balanced set of skills as opposed to what I think is really the 
conventional approach in graduate education and to some extent 
undergraduate education to focus on the technical skills almost 
exclusively. And so really what we need to do is sort of 
approach the whole student. Now, I think community colleges can 
play a very important role because you can really begin to 
channel pipelines where students can go and get the 
prerequisite knowledge that they need, the early levels of math 
and statistics, before they go on to graduate education. Thank 
you.
    Dr. McQueeney. I would just like to comment that a lot of 
the focus in the past has been on the graduate level of 
education, as Dr. Rappa just talked about, and while we 
continue to have a strong need for Ph.D.'s and computer science 
and electric engineering and mathematics, the biggest skill gap 
that we see is at the masters level, quite frankly, of people 
who may not have the mathematical skills to create an entire 
new type of analysis of data but who have more than basic IT 
skills who actually can understand the implications of using 
different analytical techniques given a problem, given a data 
set with certain statistical properties, what would be the 
appropriate analytical technique to use, and when you apply 
that technique, how could you be sure that the results would be 
reliable and proper, and so a lot of our focus has been on 
creating an intermediate level of skill that has the basic 
understanding of how to use these tools even if it would fall 
on someone with more of a Ph.D. level of training to create new 
analytical approaches.
    Dr. Jahanian. Representative Wilson, I want to echo 
something that has been said. If you think about big data, let 
us just step back. There are three related problems that go 
beyond big data. It includes all of our IT workforce, computer 
science, computational science and so on. These problems have 
to do with underproduction, which everybody recognizes, 
underrepresentation and then pipeline issues. Chairman Bucshon 
already alluded to this, that we need to worry about our high 
schools, we need to worry about the pipeline. I have three 
kids, and I know where we lose our kids, it is not in masters 
or Ph.D., we lose the interest of our kids in high schools and 
middle schools, so that has to be fixed, and there are a number 
of programs that we have initiated, pilot programs that try to 
address that issue.
    Let me share with you one anecdotal sort of evidence that 
provides data on this. Annualized Bureau of Labor Statistics 
data predicts that each year we need about 140,000 job 
openings. We will have 140,000 job openings in computing and 
broadly speaking IT-related jobs but we are only producing 
about 100,000 qualified individuals including masters, Ph.D., 
undergraduate and community colleges. In fact, many of these 
jobs would be available to individuals who have two year or 
four year degrees.
    Another data point that I want to share with you is that 62 
percent of all newly created STEM job openings between 2010 and 
2020 will be in computing and IT. Let us not forget that. And 
that includes data, that includes computational skills and many 
of the other skills that the other witnesses alluded to. Thank 
you.
    Ms. Wilson. Just in my 16--oh, 10, 9, 8--what would you 
suggest that we begin to--how do we begin to get children 
interested in these sort of skills? I know every little child 
has an iPad. They can work these computers better than adults. 
What do you think we can do to stimulate that all the way from 
K-12 and into the community colleges so we will have more IT 
graduates? Do you suggest we buy each one--we outfit classrooms 
with iPads, or what do you think?
    Dr. McQueeney. I think there is an intrinsic curiosity in 
younger folks about a lot of the tools they use to communicate 
with each other that have tremendously greater scalability than 
the tools that I use to communicate with my friends.
    Ms. Wilson. Right.
    Dr. McQueeney. So the essence of what is a large 
community's opinion on a topic of interest could involve the 
opinions of thousands or millions of people and so I think a 
lot of the young folks I talk to when I visit K-12 programs or, 
you know, in programs like eWeek, they have an intrinsic sense 
not only of the device and the technology but they have a sense 
of the reach of that device and technology which is the 
beginning of an appreciation of really what we are talking 
about with big data, that there are trends that they can reach 
with that device, and I think that fires their imagination in a 
very powerful way.
    Chairman Bucshon. Thank you. I will now recognize Mr. 
Massie, Chairman Massie, for his questioning.
    Mr. Massie. Thank you, Chairman.
    So one of the questions that I have as we deal with the 
interface between government and private industry here is, are 
you aware of any government data sets that we need to get more 
into the public domain for usage? For instance, I think we have 
done a pretty good job about getting some of the mapping stuff 
out there but some of that map information is old, goes back to 
the 1940s and 1950s, and I know the government has been paying 
for LIDAR mapping, which is a high-resolution terrain mapping, 
and I am kind of concerned that that is not getting out there. 
Are you aware of that, and are there any other data sets that 
would be useful to the public that the public has paid for that 
we might want to work on getting out to the public?
    Dr. McQueeney. I think the government has done an excellent 
job and had many initiatives that were very focused on getting 
that valuable data out so it could be used. You mentioned 
LIDAR. I know that one of the uses that is very promising for 
LIDAR is to do something like an inventory of the forests in 
the country, to actually be able to conduct a definitive 
inventory. Right now, the agencies that are responsible for 
that use a statistical sampling technique but in a world where 
you can take LIDAR images and process that enormous data 
volume, you are able to move then from a statistical sampling 
basis, which is all we could do before, to a more definitive 
approach to get a very, very good picture of one of the more 
valuable natural resources that needs tremendous amounts of 
stewardship. So I think that is an example of a data set that 
could be extremely valuable. But I think in general, the 
government is very well and properly focused on getting those 
valuable data sources out. Weather would be another--basic 
weather data would be another good example that can be built on 
to add extra value.
    Mr. Massie. Are the other witnesses aware of any data sets 
that we need to promote more?
    Dr. Jahanian. I want to highlight a couple of things. I am 
sure you are aware of data.gov, which is a Web site that makes 
a lot of government data sets available, and the goal here is 
to increase public access to high-value machine readable data 
sets that are generated by the government. Hopefully it will 
create new economic values. There are also a number of 
activities in encouraging the private sector, entrepreneurs to 
develop applications on top of that data. It is not just making 
the data available but also making the data valuable so there 
are a number of essential activities related to that.
    There was a recent Wall Street Journal article actually 
that highlighted at least a dozen different kind of government 
data sets that have been made available from labor and health 
violations to flu incidents, energy prize, offshore activities, 
solar information, and so on and so on that are interesting. 
From the National Science Foundation's point of view, I should 
mention that as you may know, we have a number of large 
facilities--LSST was mentioned, Neon, which is another facility 
that collects a lot of data, will be collecting a lot of data. 
The science and engineering community needs that data, and many 
Federal agencies are working very hard to make that data 
available. There are a number of issues having to do with open 
access that go beyond the scope of this question.
    Mr. Massie. Let me ask a follow-up question to that. So big 
data like any other data could be misused, altered, hacked, 
illegally accessed, and sometimes it may just be an honest 
mistake. We share data that we probably shouldn't have, for 
instance, where some farm data that got out there and it could 
really compromise our food safety if people know where all the 
food sources are. How do we balance the desire for privacy, 
actually the constitutional right to privacy, with sharing all 
of this data now that everybody is under a microscope?
    Dr. Rappa. I thank you for your question, and I would like 
to sort of just turn it a little bit because we do work--each 
year we work with about 16, 17 organizations that share data 
under a confidentiality agreement including three government 
agencies in which we put teams of students working on very 
complex analytics projects, and so while I applaud, and I think 
it is very important and I do think the government is doing a 
good job at sharing data openly, it is a very important thing 
to do, I think there is also an opportunity to engage the 
academic community in other ways to help understand that data, 
which might mitigate some of these issues around the privacy 
element.
    Mr. Massie. Dr. McQueeney?
    Dr. McQueeney. Yes, that is an excellent question. Thank 
you for that. One of the things that we can do is to get data 
about the data. We call it metadata. So we analyze the data and 
we don't just look at what information we can get from the data 
but we describe the data perhaps in terms of its sensitivity--
is this more or less sensitive from a point of view of privacy 
or security or secrecy--and we can then tag those data sets 
with metadata that describes the implications of using that 
data and then we can build into the systems that handle the 
data policies that look not only at the data but the metadata 
that describes what are the contents and what are the 
implications of sharing and combining that data and so we can 
actually build into the foundation of big data systems the 
ability to interpret policies that we have set in a very 
conscious and clear-eyed way and as they process the data they 
can be respectful of that metadata. The medical community has 
actually done a lot of very good work around patient 
confidentiality while still getting very good pattern analysis 
of different kinds of outcomes.
    Mr. Massie. Thank you very much. My time expired. I 
appreciate your answer and concern for that question, Mr. 
Chairman.
    Mr. Bucshon. Thank you, Mr. Massie. I now recognize Dr. 
Bera for five minutes for his questions.
    Mr. Bera. Thank you, Mr. Chairman, and thank you for the 
series of hearings that we have had on the Subcommittee. It has 
been great.
    You know, big data is incredibly important and how we 
manage data and with the rapidity of how the world is changing. 
I mean, when I think back to being a high-school student, and 
for me it was going and looking at the index cards, walking 
down and looking in the encyclopedia. Now, when my daughter, 
you know, she has vast access, or when I do rounds in the 
hospital, we would have to race down to the library to get 
information but now before we are even finished presenting, the 
medical students or the residents can just look at the latest 
data on, you know, a device like this and get access to the 
most accurate and timely information. So it is incredibly 
important that we make these investments to not only manage the 
data, to sort that data and then to make sure it is accessible. 
It is a critical priority that we have that workforce both at 
the professional level but then also at the management level 
and I think the number that I read was we need about 1.5 
million managers. So there is a huge need but also a huge 
opportunity.
    When I think back to the talent that has been impacted in 
the last four years in the recession, you know, there are a 
large number of extremely intelligent and talented individuals 
in their 30s and 40s who have been hit hard. These are folks 
like myself that were trained for a 20th-century workforce but 
now we find ourselves in a 21st-century economy.
    Dr. Rappa, are there some best practices--and these aren't 
individuals that need to get a graduate degree, you know, they 
are talented individuals--where we could take them and quickly 
train them for this new economy? Are there examples?
    Dr. Rappa. Right. So we do offer it as a graduate degree 
but we do this in 10 months, and indeed, a good, fairly 
substantial, larger portion of our population are people who 
are returning from--or coming from the workforce to go through 
this and some of them are in exactly the position that you say. 
They were transitioning, their companies were faltering. And so 
the key really with this is short duration. Ten months is 
actually a very reasonably good time because you could build 
the skills that you need. If it is too short, you can't 
accumulate the skills but the key thing is that you have really 
demonstrated ROI on that education because that person who is 
coming in to do that has to know that they have a very high 
probability of getting a job when they leave and at a 
particular salary rate so that they can justify the investment 
and time, and that is really what we have done.
    Mr. Bera. Dr. McQueeney, are there potentially any 
examples--you know, again, a lot of these folks are also paying 
their mortgage, they have to continue to foot their bills--of 
possibly even doing an advanced work-study type of program 
where you recruit this talent and they are getting on-the-job 
training as opposed to a traditional school model?
    Dr. McQueeney. Yes. In fact, there is a related topic here 
that I think is quite interesting, which is the application of 
big data and analytics back on to the educational process 
itself. You have seen the great upsurge in videos that attempt 
to replace traditional brick-and-mortar classroom attendance, 
coursework. You have seen a number of startup companies formed 
in this space. If you look at the education process, each of us 
really learns quite differently. Some of us may learn more from 
hearing or from seeing or from working problems, and great 
teachers, great professors are sensitive to how their different 
students learn and are capable of presenting material in 
alternate ways to make sure they reach all the students. With 
electronic delivery of course materials and monitoring of 
student progress, we generate digital exhaust, if you will, 
that describes how that student is learning, how that student 
responds to the instruction, and for the parts of the 
instruction that are delivered electronically, we actually have 
the ability to do analytics and to do an optimization process 
so that each of us on the panel might not get the same length 
of lecture on five different topics. It might be adjusted to 
our historical learning patterns.
    So we have worked with a number of universities and other, 
you know, non-traditional educational institutions to apply the 
big data and analytics techniques to the education and training 
process itself.
    Mr. Bera. Great. In my last 30 seconds, so we have access 
to data. I think one element that we should also be conscious 
of is the quality of the data because there certainly is very 
good-quality data but at the same time there is very poor-
quality data that is out there and, you know, any of you who 
want to comment on how we monitor quality?
    Dr. Rappa. I think most data starts off as bad data, for 
the most part, unless it is being collected in a highly careful 
way. And so it is, you know--I think just as we hear about big 
data today, we are going to hear about bad data in the future. 
Most projects start out where you have enormous front end to 
them to really understanding cleaning and cultivating that data 
to make it useful, and that is an important part of the 
educational process.
    Dr. Jahanian. I would just add that there are a number of 
techniques that have been developed and are in development 
dealing with data exploration, data cleaning and so on. 
Furthermore, when we talk about large-scale data sets, there 
are statistical techniques that are being applied that really 
take care of the noise, take care of some of these 
inconsistencies, and that is one of the attractions of big 
data.
    Mr. Bera. Great. Thank you.
    Chairman Massie. [Presiding] Thank you, Mr. Bera. I now 
recognize Mr. Schweikert from Arizona for five minutes.
    Mr. Schweikert. Thank you, Mr. Chairman.
    This is one of those types of conversations, you know, we 
could all sit around and buy you some well-caffeinated coffee 
and talk for hours and still have no idea if we made any 
progress.
    Doctor, is it McQueeney?
    Dr. McQueeney. Yes.
    Mr. Schweikert. First, you are with IBM?
    Dr. McQueeney. Yes.
    Mr. Schweikert. In your testimony, help me do a little 
ferreting out here. Hardware technology or IT talent, what is 
your biggest bottleneck right now?
    Dr. McQueeney. There are bottlenecks in a number of areas. 
If I looked at the hardware itself, the biggest challenge 
getting from the petascale to the exascale is actually the 
power dissipation of the systems. The new technology work that 
we are doing is to get the computations more efficient in terms 
of floating point operations per watt so that if you assembled 
a system thousand times bigger than today's supercomputers you 
could house it and cool it.
    Mr. Schweikert. You don't want to take down the power grid?
    Dr. McQueeney. The power grid may not in fact be able to 
supply enough power if we didn't make some innovations. That is 
a good point.
    Mr. Schweikert. But hasn't your company actually been one 
of the leaders at producing some of those breakthroughs?
    Dr. McQueeney. In fact, we have, and in fact, a lot of that 
history goes back to work that started with the Department of 
Energy many years ago, and this bears on an interesting 
historical point. In a time when we are concerned about making 
investments efficiently, if I go back to the beginning of the 
ASCII program with the Department of Energy to do the nuclear 
weapons stockpile stewardship program, the Department of Energy 
scientists did a very careful analysis of what were the core 
algorithms, the core analytics, if you will, in today's 
language, that needed to be done at a certain level to provide 
the mission that they needed to provide, and they found that 
the current path at that time of supercomputing was going to 
take five years to produce a machine that they needed in 1 or 
two years. The analysis they did was thorough enough to reveal 
that there weren't bottlenecks everywhere but at that time 
there were bottlenecks mostly in the inner process or 
communication. So they made a very thoughtful, very surgical 
investment in accelerating just the piece that was needed to 
close their mission gap, which was the beginning of a very long 
run of government-industry collaboration.
    Mr. Schweikert. But you are in some ways heading towards 
where my question is. So if that bottleneck, in today's world, 
do I find the technology if I went out to the private sector 
around the world that is competing and producing high-end 
supercomputing or is it coming out of a government lab? And I 
know the pop culture terminology is ``public-private 
partnership'' but the reality, they do operate in pretty 
substantially different silos.
    Dr. McQueeney. The real forcing function for a breakthrough 
is a critical mission need. So in the case of high-performance 
computing, it has often been a government agency with a 
critical mission that----
    Mr. Schweikert. But they were doing a specific request for 
how they wanted to manage their data?
    Dr. McQueeney. That is correct, and once that technology is 
available, it can be consumed very rapidly in lots of other 
applications that could take great advantage of it but didn't 
have a compelling enough need to get over that hurdle. That is 
when the disbursal of technology starts.
    Mr. Schweikert. Just as an aside, only because I had some 
acquaintances who were--I used to be an old SQL programmer so I 
am way out of date now. IBM was actually running a fascinating 
large data project where they were doing sweeping data sets 
through the world's social media and gathering it and looking 
for trends. Can you in 30 seconds or so tell me your knowledge 
on that?
    Dr. McQueeney. Yeah, we have analyzed the public social 
media sources with several of our customers and we can gain a 
lot of insights. For example, you know, retailers can gain 
insights about trends and their clients. Transportation 
agencies can gain insights about likely traffic congestion. 
There are many sources of public data, both social media and 
other forms that can be analyzed to reveal patterns about how 
people conduct their daily activities that are very useful for 
optimizing the public infrastructure.
    Mr. Schweikert. Forgive me, I am blind as a bat without 
these. Is it Dr. Rappa?
    Dr. Rappa. Yes.
    Mr. Schweikert. Isn't my single biggest problem in big data 
right now is noise that when I put data set after data set 
after data set and build on it, that just small incremental 
errors actually create really bad decisions on the end?
    Dr. Rappa. Well, I think part of the education around 
handling big data deals very squarely with the quality of the 
data and how to clean it and cultivate it to reduce the noise, 
to----
    Mr. Schweikert. But you and I can go over a long series of 
public policies, both state, national, you know, military, 
others, where we built it on really gigantic analyzed data sets 
and it was wrong.
    Dr. Rappa. Well, I think that, you know, the challenge here 
is education. So as I alluded to earlier, we have teams of 
students----
    Mr. Schweikert. Is it education or developing educational 
skepticism?
    Dr. Rappa. It is developing the education around how to 
squarely understand the inherent challenges in data. Data is 
not born clean. It isn't born ready to be analyzed.
    Mr. Schweikert. And when you and I build our model, the way 
we wait, you know, because we start plugging in human factors 
that, you know, you and I bring our biases and we----
    Dr. Rappa. And this is why we really need a focused 
education squarely around how do you draw insights from data 
because there are these inherent problems in data, especially 
as you scale them up, as you combine different data sets, as 
you combine different types of data.
    Mr. Schweikert. Thank you, Doctor, and Mr. Chairman, thank 
you for tolerating. It is just one of my great fears. And look, 
I am a data freak. I mean, you have got to see the servers and 
stuff I have at home. But I have learned when we make big-time 
public policy on something we all know is right, we keep making 
huge, very costly mistakes.
    Chairman Massie. Thank you, Mr. Schweikert. I now recognize 
Mr. Hultgren from Illinois for five minutes.
    Mr. Hultgren. Thank you, Mr. Chairman. Thank you all for 
being here. First of all, I just want to thank Dr. McQueeney 
too. I appreciate your mention and your support for the 
exascale computing bill I am currently authoring. I am very 
excited about the potential there and see some huge shift in 
our national computing capabilities and I am very excited about 
that, so I appreciate your mention and support of that.
    I do have a few questions, and first I guess I would 
address this one to Dr. McQueeney and also Dr. Jahanian. Is 
that right? I am sorry. I wonder if you could comment briefly 
on where the United States stands in your opinion in worldwide 
computing leadership? I know the metric of the fastest 
supercomputer is one metric but what do you use as a metric for 
big data to determine which countries are using it most 
effectively?
    Dr. McQueeney. The common thing that is cited in these 
discussions is the top 500 supercomputers list. That is 
something that is compiled twice a year, as you well know, and 
we have usually been at the top of that list. We have continued 
to be the majority of the systems on that list but other 
countries have noticed the success that we had in, you know, 
government leading the way on high-performance computing 
breakthroughs. Once those systems are built, they find hundreds 
and thousands of other applications, each with a client that 
might not have been able to fund that breakthrough themselves 
but can certainly utilize it. Other countries have popped up on 
the top of that list because they are interested in emulating 
the success we have had in leading the way with innovation and 
then seeing that innovation used broadly across the commercial 
sector. So the top 500 list is a very technical, perhaps very 
geeky measure of who is on top, and I would say that we are 
still in a leadership position there but it has been stronger 
in the past than it is today.
    If you turn to more of a business view, you would want to 
look at the companies that were taking the best advantage of 
data sources, either to drive value in their companies or to 
provide benefits such as public safety or health benefits, and 
there again I think we are in a good position but it is a very 
different kind of skill, a conversation we didn't quite finish 
before about the skill to build these large systems is a very 
focused, very large-scale, very capital-intensive activity but 
the skills to use these systems are more focused on creativity 
and are actually better done by large groups of small teams. In 
fact, you know, the NSF has been a leader in fostering that 
kind of innovation where thousands and thousands of groups can 
build innovative applications and take advantage of these 
systems.
    Mr. Hultgren. Thanks. Dr. Jahanian?
    Dr. Jahanian. Yes, just a couple of quick comments. There 
is no question that we continue to maintain our leadership 
worldwide in this area, and there is no doubt that continued 
investment in this area is extremely important to the future of 
the country. As I mentioned just a few minutes ago, NSF's 
investment in Blue Waters, Stampede, as well as the Yellowstone 
supercomputing centers represent a range of investments that we 
make in high-performance computing, addressing the needs of not 
only the top five percent of application that have 
exceptionally high computational needs but also a broad 
spectrum of researchers across the country in science and 
engineering who would need computational resources.
    A couple of comments. Just look at Blue Waters, for 
example, which is at University of Illinois. A couple of data 
points about it. It has--if you could--just the computing power 
of it, if you could multiply two numbers together every second, 
it would take 32 million years to do what Blue Waters does in 
one second. That is astonishing power, for example, of Blue 
Waters. In terms of storage capacity, memory capacity and so 
on, there is a similar kind of scale.
    The second point that I want to make is, we view 
computation and data to be two sides of the same coin. You 
really need to address both. So when we talk about 
computational capabilities, we also have to worry about cyber 
infrastructure to manage, to curate, to serve data to science 
and engineering community, and the investment in cyber 
infrastructure has to be balanced between the computation side 
of it as well as management and curation of data.
    Mr. Hultgren. Let me have--my time is running out but I 
have a follow-up question to the two of you as well if you 
could both comment in the time I have. It seems to me that 
exascale computing is focused on solving discrete problems that 
necessitate massive computing power and speed. Are these 
different problems than those we are addressing through big 
data analytical tools and how do these two terms, how are they 
different, how are they similar?
    Dr. McQueeney. Historically, we have tended to talk about 
them differently, but as we project how the exascale systems 
will be built and how they will be used and we look at the 
growing importance of big data analytic systems, we see that 
the platforms on which these systems will both depend will be 
much more common than separate, and in fact, we see that there 
is no conflict between investments in classically what we have 
called HPC and what we are now calling big data analytics, and 
both are changing actually. The way we use an exascale system 
will not be the same way that we use a petascale system. There 
isn't time here to go into it, but it actually morphs into a 
direction that is much more common with what we will do in big 
data and analytics.
    Dr. Jahanian. I would just add that many of the problems 
that the business community needs, the science and engineering 
community needs are being addressed today through different 
kind of computational architectures that doesn't necessarily 
require today to have exascale computing including weather 
modeling, a number of other applications that have been 
mentioned. So it is really important to consider the investment 
in exascale computing in the spectrum of investment that we 
make to support computational and data needs of the entire 
science and engineering community and of course the private 
sector.
    Mr. Hultgren. Thank you so much. Chairman, thank you. I 
yield back.
    Chairman Massie. I now recognize Mr. Lipinski from Illinois 
for five minutes.
    Mr. Lipinski. Thank you, Mr. Chairman. I am glad that Dr. 
Jahanian mentioned Blue Waters there. We were just there not 
that long ago, but since you covered that, I can move on to a 
different area.
    Dr. McQueeney, in your testimony you talk about how the 
Federal Government needs to invest in big data if the U.S. is 
going to maintain its leadership and competitive edge in this 
area. The needs and potential benefits of big data for the 
Federal Government align closely with those of private industry 
in a number of areas. If that is the case, how can the Federal 
Government more effectively partner with industry to achieve 
common goals and do you believe that industry has sufficient 
input in the Federal Government's research agenda as it relates 
to big data?
    Dr. McQueeney. I do think we have sufficient input. I think 
we have excellent dialogs with the relevant agencies and 
national laboratories, and I think the roles are complementary. 
I go back to the story about the early days of the ASCII 
program where through a collaboration we realized that the key 
piece of a supercomputing system that needed to be accelerated 
was not the entire investment. We could ride on the commercial 
investments for most of the components of the supercomputing 
systems at that time except for one, which was the high-
bandwidth switching between processors. And so that kind of 
thoughtful connection between the leaders in commercial 
computing and the leaders on the government side has been able 
historically to identify which areas are critical to attain 
government mission imperatives and where we can leverage 
commercial technology and where we need to accelerate that in a 
surgical fashion. So it has, in our view, been a very good 
partnership based on very high-bandwidth technical 
communications, understanding of applications and knowing when 
the government should be leveraging commercial investments and 
when they need to accelerate parts of that investment to attain 
unique mission goals, and again, as I have said before, once 
those barriers are crossed in terms of either the scalability 
of the system or the internal bandwidth of the system, it opens 
up thousands of new applications where there were ready 
problems to be analyzed but those applications weren't large 
enough to drive that breakthrough. So that is how the effect 
works of the leadership coming from some of the government 
agencies and then being realized broadly across industry. That 
is the essence of where this leadership has come from so 
successfully over the years.
    Mr. Lipinski. I want to follow up with Dr. Rappa on that. 
Dr. Rappa, you discussed the importance of public-private 
partnerships to realizing the benefits of big data and stated 
specifically that we must intensify and accelerate the national 
investment in proven models. What characteristics make a 
public-private partnership successful and what models should we 
be investing in? What were you referring to there?
    Dr. Rappa. Well, I think first of all, we have been doing 
this now for six years and so I think we do have a fairly 
interesting, novel model for producing talent in this field 
with a kind of proven track record based on data, based on 
market value of the graduates, but I think it comes really, you 
know, partly from the university community, partly from the 
academic community. Obviously we have a set of missions to 
educate students but we need to also, I think, do that by 
trying to really understand the employer, what are they looking 
for when they hire talent, what are the kinds of skills that 
they need in order to be effective on the job, and I think 
employers need to sort of be open to working with the academic 
community. You know, there is a certain amount of dissidence 
that naturally occurs because there are two different worlds 
with different missions but I think it is really--I think we 
have shown that it is possible with organizational innovation, 
with a focused effort, with a sense of openness to engage the 
private sector in a very positive way, not just at NC State but 
at other universities. There are many, many examples now that I 
hope we are providing some leadership on but that other 
universities are working with our model but also pursuing other 
creative models to do this. There are probably about two dozen 
around the country already.
    Mr. Lipinski. Thank you. Dr. Jahanian, anything you want to 
add about public-private partnerships?
    Dr. Jahanian. Yes, indeed. There is no question that when 
we think about the innovation ecosystem in this country, it 
includes academia, it includes the private sector, it includes 
government investment and a talent-rich workforce. The private 
sector is investing heavily in cloud computing, as you know. It 
is investing heavily in making computational resources also 
available. I think there are opportunities for the Federal 
investment to leverage that and make some of that available. Of 
course that is commercially available today to our researchers, 
to our scientists and engineers who could rely on those 
systems. We have announced a number of partnerships, one with 
IBM and Google, another one with Microsoft that make some of 
those resources available to the research community.
    Dr. McQueeney already mentioned this, that there is high-
bandwidth communication between the private sector and various 
Federal agencies. I can tell you from NSF's perspective, it is 
a very, very rich collaboration. On my advisory committee, I 
have a number of the senior leader from the private sector who 
serve on my advisory committee advising us on our portfolio, on 
our investments in addition to academics who serve on my 
advisory committee.
    The final comment that I want to make is, there are a 
number of programs at NSF, and I know you are familiar with all 
of them, including SBIR, including I-Corps and so on that focus 
on transfer of knowledge from lab to practice. Federal 
Government invests heavily in advancing frontiers of knowledge. 
For us to accelerate those programs such as I-Corps, SBIR and 
so on serves a tremendous purpose, and here again, there are 
opportunities to engage the private sector and accelerate the 
transfer of knowledge to practice to benefit the Nation. Thank 
you.
    Mr. Lipinski. Thank you.
    Chairman Massie. Thank you, Mr. Lipinski. I now recognize 
Mr. Bridenstine from Oklahoma for five minutes.
    Mr. Bridenstine. Thank you, Mr. Chairman.
    I also serve on the House Armed Services Committee, and I 
am aware that the Department of Defense is moving towards 
cloud-based computing solutions, and this of course creates 
some consternation about security issues, cyber hacking, other 
cyber crimes, and I was wondering if any of your organizations 
are involved in helping the Department of Defense work through 
these issues and what those solutions might be, if you could 
share with us on that?
    Dr. McQueeney. Sure, if I could start? You are quite right 
to raise the concern about security for any systems used by the 
Defense Department especially, although it would be true for 
all Federal agencies. And when you move to a cloud computing 
model, there is an extra imperative to be concerned about 
security, and if you think of it in terms of the DOD might 
think of it, if that environment should be compromised by an 
enemy, it is a bigger piece of resource than an individual 
machine so it requires special vigilance. Now, the good news 
technically is, the way we handle virtualization, which is the 
foundation of how cloud computing is delivered from a compute 
virtualization point of view, there are actually sophisticated 
techniques that can provide additional security in a 
virtualized environment that we can provide even when using 
things running on bare metal. We have additional abilities to 
instrument the operation of that cloud and to very rapidly 
detect any kind of pattern or behavior that is indicative of a 
threat.
    We did a project a number of years ago with the U.S. Air 
Force and they graciously let us write a short press release on 
it where we built a cloud computing environment that was at the 
cutting edge a few years ago. We instrumented it very 
thoroughly with watching the package flowing on the 
interconnected network that built the cloud in question and we 
very carefully isolated it from the rest of the world, 
introduced known cyber attacks into it and were able to show 
that if we knew the patterns of command and control, as the 
defense folks might say, of these cyber attacks, we could 
actually spot them assembling themselves and interrupt them 
before they had a chance to launch. So having tremendous 
control over the environment out of which we were getting 
compute resources gave us abilities to do additional security 
and additional monitoring, even if we assumed the security was 
not perfect and could be breached, could we essentially in real 
time detect that breach and interrupt it before it stopped. I 
thought that was a very forward-looking piece of work that was 
driven by the Air Force CIO's office.
    Mr. Bridenstine. Excellent. Go ahead.
    Dr. Jahanian. As you alluded to, these new environments, 
whether it is mobile platforms or cloud computing, are 
introducing new challenges, and we recognize that attackers and 
defenders are coevolving and there are enormous challenges to 
protecting our critical infrastructure and our cyber 
infrastructure.
    I wanted to mention NSF's Secure and Trustworthy Cyberspace 
program, which is a research program addressing many of the 
challenges that we alluded to, and this is a research program 
that addresses not only the technology issues but also 
transition to practice. Furthermore, the NITRD research and 
development subcommittee has a working group that focuses on 
coordination of activity across various agencies on 
cybersecurity and there is rich dialog involving various 
agencies on that issue.
    Mr. Bridenstine. Excellent. Are there any other things that 
the Department of Defense could do to help you guys with the 
objective of securing cloud computing for the Department of 
Defense?
    Dr. Rappa. So I am currently co-directing a project with a 
colleague at NC State, which is the science of security project 
that is done in collaboration with Carnegie-Mellon University 
and University of Illinois, and we are trying to bring together 
large groups, multidisciplinary groups of faculty to really try 
to understand the underpinning of the security problem and how 
to produce science around it. It is a very long-term challenge 
but it is one which I think has to start with getting the 
faculty across different disciplines focused on it and 
certainly I think it has been a tremendous opportunity and I 
look forward to moving into the future.
    Dr. McQueeney. Yeah, Dr. Rappa makes a very interesting 
point, to close the loop here. The cybersecurity problem is 
itself a big data and fast-data problem, and in fact, with some 
of the advanced persistent threats that we see today, which 
depend on breaching an infrastructure and then laying dormant 
for several months, what the attacker is trying to do is to 
wait out how long you keep your log file data so that when they 
launch themselves, it is difficult to do forensics, and so what 
we have learned is that these log files are actually the 
essence of the big data you need to do pattern analysis, 
pattern discovery on forensics, you know, should any attack 
occur. So in fact, most of the science behind big data 
including data at rest and large-scale computation and fast-
data that are eating very high-speed streams is directly 
relevant to the subject of cyber defense.
    Mr. Bridenstine. Thank you.
    Chairman Massie. Thank you, Mr. Bridenstine. If the Ranking 
Member is amenable to this, I think we will do another round of 
questions?
    Ms. Wilson. Yes.
    Chairman Massie. Did you have something to introduce into 
the record?
    Ms. Wilson. I do. Thank you, Mr. Chair. Mr. Kilmer has lots 
of conflicts. As we saw him come to the meeting, he had to 
leave, and I want to ask unanimous consent on behalf of Mr. 
Kilmer to introduce a report on big data from IDC into the 
record, and then I have a question.
    Chairman Massie. Without objection, so ordered. It will be 
set into the record.
    [The information appears in Appendix II]
    Ms. Wilson. Thank you. This question is for everyone.
    We have all had several discussions lately about the value 
of NSF-funded research to society and how we might certify that 
value based on the grant proposal. I think we might use big 
data instructively here. It is an incredibly interdisciplinary 
field where tools are developed in the pursuit of one narrow 
research question, let us say in the social sciences might have 
profound applications across many fields of science and even in 
many sectors of the economy that can't possibly be anticipated 
at the time of the proposal. What is the potential for data 
analytics being developed in one little seemingly irrelevant 
corner having unintended benefits to other fields and societal 
applications? And if you have concrete examples, that would be 
even better for us to understand. Thank you.
    Dr. Jahanian. Okay. I guess I will start. There is no 
question there are all sorts of explorations that we are doing 
in the area of big data that we can't even begin to see the 
potential impact of it. I will give you an example. NSF has 
been investing and other agencies with the private sector in 
what is known as the area of machine learning. These 
investments have taken place for at least 20 or 30 years. In 
fact, IBM has also led efforts in this area. I can tell you 
that it is investments of the last 20 or 30 years that have 
come to fruition such that these machine learning algorithms 
essentially allow us to look at these large data sets and 
identify trends and be able to adapt. Essentially, they have a 
broad range of applications from weather forecasting to 
financial modeling to biomedical research and so on that have 
had tremendous, tremendous impact and now we use these 
techniques as if they are off-the-shelf solutions available 
that you can buy. These are through years of investment that we 
have made that have come to fruition, so that is an example of 
that.
    We are investing in all sorts of areas in natural language 
understanding, in information retrieval, in various algorithms 
and approaches to automated scalable approaches to reasoning 
that could be applied to understanding relationship between 
gene sequence structure and biological functions. These are all 
essentially the kinds of investments that we are making that 
some of us we could see how it comes to fruition. Some of it 
relies on decades of investment that we have already made in 
computational techniques and data-intensive techniques.
    Dr. McQueeney. If I could offer you an example from the 
medical world, one of the critical problems in medicine is the 
loss of premature infants due to infections, and physicians 
have struggled for a long time with identifying the onset of an 
infection at a very early point because as these infections can 
grow exponentially, the earlier you can intercept them, the 
more likely you are to have a lifesaving benefit for someone 
who is very vulnerable such as a premature infant. We have done 
work with the Toronto Hospital for Sick Kids where a physician 
up there had an idea that all the instrumentation in the NICU 
that is--you know, you have probably been in a hospital room or 
intensive-care room, all the instruments around the bed, 
someone comes in every half an hour and writes down those 
numbers but the instruments are producing readings 
continuously, and this physician had the idea that if we kept 
all that data and we stored all that data as it came out of the 
machines in real time, which was a tremendous aggregation from 
a velocity of data point of view and correlated with the 
eventual issues that these premature infants had, we might be 
able to detect patterns using techniques such as machine 
learning that we were just hearing about that would give us an 
early identification of an upcoming infection, the ability to 
treat it before it got out of control, and her theories were 
absolutely correct. There were signatures in the data that gave 
up to 24 hours advance notice of an onset of an infection that 
was time for the doctors to in many cases provide some kind of 
lifesaving therapy. So there is an example of very, very deep 
mathematics, computer science being applied to a problem where 
the data was being produced every day by these instruments and 
it wasn't being captured and it wasn't being looked at and it 
wasn't being correlated with results to produce a fantastic 
outcome.
    Dr. Rappa. I would just sum up by saying that really big 
data is part of a decades-long process that really started with 
computerization in the 1940s and 1950s and eventually got 
interconnected through the Internet in the 1970s, 1980s and 
1990s that the world that we are turning into, data is going to 
be everywhere. It is going to affect exactly what happens here. 
It is going to affect hospitals, universities, every corner of 
the economy literally, and so we need to take approaches to 
that to try to develop understanding around big data, how it is 
applied, how the tools of analytics are applied across, you 
know, virtually every sector of the economy, and so I would 
take a very broad view, not looking at it as specifically, you 
know, a realm of computer technology or some other sort of 
isolated realm but looking at it as, you know, unfortunately as 
the big thing it is.
    Dr. Jahanian. May I offer another example as I was thinking 
about it? I am reminded of the work by Daphne Koller and her 
collaborators at Stanford on classifying breast cancer via 
image analysis. As you know, 40,000 women die from this disease 
each year. By extending essentially image analysis techniques 
to hundreds of, I should say thousands and thousands of biopsy 
images, they were able to identify a subset of cellular 
features. Out of 6,000 possible features, they were able to 
essentially identify a few of them that were predictive of 
survival time among breast cancer patients. What is really 
surprising is that the feature that they identified, it wasn't 
just from--the best feature, I should say, that is a predictor 
of survival, was not from the cancerous tissue itself but it 
was from the surrounding tissue, and that has led to new kinds 
of treatments. It has led to new kinds of diagnosis techniques 
and also a very personalized treatment that could aim to 
improve survival times in patients. That is a very, very 
concrete example.
    Another example is the work that Google had done during 
H1N1 virus. I will be very brief about this. Before they 
actually discovered a vaccine, we wanted to track the spread of 
disease. Google engineers used data that had nothing to do with 
the virus directly from billions of essentially web searches 
from around the world together from publicly available, 
essentially historic data on flu trends, to predict the spread 
of flu virus down to small regions in the country--or across 
the world, rather. This is a remarkable essentially application 
of data that one would have never thought could be applicable 
to something like H1N1 virus.
    Ms. Wilson. Thank you very much.
    Chairman Massie. Thank you, Ms. Wilson. Thank you for that 
very excellent example of how we can use--a private company can 
find information in the data.
    We got a little bit out of order so the last question is 
going to be mine. I reserve five minutes for myself. And the 
question I want to ask is, we have heard about banks that are 
too big to fail, and we also know that the Internet is now too 
big to fail. We recently in the House passed a CISPA bill which 
is somewhat controversial but some people felt it was necessary 
to do because the Internet was so big and pervasive in our 
lives. So my question to you is, are there any big data sets 
that are too big to fail? In other words, are there ones that 
are pervasive that we have let through osmosis become--we have 
become too dependent upon or maybe not too dependent but we are 
dependent upon these data sets, for instance, weather, you 
know, and early warning systems? Not all of those, I imagine, 
are government systems. Some of them are private but possibly 
the government is relying on these systems and so I would be 
remiss if I didn't ask this question now before something 
fails, but tell us what is too big to fail right now? What 
would we bail out, and is there sufficient redundancy in the 
collection, storage and access of these data sets? Thank you.
    Dr. McQueeney. Well, first, I would just like to say that 
we were delighted to support that cyber bill, and I 
congratulate you on such broad bipartisan support in the House 
for getting that acted upon.
    Data sets have the property that they can often be 
subdivided and often be replicated, and so we have a lot of 
techniques by which we can assure the continuity of data if we 
take the time to do it, and if there were very valuable 
historical records on things like long-term weather trends that 
were only stored in one place, that actually could be a concern 
because that is literally irreplaceable data. But essentially 
all of the IT techniques needed to take those large data sets 
and segment them and replicate them in different secure places 
so they could be re-created do exist but I think you raise an 
interesting point, that it is worthwhile to periodically check 
that we are being appropriately vigilant with the digital 
archives that are so valuable.
    Chairman Massie. Dr. Jahanian?
    Dr. Jahanian. I don't have a specific example. What I can 
tell you is that similar to the issue of cybersecurity, as 
Nation's critical infrastructure and more generally the 
Internet is playing a vital role in integrating the economic, 
you know, political, societal fabric of our society, we are 
going to become more and more dependent on data, and data is 
going to play an increasingly significant role in our day-to-
day lives, and for that reason, I think the same sort of issues 
that apply to all sorts of IT solutions that we take for 
granted will increasingly be applied to data.
    From a research and engineering community's point of view, 
it is not just failure of the data but making that data 
accessible and also making the data accessible to broad 
community of scientists and engineers is an issue that we are 
quite concerned about.
    Chairman. Massie. Thank you very much. I was part of the 
bipartisan on CISPA, opposing CISPA actually, but that is okay.
    I want to thank the witnesses for their valuable testimony 
and the Members for their questions today. The Members in the 
Committee may have additional questions for you, and we will 
ask that you respond to those in writing. The record will 
remain open for two weeks for additional comments and written 
questions from the Members.
    The witnesses are excused and this hearing is adjourned.
    [Whereupon, at 11:35 a.m., the Subcommittees were 
adjourned.]



                               Appendix I

                              ----------                              


                   Answers to Post-Hearing Questions

Responses by Dr. Michael Rappa

[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]

Responses by Dr. Farnam Jahanian

[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]

                              Appendix II

                              ----------                              


                   Additional Material for the Record

   IDC IVIEW, The Digital Universe in 2020: Big Data, Bigger Digital 
       Shadows, and Biggest Growth in the Far East, submitted by 
                      Representative Derek Kilmer

[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]