[House Hearing, 113 Congress]
[From the U.S. Government Publishing Office]
NEXT GENERATION COMPUTING
AND BIG DATA ANALYTICS
=======================================================================
JOINT HEARING
BEFORE THE
SUBCOMMITTEE ON RESEARCH &
SUBCOMMITTEE ON TECHNOLOGY
COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY
HOUSE OF REPRESENTATIVES
ONE HUNDRED THIRTEENTH CONGRESS
FIRST SESSION
__________
WEDNESDAY, APRIL 24, 2013
__________
Serial No. 113-22
__________
Printed for the use of the Committee on Science, Space, and Technology
Available via the World Wide Web: http://science.house.gov
----------
U.S. GOVERNMENT PRINTING OFFICE
80-561 PDF WASHINGTON : 2013
COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY
HON. LAMAR S. SMITH, Texas, Chair
DANA ROHRABACHER, California EDDIE BERNICE JOHNSON, Texas
RALPH M. HALL, Texas ZOE LOFGREN, California
F. JAMES SENSENBRENNER, JR., DANIEL LIPINSKI, Illinois
Wisconsin DONNA F. EDWARDS, Maryland
FRANK D. LUCAS, Oklahoma FREDERICA S. WILSON, Florida
RANDY NEUGEBAUER, Texas SUZANNE BONAMICI, Oregon
MICHAEL T. McCAUL, Texas ERIC SWALWELL, California
PAUL C. BROUN, Georgia DAN MAFFEI, New York
STEVEN M. PALAZZO, Mississippi ALAN GRAYSON, Florida
MO BROOKS, Alabama JOSEPH KENNEDY III, Massachusetts
RANDY HULTGREN, Illinois SCOTT PETERS, California
LARRY BUCSHON, Indiana DEREK KILMER, Washington
STEVE STOCKMAN, Texas AMI BERA, California
BILL POSEY, Florida ELIZABETH ESTY, Connecticut
CYNTHIA LUMMIS, Wyoming MARC VEASEY, Texas
DAVID SCHWEIKERT, Arizona JULIA BROWNLEY, California
THOMAS MASSIE, Kentucky MARK TAKANO, California
KEVIN CRAMER, North Dakota ROBIN KELLY, Illinois
JIM BRIDENSTINE, Oklahoma
RANDY WEBER, Texas
CHRIS STEWART, Utah
VACANCY
------
Subcommittee on Research
HON. LARRY BUCSHON, Indiana, Chair
STEVEN M. PALAZZO, Mississippi DANIEL LIPINSKI, Illinois
MO BROOKS, Alabama ZOE LOFGREN, California
STEVE STOCKMAN, Texas AMI BERA, California
CYNTHIA LUMMIS, Wyoming ELIZABETH ESTY, Connecticut
JIM BRIDENSTINE, Oklahoma EDDIE BERNICE JOHNSON, Texas
LAMAR S. SMITH, Texas
------
Subcommittee on Technology
HON. THOMAS MASSIE, Kentucky, Chair
JIM BRIDENSTINE, Oklahoma FREDERICA S. WILSON, Florida
RANDY HULTGREN, Illinois SCOTT PETERS, California
DAVID SCHWEIKERT, Arizona DEREK KILMER, Washington
EDDIE BERNICE JOHNSON, Texas
LAMAR S. SMITH, Texas
C O N T E N T S
Wednesday, April 24, 2013
Page
Witness List..................................................... 2
Hearing Charter.................................................. 3
Opening Statements
Statement by Representative Larry Bucshon, Chairman, Subcommittee
on Research, Committee on Science, Space, and Technology, U.S.
House of Representatives....................................... 8
Written Statement............................................ 9
Statement by Representative Daniel Lipinski, Ranking Minority
Member, Subcommittee on Research, Committee on Science, Space,
and Technology, U.S. House of Representatives.................. 10
Written Statement............................................ 11
Statement by Representative Thomas Massie, Chairman, Subcommittee
on Technology, Committee on Science, Space, and Technology,
U.S. House of Representatives.................................. 12
Written Statement............................................ 13
Statement by Representative Frederica S. Wilson, Ranking Minority
Member, Subcommittee on Technology, Committee on Science,
Space, and Technology, U.S. House of Representatives........... 13
Written Statement............................................ 14
Witnesses:
Dr. David McQueeney, Vice President, Technical Strategy and
Worldwide Operations, IBM Research
Oral Statement............................................... 16
Written Statement............................................ 18
Dr. Michael Rappa, Director, Institute for Advanced Analytics,
Distinguished University Professor, North Carolina State
University
Oral Statement............................................... 26
Written Statement............................................ 28
Dr. Farnam Jahanian, Assistant Director for the Computer and
Information Science and Engineering (CISE) Directorate,
National Science Foundation
Oral Statement............................................... 36
Written Statement............................................ 38
Discussion....................................................... 55
Appendix I: Answers to Post-Hearing Questions
Dr. Michael Rappa, Director, Institute for Advanced Analytics,
Distinguished University Professor, North Carolina State
University..................................................... 76
Dr. Farnam Jahanian, Assistant Director for the Computer and
Information Science and Engineering (CISE) Directorate,
National Science Foundation.................................... 79
Appendix II: Additional Material for the Record
IDC IVIEW report, The Digital Universe in 2020: Big Data, Bigger
Digital Shadows, and Biggest Growth in the Far East, submitted
by Representative Derek Kilmer, Subcommittee on Technology,
Committee on Science, Space, and Technology, U.S. House of
Representatives................................................ 86
NEXT GENERATION COMPUTING AND BIG DATA ANALYTICS
----------
WEDNESDAY, APRIL 24, 2013
House of Representatives,
Subcommittee on Research &
Subcommittee Technology
Committee on Science, Space, and Technology,
Washington, D.C.
The Subcommittees met, pursuant to call, at 10:04 a.m., in
Room 2318 of the Rayburn House Office Building, Hon. Larry
Bucshon [Chairman of the Subcommittee on Research] presiding.
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
Chairman Bucshon. All right. This joint hearing of the
Subcommittee on Research and the Subcommittee on Technology
will come to order.
Good morning, and welcome to today's joint hearing entitled
``Next Generation Computing and Big Data Analytics.'' In front
of you are packets containing the written testimony,
biographies and Truth in Testimony disclosures for today's
witnesses.
Before I get started, since this is a joint hearing
involving two Subcommittees, I want to explain how we will
operate procedurally so all Members understand how the
question-and-answer period will be handled. As always, we will
alternate rounds of questioning between majority and minority
Members. The Chairmen and Ranking Members of the Research and
Technology Subcommittees will be recognized first. Then we will
recognize Members present at the gavel in order of seniority on
the full Committee and those coming in after the gavel will be
recognized in order of their arrival. I now recognize myself
for five minutes for an opening statement.
Again, I would like to welcome everyone to today's hearing
where we will examine how advancements in information
technology and data analytics enable private and public sector
organizations to provide greater value to their customers and
citizens. Industry, academia, and government are all interested
in determining how to extract value, gain insights, and make
better decisions based on the wealth of data that is generated
today. In recent years, ``big data'' has become the popular
term used to encompass this phenomenon.
TechAmerica, an information technology trade association,
defines big data as ``large volumes of high-velocity, complex
and variable data that require advanced techniques and
technologies to enable the capture, storage, distribution,
management, and analysis of the information.''
Big data offers a range of opportunities for private
industry to reduce costs and increase profitability. It can
enable scientists to make discoveries on a previously
unreachable scale. And it can allow governments to identify
ways to serve its citizens more efficiently.
The McKinsey Global Institute predicts that effective
information management can provide $300 billion in annual value
to the U.S. health care sector alone. TechAmerica released a
report last year highlighting how big data initiatives can
improve the efficiency and effectiveness of government
services, and through the use of advanced computing power and
analytic techniques, universities and Federal laboratories can
drive new research initiatives that will significantly increase
our scientific knowledge base.
There are also various challenges associated with big data
that the Committee will explore today. McKinsey has estimated
that the U.S. will face a shortfall of 140,000 to 190,000
professionals with significant technical depth in data
analytics, and a further shortfall of an additional 1.5 million
managers and analysts who can work effectively with big data
analysis by 2018. Committee Members will be interested to learn
how industry, academia, and government are addressing this
shortfall.
While the term ``big data'' is relatively new, public and
private organizations have been investing in computing power
and data analytics for a number of years. In March of last
year, the Obama Administration announced a Big Data Research
and Development Initiative, including $200 million in new
funding across six different government departments and
agencies. I am interested to learn how effectively these
programs are being coordinated across the different Federal
agencies to ensure that taxpayer dollars are being leveraged
effectively. Finally, privacy and security are major concerns
when private and public organizations are collecting,
analyzing, and disseminating massive data sets.
We have an excellent panel of witnesses ranging across
industry, academia, and government. I would like to extend my
appreciation to each of our witnesses for taking the time and
effort to appear before us today. We look forward to your
testimony.
[The prepared statement of Mr. Bucshon follows:]
Prepared Statement of Subcommittee on Research Chairman Larry Bucshon
Good morning, I would like to welcome everyone to today's hearing
where we will examine how advancements in information technology and
data analytics enable private and public sector organizations to
provide greater value to their customers and citizens.
Industry, academia, and government are all interested in
determining how to extract value, gain insights, and make better
decisions based on the wealth of data that is generated today. In
recent years, ``Big Data'' has become the popular term used to
encompass this phenomenon.
TechAmerica, an information technology trade association, defines
Big Data as ``large volumes of high velocity, complex and variable data
that require advanced techniques and technologies to enable the
capture, storage, distribution, management, and analysis of the
information.''
Big Data offers a range of opportunities for private industry to
reduce costs and increase profitability. It can enable scientists to
make discoveries on a previously unreachable scale. And it can allow
governments to identify ways to serve its citizens more efficiently.
The McKinsey Global Institute predicts that effective information
management can provide $300 billion in annual value to the US health
care sector alone. TechAmerica released a report last year highlighting
how Big Data initiatives can improve the efficiency and effectiveness
of government services. And, through the use of advanced computing
power and analytics techniques, universities and federal laboratories
can drive new research initiatives that will significantly increase our
scientific knowledge-base.
There are also various challenges associated with Big Data that the
Committee will explore today. McKinsey has estimated that the US will
face a shortfall of 140,000 to 190,000 professionals with significant
technical depth in data analytics, and a further shortfall of an
additional 1.5 million managers and analysts who can work effectively
with big data analysis by 2018. Committee members will be interested to
learn how industry, academia, and government are addressing this
shortfall.
While the term Big Data is relatively new, public and private
organizations have been investing in computing power and data analytics
for a number of years. In March of last year, the Obama Administration
announced a ``Big Data Research and Development Initiative,'' including
$200 million in new funding across six different federal departments
and agencies. I am interested to learn how effectively these programs
are being coordinated across the different federal agencies to ensure
that taxpayer dollars are being leveraged effectively.
Finally, privacy and security are major concerns when private and
public organizations are collecting, analyzing, and disseminating
massive data sets. We have an excellent panel of witnesses ranging
across industry, academia and government. I'd like to extend my
appreciation to each of our witnesses for taking the time and effort to
appear before us today. We look forward to your testimony.
Chairman Bucshon. I will now yield to Mr. Lipinski for his
opening statement.
Mr. Lipinski. Thank you. I want to thank you, Chairman
Bucshon, and I want to thank Chairman Massie for holding this
hearing. I want to welcome and thank the witnesses for being
here.
Today's hearing gives us an opportunity to talk about the
new tools and analytics that are being developed for big data.
As Chairman Bucshon stated, big data can be thought of as large
volumes of complex and diverse types of data that change
rapidly with time.
In basic scientific research in national security as well
as in economic sectors ranging from energy to health care, big
data challenges are becoming fundamentally important.
Effectively dealing with big data can impact how we do business
and how we think about the world.
As a Member of the Research Subcommittee for several years,
I have watched as the amount and complexity of data has grown
by leaps and bounds. The field of astronomy is a great example.
When the Sloan Digital Sky Survey started work in 2000, its
telescope in New Mexico collected more data in a few weeks than
had been collected in the history of astronomy, and that
telescope will be surpassed when the Large Synoptic Survey
Telescope begins scientific operations in 2020. LSST will
photograph the entire sky every few days, producing data at a
rate almost 100 times greater than the Sloan Survey. But data
is useless without the means to store and analyze it in an
efficient manner.
The types of data are changing as well. Data has gone from
being mostly numbers entered into Excel spreadsheets to data
coming from sensors, cell phone cameras and millions of email
messages. In fact, it is estimated that over 85 percent of data
generated today are these kinds of unstructured data, data like
videos and emails. The change in the volume and variety of data
as well as how fast data is being produced and changed creates
almost limitless opportunities. For example, since
cybersecurity data is massive, varied, and changing quickly,
big data technologies have the potential to detect and prevent
cyber attacks before they happen. I know that organizations
like IBM are developing technologies to do just that.
Additionally, big data could be used to establish new business
models, create transparency, improve decision-making and reduce
inefficiencies within businesses and government.
But along with the opportunities, there are a number of
challenges. We need new tools and software packages to manage,
organize, and analyze all these different kinds of data.
Additionally, we will need an analytic workforce to ensure the
gains of big data. These challenges necessitate involvement
from government, academia and the private sector. That is why I
am happy to see all those sectors represented here today.
The government has and will continue to play an
instrumental role in this area. For instance, the Networking
and Information Technology Research and Development program, or
NITRD, created an interagency big data group that is
coordinating Federal efforts in technologies, research,
competitions, and workforce development for big data. We had a
hearing on the NITRD program back in February, and I expect
that we will be able to take a broader look at many of the same
issues in today's hearing.
In some cases, agencies have teamed up to issue joint
solicitations. For example, NSF and NIH have a joint big data
grant program that awarded nearly $15 million of grants to
eight teams of researchers last year. These first award grants
went to projects focused on designing new tools for big data
and new data analytic approaches. We will be hearing more about
these and other interagency activities from Dr. Jahanian in his
testimony. We will also learn more about specific programs at
NSF, one of the leading agencies in Federal big data efforts on
both the analytics side and the computational resources side.
As I mentioned before, one of the areas being coordinated
through NITRD is workforce development for big data. Several
agencies, including NSF, have education activities to support a
new generation of big data researchers. As we will likely hear
from all of the witnesses, we face a looming shortage of
workers with the skills needed to analyze and manage large,
complex and high-velocity data sets. There is some overlap with
the broader STEM skills we so often speak about in this
committee, but there are also unique skills required to address
the big challenges of big data. We need to consider how to
build those skills into STEM curricula, especially at the
undergraduate and graduate levels. I look forward to hearing
from our witnesses about the current educational efforts and
what additional initiatives may be necessary.
And finally, since big data involves different types of
data that can be produced and transferred quickly, there are
concerns over privacy. We need to ensure that we strike the
right balance between exploring and implementing all of the
potential benefits of big data while also protecting
individuals' personal information.
I look forward to hearing the witnesses' testimony and our
discussion today, and I yield back the balance of my time.
[The prepared statement of Mr. Lipinski follows:]
Prepared Statement of Subcommittee on Research
Ranking Minority Member Daniel Lipinski
Thank you, Chairmen Bucshon and Massie for holding this hearing on
examining the next generation of computing and big data analytics. I
want to welcome and thank the witnesses for being here today.
Today's hearing gives us an opportunity to talk about the new tools
and analytics that are being developed for big data. Big data can be
thought of as large volumes of complex and diverse types of data that
are also high velocity--meaning they change rapidly with time.
As a member of the Research Subcommittee for several years now, I
have watched as the amount and complexity of data has grown by leaps
and bounds. The field of astronomy is a great example. When the Sloan
Digital Sky Survey started work in 2000, its telescope in New Mexico
collected more data in a few weeks than had been collected in the
history of astronomy. And that telescope will be surpassed when the
Large Synoptic Survey Telescope goes online in about 2020. LSST will
photograph the entire sky every few days. That's difficult for any of
us to wrap our heads around.
The types of data are changing as well. Data has gone from being
mostly numbers entered in excel spreadsheets to data coming from
sensors, cellphone cameras, and millions of email messages. In fact, it
is estimated that over 85 percent of data generated today are these
kinds of unstructured data--data like videos or emails.
The change in the volume and variety of data as well as how fast
data is being produced and changed creates almost limitless
opportunities. For example, since cybersecurity data is massive,
varied, and changing quickly, big data technologies have the potential
to detect and prevent cyber attacks before they even happen. I know
that organizations like IBM are developing technologies to do just
that. Additionally, big data could be used to establish new business
models, create transparency, improve decision-making, and reduce
inefficiencies within businesses and government.
But along with the opportunities, there are a number of challenges.
We need new tools and software packages to manage, organize, and
analyze all these different kinds of data. Additionally, we will need
an analytic workforce to ensure the gains of big data. These challenges
necessitate involvement from government, academia, and the private
sector. That is why I am happy to see all those sectors represented
today.
The government has and will continue to play an instrumental role
in this area. For instance, the Networking and Information Technology
Research and Development--or NITRD--program created an interagency big
data group that is coordinating federal efforts in technologies,
research, competitions, and workforce development for big data.
In some cases, agencies have teamed up to issue joint
solicitations. For example, NSF and NIH have a joint big data grant
program that awarded nearly $15 million of grants to eight teams of
researchers last year. These first awarded grants went to projects
focused on designing new tools for big data and new data analytic
approaches. We will hear more about these and other interagency
activities from Dr. Jahanian in his testimony. We will also learn more
about specific programs at NSF, one of the leading agencies in federal
big data efforts on both the analytics side and the computational
resources side.
As I mentioned before, one of the areas being coordinated through
NITRD is the workforce development needs for big data. Several
agencies, including NSF, have education activities to support a new
generation of big data researchers. As you will likely hear from all of
the witnesses, we face a looming shortage of workers with the skills
needed to analyze and manage large, complex, and high-velocity data
sets. There is some overlap with the broader STEM skills we often speak
of in this committee. But there are also some unique skills required to
address the challenges of big data. We need to consider how to build
those skills into STEM curricula, especially at the undergraduate and
graduate levels. I look forward to hearing from our witnesses about the
current educational efforts and what additional initiatives may be
necessary.
Finally, since big data involves different types of data that can
be produced and transferred quickly, there are concerns over privacy.
We need to ensure that we strike the right balance between exploring
and implementing all of the potential benefits of big data while also
protecting individuals' personal information.
I look forward to hearing the witnesses' testimonies and to our
discussion today.
Chairman Bucshon. Thank you, Mr. Lipinski. The Chair now
recognizes the Chairman of the Subcommittee on Technology, Mr.
Massie, for five minutes for his opening statement.
Mr. Massie. Thank you, Chairman.
Good morning. Today we are examining an issue that we hear
a lot about. ``Big data'' is a popular new term that can mean a
lot of different things. The scientific community, though, has
generated and used big data before there was the term ``big
data.'' In fact, in 1991 this Committee authored the High
Performance Computing Act, which organized the Federal agency
research, development, and training efforts in support of
advanced computing.
Individual researchers have always been faced with
difficult decisions about their data: what to keep, what to
toss, what to verify with additional experiments. And as our
computing power has increased, so has the luxury of storing
more data. Incorporating computer power to process more
scientific data is transforming laboratories across the
country.
At the same time, the ability to analyze large amounts of
data across multiple networked platforms is also transforming
the private sector. Through big data applications, businesses
have not only revealed previously hidden efficiency
improvements in their internal operations, but, more
importantly, also uncovered entirely new types of businesses
built around data that was previously not accessible due to its
size and complexity.
Today's hearing will examine the hype around big data. Is
the United States the most innovative Nation in big data? Is
our regulatory system creating any burdens on businesses? Could
public-private partnerships with the Federal agencies be
improved to allow for more data innovations?
I thank our witnesses today for their participation today
and I look forward to hearing their testimony. Thank you. I
yield back.
[The prepared statement of Mr. Massie follows:]
Prepared Statement of Subcommittee on Technology
Chairman Thomas Massie
Good Morning. Today we are examining an issue that we hear a lot
about. ``Big Data'' is a popular new term that can mean a lot of
different things.
The scientific community has generated and used Big Data before
there was Big Data. In fact, in 1991 this Committee authored the High
Performance Computing Act, which organized the federal agency research,
development and training efforts in support of advanced computing.
Individual researchers have always been faced with difficult
decisions about their data: what to keep, what to toss, what to verify
with additional experiments. As our computing power has increased, so
has the luxury of storing more data. Today, managing this data allows
for better-informed experiments, more exact metrics, and perhaps
significantly longer doctoral theses. Incorporating computer power to
process more scientific data is transforming laboratories across the
country.
At the same time, the ability to analyze large amounts of data
across multiple networked platforms is also transforming the private
sector. Through Big Data applications, businesses have not only
revealed previously hidden efficiency improvements in their internal
operations, but also uncovered entirely new types of business built
around data that was previously not accessible due to its size and
complexity.
Today's hearing will examine the hype around Big Data. Is the
United States the most innovative nation in Big Data? Is our regulatory
system creating any burdens on businesses? Could public-private
partnerships with the federal agencies be improved to allow for more
data innovations?
I thank our witnesses for their participation today and look
forward to hearing their testimony.
Chairman Bucshon. Thank you, Mr. Massie. The Chair now
recognizes Ms. Wilson for five minutes for her opening
statement.
Ms. Wilson. First of all, I would like to thank both
Chairman Bucshon and Chairman Massie for holding this joint
hearing, and thank you all to our witnesses for being here
today. Welcome.
This morning's hearing provides us with the opportunity to
discuss one of the newest buzzwords in Washington, and you know
we have many buzzwords here. This one: big data. This buzzword
is not an exaggeration. A computer that used to take up the
space of this entire room now fits in the palm of your hand. It
is remarkable.
Just as computers have gotten immensely smaller, they have
also gotten immensely more powerful. Instead of talking about
megabytes, we are now talking about petabytes and zettabytes--
quadrillions and sextillions of units of information. It
boggles the mind. Collecting and storing this huge volume of
data would have been impossible just a few years ago.
I am looking forward to your testimony and learning more
about the benefits of big data to society. As I understand it,
big data has the potential to improve nearly all sectors of
society. The National Cancer Institute is funding a prototype
in biological big data that could lead to new advances in
cancer treatment. Companies and agencies are using big data to
run controlled experiments that improve decision-making.
Scientists at Florida International University in my district
are using big data to advance understanding of topics including
cybersecurity, social networks and cloud computing.
But there are challenges. In order to reap all the benefits
of complex and broadly available data, we need new technologies
and software. We also need a workforce, a workforce with the
skills necessary to analyze data of such great volume and
complexity. A recent study estimates that the United States is
in need of 190,000 additional data scientists.
In thinking about this hearing on big data, I couldn't help
but think about the tragic events last week in Boston. The
marathon bombings may be one of the most photographed attacks
in history. The Massachusetts State Police asked the public to
share the photos and videos taken on that awful day. Now all of
this digital information has been and is being used by the
Boston Police Department and the FBI in their investigation. It
appears that this data has been instrumental in helping to
identify the individuals who were involved.
Examples like this one demonstrate how important it is that
we develop and attain the tools and the skills people need to
analyze tremendous amounts of complex data. Big data can not
only lead to amazing scientific discoveries; it can also save
lives.
As we learn more about these opportunities and challenges
today, I hope our witnesses will offer recommendations on how
the Federal Government can help create the new tools, software
and workforce needed to realize the full potential of big data.
Chairman Bucshon, Chairman Massie, thank you again for
holding this hearing, and I yield back the balance of my time.
[The prepared statement of Ms. Wilson follows:]
Prepared Statement of Subcommittee on Technology
Ranking Minority Member Frederica S. Wilson
I'd like to thank both Chairman Bucshon and Chairman Massie for
holding this joint hearing. And thank you to all of our witnesses for
being here today.
This morning's hearing provides us with the opportunity to discuss
one of the newest buzz-words in Washington and around the world--``big
data.''
This buzz-word is not an exaggeration: A computer that used to take
up the space of this entire room now fits in the palm of your hand. It
is remarkable.
Just as computers have gotten immensely smaller, they have also
gotten immensely more powerful. Instead of talking about megabytes, we
are now talking about petabytes and zettabytes--quadrillions and
sextillions of units of information. It boggles the mind. Collecting
and storing this huge volume of data would have been impossible just a
few years ago.
I'm looking forward to the testimony of today's witnesses and
learning more about the benefits of ``big data'' to society.
As I understand it, big data has the potential to improve nearly
all sectors of society. The National Cancer Institute is funding a
prototype in biological ``big data'' that could lead to new advances in
cancer treatment. Companies and agencies are using ``big data'' to run
controlled experiments that improve decision-making. Scientists at
Florida International University--in my district--are using ``big
data'' to advance understanding of topics including cybersecurity,
social networks, and cloud computing.
But there are challenges. In order to reap all the benefits of
complex and broadly available data, we need new technologies and
software. We also need a workforce with the skills necessary to analyze
data of such great volume and complexity. A recent study estimates that
the United States is in need of 190,000 additional data scientists.
In thinking about this hearing on ``big data,'' I couldn't help but
think about the tragic events last week in Boston. The marathon
bombings may be one of the most photographed attacks in history. The
Massachusetts State Police asked the public to share the photos and
videos taken on that awful day. Now, all of this digital information
has been and is being used by the Boston Police Department and the FBI
in their investigation. It appears that this data has been instrumental
in helping to identify the individuals who were involved.
Examples like this one demonstrate how important it is that we
develop and attain the tools and the skilled people needed to analyze
tremendous amounts of complex data. Big data can not only lead to
amazing scientific discoveries--It can also save lives.
As we learn more about these opportunities and challenges today, I
hope our witnesses will offer recommendations on how the federal
government can help create the new tools, software, and workforce
needed to realize the full potential of ``big data.''
Chairman Bucshon. Thank you, Ms. Wilson.
If there are Members who wish to submit additional opening
statements, your statements will be added to the record at this
point.
It is now time to introduce our panel of witnesses. Our
first witness is Dr. David McQueeney, the Vice President of
Technical Strategy and Worldwide Operations at IBM Research. In
this capacity, he is responsible for setting the direction of
IBM's overall research strategy across 12 worldwide labs and
leading the global operations and information systems teams.
Dr. McQueeney's background covers a wide range of disciplines,
spending about half of his career as a researcher and research
executive and half in IBM's customer-focused areas. He holds an
M.S. and Ph.D. in solid-state physics from Cornell University
and an A.B. in physics from Dartmouth College. Welcome.
Our second witness is Dr. Michael Rappa, the Executive
Director of the Institute for Advanced Analytics and Faculty
Member of the Department of Computer Science at North Carolina
State University. Dr. Rappa has 25 years of experience as a
professor working across academic disciplines at the
intersection of management and computing. He began his teaching
career at the University of Minnesota where he earned his
doctorate degree. Welcome.
And our final witness is Dr. Farnam Jahanian, the Assistant
Director for the Computer and Information Science and
Engineering Directorate at the National Science Foundation and
a frequent visitor to our Subcommittee. He oversees the CISE's
mission to uphold the Nation's leadership in computer and
information science and engineering. He also serves as Co-chair
of the Networking and Information Technology Research and
Development, or NITRD, Subcommittee of the National Science and
Technology Council Committee on Technology, providing overall
coordination for the activities of 14 government agencies. Dr.
Jahanian holds a master's degree and a Ph.D. in computer
science from the University of Texas at Austin. Welcome again.
As our witnesses should know, spoken testimony is limited
to five minutes each after which Members of the Committee have
five minutes each to ask questions. Your written testimony will
be included in the record of the hearing.
I now recognize our first witness, Dr. McQueeney, for five
minutes for his testimony.
TESTIMONY OF DR. DAVID MCQUEENEY, VICE PRESIDENT,
TECHNICAL STRATEGY AND WORLDWIDE OPERATIONS,
IBM RESEARCH
Dr. McQueeney. Good morning, Mr. Chairman, Ranking Members,
Members of the Subcommittees. Thank you for the opportunity to
testify today. My written testimony covers next-generation
computing, big data and analytics, workforce development and
the role of government. In my five minutes, I will focus on
areas where I can offer critical insights from my personal
experience.
Computing today is undergoing profound change. We are
moving from computing based on processors that are programmed
to follow a predesigned sequence of instructions to cognitive
computing systems based on massive amounts of data evolving
into systems that can learn. This new approach will require new
strategies in hardware and in software and improved skills to
maintain U.S. leadership. Cognitive systems will digest and
exploit massive data volumes. Tools such as mobile phones,
videos and social networks generate as much data in two days in
2013 as in all of human history prior to 2003.
Advanced analytics can be thought of as tools for infusing
all this data to make decisions on facts rather than intuition.
The challenge is to transform latent data into actionable
information to decide what to do next. For example, the Memphis
Police Department is using data analytics to map crime hotspots
and find patterns. As a result, they have been able to reduce
crime by 30 percent with no increase in overall police
manpower.
To run advanced analytics, it is essential to have the most
powerful computing systems. However, current supercomputing
systems are reaching performance levels that will stagnate
without significant innovation. We must move to the next
generation of large-scale computing called exascale computing,
a thousand times faster than today's petascale machines.
The United States needs to invest now in the research and
development for exascale systems to maintain strategic and
economic leadership. Government-funded research on domain
skills, especially at our national laboratories, should target
systems for modeling, simulation, and analytics on big data.
Before 2005, the United States had a clear lead in the
global supercomputing race. Today, we are still ahead but the
rest of the world is catching up rapidly. To stay ahead will
require new skills and knowledge and new types of decision-
making. Nearly two million IT jobs will be created by 2015 in
the United States to support big data, and the job candidates
with analytic skills will get these jobs.
Industry is developing many collaborative skills programs,
as enumerated in my testimony. I highlight our announcement
today with Rensselaer Polytechnic Institute to offer a graduate
degree program in the fall of 2013, the Master of Science in
Business Analytics.
Privacy must be considered in the design of big data
systems. Big data does not require the sacrifice of personal
privacy. When personal information is used, design-in processes
such as IBM's Privacy By Design can protect privacy. When
people understand how information is used, they have the
ability to set data usage policies and enjoy benefits of the
analysis, they tend not to have privacy concerns.
The government's role should focus on research and skills.
First, Federal research investment in high-performance
computing is critical to big data. Industry needs university-
based exploratory research into numerous areas including system
design, flexible software defined environments, and IT
infrastructure.
Second, IBM strongly supports the reauthorization of the
Department of Energy High End Computing Revitalization Act of
2004 to be offered by Representative Hultgren. This bill will
improve high-end computing R&D at the DOE and strengthen
government industry partnerships for exascale platforms. IBM
has a long history of successful partnerships with DOE. This
partnership established computational simulation as an
essential tool for scientific inquiry and led to world
leadership in the United States in high-performance computing.
The challenge ahead is to continue this growth. Past Federal
investments in HP-related research, particularly at DOE's
national labs, have underpinned mission-critical supercomputers
at DOD, NASA, NOAA, and in the intelligence agencies.
Third, the professional science masters program supported
by NSF is particularly relevant as it provides advanced
training in science or mathematics and develops workplace
skills valued by employers. Finally, Congress should
reauthorize the Carl D. Perkins Act and the Federal work-study
program and restructure them to align workforce needs and big
data.
In conclusion, there exists today a tremendous abundance of
data about our world. New cognitive computing capabilities will
help determine which countries and businesses will thrive. The
United States should support advanced computing and build its
workforce to seize the future.
Thank you, and I welcome your questions.
[The prepared statement of Dr. McQueeney follows:]
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
Chairman Bucshon. Thank you, Dr. McQueeney.
I now recognize Dr. Rappa for five minutes for his
testimony.
TESTIMONY OF DR. MICHAEL RAPPA, DIRECTOR,
INSTITUTE FOR ADVANCED ANALYTICS,
DISTINGUISHED UNIVERSITY PROFESSOR,
NORTH CAROLINA STATE UNIVERSITY
Dr. Rappa. Good morning, Chairman Bucshon, Chairman Massie,
Ranking Member Lipinski, Ranking Member Wilson and other
Members of the Subcommittee. I appreciate the opportunity to be
here this morning to speak with you about data analytics and
the role institutions of higher learning can play in advancing
the field.
I am going to draw this morning's testimony on my own
behalf as a professor and director of a research institute,
educational institute for over the past 25 years.
I think it is important to start with the fact that the
world is changing around data very rapidly and our ability to
productively use it becomes a very central part of what we do
as a society today, as has been heard already. A generation
ago, data was scarce, expensive, time consuming to collect and
difficult to analyze. Today, data is everywhere.
Advances in computer technology and powerful analytic tools
make it possible not only to collect vast quantities of data
but also analyze and draw insights from data to solve pressing
problems from increasing operational efficiency to combating
fraud, to better health care, to protecting national security.
Data is everywhere. The question is, how well are we prepared
to use it? We have the data, the technology, the methods and
tools, all of which continue to advance. The national
challenge, in my view, going forward will be our ability to
educate a data-savvy workforce that has the analytical skills
to put data into action. Estimates of the talent gap as we have
heard are large and growing.
This is a dire but solvable problem. As we have shown at NC
State, working closely with employers and focusing on their
needs, we can produce the kind of talent that is so desperately
needed today. We do it quickly in just 10 months with a
domestic student population ranging from their early 20s to
their late 50s, many of whom are returning to school. We have
done this now for six years economically with consistently high
student outcomes using a sustainable and scalable business
model based on self-financed tuition.
What it comes down to is creative innovation, how we
organize graduate education, allowing us to engage with
employers more productively to yield high-quality results in
the skills and readiness of our graduates.
I encourage the Committee to focus its attention on
workforce needs, to encourage the government to seek out
innovation in higher education and to promote new and novel
learning models. This is a solvable problem. With the proper
incentives, focused resources, open collaboration with
industry, we can produce the analytics professionals needed to
extract value from big data and to move the economy forward. As
I said, we have done this ourselves now for 6 straight years to
great effect. We will graduate a class in a matter of another
week, 80 students in the Master of Sciences and Analytics
Program, with already 95 percent of them placed in jobs. They
are literally the most sought after and highest-paid graduates
of the university.
So we can do this. It is a solvable problem. Thank you
again for your time. I will be glad to answer any questions.
[The prepared statement of Dr. Rappa follows:]
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
Chairman Bucshon. Thank you for your testimony.
I now recognize our final witness, Dr. Jahanian, for five
minutes for his testimony.
TESTIMONY OF DR. FARNAM JAHANIAN,
ASSISTANT DIRECTOR FOR THE COMPUTER AND
INFORMATION SCIENCE AND ENGINEERING (CISE)
DIRECTORATE, NATIONAL SCIENCE FOUNDATION
Dr. Jahanian. Good morning, Chairman Massie, Chairman
Bucshon, Ranking Members Wilson and Lipinski, and Members of
the Subcommittee. It is my pleasure to be back here to discuss
the next generation of computing and big data analytics.
Today we live in an era of data and information enabled by
advanced technologies that surround us. Data is generated by
modern experimental methods, scientific instruments such as
telescopes and particle accelerators, large-scale simulators,
Internet transactions, email, video images, clickstreams, and
widespread deployment of sensors everywhere. Approximately 90
percent of the data in the world today were created in the last
two years alone. However, when we talk about big data, it is
important to emphasize not only the enormous volume of data
being generated but also the velocity, heterogeneity and
complexity of data that now confronts us.
Why is big data important? Several others have alluded to
this already. Data represents a transformative new currency.
Big data is increasingly important to all facets of our
Nation's discovery and innovation ecosystem. First, insights
and more accurate predictions from large and complex
collections of data are creating opportunities in new markets,
driving the creation of IT products and services and boosting
the productivity of businesses. Second, advances in our ability
to store, integrate, and extract meaning and information from
data are accelerating the pace of discovery in almost every
science and engineering discipline. Third, big data has the
potential to solve many of the Nation's most pressing
challenges from health care and education to cybersecurity and
public safety, yielding enormous societal benefits and ensuring
sustained U.S. competitiveness.
Let me share with you just a few examples of the promise of
big data. These are all grounded in research that is funded by
the Federal Government or by the private sector, the work that
is done in the private sector. By integrating biomedical,
clinical and scientific data, we can predict the onset of
diseases and identify unwanted drug interactions. By coupling
roadway sensors, traffic cameras, individual GPS devices, we
can reduce traffic congestion and generate significant savings
in time and fuel. By accurately predicting natural disasters
such as hurricanes and tornadoes, we can employ lifesaving and
preventative measures that mitigate their potential impact. By
correlating disparate data streams through text mining, image
analysis and face recognition, we can enhance public safety and
public security. By integrating emerging technologies such as
MOOCs and inverted classrooms with knowledge from research
about how people learn, we can transform formal and informal
education.
What does this mean for scientific discovery? Data-driven
discovery, also called the fourth paradigm, is revolutionizing
scientific exploration and engineering innovations. It enables
extraction of new knowledge, provides novel approaches to
driving discovery and decision-making, yields increasingly
accurate predictions and provides deeper understanding of
causal relationship based on advanced data analysis.
What is government doing to ensure we harness this
potential? As it was mentioned already, in 2011 U.S. Networking
and Information Technology Research and Development Program,
also called NITRD, formed a big data senior steering group to
identify, initiate and coordinate big data research and
development activities across the government to ensure that
Federal agencies, the scientific research enterprise, and
public maximally benefit from data-driven discovery. In March
2012, the National Big Data R&D Initiative was launched,
focusing the steering committee group's focus on the tools,
technologies and human capital needed to move from data to
knowledge to action. We see exciting new partnership
opportunities with the private sector, state and local
governments, academia and nonprofits.
At NSF, we have identified four major investment areas that
address current challenges and promise to serve as the
foundation of comprehensive long-term agenda: first, investment
in foundational research to advance big data techniques and
technologies; second, support for building new
interdisciplinary research communities; third, investment in
education and workforce development; and finally, development
and deployment of cyber infrastructure to capture, manage, and
analyze and share digital data.
I should add that NSF's investment in cyber infrastructure
includes advanced computational resources that support data-
enabled science. In particular, the newly dedicated Blue
Waters, Stampede and Yellowstone supercomputers will expand our
Nation's computational capabilities significantly.
In summary, big data represents enormous opportunities for
our Nation. Investments in big data research and education will
advance the frontier of knowledge, further fostering
innovation, creating new economic opportunities, and yielding
new approaches to addressing national priorities.
Thank you again for this opportunity. I would be happy to
answer any questions.
[The prepared statement of Dr. Jahanian follows:]
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
Chairman Bucshon. Thank you for your testimony. I would
like to thank all the witnesses for their testimony. I am
reminding Members that Committee rules limit questioning to
five minutes, and the Chair at this point will recognize
himself for five minutes to start the questions.
First, Dr. Jahanian, the Administration announced their Big
Data Research and Development Initiative in March 2012
including $200 million in new commitments for big data research
initiatives. However, the National Science Foundation,
Department of Defense, Department of Energy, and other agencies
have had significant research programs and data analytics that
predated the initiative. How has the Administration's
initiative changed the ways these agency research programs are
coordinated and are we effectively managing and leveraging our
research investments across agencies?
Dr. Jahanian. Thank you for your question. You are
absolutely right that it is not that suddenly last March we
woke up and said boy, data is really important, we need to do
something about it. There has been significant investment by
the Federal sector and private sector in areas having to do
with data. The challenges we face are many--stewardship of
digital data and software, for example. Many data sets, as was
mentioned, are too poorly organized or also unstructured. Many
data sets are heterogeneous. The utility of data is also
limited by our ability to interpret them. Many data are being
collected at a scale that we can't even store them, let alone
analyze them. Also, large and linked data sets may be exploited
to identify individuals and so there are also the privacy
issues. So there are enormous challenges that we face.
As you alluded to, on March 29, 2012, OSTP in concert with
a number of Federal agencies launched the national Big Data
Research Initiative. It expands the scope of our activities in
several directions, for example, state-of-the-art core
technologies that we need to collect, store, preserve, manage
and analyze data, harnessing these technologies to accelerate
pace of discovery, supporting responsible stewardship, for
example, and sustainable business models for big data.
There are a number of cross-coordination efforts taking
place under NITRD. Let me start with NSF. All NSF directorates,
for example, are participating in this. A multidisciplinary
panel of experts are making a recommendation on funding of
this. Furthermore, big data is being coordinated through a
senior steering group that reports to the assistant directors
at NSF for all the coordination because it involves every
science and engineering discipline.
As far as the Federal Government is concerned, the Big Data
R&D Initiative is coordinated through the NITRD Subcommittee.
As you know, I Chair the Subcommittee. There is a senior
steering group that regularly meets to coordinate the
activities on many of the fronts that I alluded to. There are
also enormous opportunities not only in terms of joint
solicitations but there are a number of workshops that we are
holding jointly with other agencies including NIH, NIST, DOE,
DOD to advance the frontiers of knowledge and exploration in
big data.
I should also mention that when it comes to this
initiative, we can't forget that the private sector plays a
significant role. When we think about innovation and discovery
ecosystems, not only are we talking about universities, we are
talking about scientists and engineers, you know, a rich,
talented labor force, investments in research and education,
and of course, a vibrant private sector. So there are a number
of programs that we have at NSF that attempt to connect the
dots when it comes to transfer of knowledge.
Chairman Bucshon. Thank you. I am glad to hear there is
quite a bit of coordination at the Federal level because I
think all of us are concerned about that, and again, investing
the taxpayer dollar wisely.
Dr. Rappa, I also serve on the Education and Workforce
Committee, and I have got children age 9 through 20, four of
them, and I have a really strong interest in how we get young
people interested in different fields of study, and obviously
we have a tremendous challenge not only with this area but many
others, and do you think that--what are your ideas on how we
engage young people in understanding what opportunities there
are in this area and what the jobs of the future might hold? I
mean, how do we do that? Because, you know, when you go to a
high-school class, and I talk to a lot of high-school class,
people say, you know, not many people come up when you ask them
what they want to be, you know, they want to analyze big data.
So how do you do that? What is your recommendation?
Dr. Rappa. Well, thank you very much for your question, and
I understand exactly what you are saying, and I think that
things are changing. You know, I think it is exactly true that
your average 8-year-old doesn't say they want to grow up, for
example, to be a statistician. It is not common, unless they
are really interested in sports. Then you see a sort of nexus
there because of the relationship. But I think what is changing
is that it is really about producing education, in my case, at
the graduate level, reaching further into the pipeline down
into undergraduate education and even touching upon high school
where people begin--where students begin to understand how data
is really used in action. So it is really about creating, not
just sort of creating knowledge or understanding but also
applying that knowledge. And when our students--our whole
education is driven around the application of that knowledge,
and so students really understand, and increasingly
undergraduates understand that this kind of graduate education
is going to lead them to a very interesting, compelling
professional life.
Chairman Bucshon. Well, thank you, because I think that we
do--you know, we do need to have this type of information
gravitate down, even to middle-school kids to get them
interested, and there is a program in Indianapolis called
Project Lead the Way who I know very well that is beginning to
do that at the high-school level, and it is showing some
success.
But my time is expired, so I would love to talk more about
that but at this point I am going to yield to Ms. Wilson for
five minutes for her questions.
Ms. Wilson. Thank you, Mr. Chair.
Along those lines, can you tell me either one of you what
skills are necessary for the big data workforce? I heard you
say something about an analytical something. And also as you
are speaking, I would like to hear from you what role can
community colleges play in preparing the next-generation
workforce for big data.
Dr. Rappa. Thank you very much for your question. I would
like to try my hand at that. So what is sort of interesting and
novel about what we have done around the education, we really
started from scratch in building an entire new graduate degree
program, and we really wanted to address this question of what
skills were needed, and we focused ourselves really looking at
the employer as the customer in a sense, the person, the
individuals who buy our product and the students and really
tried to understand the skills that they need, and really where
that brings you is that there is these technical skills which
are important in programming, in math and statistics, but
employers really want much more than that. They want
individuals who can work well in teams, who can communicate
these insights to decision makers, who can actually use the
tools and apply the knowledge in an organizational context, and
so we have structured the whole education to build a very
balanced set of skills as opposed to what I think is really the
conventional approach in graduate education and to some extent
undergraduate education to focus on the technical skills almost
exclusively. And so really what we need to do is sort of
approach the whole student. Now, I think community colleges can
play a very important role because you can really begin to
channel pipelines where students can go and get the
prerequisite knowledge that they need, the early levels of math
and statistics, before they go on to graduate education. Thank
you.
Dr. McQueeney. I would just like to comment that a lot of
the focus in the past has been on the graduate level of
education, as Dr. Rappa just talked about, and while we
continue to have a strong need for Ph.D.'s and computer science
and electric engineering and mathematics, the biggest skill gap
that we see is at the masters level, quite frankly, of people
who may not have the mathematical skills to create an entire
new type of analysis of data but who have more than basic IT
skills who actually can understand the implications of using
different analytical techniques given a problem, given a data
set with certain statistical properties, what would be the
appropriate analytical technique to use, and when you apply
that technique, how could you be sure that the results would be
reliable and proper, and so a lot of our focus has been on
creating an intermediate level of skill that has the basic
understanding of how to use these tools even if it would fall
on someone with more of a Ph.D. level of training to create new
analytical approaches.
Dr. Jahanian. Representative Wilson, I want to echo
something that has been said. If you think about big data, let
us just step back. There are three related problems that go
beyond big data. It includes all of our IT workforce, computer
science, computational science and so on. These problems have
to do with underproduction, which everybody recognizes,
underrepresentation and then pipeline issues. Chairman Bucshon
already alluded to this, that we need to worry about our high
schools, we need to worry about the pipeline. I have three
kids, and I know where we lose our kids, it is not in masters
or Ph.D., we lose the interest of our kids in high schools and
middle schools, so that has to be fixed, and there are a number
of programs that we have initiated, pilot programs that try to
address that issue.
Let me share with you one anecdotal sort of evidence that
provides data on this. Annualized Bureau of Labor Statistics
data predicts that each year we need about 140,000 job
openings. We will have 140,000 job openings in computing and
broadly speaking IT-related jobs but we are only producing
about 100,000 qualified individuals including masters, Ph.D.,
undergraduate and community colleges. In fact, many of these
jobs would be available to individuals who have two year or
four year degrees.
Another data point that I want to share with you is that 62
percent of all newly created STEM job openings between 2010 and
2020 will be in computing and IT. Let us not forget that. And
that includes data, that includes computational skills and many
of the other skills that the other witnesses alluded to. Thank
you.
Ms. Wilson. Just in my 16--oh, 10, 9, 8--what would you
suggest that we begin to--how do we begin to get children
interested in these sort of skills? I know every little child
has an iPad. They can work these computers better than adults.
What do you think we can do to stimulate that all the way from
K-12 and into the community colleges so we will have more IT
graduates? Do you suggest we buy each one--we outfit classrooms
with iPads, or what do you think?
Dr. McQueeney. I think there is an intrinsic curiosity in
younger folks about a lot of the tools they use to communicate
with each other that have tremendously greater scalability than
the tools that I use to communicate with my friends.
Ms. Wilson. Right.
Dr. McQueeney. So the essence of what is a large
community's opinion on a topic of interest could involve the
opinions of thousands or millions of people and so I think a
lot of the young folks I talk to when I visit K-12 programs or,
you know, in programs like eWeek, they have an intrinsic sense
not only of the device and the technology but they have a sense
of the reach of that device and technology which is the
beginning of an appreciation of really what we are talking
about with big data, that there are trends that they can reach
with that device, and I think that fires their imagination in a
very powerful way.
Chairman Bucshon. Thank you. I will now recognize Mr.
Massie, Chairman Massie, for his questioning.
Mr. Massie. Thank you, Chairman.
So one of the questions that I have as we deal with the
interface between government and private industry here is, are
you aware of any government data sets that we need to get more
into the public domain for usage? For instance, I think we have
done a pretty good job about getting some of the mapping stuff
out there but some of that map information is old, goes back to
the 1940s and 1950s, and I know the government has been paying
for LIDAR mapping, which is a high-resolution terrain mapping,
and I am kind of concerned that that is not getting out there.
Are you aware of that, and are there any other data sets that
would be useful to the public that the public has paid for that
we might want to work on getting out to the public?
Dr. McQueeney. I think the government has done an excellent
job and had many initiatives that were very focused on getting
that valuable data out so it could be used. You mentioned
LIDAR. I know that one of the uses that is very promising for
LIDAR is to do something like an inventory of the forests in
the country, to actually be able to conduct a definitive
inventory. Right now, the agencies that are responsible for
that use a statistical sampling technique but in a world where
you can take LIDAR images and process that enormous data
volume, you are able to move then from a statistical sampling
basis, which is all we could do before, to a more definitive
approach to get a very, very good picture of one of the more
valuable natural resources that needs tremendous amounts of
stewardship. So I think that is an example of a data set that
could be extremely valuable. But I think in general, the
government is very well and properly focused on getting those
valuable data sources out. Weather would be another--basic
weather data would be another good example that can be built on
to add extra value.
Mr. Massie. Are the other witnesses aware of any data sets
that we need to promote more?
Dr. Jahanian. I want to highlight a couple of things. I am
sure you are aware of data.gov, which is a Web site that makes
a lot of government data sets available, and the goal here is
to increase public access to high-value machine readable data
sets that are generated by the government. Hopefully it will
create new economic values. There are also a number of
activities in encouraging the private sector, entrepreneurs to
develop applications on top of that data. It is not just making
the data available but also making the data valuable so there
are a number of essential activities related to that.
There was a recent Wall Street Journal article actually
that highlighted at least a dozen different kind of government
data sets that have been made available from labor and health
violations to flu incidents, energy prize, offshore activities,
solar information, and so on and so on that are interesting.
From the National Science Foundation's point of view, I should
mention that as you may know, we have a number of large
facilities--LSST was mentioned, Neon, which is another facility
that collects a lot of data, will be collecting a lot of data.
The science and engineering community needs that data, and many
Federal agencies are working very hard to make that data
available. There are a number of issues having to do with open
access that go beyond the scope of this question.
Mr. Massie. Let me ask a follow-up question to that. So big
data like any other data could be misused, altered, hacked,
illegally accessed, and sometimes it may just be an honest
mistake. We share data that we probably shouldn't have, for
instance, where some farm data that got out there and it could
really compromise our food safety if people know where all the
food sources are. How do we balance the desire for privacy,
actually the constitutional right to privacy, with sharing all
of this data now that everybody is under a microscope?
Dr. Rappa. I thank you for your question, and I would like
to sort of just turn it a little bit because we do work--each
year we work with about 16, 17 organizations that share data
under a confidentiality agreement including three government
agencies in which we put teams of students working on very
complex analytics projects, and so while I applaud, and I think
it is very important and I do think the government is doing a
good job at sharing data openly, it is a very important thing
to do, I think there is also an opportunity to engage the
academic community in other ways to help understand that data,
which might mitigate some of these issues around the privacy
element.
Mr. Massie. Dr. McQueeney?
Dr. McQueeney. Yes, that is an excellent question. Thank
you for that. One of the things that we can do is to get data
about the data. We call it metadata. So we analyze the data and
we don't just look at what information we can get from the data
but we describe the data perhaps in terms of its sensitivity--
is this more or less sensitive from a point of view of privacy
or security or secrecy--and we can then tag those data sets
with metadata that describes the implications of using that
data and then we can build into the systems that handle the
data policies that look not only at the data but the metadata
that describes what are the contents and what are the
implications of sharing and combining that data and so we can
actually build into the foundation of big data systems the
ability to interpret policies that we have set in a very
conscious and clear-eyed way and as they process the data they
can be respectful of that metadata. The medical community has
actually done a lot of very good work around patient
confidentiality while still getting very good pattern analysis
of different kinds of outcomes.
Mr. Massie. Thank you very much. My time expired. I
appreciate your answer and concern for that question, Mr.
Chairman.
Mr. Bucshon. Thank you, Mr. Massie. I now recognize Dr.
Bera for five minutes for his questions.
Mr. Bera. Thank you, Mr. Chairman, and thank you for the
series of hearings that we have had on the Subcommittee. It has
been great.
You know, big data is incredibly important and how we
manage data and with the rapidity of how the world is changing.
I mean, when I think back to being a high-school student, and
for me it was going and looking at the index cards, walking
down and looking in the encyclopedia. Now, when my daughter,
you know, she has vast access, or when I do rounds in the
hospital, we would have to race down to the library to get
information but now before we are even finished presenting, the
medical students or the residents can just look at the latest
data on, you know, a device like this and get access to the
most accurate and timely information. So it is incredibly
important that we make these investments to not only manage the
data, to sort that data and then to make sure it is accessible.
It is a critical priority that we have that workforce both at
the professional level but then also at the management level
and I think the number that I read was we need about 1.5
million managers. So there is a huge need but also a huge
opportunity.
When I think back to the talent that has been impacted in
the last four years in the recession, you know, there are a
large number of extremely intelligent and talented individuals
in their 30s and 40s who have been hit hard. These are folks
like myself that were trained for a 20th-century workforce but
now we find ourselves in a 21st-century economy.
Dr. Rappa, are there some best practices--and these aren't
individuals that need to get a graduate degree, you know, they
are talented individuals--where we could take them and quickly
train them for this new economy? Are there examples?
Dr. Rappa. Right. So we do offer it as a graduate degree
but we do this in 10 months, and indeed, a good, fairly
substantial, larger portion of our population are people who
are returning from--or coming from the workforce to go through
this and some of them are in exactly the position that you say.
They were transitioning, their companies were faltering. And so
the key really with this is short duration. Ten months is
actually a very reasonably good time because you could build
the skills that you need. If it is too short, you can't
accumulate the skills but the key thing is that you have really
demonstrated ROI on that education because that person who is
coming in to do that has to know that they have a very high
probability of getting a job when they leave and at a
particular salary rate so that they can justify the investment
and time, and that is really what we have done.
Mr. Bera. Dr. McQueeney, are there potentially any
examples--you know, again, a lot of these folks are also paying
their mortgage, they have to continue to foot their bills--of
possibly even doing an advanced work-study type of program
where you recruit this talent and they are getting on-the-job
training as opposed to a traditional school model?
Dr. McQueeney. Yes. In fact, there is a related topic here
that I think is quite interesting, which is the application of
big data and analytics back on to the educational process
itself. You have seen the great upsurge in videos that attempt
to replace traditional brick-and-mortar classroom attendance,
coursework. You have seen a number of startup companies formed
in this space. If you look at the education process, each of us
really learns quite differently. Some of us may learn more from
hearing or from seeing or from working problems, and great
teachers, great professors are sensitive to how their different
students learn and are capable of presenting material in
alternate ways to make sure they reach all the students. With
electronic delivery of course materials and monitoring of
student progress, we generate digital exhaust, if you will,
that describes how that student is learning, how that student
responds to the instruction, and for the parts of the
instruction that are delivered electronically, we actually have
the ability to do analytics and to do an optimization process
so that each of us on the panel might not get the same length
of lecture on five different topics. It might be adjusted to
our historical learning patterns.
So we have worked with a number of universities and other,
you know, non-traditional educational institutions to apply the
big data and analytics techniques to the education and training
process itself.
Mr. Bera. Great. In my last 30 seconds, so we have access
to data. I think one element that we should also be conscious
of is the quality of the data because there certainly is very
good-quality data but at the same time there is very poor-
quality data that is out there and, you know, any of you who
want to comment on how we monitor quality?
Dr. Rappa. I think most data starts off as bad data, for
the most part, unless it is being collected in a highly careful
way. And so it is, you know--I think just as we hear about big
data today, we are going to hear about bad data in the future.
Most projects start out where you have enormous front end to
them to really understanding cleaning and cultivating that data
to make it useful, and that is an important part of the
educational process.
Dr. Jahanian. I would just add that there are a number of
techniques that have been developed and are in development
dealing with data exploration, data cleaning and so on.
Furthermore, when we talk about large-scale data sets, there
are statistical techniques that are being applied that really
take care of the noise, take care of some of these
inconsistencies, and that is one of the attractions of big
data.
Mr. Bera. Great. Thank you.
Chairman Massie. [Presiding] Thank you, Mr. Bera. I now
recognize Mr. Schweikert from Arizona for five minutes.
Mr. Schweikert. Thank you, Mr. Chairman.
This is one of those types of conversations, you know, we
could all sit around and buy you some well-caffeinated coffee
and talk for hours and still have no idea if we made any
progress.
Doctor, is it McQueeney?
Dr. McQueeney. Yes.
Mr. Schweikert. First, you are with IBM?
Dr. McQueeney. Yes.
Mr. Schweikert. In your testimony, help me do a little
ferreting out here. Hardware technology or IT talent, what is
your biggest bottleneck right now?
Dr. McQueeney. There are bottlenecks in a number of areas.
If I looked at the hardware itself, the biggest challenge
getting from the petascale to the exascale is actually the
power dissipation of the systems. The new technology work that
we are doing is to get the computations more efficient in terms
of floating point operations per watt so that if you assembled
a system thousand times bigger than today's supercomputers you
could house it and cool it.
Mr. Schweikert. You don't want to take down the power grid?
Dr. McQueeney. The power grid may not in fact be able to
supply enough power if we didn't make some innovations. That is
a good point.
Mr. Schweikert. But hasn't your company actually been one
of the leaders at producing some of those breakthroughs?
Dr. McQueeney. In fact, we have, and in fact, a lot of that
history goes back to work that started with the Department of
Energy many years ago, and this bears on an interesting
historical point. In a time when we are concerned about making
investments efficiently, if I go back to the beginning of the
ASCII program with the Department of Energy to do the nuclear
weapons stockpile stewardship program, the Department of Energy
scientists did a very careful analysis of what were the core
algorithms, the core analytics, if you will, in today's
language, that needed to be done at a certain level to provide
the mission that they needed to provide, and they found that
the current path at that time of supercomputing was going to
take five years to produce a machine that they needed in 1 or
two years. The analysis they did was thorough enough to reveal
that there weren't bottlenecks everywhere but at that time
there were bottlenecks mostly in the inner process or
communication. So they made a very thoughtful, very surgical
investment in accelerating just the piece that was needed to
close their mission gap, which was the beginning of a very long
run of government-industry collaboration.
Mr. Schweikert. But you are in some ways heading towards
where my question is. So if that bottleneck, in today's world,
do I find the technology if I went out to the private sector
around the world that is competing and producing high-end
supercomputing or is it coming out of a government lab? And I
know the pop culture terminology is ``public-private
partnership'' but the reality, they do operate in pretty
substantially different silos.
Dr. McQueeney. The real forcing function for a breakthrough
is a critical mission need. So in the case of high-performance
computing, it has often been a government agency with a
critical mission that----
Mr. Schweikert. But they were doing a specific request for
how they wanted to manage their data?
Dr. McQueeney. That is correct, and once that technology is
available, it can be consumed very rapidly in lots of other
applications that could take great advantage of it but didn't
have a compelling enough need to get over that hurdle. That is
when the disbursal of technology starts.
Mr. Schweikert. Just as an aside, only because I had some
acquaintances who were--I used to be an old SQL programmer so I
am way out of date now. IBM was actually running a fascinating
large data project where they were doing sweeping data sets
through the world's social media and gathering it and looking
for trends. Can you in 30 seconds or so tell me your knowledge
on that?
Dr. McQueeney. Yeah, we have analyzed the public social
media sources with several of our customers and we can gain a
lot of insights. For example, you know, retailers can gain
insights about trends and their clients. Transportation
agencies can gain insights about likely traffic congestion.
There are many sources of public data, both social media and
other forms that can be analyzed to reveal patterns about how
people conduct their daily activities that are very useful for
optimizing the public infrastructure.
Mr. Schweikert. Forgive me, I am blind as a bat without
these. Is it Dr. Rappa?
Dr. Rappa. Yes.
Mr. Schweikert. Isn't my single biggest problem in big data
right now is noise that when I put data set after data set
after data set and build on it, that just small incremental
errors actually create really bad decisions on the end?
Dr. Rappa. Well, I think part of the education around
handling big data deals very squarely with the quality of the
data and how to clean it and cultivate it to reduce the noise,
to----
Mr. Schweikert. But you and I can go over a long series of
public policies, both state, national, you know, military,
others, where we built it on really gigantic analyzed data sets
and it was wrong.
Dr. Rappa. Well, I think that, you know, the challenge here
is education. So as I alluded to earlier, we have teams of
students----
Mr. Schweikert. Is it education or developing educational
skepticism?
Dr. Rappa. It is developing the education around how to
squarely understand the inherent challenges in data. Data is
not born clean. It isn't born ready to be analyzed.
Mr. Schweikert. And when you and I build our model, the way
we wait, you know, because we start plugging in human factors
that, you know, you and I bring our biases and we----
Dr. Rappa. And this is why we really need a focused
education squarely around how do you draw insights from data
because there are these inherent problems in data, especially
as you scale them up, as you combine different data sets, as
you combine different types of data.
Mr. Schweikert. Thank you, Doctor, and Mr. Chairman, thank
you for tolerating. It is just one of my great fears. And look,
I am a data freak. I mean, you have got to see the servers and
stuff I have at home. But I have learned when we make big-time
public policy on something we all know is right, we keep making
huge, very costly mistakes.
Chairman Massie. Thank you, Mr. Schweikert. I now recognize
Mr. Hultgren from Illinois for five minutes.
Mr. Hultgren. Thank you, Mr. Chairman. Thank you all for
being here. First of all, I just want to thank Dr. McQueeney
too. I appreciate your mention and your support for the
exascale computing bill I am currently authoring. I am very
excited about the potential there and see some huge shift in
our national computing capabilities and I am very excited about
that, so I appreciate your mention and support of that.
I do have a few questions, and first I guess I would
address this one to Dr. McQueeney and also Dr. Jahanian. Is
that right? I am sorry. I wonder if you could comment briefly
on where the United States stands in your opinion in worldwide
computing leadership? I know the metric of the fastest
supercomputer is one metric but what do you use as a metric for
big data to determine which countries are using it most
effectively?
Dr. McQueeney. The common thing that is cited in these
discussions is the top 500 supercomputers list. That is
something that is compiled twice a year, as you well know, and
we have usually been at the top of that list. We have continued
to be the majority of the systems on that list but other
countries have noticed the success that we had in, you know,
government leading the way on high-performance computing
breakthroughs. Once those systems are built, they find hundreds
and thousands of other applications, each with a client that
might not have been able to fund that breakthrough themselves
but can certainly utilize it. Other countries have popped up on
the top of that list because they are interested in emulating
the success we have had in leading the way with innovation and
then seeing that innovation used broadly across the commercial
sector. So the top 500 list is a very technical, perhaps very
geeky measure of who is on top, and I would say that we are
still in a leadership position there but it has been stronger
in the past than it is today.
If you turn to more of a business view, you would want to
look at the companies that were taking the best advantage of
data sources, either to drive value in their companies or to
provide benefits such as public safety or health benefits, and
there again I think we are in a good position but it is a very
different kind of skill, a conversation we didn't quite finish
before about the skill to build these large systems is a very
focused, very large-scale, very capital-intensive activity but
the skills to use these systems are more focused on creativity
and are actually better done by large groups of small teams. In
fact, you know, the NSF has been a leader in fostering that
kind of innovation where thousands and thousands of groups can
build innovative applications and take advantage of these
systems.
Mr. Hultgren. Thanks. Dr. Jahanian?
Dr. Jahanian. Yes, just a couple of quick comments. There
is no question that we continue to maintain our leadership
worldwide in this area, and there is no doubt that continued
investment in this area is extremely important to the future of
the country. As I mentioned just a few minutes ago, NSF's
investment in Blue Waters, Stampede, as well as the Yellowstone
supercomputing centers represent a range of investments that we
make in high-performance computing, addressing the needs of not
only the top five percent of application that have
exceptionally high computational needs but also a broad
spectrum of researchers across the country in science and
engineering who would need computational resources.
A couple of comments. Just look at Blue Waters, for
example, which is at University of Illinois. A couple of data
points about it. It has--if you could--just the computing power
of it, if you could multiply two numbers together every second,
it would take 32 million years to do what Blue Waters does in
one second. That is astonishing power, for example, of Blue
Waters. In terms of storage capacity, memory capacity and so
on, there is a similar kind of scale.
The second point that I want to make is, we view
computation and data to be two sides of the same coin. You
really need to address both. So when we talk about
computational capabilities, we also have to worry about cyber
infrastructure to manage, to curate, to serve data to science
and engineering community, and the investment in cyber
infrastructure has to be balanced between the computation side
of it as well as management and curation of data.
Mr. Hultgren. Let me have--my time is running out but I
have a follow-up question to the two of you as well if you
could both comment in the time I have. It seems to me that
exascale computing is focused on solving discrete problems that
necessitate massive computing power and speed. Are these
different problems than those we are addressing through big
data analytical tools and how do these two terms, how are they
different, how are they similar?
Dr. McQueeney. Historically, we have tended to talk about
them differently, but as we project how the exascale systems
will be built and how they will be used and we look at the
growing importance of big data analytic systems, we see that
the platforms on which these systems will both depend will be
much more common than separate, and in fact, we see that there
is no conflict between investments in classically what we have
called HPC and what we are now calling big data analytics, and
both are changing actually. The way we use an exascale system
will not be the same way that we use a petascale system. There
isn't time here to go into it, but it actually morphs into a
direction that is much more common with what we will do in big
data and analytics.
Dr. Jahanian. I would just add that many of the problems
that the business community needs, the science and engineering
community needs are being addressed today through different
kind of computational architectures that doesn't necessarily
require today to have exascale computing including weather
modeling, a number of other applications that have been
mentioned. So it is really important to consider the investment
in exascale computing in the spectrum of investment that we
make to support computational and data needs of the entire
science and engineering community and of course the private
sector.
Mr. Hultgren. Thank you so much. Chairman, thank you. I
yield back.
Chairman Massie. I now recognize Mr. Lipinski from Illinois
for five minutes.
Mr. Lipinski. Thank you, Mr. Chairman. I am glad that Dr.
Jahanian mentioned Blue Waters there. We were just there not
that long ago, but since you covered that, I can move on to a
different area.
Dr. McQueeney, in your testimony you talk about how the
Federal Government needs to invest in big data if the U.S. is
going to maintain its leadership and competitive edge in this
area. The needs and potential benefits of big data for the
Federal Government align closely with those of private industry
in a number of areas. If that is the case, how can the Federal
Government more effectively partner with industry to achieve
common goals and do you believe that industry has sufficient
input in the Federal Government's research agenda as it relates
to big data?
Dr. McQueeney. I do think we have sufficient input. I think
we have excellent dialogs with the relevant agencies and
national laboratories, and I think the roles are complementary.
I go back to the story about the early days of the ASCII
program where through a collaboration we realized that the key
piece of a supercomputing system that needed to be accelerated
was not the entire investment. We could ride on the commercial
investments for most of the components of the supercomputing
systems at that time except for one, which was the high-
bandwidth switching between processors. And so that kind of
thoughtful connection between the leaders in commercial
computing and the leaders on the government side has been able
historically to identify which areas are critical to attain
government mission imperatives and where we can leverage
commercial technology and where we need to accelerate that in a
surgical fashion. So it has, in our view, been a very good
partnership based on very high-bandwidth technical
communications, understanding of applications and knowing when
the government should be leveraging commercial investments and
when they need to accelerate parts of that investment to attain
unique mission goals, and again, as I have said before, once
those barriers are crossed in terms of either the scalability
of the system or the internal bandwidth of the system, it opens
up thousands of new applications where there were ready
problems to be analyzed but those applications weren't large
enough to drive that breakthrough. So that is how the effect
works of the leadership coming from some of the government
agencies and then being realized broadly across industry. That
is the essence of where this leadership has come from so
successfully over the years.
Mr. Lipinski. I want to follow up with Dr. Rappa on that.
Dr. Rappa, you discussed the importance of public-private
partnerships to realizing the benefits of big data and stated
specifically that we must intensify and accelerate the national
investment in proven models. What characteristics make a
public-private partnership successful and what models should we
be investing in? What were you referring to there?
Dr. Rappa. Well, I think first of all, we have been doing
this now for six years and so I think we do have a fairly
interesting, novel model for producing talent in this field
with a kind of proven track record based on data, based on
market value of the graduates, but I think it comes really, you
know, partly from the university community, partly from the
academic community. Obviously we have a set of missions to
educate students but we need to also, I think, do that by
trying to really understand the employer, what are they looking
for when they hire talent, what are the kinds of skills that
they need in order to be effective on the job, and I think
employers need to sort of be open to working with the academic
community. You know, there is a certain amount of dissidence
that naturally occurs because there are two different worlds
with different missions but I think it is really--I think we
have shown that it is possible with organizational innovation,
with a focused effort, with a sense of openness to engage the
private sector in a very positive way, not just at NC State but
at other universities. There are many, many examples now that I
hope we are providing some leadership on but that other
universities are working with our model but also pursuing other
creative models to do this. There are probably about two dozen
around the country already.
Mr. Lipinski. Thank you. Dr. Jahanian, anything you want to
add about public-private partnerships?
Dr. Jahanian. Yes, indeed. There is no question that when
we think about the innovation ecosystem in this country, it
includes academia, it includes the private sector, it includes
government investment and a talent-rich workforce. The private
sector is investing heavily in cloud computing, as you know. It
is investing heavily in making computational resources also
available. I think there are opportunities for the Federal
investment to leverage that and make some of that available. Of
course that is commercially available today to our researchers,
to our scientists and engineers who could rely on those
systems. We have announced a number of partnerships, one with
IBM and Google, another one with Microsoft that make some of
those resources available to the research community.
Dr. McQueeney already mentioned this, that there is high-
bandwidth communication between the private sector and various
Federal agencies. I can tell you from NSF's perspective, it is
a very, very rich collaboration. On my advisory committee, I
have a number of the senior leader from the private sector who
serve on my advisory committee advising us on our portfolio, on
our investments in addition to academics who serve on my
advisory committee.
The final comment that I want to make is, there are a
number of programs at NSF, and I know you are familiar with all
of them, including SBIR, including I-Corps and so on that focus
on transfer of knowledge from lab to practice. Federal
Government invests heavily in advancing frontiers of knowledge.
For us to accelerate those programs such as I-Corps, SBIR and
so on serves a tremendous purpose, and here again, there are
opportunities to engage the private sector and accelerate the
transfer of knowledge to practice to benefit the Nation. Thank
you.
Mr. Lipinski. Thank you.
Chairman Massie. Thank you, Mr. Lipinski. I now recognize
Mr. Bridenstine from Oklahoma for five minutes.
Mr. Bridenstine. Thank you, Mr. Chairman.
I also serve on the House Armed Services Committee, and I
am aware that the Department of Defense is moving towards
cloud-based computing solutions, and this of course creates
some consternation about security issues, cyber hacking, other
cyber crimes, and I was wondering if any of your organizations
are involved in helping the Department of Defense work through
these issues and what those solutions might be, if you could
share with us on that?
Dr. McQueeney. Sure, if I could start? You are quite right
to raise the concern about security for any systems used by the
Defense Department especially, although it would be true for
all Federal agencies. And when you move to a cloud computing
model, there is an extra imperative to be concerned about
security, and if you think of it in terms of the DOD might
think of it, if that environment should be compromised by an
enemy, it is a bigger piece of resource than an individual
machine so it requires special vigilance. Now, the good news
technically is, the way we handle virtualization, which is the
foundation of how cloud computing is delivered from a compute
virtualization point of view, there are actually sophisticated
techniques that can provide additional security in a
virtualized environment that we can provide even when using
things running on bare metal. We have additional abilities to
instrument the operation of that cloud and to very rapidly
detect any kind of pattern or behavior that is indicative of a
threat.
We did a project a number of years ago with the U.S. Air
Force and they graciously let us write a short press release on
it where we built a cloud computing environment that was at the
cutting edge a few years ago. We instrumented it very
thoroughly with watching the package flowing on the
interconnected network that built the cloud in question and we
very carefully isolated it from the rest of the world,
introduced known cyber attacks into it and were able to show
that if we knew the patterns of command and control, as the
defense folks might say, of these cyber attacks, we could
actually spot them assembling themselves and interrupt them
before they had a chance to launch. So having tremendous
control over the environment out of which we were getting
compute resources gave us abilities to do additional security
and additional monitoring, even if we assumed the security was
not perfect and could be breached, could we essentially in real
time detect that breach and interrupt it before it stopped. I
thought that was a very forward-looking piece of work that was
driven by the Air Force CIO's office.
Mr. Bridenstine. Excellent. Go ahead.
Dr. Jahanian. As you alluded to, these new environments,
whether it is mobile platforms or cloud computing, are
introducing new challenges, and we recognize that attackers and
defenders are coevolving and there are enormous challenges to
protecting our critical infrastructure and our cyber
infrastructure.
I wanted to mention NSF's Secure and Trustworthy Cyberspace
program, which is a research program addressing many of the
challenges that we alluded to, and this is a research program
that addresses not only the technology issues but also
transition to practice. Furthermore, the NITRD research and
development subcommittee has a working group that focuses on
coordination of activity across various agencies on
cybersecurity and there is rich dialog involving various
agencies on that issue.
Mr. Bridenstine. Excellent. Are there any other things that
the Department of Defense could do to help you guys with the
objective of securing cloud computing for the Department of
Defense?
Dr. Rappa. So I am currently co-directing a project with a
colleague at NC State, which is the science of security project
that is done in collaboration with Carnegie-Mellon University
and University of Illinois, and we are trying to bring together
large groups, multidisciplinary groups of faculty to really try
to understand the underpinning of the security problem and how
to produce science around it. It is a very long-term challenge
but it is one which I think has to start with getting the
faculty across different disciplines focused on it and
certainly I think it has been a tremendous opportunity and I
look forward to moving into the future.
Dr. McQueeney. Yeah, Dr. Rappa makes a very interesting
point, to close the loop here. The cybersecurity problem is
itself a big data and fast-data problem, and in fact, with some
of the advanced persistent threats that we see today, which
depend on breaching an infrastructure and then laying dormant
for several months, what the attacker is trying to do is to
wait out how long you keep your log file data so that when they
launch themselves, it is difficult to do forensics, and so what
we have learned is that these log files are actually the
essence of the big data you need to do pattern analysis,
pattern discovery on forensics, you know, should any attack
occur. So in fact, most of the science behind big data
including data at rest and large-scale computation and fast-
data that are eating very high-speed streams is directly
relevant to the subject of cyber defense.
Mr. Bridenstine. Thank you.
Chairman Massie. Thank you, Mr. Bridenstine. If the Ranking
Member is amenable to this, I think we will do another round of
questions?
Ms. Wilson. Yes.
Chairman Massie. Did you have something to introduce into
the record?
Ms. Wilson. I do. Thank you, Mr. Chair. Mr. Kilmer has lots
of conflicts. As we saw him come to the meeting, he had to
leave, and I want to ask unanimous consent on behalf of Mr.
Kilmer to introduce a report on big data from IDC into the
record, and then I have a question.
Chairman Massie. Without objection, so ordered. It will be
set into the record.
[The information appears in Appendix II]
Ms. Wilson. Thank you. This question is for everyone.
We have all had several discussions lately about the value
of NSF-funded research to society and how we might certify that
value based on the grant proposal. I think we might use big
data instructively here. It is an incredibly interdisciplinary
field where tools are developed in the pursuit of one narrow
research question, let us say in the social sciences might have
profound applications across many fields of science and even in
many sectors of the economy that can't possibly be anticipated
at the time of the proposal. What is the potential for data
analytics being developed in one little seemingly irrelevant
corner having unintended benefits to other fields and societal
applications? And if you have concrete examples, that would be
even better for us to understand. Thank you.
Dr. Jahanian. Okay. I guess I will start. There is no
question there are all sorts of explorations that we are doing
in the area of big data that we can't even begin to see the
potential impact of it. I will give you an example. NSF has
been investing and other agencies with the private sector in
what is known as the area of machine learning. These
investments have taken place for at least 20 or 30 years. In
fact, IBM has also led efforts in this area. I can tell you
that it is investments of the last 20 or 30 years that have
come to fruition such that these machine learning algorithms
essentially allow us to look at these large data sets and
identify trends and be able to adapt. Essentially, they have a
broad range of applications from weather forecasting to
financial modeling to biomedical research and so on that have
had tremendous, tremendous impact and now we use these
techniques as if they are off-the-shelf solutions available
that you can buy. These are through years of investment that we
have made that have come to fruition, so that is an example of
that.
We are investing in all sorts of areas in natural language
understanding, in information retrieval, in various algorithms
and approaches to automated scalable approaches to reasoning
that could be applied to understanding relationship between
gene sequence structure and biological functions. These are all
essentially the kinds of investments that we are making that
some of us we could see how it comes to fruition. Some of it
relies on decades of investment that we have already made in
computational techniques and data-intensive techniques.
Dr. McQueeney. If I could offer you an example from the
medical world, one of the critical problems in medicine is the
loss of premature infants due to infections, and physicians
have struggled for a long time with identifying the onset of an
infection at a very early point because as these infections can
grow exponentially, the earlier you can intercept them, the
more likely you are to have a lifesaving benefit for someone
who is very vulnerable such as a premature infant. We have done
work with the Toronto Hospital for Sick Kids where a physician
up there had an idea that all the instrumentation in the NICU
that is--you know, you have probably been in a hospital room or
intensive-care room, all the instruments around the bed,
someone comes in every half an hour and writes down those
numbers but the instruments are producing readings
continuously, and this physician had the idea that if we kept
all that data and we stored all that data as it came out of the
machines in real time, which was a tremendous aggregation from
a velocity of data point of view and correlated with the
eventual issues that these premature infants had, we might be
able to detect patterns using techniques such as machine
learning that we were just hearing about that would give us an
early identification of an upcoming infection, the ability to
treat it before it got out of control, and her theories were
absolutely correct. There were signatures in the data that gave
up to 24 hours advance notice of an onset of an infection that
was time for the doctors to in many cases provide some kind of
lifesaving therapy. So there is an example of very, very deep
mathematics, computer science being applied to a problem where
the data was being produced every day by these instruments and
it wasn't being captured and it wasn't being looked at and it
wasn't being correlated with results to produce a fantastic
outcome.
Dr. Rappa. I would just sum up by saying that really big
data is part of a decades-long process that really started with
computerization in the 1940s and 1950s and eventually got
interconnected through the Internet in the 1970s, 1980s and
1990s that the world that we are turning into, data is going to
be everywhere. It is going to affect exactly what happens here.
It is going to affect hospitals, universities, every corner of
the economy literally, and so we need to take approaches to
that to try to develop understanding around big data, how it is
applied, how the tools of analytics are applied across, you
know, virtually every sector of the economy, and so I would
take a very broad view, not looking at it as specifically, you
know, a realm of computer technology or some other sort of
isolated realm but looking at it as, you know, unfortunately as
the big thing it is.
Dr. Jahanian. May I offer another example as I was thinking
about it? I am reminded of the work by Daphne Koller and her
collaborators at Stanford on classifying breast cancer via
image analysis. As you know, 40,000 women die from this disease
each year. By extending essentially image analysis techniques
to hundreds of, I should say thousands and thousands of biopsy
images, they were able to identify a subset of cellular
features. Out of 6,000 possible features, they were able to
essentially identify a few of them that were predictive of
survival time among breast cancer patients. What is really
surprising is that the feature that they identified, it wasn't
just from--the best feature, I should say, that is a predictor
of survival, was not from the cancerous tissue itself but it
was from the surrounding tissue, and that has led to new kinds
of treatments. It has led to new kinds of diagnosis techniques
and also a very personalized treatment that could aim to
improve survival times in patients. That is a very, very
concrete example.
Another example is the work that Google had done during
H1N1 virus. I will be very brief about this. Before they
actually discovered a vaccine, we wanted to track the spread of
disease. Google engineers used data that had nothing to do with
the virus directly from billions of essentially web searches
from around the world together from publicly available,
essentially historic data on flu trends, to predict the spread
of flu virus down to small regions in the country--or across
the world, rather. This is a remarkable essentially application
of data that one would have never thought could be applicable
to something like H1N1 virus.
Ms. Wilson. Thank you very much.
Chairman Massie. Thank you, Ms. Wilson. Thank you for that
very excellent example of how we can use--a private company can
find information in the data.
We got a little bit out of order so the last question is
going to be mine. I reserve five minutes for myself. And the
question I want to ask is, we have heard about banks that are
too big to fail, and we also know that the Internet is now too
big to fail. We recently in the House passed a CISPA bill which
is somewhat controversial but some people felt it was necessary
to do because the Internet was so big and pervasive in our
lives. So my question to you is, are there any big data sets
that are too big to fail? In other words, are there ones that
are pervasive that we have let through osmosis become--we have
become too dependent upon or maybe not too dependent but we are
dependent upon these data sets, for instance, weather, you
know, and early warning systems? Not all of those, I imagine,
are government systems. Some of them are private but possibly
the government is relying on these systems and so I would be
remiss if I didn't ask this question now before something
fails, but tell us what is too big to fail right now? What
would we bail out, and is there sufficient redundancy in the
collection, storage and access of these data sets? Thank you.
Dr. McQueeney. Well, first, I would just like to say that
we were delighted to support that cyber bill, and I
congratulate you on such broad bipartisan support in the House
for getting that acted upon.
Data sets have the property that they can often be
subdivided and often be replicated, and so we have a lot of
techniques by which we can assure the continuity of data if we
take the time to do it, and if there were very valuable
historical records on things like long-term weather trends that
were only stored in one place, that actually could be a concern
because that is literally irreplaceable data. But essentially
all of the IT techniques needed to take those large data sets
and segment them and replicate them in different secure places
so they could be re-created do exist but I think you raise an
interesting point, that it is worthwhile to periodically check
that we are being appropriately vigilant with the digital
archives that are so valuable.
Chairman Massie. Dr. Jahanian?
Dr. Jahanian. I don't have a specific example. What I can
tell you is that similar to the issue of cybersecurity, as
Nation's critical infrastructure and more generally the
Internet is playing a vital role in integrating the economic,
you know, political, societal fabric of our society, we are
going to become more and more dependent on data, and data is
going to play an increasingly significant role in our day-to-
day lives, and for that reason, I think the same sort of issues
that apply to all sorts of IT solutions that we take for
granted will increasingly be applied to data.
From a research and engineering community's point of view,
it is not just failure of the data but making that data
accessible and also making the data accessible to broad
community of scientists and engineers is an issue that we are
quite concerned about.
Chairman. Massie. Thank you very much. I was part of the
bipartisan on CISPA, opposing CISPA actually, but that is okay.
I want to thank the witnesses for their valuable testimony
and the Members for their questions today. The Members in the
Committee may have additional questions for you, and we will
ask that you respond to those in writing. The record will
remain open for two weeks for additional comments and written
questions from the Members.
The witnesses are excused and this hearing is adjourned.
[Whereupon, at 11:35 a.m., the Subcommittees were
adjourned.]
Appendix I
----------
Answers to Post-Hearing Questions
Responses by Dr. Michael Rappa
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
Responses by Dr. Farnam Jahanian
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]
Appendix II
----------
Additional Material for the Record
IDC IVIEW, The Digital Universe in 2020: Big Data, Bigger Digital
Shadows, and Biggest Growth in the Far East, submitted by
Representative Derek Kilmer
[GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]