[House Hearing, 113 Congress] [From the U.S. Government Publishing Office] NEXT GENERATION COMPUTING AND BIG DATA ANALYTICS ======================================================================= JOINT HEARING BEFORE THE SUBCOMMITTEE ON RESEARCH & SUBCOMMITTEE ON TECHNOLOGY COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY HOUSE OF REPRESENTATIVES ONE HUNDRED THIRTEENTH CONGRESS FIRST SESSION __________ WEDNESDAY, APRIL 24, 2013 __________ Serial No. 113-22 __________ Printed for the use of the Committee on Science, Space, and Technology Available via the World Wide Web: http://science.house.gov ---------- U.S. GOVERNMENT PRINTING OFFICE 80-561 PDF WASHINGTON : 2013 COMMITTEE ON SCIENCE, SPACE, AND TECHNOLOGY HON. LAMAR S. SMITH, Texas, Chair DANA ROHRABACHER, California EDDIE BERNICE JOHNSON, Texas RALPH M. HALL, Texas ZOE LOFGREN, California F. JAMES SENSENBRENNER, JR., DANIEL LIPINSKI, Illinois Wisconsin DONNA F. EDWARDS, Maryland FRANK D. LUCAS, Oklahoma FREDERICA S. WILSON, Florida RANDY NEUGEBAUER, Texas SUZANNE BONAMICI, Oregon MICHAEL T. McCAUL, Texas ERIC SWALWELL, California PAUL C. BROUN, Georgia DAN MAFFEI, New York STEVEN M. PALAZZO, Mississippi ALAN GRAYSON, Florida MO BROOKS, Alabama JOSEPH KENNEDY III, Massachusetts RANDY HULTGREN, Illinois SCOTT PETERS, California LARRY BUCSHON, Indiana DEREK KILMER, Washington STEVE STOCKMAN, Texas AMI BERA, California BILL POSEY, Florida ELIZABETH ESTY, Connecticut CYNTHIA LUMMIS, Wyoming MARC VEASEY, Texas DAVID SCHWEIKERT, Arizona JULIA BROWNLEY, California THOMAS MASSIE, Kentucky MARK TAKANO, California KEVIN CRAMER, North Dakota ROBIN KELLY, Illinois JIM BRIDENSTINE, Oklahoma RANDY WEBER, Texas CHRIS STEWART, Utah VACANCY ------ Subcommittee on Research HON. LARRY BUCSHON, Indiana, Chair STEVEN M. PALAZZO, Mississippi DANIEL LIPINSKI, Illinois MO BROOKS, Alabama ZOE LOFGREN, California STEVE STOCKMAN, Texas AMI BERA, California CYNTHIA LUMMIS, Wyoming ELIZABETH ESTY, Connecticut JIM BRIDENSTINE, Oklahoma EDDIE BERNICE JOHNSON, Texas LAMAR S. SMITH, Texas ------ Subcommittee on Technology HON. THOMAS MASSIE, Kentucky, Chair JIM BRIDENSTINE, Oklahoma FREDERICA S. WILSON, Florida RANDY HULTGREN, Illinois SCOTT PETERS, California DAVID SCHWEIKERT, Arizona DEREK KILMER, Washington EDDIE BERNICE JOHNSON, Texas LAMAR S. SMITH, Texas C O N T E N T S Wednesday, April 24, 2013 Page Witness List..................................................... 2 Hearing Charter.................................................. 3 Opening Statements Statement by Representative Larry Bucshon, Chairman, Subcommittee on Research, Committee on Science, Space, and Technology, U.S. House of Representatives....................................... 8 Written Statement............................................ 9 Statement by Representative Daniel Lipinski, Ranking Minority Member, Subcommittee on Research, Committee on Science, Space, and Technology, U.S. House of Representatives.................. 10 Written Statement............................................ 11 Statement by Representative Thomas Massie, Chairman, Subcommittee on Technology, Committee on Science, Space, and Technology, U.S. House of Representatives.................................. 12 Written Statement............................................ 13 Statement by Representative Frederica S. Wilson, Ranking Minority Member, Subcommittee on Technology, Committee on Science, Space, and Technology, U.S. House of Representatives........... 13 Written Statement............................................ 14 Witnesses: Dr. David McQueeney, Vice President, Technical Strategy and Worldwide Operations, IBM Research Oral Statement............................................... 16 Written Statement............................................ 18 Dr. Michael Rappa, Director, Institute for Advanced Analytics, Distinguished University Professor, North Carolina State University Oral Statement............................................... 26 Written Statement............................................ 28 Dr. Farnam Jahanian, Assistant Director for the Computer and Information Science and Engineering (CISE) Directorate, National Science Foundation Oral Statement............................................... 36 Written Statement............................................ 38 Discussion....................................................... 55 Appendix I: Answers to Post-Hearing Questions Dr. Michael Rappa, Director, Institute for Advanced Analytics, Distinguished University Professor, North Carolina State University..................................................... 76 Dr. Farnam Jahanian, Assistant Director for the Computer and Information Science and Engineering (CISE) Directorate, National Science Foundation.................................... 79 Appendix II: Additional Material for the Record IDC IVIEW report, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, submitted by Representative Derek Kilmer, Subcommittee on Technology, Committee on Science, Space, and Technology, U.S. House of Representatives................................................ 86 NEXT GENERATION COMPUTING AND BIG DATA ANALYTICS ---------- WEDNESDAY, APRIL 24, 2013 House of Representatives, Subcommittee on Research & Subcommittee Technology Committee on Science, Space, and Technology, Washington, D.C. The Subcommittees met, pursuant to call, at 10:04 a.m., in Room 2318 of the Rayburn House Office Building, Hon. Larry Bucshon [Chairman of the Subcommittee on Research] presiding. [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT] Chairman Bucshon. All right. This joint hearing of the Subcommittee on Research and the Subcommittee on Technology will come to order. Good morning, and welcome to today's joint hearing entitled ``Next Generation Computing and Big Data Analytics.'' In front of you are packets containing the written testimony, biographies and Truth in Testimony disclosures for today's witnesses. Before I get started, since this is a joint hearing involving two Subcommittees, I want to explain how we will operate procedurally so all Members understand how the question-and-answer period will be handled. As always, we will alternate rounds of questioning between majority and minority Members. The Chairmen and Ranking Members of the Research and Technology Subcommittees will be recognized first. Then we will recognize Members present at the gavel in order of seniority on the full Committee and those coming in after the gavel will be recognized in order of their arrival. I now recognize myself for five minutes for an opening statement. Again, I would like to welcome everyone to today's hearing where we will examine how advancements in information technology and data analytics enable private and public sector organizations to provide greater value to their customers and citizens. Industry, academia, and government are all interested in determining how to extract value, gain insights, and make better decisions based on the wealth of data that is generated today. In recent years, ``big data'' has become the popular term used to encompass this phenomenon. TechAmerica, an information technology trade association, defines big data as ``large volumes of high-velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information.'' Big data offers a range of opportunities for private industry to reduce costs and increase profitability. It can enable scientists to make discoveries on a previously unreachable scale. And it can allow governments to identify ways to serve its citizens more efficiently. The McKinsey Global Institute predicts that effective information management can provide $300 billion in annual value to the U.S. health care sector alone. TechAmerica released a report last year highlighting how big data initiatives can improve the efficiency and effectiveness of government services, and through the use of advanced computing power and analytic techniques, universities and Federal laboratories can drive new research initiatives that will significantly increase our scientific knowledge base. There are also various challenges associated with big data that the Committee will explore today. McKinsey has estimated that the U.S. will face a shortfall of 140,000 to 190,000 professionals with significant technical depth in data analytics, and a further shortfall of an additional 1.5 million managers and analysts who can work effectively with big data analysis by 2018. Committee Members will be interested to learn how industry, academia, and government are addressing this shortfall. While the term ``big data'' is relatively new, public and private organizations have been investing in computing power and data analytics for a number of years. In March of last year, the Obama Administration announced a Big Data Research and Development Initiative, including $200 million in new funding across six different government departments and agencies. I am interested to learn how effectively these programs are being coordinated across the different Federal agencies to ensure that taxpayer dollars are being leveraged effectively. Finally, privacy and security are major concerns when private and public organizations are collecting, analyzing, and disseminating massive data sets. We have an excellent panel of witnesses ranging across industry, academia, and government. I would like to extend my appreciation to each of our witnesses for taking the time and effort to appear before us today. We look forward to your testimony. [The prepared statement of Mr. Bucshon follows:] Prepared Statement of Subcommittee on Research Chairman Larry Bucshon Good morning, I would like to welcome everyone to today's hearing where we will examine how advancements in information technology and data analytics enable private and public sector organizations to provide greater value to their customers and citizens. Industry, academia, and government are all interested in determining how to extract value, gain insights, and make better decisions based on the wealth of data that is generated today. In recent years, ``Big Data'' has become the popular term used to encompass this phenomenon. TechAmerica, an information technology trade association, defines Big Data as ``large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information.'' Big Data offers a range of opportunities for private industry to reduce costs and increase profitability. It can enable scientists to make discoveries on a previously unreachable scale. And it can allow governments to identify ways to serve its citizens more efficiently. The McKinsey Global Institute predicts that effective information management can provide $300 billion in annual value to the US health care sector alone. TechAmerica released a report last year highlighting how Big Data initiatives can improve the efficiency and effectiveness of government services. And, through the use of advanced computing power and analytics techniques, universities and federal laboratories can drive new research initiatives that will significantly increase our scientific knowledge-base. There are also various challenges associated with Big Data that the Committee will explore today. McKinsey has estimated that the US will face a shortfall of 140,000 to 190,000 professionals with significant technical depth in data analytics, and a further shortfall of an additional 1.5 million managers and analysts who can work effectively with big data analysis by 2018. Committee members will be interested to learn how industry, academia, and government are addressing this shortfall. While the term Big Data is relatively new, public and private organizations have been investing in computing power and data analytics for a number of years. In March of last year, the Obama Administration announced a ``Big Data Research and Development Initiative,'' including $200 million in new funding across six different federal departments and agencies. I am interested to learn how effectively these programs are being coordinated across the different federal agencies to ensure that taxpayer dollars are being leveraged effectively. Finally, privacy and security are major concerns when private and public organizations are collecting, analyzing, and disseminating massive data sets. We have an excellent panel of witnesses ranging across industry, academia and government. I'd like to extend my appreciation to each of our witnesses for taking the time and effort to appear before us today. We look forward to your testimony. Chairman Bucshon. I will now yield to Mr. Lipinski for his opening statement. Mr. Lipinski. Thank you. I want to thank you, Chairman Bucshon, and I want to thank Chairman Massie for holding this hearing. I want to welcome and thank the witnesses for being here. Today's hearing gives us an opportunity to talk about the new tools and analytics that are being developed for big data. As Chairman Bucshon stated, big data can be thought of as large volumes of complex and diverse types of data that change rapidly with time. In basic scientific research in national security as well as in economic sectors ranging from energy to health care, big data challenges are becoming fundamentally important. Effectively dealing with big data can impact how we do business and how we think about the world. As a Member of the Research Subcommittee for several years, I have watched as the amount and complexity of data has grown by leaps and bounds. The field of astronomy is a great example. When the Sloan Digital Sky Survey started work in 2000, its telescope in New Mexico collected more data in a few weeks than had been collected in the history of astronomy, and that telescope will be surpassed when the Large Synoptic Survey Telescope begins scientific operations in 2020. LSST will photograph the entire sky every few days, producing data at a rate almost 100 times greater than the Sloan Survey. But data is useless without the means to store and analyze it in an efficient manner. The types of data are changing as well. Data has gone from being mostly numbers entered into Excel spreadsheets to data coming from sensors, cell phone cameras and millions of email messages. In fact, it is estimated that over 85 percent of data generated today are these kinds of unstructured data, data like videos and emails. The change in the volume and variety of data as well as how fast data is being produced and changed creates almost limitless opportunities. For example, since cybersecurity data is massive, varied, and changing quickly, big data technologies have the potential to detect and prevent cyber attacks before they happen. I know that organizations like IBM are developing technologies to do just that. Additionally, big data could be used to establish new business models, create transparency, improve decision-making and reduce inefficiencies within businesses and government. But along with the opportunities, there are a number of challenges. We need new tools and software packages to manage, organize, and analyze all these different kinds of data. Additionally, we will need an analytic workforce to ensure the gains of big data. These challenges necessitate involvement from government, academia and the private sector. That is why I am happy to see all those sectors represented here today. The government has and will continue to play an instrumental role in this area. For instance, the Networking and Information Technology Research and Development program, or NITRD, created an interagency big data group that is coordinating Federal efforts in technologies, research, competitions, and workforce development for big data. We had a hearing on the NITRD program back in February, and I expect that we will be able to take a broader look at many of the same issues in today's hearing. In some cases, agencies have teamed up to issue joint solicitations. For example, NSF and NIH have a joint big data grant program that awarded nearly $15 million of grants to eight teams of researchers last year. These first award grants went to projects focused on designing new tools for big data and new data analytic approaches. We will be hearing more about these and other interagency activities from Dr. Jahanian in his testimony. We will also learn more about specific programs at NSF, one of the leading agencies in Federal big data efforts on both the analytics side and the computational resources side. As I mentioned before, one of the areas being coordinated through NITRD is workforce development for big data. Several agencies, including NSF, have education activities to support a new generation of big data researchers. As we will likely hear from all of the witnesses, we face a looming shortage of workers with the skills needed to analyze and manage large, complex and high-velocity data sets. There is some overlap with the broader STEM skills we so often speak about in this committee, but there are also unique skills required to address the big challenges of big data. We need to consider how to build those skills into STEM curricula, especially at the undergraduate and graduate levels. I look forward to hearing from our witnesses about the current educational efforts and what additional initiatives may be necessary. And finally, since big data involves different types of data that can be produced and transferred quickly, there are concerns over privacy. We need to ensure that we strike the right balance between exploring and implementing all of the potential benefits of big data while also protecting individuals' personal information. I look forward to hearing the witnesses' testimony and our discussion today, and I yield back the balance of my time. [The prepared statement of Mr. Lipinski follows:] Prepared Statement of Subcommittee on Research Ranking Minority Member Daniel Lipinski Thank you, Chairmen Bucshon and Massie for holding this hearing on examining the next generation of computing and big data analytics. I want to welcome and thank the witnesses for being here today. Today's hearing gives us an opportunity to talk about the new tools and analytics that are being developed for big data. Big data can be thought of as large volumes of complex and diverse types of data that are also high velocity--meaning they change rapidly with time. As a member of the Research Subcommittee for several years now, I have watched as the amount and complexity of data has grown by leaps and bounds. The field of astronomy is a great example. When the Sloan Digital Sky Survey started work in 2000, its telescope in New Mexico collected more data in a few weeks than had been collected in the history of astronomy. And that telescope will be surpassed when the Large Synoptic Survey Telescope goes online in about 2020. LSST will photograph the entire sky every few days. That's difficult for any of us to wrap our heads around. The types of data are changing as well. Data has gone from being mostly numbers entered in excel spreadsheets to data coming from sensors, cellphone cameras, and millions of email messages. In fact, it is estimated that over 85 percent of data generated today are these kinds of unstructured data--data like videos or emails. The change in the volume and variety of data as well as how fast data is being produced and changed creates almost limitless opportunities. For example, since cybersecurity data is massive, varied, and changing quickly, big data technologies have the potential to detect and prevent cyber attacks before they even happen. I know that organizations like IBM are developing technologies to do just that. Additionally, big data could be used to establish new business models, create transparency, improve decision-making, and reduce inefficiencies within businesses and government. But along with the opportunities, there are a number of challenges. We need new tools and software packages to manage, organize, and analyze all these different kinds of data. Additionally, we will need an analytic workforce to ensure the gains of big data. These challenges necessitate involvement from government, academia, and the private sector. That is why I am happy to see all those sectors represented today. The government has and will continue to play an instrumental role in this area. For instance, the Networking and Information Technology Research and Development--or NITRD--program created an interagency big data group that is coordinating federal efforts in technologies, research, competitions, and workforce development for big data. In some cases, agencies have teamed up to issue joint solicitations. For example, NSF and NIH have a joint big data grant program that awarded nearly $15 million of grants to eight teams of researchers last year. These first awarded grants went to projects focused on designing new tools for big data and new data analytic approaches. We will hear more about these and other interagency activities from Dr. Jahanian in his testimony. We will also learn more about specific programs at NSF, one of the leading agencies in federal big data efforts on both the analytics side and the computational resources side. As I mentioned before, one of the areas being coordinated through NITRD is the workforce development needs for big data. Several agencies, including NSF, have education activities to support a new generation of big data researchers. As you will likely hear from all of the witnesses, we face a looming shortage of workers with the skills needed to analyze and manage large, complex, and high-velocity data sets. There is some overlap with the broader STEM skills we often speak of in this committee. But there are also some unique skills required to address the challenges of big data. We need to consider how to build those skills into STEM curricula, especially at the undergraduate and graduate levels. I look forward to hearing from our witnesses about the current educational efforts and what additional initiatives may be necessary. Finally, since big data involves different types of data that can be produced and transferred quickly, there are concerns over privacy. We need to ensure that we strike the right balance between exploring and implementing all of the potential benefits of big data while also protecting individuals' personal information. I look forward to hearing the witnesses' testimonies and to our discussion today. Chairman Bucshon. Thank you, Mr. Lipinski. The Chair now recognizes the Chairman of the Subcommittee on Technology, Mr. Massie, for five minutes for his opening statement. Mr. Massie. Thank you, Chairman. Good morning. Today we are examining an issue that we hear a lot about. ``Big data'' is a popular new term that can mean a lot of different things. The scientific community, though, has generated and used big data before there was the term ``big data.'' In fact, in 1991 this Committee authored the High Performance Computing Act, which organized the Federal agency research, development, and training efforts in support of advanced computing. Individual researchers have always been faced with difficult decisions about their data: what to keep, what to toss, what to verify with additional experiments. And as our computing power has increased, so has the luxury of storing more data. Incorporating computer power to process more scientific data is transforming laboratories across the country. At the same time, the ability to analyze large amounts of data across multiple networked platforms is also transforming the private sector. Through big data applications, businesses have not only revealed previously hidden efficiency improvements in their internal operations, but, more importantly, also uncovered entirely new types of businesses built around data that was previously not accessible due to its size and complexity. Today's hearing will examine the hype around big data. Is the United States the most innovative Nation in big data? Is our regulatory system creating any burdens on businesses? Could public-private partnerships with the Federal agencies be improved to allow for more data innovations? I thank our witnesses today for their participation today and I look forward to hearing their testimony. Thank you. I yield back. [The prepared statement of Mr. Massie follows:] Prepared Statement of Subcommittee on Technology Chairman Thomas Massie Good Morning. Today we are examining an issue that we hear a lot about. ``Big Data'' is a popular new term that can mean a lot of different things. The scientific community has generated and used Big Data before there was Big Data. In fact, in 1991 this Committee authored the High Performance Computing Act, which organized the federal agency research, development and training efforts in support of advanced computing. Individual researchers have always been faced with difficult decisions about their data: what to keep, what to toss, what to verify with additional experiments. As our computing power has increased, so has the luxury of storing more data. Today, managing this data allows for better-informed experiments, more exact metrics, and perhaps significantly longer doctoral theses. Incorporating computer power to process more scientific data is transforming laboratories across the country. At the same time, the ability to analyze large amounts of data across multiple networked platforms is also transforming the private sector. Through Big Data applications, businesses have not only revealed previously hidden efficiency improvements in their internal operations, but also uncovered entirely new types of business built around data that was previously not accessible due to its size and complexity. Today's hearing will examine the hype around Big Data. Is the United States the most innovative nation in Big Data? Is our regulatory system creating any burdens on businesses? Could public-private partnerships with the federal agencies be improved to allow for more data innovations? I thank our witnesses for their participation today and look forward to hearing their testimony. Chairman Bucshon. Thank you, Mr. Massie. The Chair now recognizes Ms. Wilson for five minutes for her opening statement. Ms. Wilson. First of all, I would like to thank both Chairman Bucshon and Chairman Massie for holding this joint hearing, and thank you all to our witnesses for being here today. Welcome. This morning's hearing provides us with the opportunity to discuss one of the newest buzzwords in Washington, and you know we have many buzzwords here. This one: big data. This buzzword is not an exaggeration. A computer that used to take up the space of this entire room now fits in the palm of your hand. It is remarkable. Just as computers have gotten immensely smaller, they have also gotten immensely more powerful. Instead of talking about megabytes, we are now talking about petabytes and zettabytes-- quadrillions and sextillions of units of information. It boggles the mind. Collecting and storing this huge volume of data would have been impossible just a few years ago. I am looking forward to your testimony and learning more about the benefits of big data to society. As I understand it, big data has the potential to improve nearly all sectors of society. The National Cancer Institute is funding a prototype in biological big data that could lead to new advances in cancer treatment. Companies and agencies are using big data to run controlled experiments that improve decision-making. Scientists at Florida International University in my district are using big data to advance understanding of topics including cybersecurity, social networks and cloud computing. But there are challenges. In order to reap all the benefits of complex and broadly available data, we need new technologies and software. We also need a workforce, a workforce with the skills necessary to analyze data of such great volume and complexity. A recent study estimates that the United States is in need of 190,000 additional data scientists. In thinking about this hearing on big data, I couldn't help but think about the tragic events last week in Boston. The marathon bombings may be one of the most photographed attacks in history. The Massachusetts State Police asked the public to share the photos and videos taken on that awful day. Now all of this digital information has been and is being used by the Boston Police Department and the FBI in their investigation. It appears that this data has been instrumental in helping to identify the individuals who were involved. Examples like this one demonstrate how important it is that we develop and attain the tools and the skills people need to analyze tremendous amounts of complex data. Big data can not only lead to amazing scientific discoveries; it can also save lives. As we learn more about these opportunities and challenges today, I hope our witnesses will offer recommendations on how the Federal Government can help create the new tools, software and workforce needed to realize the full potential of big data. Chairman Bucshon, Chairman Massie, thank you again for holding this hearing, and I yield back the balance of my time. [The prepared statement of Ms. Wilson follows:] Prepared Statement of Subcommittee on Technology Ranking Minority Member Frederica S. Wilson I'd like to thank both Chairman Bucshon and Chairman Massie for holding this joint hearing. And thank you to all of our witnesses for being here today. This morning's hearing provides us with the opportunity to discuss one of the newest buzz-words in Washington and around the world--``big data.'' This buzz-word is not an exaggeration: A computer that used to take up the space of this entire room now fits in the palm of your hand. It is remarkable. Just as computers have gotten immensely smaller, they have also gotten immensely more powerful. Instead of talking about megabytes, we are now talking about petabytes and zettabytes--quadrillions and sextillions of units of information. It boggles the mind. Collecting and storing this huge volume of data would have been impossible just a few years ago. I'm looking forward to the testimony of today's witnesses and learning more about the benefits of ``big data'' to society. As I understand it, big data has the potential to improve nearly all sectors of society. The National Cancer Institute is funding a prototype in biological ``big data'' that could lead to new advances in cancer treatment. Companies and agencies are using ``big data'' to run controlled experiments that improve decision-making. Scientists at Florida International University--in my district--are using ``big data'' to advance understanding of topics including cybersecurity, social networks, and cloud computing. But there are challenges. In order to reap all the benefits of complex and broadly available data, we need new technologies and software. We also need a workforce with the skills necessary to analyze data of such great volume and complexity. A recent study estimates that the United States is in need of 190,000 additional data scientists. In thinking about this hearing on ``big data,'' I couldn't help but think about the tragic events last week in Boston. The marathon bombings may be one of the most photographed attacks in history. The Massachusetts State Police asked the public to share the photos and videos taken on that awful day. Now, all of this digital information has been and is being used by the Boston Police Department and the FBI in their investigation. It appears that this data has been instrumental in helping to identify the individuals who were involved. Examples like this one demonstrate how important it is that we develop and attain the tools and the skilled people needed to analyze tremendous amounts of complex data. Big data can not only lead to amazing scientific discoveries--It can also save lives. As we learn more about these opportunities and challenges today, I hope our witnesses will offer recommendations on how the federal government can help create the new tools, software, and workforce needed to realize the full potential of ``big data.'' Chairman Bucshon. Thank you, Ms. Wilson. If there are Members who wish to submit additional opening statements, your statements will be added to the record at this point. It is now time to introduce our panel of witnesses. Our first witness is Dr. David McQueeney, the Vice President of Technical Strategy and Worldwide Operations at IBM Research. In this capacity, he is responsible for setting the direction of IBM's overall research strategy across 12 worldwide labs and leading the global operations and information systems teams. Dr. McQueeney's background covers a wide range of disciplines, spending about half of his career as a researcher and research executive and half in IBM's customer-focused areas. He holds an M.S. and Ph.D. in solid-state physics from Cornell University and an A.B. in physics from Dartmouth College. Welcome. Our second witness is Dr. Michael Rappa, the Executive Director of the Institute for Advanced Analytics and Faculty Member of the Department of Computer Science at North Carolina State University. Dr. Rappa has 25 years of experience as a professor working across academic disciplines at the intersection of management and computing. He began his teaching career at the University of Minnesota where he earned his doctorate degree. Welcome. And our final witness is Dr. Farnam Jahanian, the Assistant Director for the Computer and Information Science and Engineering Directorate at the National Science Foundation and a frequent visitor to our Subcommittee. He oversees the CISE's mission to uphold the Nation's leadership in computer and information science and engineering. He also serves as Co-chair of the Networking and Information Technology Research and Development, or NITRD, Subcommittee of the National Science and Technology Council Committee on Technology, providing overall coordination for the activities of 14 government agencies. Dr. Jahanian holds a master's degree and a Ph.D. in computer science from the University of Texas at Austin. Welcome again. As our witnesses should know, spoken testimony is limited to five minutes each after which Members of the Committee have five minutes each to ask questions. Your written testimony will be included in the record of the hearing. I now recognize our first witness, Dr. McQueeney, for five minutes for his testimony. TESTIMONY OF DR. DAVID MCQUEENEY, VICE PRESIDENT, TECHNICAL STRATEGY AND WORLDWIDE OPERATIONS, IBM RESEARCH Dr. McQueeney. Good morning, Mr. Chairman, Ranking Members, Members of the Subcommittees. Thank you for the opportunity to testify today. My written testimony covers next-generation computing, big data and analytics, workforce development and the role of government. In my five minutes, I will focus on areas where I can offer critical insights from my personal experience. Computing today is undergoing profound change. We are moving from computing based on processors that are programmed to follow a predesigned sequence of instructions to cognitive computing systems based on massive amounts of data evolving into systems that can learn. This new approach will require new strategies in hardware and in software and improved skills to maintain U.S. leadership. Cognitive systems will digest and exploit massive data volumes. Tools such as mobile phones, videos and social networks generate as much data in two days in 2013 as in all of human history prior to 2003. Advanced analytics can be thought of as tools for infusing all this data to make decisions on facts rather than intuition. The challenge is to transform latent data into actionable information to decide what to do next. For example, the Memphis Police Department is using data analytics to map crime hotspots and find patterns. As a result, they have been able to reduce crime by 30 percent with no increase in overall police manpower. To run advanced analytics, it is essential to have the most powerful computing systems. However, current supercomputing systems are reaching performance levels that will stagnate without significant innovation. We must move to the next generation of large-scale computing called exascale computing, a thousand times faster than today's petascale machines. The United States needs to invest now in the research and development for exascale systems to maintain strategic and economic leadership. Government-funded research on domain skills, especially at our national laboratories, should target systems for modeling, simulation, and analytics on big data. Before 2005, the United States had a clear lead in the global supercomputing race. Today, we are still ahead but the rest of the world is catching up rapidly. To stay ahead will require new skills and knowledge and new types of decision- making. Nearly two million IT jobs will be created by 2015 in the United States to support big data, and the job candidates with analytic skills will get these jobs. Industry is developing many collaborative skills programs, as enumerated in my testimony. I highlight our announcement today with Rensselaer Polytechnic Institute to offer a graduate degree program in the fall of 2013, the Master of Science in Business Analytics. Privacy must be considered in the design of big data systems. Big data does not require the sacrifice of personal privacy. When personal information is used, design-in processes such as IBM's Privacy By Design can protect privacy. When people understand how information is used, they have the ability to set data usage policies and enjoy benefits of the analysis, they tend not to have privacy concerns. The government's role should focus on research and skills. First, Federal research investment in high-performance computing is critical to big data. Industry needs university- based exploratory research into numerous areas including system design, flexible software defined environments, and IT infrastructure. Second, IBM strongly supports the reauthorization of the Department of Energy High End Computing Revitalization Act of 2004 to be offered by Representative Hultgren. This bill will improve high-end computing R&D at the DOE and strengthen government industry partnerships for exascale platforms. IBM has a long history of successful partnerships with DOE. This partnership established computational simulation as an essential tool for scientific inquiry and led to world leadership in the United States in high-performance computing. The challenge ahead is to continue this growth. Past Federal investments in HP-related research, particularly at DOE's national labs, have underpinned mission-critical supercomputers at DOD, NASA, NOAA, and in the intelligence agencies. Third, the professional science masters program supported by NSF is particularly relevant as it provides advanced training in science or mathematics and develops workplace skills valued by employers. Finally, Congress should reauthorize the Carl D. Perkins Act and the Federal work-study program and restructure them to align workforce needs and big data. In conclusion, there exists today a tremendous abundance of data about our world. New cognitive computing capabilities will help determine which countries and businesses will thrive. The United States should support advanced computing and build its workforce to seize the future. Thank you, and I welcome your questions. [The prepared statement of Dr. McQueeney follows:] [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT] Chairman Bucshon. Thank you, Dr. McQueeney. I now recognize Dr. Rappa for five minutes for his testimony. TESTIMONY OF DR. MICHAEL RAPPA, DIRECTOR, INSTITUTE FOR ADVANCED ANALYTICS, DISTINGUISHED UNIVERSITY PROFESSOR, NORTH CAROLINA STATE UNIVERSITY Dr. Rappa. Good morning, Chairman Bucshon, Chairman Massie, Ranking Member Lipinski, Ranking Member Wilson and other Members of the Subcommittee. I appreciate the opportunity to be here this morning to speak with you about data analytics and the role institutions of higher learning can play in advancing the field. I am going to draw this morning's testimony on my own behalf as a professor and director of a research institute, educational institute for over the past 25 years. I think it is important to start with the fact that the world is changing around data very rapidly and our ability to productively use it becomes a very central part of what we do as a society today, as has been heard already. A generation ago, data was scarce, expensive, time consuming to collect and difficult to analyze. Today, data is everywhere. Advances in computer technology and powerful analytic tools make it possible not only to collect vast quantities of data but also analyze and draw insights from data to solve pressing problems from increasing operational efficiency to combating fraud, to better health care, to protecting national security. Data is everywhere. The question is, how well are we prepared to use it? We have the data, the technology, the methods and tools, all of which continue to advance. The national challenge, in my view, going forward will be our ability to educate a data-savvy workforce that has the analytical skills to put data into action. Estimates of the talent gap as we have heard are large and growing. This is a dire but solvable problem. As we have shown at NC State, working closely with employers and focusing on their needs, we can produce the kind of talent that is so desperately needed today. We do it quickly in just 10 months with a domestic student population ranging from their early 20s to their late 50s, many of whom are returning to school. We have done this now for six years economically with consistently high student outcomes using a sustainable and scalable business model based on self-financed tuition. What it comes down to is creative innovation, how we organize graduate education, allowing us to engage with employers more productively to yield high-quality results in the skills and readiness of our graduates. I encourage the Committee to focus its attention on workforce needs, to encourage the government to seek out innovation in higher education and to promote new and novel learning models. This is a solvable problem. With the proper incentives, focused resources, open collaboration with industry, we can produce the analytics professionals needed to extract value from big data and to move the economy forward. As I said, we have done this ourselves now for 6 straight years to great effect. We will graduate a class in a matter of another week, 80 students in the Master of Sciences and Analytics Program, with already 95 percent of them placed in jobs. They are literally the most sought after and highest-paid graduates of the university. So we can do this. It is a solvable problem. Thank you again for your time. I will be glad to answer any questions. [The prepared statement of Dr. Rappa follows:] [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT] Chairman Bucshon. Thank you for your testimony. I now recognize our final witness, Dr. Jahanian, for five minutes for his testimony. TESTIMONY OF DR. FARNAM JAHANIAN, ASSISTANT DIRECTOR FOR THE COMPUTER AND INFORMATION SCIENCE AND ENGINEERING (CISE) DIRECTORATE, NATIONAL SCIENCE FOUNDATION Dr. Jahanian. Good morning, Chairman Massie, Chairman Bucshon, Ranking Members Wilson and Lipinski, and Members of the Subcommittee. It is my pleasure to be back here to discuss the next generation of computing and big data analytics. Today we live in an era of data and information enabled by advanced technologies that surround us. Data is generated by modern experimental methods, scientific instruments such as telescopes and particle accelerators, large-scale simulators, Internet transactions, email, video images, clickstreams, and widespread deployment of sensors everywhere. Approximately 90 percent of the data in the world today were created in the last two years alone. However, when we talk about big data, it is important to emphasize not only the enormous volume of data being generated but also the velocity, heterogeneity and complexity of data that now confronts us. Why is big data important? Several others have alluded to this already. Data represents a transformative new currency. Big data is increasingly important to all facets of our Nation's discovery and innovation ecosystem. First, insights and more accurate predictions from large and complex collections of data are creating opportunities in new markets, driving the creation of IT products and services and boosting the productivity of businesses. Second, advances in our ability to store, integrate, and extract meaning and information from data are accelerating the pace of discovery in almost every science and engineering discipline. Third, big data has the potential to solve many of the Nation's most pressing challenges from health care and education to cybersecurity and public safety, yielding enormous societal benefits and ensuring sustained U.S. competitiveness. Let me share with you just a few examples of the promise of big data. These are all grounded in research that is funded by the Federal Government or by the private sector, the work that is done in the private sector. By integrating biomedical, clinical and scientific data, we can predict the onset of diseases and identify unwanted drug interactions. By coupling roadway sensors, traffic cameras, individual GPS devices, we can reduce traffic congestion and generate significant savings in time and fuel. By accurately predicting natural disasters such as hurricanes and tornadoes, we can employ lifesaving and preventative measures that mitigate their potential impact. By correlating disparate data streams through text mining, image analysis and face recognition, we can enhance public safety and public security. By integrating emerging technologies such as MOOCs and inverted classrooms with knowledge from research about how people learn, we can transform formal and informal education. What does this mean for scientific discovery? Data-driven discovery, also called the fourth paradigm, is revolutionizing scientific exploration and engineering innovations. It enables extraction of new knowledge, provides novel approaches to driving discovery and decision-making, yields increasingly accurate predictions and provides deeper understanding of causal relationship based on advanced data analysis. What is government doing to ensure we harness this potential? As it was mentioned already, in 2011 U.S. Networking and Information Technology Research and Development Program, also called NITRD, formed a big data senior steering group to identify, initiate and coordinate big data research and development activities across the government to ensure that Federal agencies, the scientific research enterprise, and public maximally benefit from data-driven discovery. In March 2012, the National Big Data R&D Initiative was launched, focusing the steering committee group's focus on the tools, technologies and human capital needed to move from data to knowledge to action. We see exciting new partnership opportunities with the private sector, state and local governments, academia and nonprofits. At NSF, we have identified four major investment areas that address current challenges and promise to serve as the foundation of comprehensive long-term agenda: first, investment in foundational research to advance big data techniques and technologies; second, support for building new interdisciplinary research communities; third, investment in education and workforce development; and finally, development and deployment of cyber infrastructure to capture, manage, and analyze and share digital data. I should add that NSF's investment in cyber infrastructure includes advanced computational resources that support data- enabled science. In particular, the newly dedicated Blue Waters, Stampede and Yellowstone supercomputers will expand our Nation's computational capabilities significantly. In summary, big data represents enormous opportunities for our Nation. Investments in big data research and education will advance the frontier of knowledge, further fostering innovation, creating new economic opportunities, and yielding new approaches to addressing national priorities. Thank you again for this opportunity. I would be happy to answer any questions. [The prepared statement of Dr. Jahanian follows:] [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT] Chairman Bucshon. Thank you for your testimony. I would like to thank all the witnesses for their testimony. I am reminding Members that Committee rules limit questioning to five minutes, and the Chair at this point will recognize himself for five minutes to start the questions. First, Dr. Jahanian, the Administration announced their Big Data Research and Development Initiative in March 2012 including $200 million in new commitments for big data research initiatives. However, the National Science Foundation, Department of Defense, Department of Energy, and other agencies have had significant research programs and data analytics that predated the initiative. How has the Administration's initiative changed the ways these agency research programs are coordinated and are we effectively managing and leveraging our research investments across agencies? Dr. Jahanian. Thank you for your question. You are absolutely right that it is not that suddenly last March we woke up and said boy, data is really important, we need to do something about it. There has been significant investment by the Federal sector and private sector in areas having to do with data. The challenges we face are many--stewardship of digital data and software, for example. Many data sets, as was mentioned, are too poorly organized or also unstructured. Many data sets are heterogeneous. The utility of data is also limited by our ability to interpret them. Many data are being collected at a scale that we can't even store them, let alone analyze them. Also, large and linked data sets may be exploited to identify individuals and so there are also the privacy issues. So there are enormous challenges that we face. As you alluded to, on March 29, 2012, OSTP in concert with a number of Federal agencies launched the national Big Data Research Initiative. It expands the scope of our activities in several directions, for example, state-of-the-art core technologies that we need to collect, store, preserve, manage and analyze data, harnessing these technologies to accelerate pace of discovery, supporting responsible stewardship, for example, and sustainable business models for big data. There are a number of cross-coordination efforts taking place under NITRD. Let me start with NSF. All NSF directorates, for example, are participating in this. A multidisciplinary panel of experts are making a recommendation on funding of this. Furthermore, big data is being coordinated through a senior steering group that reports to the assistant directors at NSF for all the coordination because it involves every science and engineering discipline. As far as the Federal Government is concerned, the Big Data R&D Initiative is coordinated through the NITRD Subcommittee. As you know, I Chair the Subcommittee. There is a senior steering group that regularly meets to coordinate the activities on many of the fronts that I alluded to. There are also enormous opportunities not only in terms of joint solicitations but there are a number of workshops that we are holding jointly with other agencies including NIH, NIST, DOE, DOD to advance the frontiers of knowledge and exploration in big data. I should also mention that when it comes to this initiative, we can't forget that the private sector plays a significant role. When we think about innovation and discovery ecosystems, not only are we talking about universities, we are talking about scientists and engineers, you know, a rich, talented labor force, investments in research and education, and of course, a vibrant private sector. So there are a number of programs that we have at NSF that attempt to connect the dots when it comes to transfer of knowledge. Chairman Bucshon. Thank you. I am glad to hear there is quite a bit of coordination at the Federal level because I think all of us are concerned about that, and again, investing the taxpayer dollar wisely. Dr. Rappa, I also serve on the Education and Workforce Committee, and I have got children age 9 through 20, four of them, and I have a really strong interest in how we get young people interested in different fields of study, and obviously we have a tremendous challenge not only with this area but many others, and do you think that--what are your ideas on how we engage young people in understanding what opportunities there are in this area and what the jobs of the future might hold? I mean, how do we do that? Because, you know, when you go to a high-school class, and I talk to a lot of high-school class, people say, you know, not many people come up when you ask them what they want to be, you know, they want to analyze big data. So how do you do that? What is your recommendation? Dr. Rappa. Well, thank you very much for your question, and I understand exactly what you are saying, and I think that things are changing. You know, I think it is exactly true that your average 8-year-old doesn't say they want to grow up, for example, to be a statistician. It is not common, unless they are really interested in sports. Then you see a sort of nexus there because of the relationship. But I think what is changing is that it is really about producing education, in my case, at the graduate level, reaching further into the pipeline down into undergraduate education and even touching upon high school where people begin--where students begin to understand how data is really used in action. So it is really about creating, not just sort of creating knowledge or understanding but also applying that knowledge. And when our students--our whole education is driven around the application of that knowledge, and so students really understand, and increasingly undergraduates understand that this kind of graduate education is going to lead them to a very interesting, compelling professional life. Chairman Bucshon. Well, thank you, because I think that we do--you know, we do need to have this type of information gravitate down, even to middle-school kids to get them interested, and there is a program in Indianapolis called Project Lead the Way who I know very well that is beginning to do that at the high-school level, and it is showing some success. But my time is expired, so I would love to talk more about that but at this point I am going to yield to Ms. Wilson for five minutes for her questions. Ms. Wilson. Thank you, Mr. Chair. Along those lines, can you tell me either one of you what skills are necessary for the big data workforce? I heard you say something about an analytical something. And also as you are speaking, I would like to hear from you what role can community colleges play in preparing the next-generation workforce for big data. Dr. Rappa. Thank you very much for your question. I would like to try my hand at that. So what is sort of interesting and novel about what we have done around the education, we really started from scratch in building an entire new graduate degree program, and we really wanted to address this question of what skills were needed, and we focused ourselves really looking at the employer as the customer in a sense, the person, the individuals who buy our product and the students and really tried to understand the skills that they need, and really where that brings you is that there is these technical skills which are important in programming, in math and statistics, but employers really want much more than that. They want individuals who can work well in teams, who can communicate these insights to decision makers, who can actually use the tools and apply the knowledge in an organizational context, and so we have structured the whole education to build a very balanced set of skills as opposed to what I think is really the conventional approach in graduate education and to some extent undergraduate education to focus on the technical skills almost exclusively. And so really what we need to do is sort of approach the whole student. Now, I think community colleges can play a very important role because you can really begin to channel pipelines where students can go and get the prerequisite knowledge that they need, the early levels of math and statistics, before they go on to graduate education. Thank you. Dr. McQueeney. I would just like to comment that a lot of the focus in the past has been on the graduate level of education, as Dr. Rappa just talked about, and while we continue to have a strong need for Ph.D.'s and computer science and electric engineering and mathematics, the biggest skill gap that we see is at the masters level, quite frankly, of people who may not have the mathematical skills to create an entire new type of analysis of data but who have more than basic IT skills who actually can understand the implications of using different analytical techniques given a problem, given a data set with certain statistical properties, what would be the appropriate analytical technique to use, and when you apply that technique, how could you be sure that the results would be reliable and proper, and so a lot of our focus has been on creating an intermediate level of skill that has the basic understanding of how to use these tools even if it would fall on someone with more of a Ph.D. level of training to create new analytical approaches. Dr. Jahanian. Representative Wilson, I want to echo something that has been said. If you think about big data, let us just step back. There are three related problems that go beyond big data. It includes all of our IT workforce, computer science, computational science and so on. These problems have to do with underproduction, which everybody recognizes, underrepresentation and then pipeline issues. Chairman Bucshon already alluded to this, that we need to worry about our high schools, we need to worry about the pipeline. I have three kids, and I know where we lose our kids, it is not in masters or Ph.D., we lose the interest of our kids in high schools and middle schools, so that has to be fixed, and there are a number of programs that we have initiated, pilot programs that try to address that issue. Let me share with you one anecdotal sort of evidence that provides data on this. Annualized Bureau of Labor Statistics data predicts that each year we need about 140,000 job openings. We will have 140,000 job openings in computing and broadly speaking IT-related jobs but we are only producing about 100,000 qualified individuals including masters, Ph.D., undergraduate and community colleges. In fact, many of these jobs would be available to individuals who have two year or four year degrees. Another data point that I want to share with you is that 62 percent of all newly created STEM job openings between 2010 and 2020 will be in computing and IT. Let us not forget that. And that includes data, that includes computational skills and many of the other skills that the other witnesses alluded to. Thank you. Ms. Wilson. Just in my 16--oh, 10, 9, 8--what would you suggest that we begin to--how do we begin to get children interested in these sort of skills? I know every little child has an iPad. They can work these computers better than adults. What do you think we can do to stimulate that all the way from K-12 and into the community colleges so we will have more IT graduates? Do you suggest we buy each one--we outfit classrooms with iPads, or what do you think? Dr. McQueeney. I think there is an intrinsic curiosity in younger folks about a lot of the tools they use to communicate with each other that have tremendously greater scalability than the tools that I use to communicate with my friends. Ms. Wilson. Right. Dr. McQueeney. So the essence of what is a large community's opinion on a topic of interest could involve the opinions of thousands or millions of people and so I think a lot of the young folks I talk to when I visit K-12 programs or, you know, in programs like eWeek, they have an intrinsic sense not only of the device and the technology but they have a sense of the reach of that device and technology which is the beginning of an appreciation of really what we are talking about with big data, that there are trends that they can reach with that device, and I think that fires their imagination in a very powerful way. Chairman Bucshon. Thank you. I will now recognize Mr. Massie, Chairman Massie, for his questioning. Mr. Massie. Thank you, Chairman. So one of the questions that I have as we deal with the interface between government and private industry here is, are you aware of any government data sets that we need to get more into the public domain for usage? For instance, I think we have done a pretty good job about getting some of the mapping stuff out there but some of that map information is old, goes back to the 1940s and 1950s, and I know the government has been paying for LIDAR mapping, which is a high-resolution terrain mapping, and I am kind of concerned that that is not getting out there. Are you aware of that, and are there any other data sets that would be useful to the public that the public has paid for that we might want to work on getting out to the public? Dr. McQueeney. I think the government has done an excellent job and had many initiatives that were very focused on getting that valuable data out so it could be used. You mentioned LIDAR. I know that one of the uses that is very promising for LIDAR is to do something like an inventory of the forests in the country, to actually be able to conduct a definitive inventory. Right now, the agencies that are responsible for that use a statistical sampling technique but in a world where you can take LIDAR images and process that enormous data volume, you are able to move then from a statistical sampling basis, which is all we could do before, to a more definitive approach to get a very, very good picture of one of the more valuable natural resources that needs tremendous amounts of stewardship. So I think that is an example of a data set that could be extremely valuable. But I think in general, the government is very well and properly focused on getting those valuable data sources out. Weather would be another--basic weather data would be another good example that can be built on to add extra value. Mr. Massie. Are the other witnesses aware of any data sets that we need to promote more? Dr. Jahanian. I want to highlight a couple of things. I am sure you are aware of data.gov, which is a Web site that makes a lot of government data sets available, and the goal here is to increase public access to high-value machine readable data sets that are generated by the government. Hopefully it will create new economic values. There are also a number of activities in encouraging the private sector, entrepreneurs to develop applications on top of that data. It is not just making the data available but also making the data valuable so there are a number of essential activities related to that. There was a recent Wall Street Journal article actually that highlighted at least a dozen different kind of government data sets that have been made available from labor and health violations to flu incidents, energy prize, offshore activities, solar information, and so on and so on that are interesting. From the National Science Foundation's point of view, I should mention that as you may know, we have a number of large facilities--LSST was mentioned, Neon, which is another facility that collects a lot of data, will be collecting a lot of data. The science and engineering community needs that data, and many Federal agencies are working very hard to make that data available. There are a number of issues having to do with open access that go beyond the scope of this question. Mr. Massie. Let me ask a follow-up question to that. So big data like any other data could be misused, altered, hacked, illegally accessed, and sometimes it may just be an honest mistake. We share data that we probably shouldn't have, for instance, where some farm data that got out there and it could really compromise our food safety if people know where all the food sources are. How do we balance the desire for privacy, actually the constitutional right to privacy, with sharing all of this data now that everybody is under a microscope? Dr. Rappa. I thank you for your question, and I would like to sort of just turn it a little bit because we do work--each year we work with about 16, 17 organizations that share data under a confidentiality agreement including three government agencies in which we put teams of students working on very complex analytics projects, and so while I applaud, and I think it is very important and I do think the government is doing a good job at sharing data openly, it is a very important thing to do, I think there is also an opportunity to engage the academic community in other ways to help understand that data, which might mitigate some of these issues around the privacy element. Mr. Massie. Dr. McQueeney? Dr. McQueeney. Yes, that is an excellent question. Thank you for that. One of the things that we can do is to get data about the data. We call it metadata. So we analyze the data and we don't just look at what information we can get from the data but we describe the data perhaps in terms of its sensitivity-- is this more or less sensitive from a point of view of privacy or security or secrecy--and we can then tag those data sets with metadata that describes the implications of using that data and then we can build into the systems that handle the data policies that look not only at the data but the metadata that describes what are the contents and what are the implications of sharing and combining that data and so we can actually build into the foundation of big data systems the ability to interpret policies that we have set in a very conscious and clear-eyed way and as they process the data they can be respectful of that metadata. The medical community has actually done a lot of very good work around patient confidentiality while still getting very good pattern analysis of different kinds of outcomes. Mr. Massie. Thank you very much. My time expired. I appreciate your answer and concern for that question, Mr. Chairman. Mr. Bucshon. Thank you, Mr. Massie. I now recognize Dr. Bera for five minutes for his questions. Mr. Bera. Thank you, Mr. Chairman, and thank you for the series of hearings that we have had on the Subcommittee. It has been great. You know, big data is incredibly important and how we manage data and with the rapidity of how the world is changing. I mean, when I think back to being a high-school student, and for me it was going and looking at the index cards, walking down and looking in the encyclopedia. Now, when my daughter, you know, she has vast access, or when I do rounds in the hospital, we would have to race down to the library to get information but now before we are even finished presenting, the medical students or the residents can just look at the latest data on, you know, a device like this and get access to the most accurate and timely information. So it is incredibly important that we make these investments to not only manage the data, to sort that data and then to make sure it is accessible. It is a critical priority that we have that workforce both at the professional level but then also at the management level and I think the number that I read was we need about 1.5 million managers. So there is a huge need but also a huge opportunity. When I think back to the talent that has been impacted in the last four years in the recession, you know, there are a large number of extremely intelligent and talented individuals in their 30s and 40s who have been hit hard. These are folks like myself that were trained for a 20th-century workforce but now we find ourselves in a 21st-century economy. Dr. Rappa, are there some best practices--and these aren't individuals that need to get a graduate degree, you know, they are talented individuals--where we could take them and quickly train them for this new economy? Are there examples? Dr. Rappa. Right. So we do offer it as a graduate degree but we do this in 10 months, and indeed, a good, fairly substantial, larger portion of our population are people who are returning from--or coming from the workforce to go through this and some of them are in exactly the position that you say. They were transitioning, their companies were faltering. And so the key really with this is short duration. Ten months is actually a very reasonably good time because you could build the skills that you need. If it is too short, you can't accumulate the skills but the key thing is that you have really demonstrated ROI on that education because that person who is coming in to do that has to know that they have a very high probability of getting a job when they leave and at a particular salary rate so that they can justify the investment and time, and that is really what we have done. Mr. Bera. Dr. McQueeney, are there potentially any examples--you know, again, a lot of these folks are also paying their mortgage, they have to continue to foot their bills--of possibly even doing an advanced work-study type of program where you recruit this talent and they are getting on-the-job training as opposed to a traditional school model? Dr. McQueeney. Yes. In fact, there is a related topic here that I think is quite interesting, which is the application of big data and analytics back on to the educational process itself. You have seen the great upsurge in videos that attempt to replace traditional brick-and-mortar classroom attendance, coursework. You have seen a number of startup companies formed in this space. If you look at the education process, each of us really learns quite differently. Some of us may learn more from hearing or from seeing or from working problems, and great teachers, great professors are sensitive to how their different students learn and are capable of presenting material in alternate ways to make sure they reach all the students. With electronic delivery of course materials and monitoring of student progress, we generate digital exhaust, if you will, that describes how that student is learning, how that student responds to the instruction, and for the parts of the instruction that are delivered electronically, we actually have the ability to do analytics and to do an optimization process so that each of us on the panel might not get the same length of lecture on five different topics. It might be adjusted to our historical learning patterns. So we have worked with a number of universities and other, you know, non-traditional educational institutions to apply the big data and analytics techniques to the education and training process itself. Mr. Bera. Great. In my last 30 seconds, so we have access to data. I think one element that we should also be conscious of is the quality of the data because there certainly is very good-quality data but at the same time there is very poor- quality data that is out there and, you know, any of you who want to comment on how we monitor quality? Dr. Rappa. I think most data starts off as bad data, for the most part, unless it is being collected in a highly careful way. And so it is, you know--I think just as we hear about big data today, we are going to hear about bad data in the future. Most projects start out where you have enormous front end to them to really understanding cleaning and cultivating that data to make it useful, and that is an important part of the educational process. Dr. Jahanian. I would just add that there are a number of techniques that have been developed and are in development dealing with data exploration, data cleaning and so on. Furthermore, when we talk about large-scale data sets, there are statistical techniques that are being applied that really take care of the noise, take care of some of these inconsistencies, and that is one of the attractions of big data. Mr. Bera. Great. Thank you. Chairman Massie. [Presiding] Thank you, Mr. Bera. I now recognize Mr. Schweikert from Arizona for five minutes. Mr. Schweikert. Thank you, Mr. Chairman. This is one of those types of conversations, you know, we could all sit around and buy you some well-caffeinated coffee and talk for hours and still have no idea if we made any progress. Doctor, is it McQueeney? Dr. McQueeney. Yes. Mr. Schweikert. First, you are with IBM? Dr. McQueeney. Yes. Mr. Schweikert. In your testimony, help me do a little ferreting out here. Hardware technology or IT talent, what is your biggest bottleneck right now? Dr. McQueeney. There are bottlenecks in a number of areas. If I looked at the hardware itself, the biggest challenge getting from the petascale to the exascale is actually the power dissipation of the systems. The new technology work that we are doing is to get the computations more efficient in terms of floating point operations per watt so that if you assembled a system thousand times bigger than today's supercomputers you could house it and cool it. Mr. Schweikert. You don't want to take down the power grid? Dr. McQueeney. The power grid may not in fact be able to supply enough power if we didn't make some innovations. That is a good point. Mr. Schweikert. But hasn't your company actually been one of the leaders at producing some of those breakthroughs? Dr. McQueeney. In fact, we have, and in fact, a lot of that history goes back to work that started with the Department of Energy many years ago, and this bears on an interesting historical point. In a time when we are concerned about making investments efficiently, if I go back to the beginning of the ASCII program with the Department of Energy to do the nuclear weapons stockpile stewardship program, the Department of Energy scientists did a very careful analysis of what were the core algorithms, the core analytics, if you will, in today's language, that needed to be done at a certain level to provide the mission that they needed to provide, and they found that the current path at that time of supercomputing was going to take five years to produce a machine that they needed in 1 or two years. The analysis they did was thorough enough to reveal that there weren't bottlenecks everywhere but at that time there were bottlenecks mostly in the inner process or communication. So they made a very thoughtful, very surgical investment in accelerating just the piece that was needed to close their mission gap, which was the beginning of a very long run of government-industry collaboration. Mr. Schweikert. But you are in some ways heading towards where my question is. So if that bottleneck, in today's world, do I find the technology if I went out to the private sector around the world that is competing and producing high-end supercomputing or is it coming out of a government lab? And I know the pop culture terminology is ``public-private partnership'' but the reality, they do operate in pretty substantially different silos. Dr. McQueeney. The real forcing function for a breakthrough is a critical mission need. So in the case of high-performance computing, it has often been a government agency with a critical mission that---- Mr. Schweikert. But they were doing a specific request for how they wanted to manage their data? Dr. McQueeney. That is correct, and once that technology is available, it can be consumed very rapidly in lots of other applications that could take great advantage of it but didn't have a compelling enough need to get over that hurdle. That is when the disbursal of technology starts. Mr. Schweikert. Just as an aside, only because I had some acquaintances who were--I used to be an old SQL programmer so I am way out of date now. IBM was actually running a fascinating large data project where they were doing sweeping data sets through the world's social media and gathering it and looking for trends. Can you in 30 seconds or so tell me your knowledge on that? Dr. McQueeney. Yeah, we have analyzed the public social media sources with several of our customers and we can gain a lot of insights. For example, you know, retailers can gain insights about trends and their clients. Transportation agencies can gain insights about likely traffic congestion. There are many sources of public data, both social media and other forms that can be analyzed to reveal patterns about how people conduct their daily activities that are very useful for optimizing the public infrastructure. Mr. Schweikert. Forgive me, I am blind as a bat without these. Is it Dr. Rappa? Dr. Rappa. Yes. Mr. Schweikert. Isn't my single biggest problem in big data right now is noise that when I put data set after data set after data set and build on it, that just small incremental errors actually create really bad decisions on the end? Dr. Rappa. Well, I think part of the education around handling big data deals very squarely with the quality of the data and how to clean it and cultivate it to reduce the noise, to---- Mr. Schweikert. But you and I can go over a long series of public policies, both state, national, you know, military, others, where we built it on really gigantic analyzed data sets and it was wrong. Dr. Rappa. Well, I think that, you know, the challenge here is education. So as I alluded to earlier, we have teams of students---- Mr. Schweikert. Is it education or developing educational skepticism? Dr. Rappa. It is developing the education around how to squarely understand the inherent challenges in data. Data is not born clean. It isn't born ready to be analyzed. Mr. Schweikert. And when you and I build our model, the way we wait, you know, because we start plugging in human factors that, you know, you and I bring our biases and we---- Dr. Rappa. And this is why we really need a focused education squarely around how do you draw insights from data because there are these inherent problems in data, especially as you scale them up, as you combine different data sets, as you combine different types of data. Mr. Schweikert. Thank you, Doctor, and Mr. Chairman, thank you for tolerating. It is just one of my great fears. And look, I am a data freak. I mean, you have got to see the servers and stuff I have at home. But I have learned when we make big-time public policy on something we all know is right, we keep making huge, very costly mistakes. Chairman Massie. Thank you, Mr. Schweikert. I now recognize Mr. Hultgren from Illinois for five minutes. Mr. Hultgren. Thank you, Mr. Chairman. Thank you all for being here. First of all, I just want to thank Dr. McQueeney too. I appreciate your mention and your support for the exascale computing bill I am currently authoring. I am very excited about the potential there and see some huge shift in our national computing capabilities and I am very excited about that, so I appreciate your mention and support of that. I do have a few questions, and first I guess I would address this one to Dr. McQueeney and also Dr. Jahanian. Is that right? I am sorry. I wonder if you could comment briefly on where the United States stands in your opinion in worldwide computing leadership? I know the metric of the fastest supercomputer is one metric but what do you use as a metric for big data to determine which countries are using it most effectively? Dr. McQueeney. The common thing that is cited in these discussions is the top 500 supercomputers list. That is something that is compiled twice a year, as you well know, and we have usually been at the top of that list. We have continued to be the majority of the systems on that list but other countries have noticed the success that we had in, you know, government leading the way on high-performance computing breakthroughs. Once those systems are built, they find hundreds and thousands of other applications, each with a client that might not have been able to fund that breakthrough themselves but can certainly utilize it. Other countries have popped up on the top of that list because they are interested in emulating the success we have had in leading the way with innovation and then seeing that innovation used broadly across the commercial sector. So the top 500 list is a very technical, perhaps very geeky measure of who is on top, and I would say that we are still in a leadership position there but it has been stronger in the past than it is today. If you turn to more of a business view, you would want to look at the companies that were taking the best advantage of data sources, either to drive value in their companies or to provide benefits such as public safety or health benefits, and there again I think we are in a good position but it is a very different kind of skill, a conversation we didn't quite finish before about the skill to build these large systems is a very focused, very large-scale, very capital-intensive activity but the skills to use these systems are more focused on creativity and are actually better done by large groups of small teams. In fact, you know, the NSF has been a leader in fostering that kind of innovation where thousands and thousands of groups can build innovative applications and take advantage of these systems. Mr. Hultgren. Thanks. Dr. Jahanian? Dr. Jahanian. Yes, just a couple of quick comments. There is no question that we continue to maintain our leadership worldwide in this area, and there is no doubt that continued investment in this area is extremely important to the future of the country. As I mentioned just a few minutes ago, NSF's investment in Blue Waters, Stampede, as well as the Yellowstone supercomputing centers represent a range of investments that we make in high-performance computing, addressing the needs of not only the top five percent of application that have exceptionally high computational needs but also a broad spectrum of researchers across the country in science and engineering who would need computational resources. A couple of comments. Just look at Blue Waters, for example, which is at University of Illinois. A couple of data points about it. It has--if you could--just the computing power of it, if you could multiply two numbers together every second, it would take 32 million years to do what Blue Waters does in one second. That is astonishing power, for example, of Blue Waters. In terms of storage capacity, memory capacity and so on, there is a similar kind of scale. The second point that I want to make is, we view computation and data to be two sides of the same coin. You really need to address both. So when we talk about computational capabilities, we also have to worry about cyber infrastructure to manage, to curate, to serve data to science and engineering community, and the investment in cyber infrastructure has to be balanced between the computation side of it as well as management and curation of data. Mr. Hultgren. Let me have--my time is running out but I have a follow-up question to the two of you as well if you could both comment in the time I have. It seems to me that exascale computing is focused on solving discrete problems that necessitate massive computing power and speed. Are these different problems than those we are addressing through big data analytical tools and how do these two terms, how are they different, how are they similar? Dr. McQueeney. Historically, we have tended to talk about them differently, but as we project how the exascale systems will be built and how they will be used and we look at the growing importance of big data analytic systems, we see that the platforms on which these systems will both depend will be much more common than separate, and in fact, we see that there is no conflict between investments in classically what we have called HPC and what we are now calling big data analytics, and both are changing actually. The way we use an exascale system will not be the same way that we use a petascale system. There isn't time here to go into it, but it actually morphs into a direction that is much more common with what we will do in big data and analytics. Dr. Jahanian. I would just add that many of the problems that the business community needs, the science and engineering community needs are being addressed today through different kind of computational architectures that doesn't necessarily require today to have exascale computing including weather modeling, a number of other applications that have been mentioned. So it is really important to consider the investment in exascale computing in the spectrum of investment that we make to support computational and data needs of the entire science and engineering community and of course the private sector. Mr. Hultgren. Thank you so much. Chairman, thank you. I yield back. Chairman Massie. I now recognize Mr. Lipinski from Illinois for five minutes. Mr. Lipinski. Thank you, Mr. Chairman. I am glad that Dr. Jahanian mentioned Blue Waters there. We were just there not that long ago, but since you covered that, I can move on to a different area. Dr. McQueeney, in your testimony you talk about how the Federal Government needs to invest in big data if the U.S. is going to maintain its leadership and competitive edge in this area. The needs and potential benefits of big data for the Federal Government align closely with those of private industry in a number of areas. If that is the case, how can the Federal Government more effectively partner with industry to achieve common goals and do you believe that industry has sufficient input in the Federal Government's research agenda as it relates to big data? Dr. McQueeney. I do think we have sufficient input. I think we have excellent dialogs with the relevant agencies and national laboratories, and I think the roles are complementary. I go back to the story about the early days of the ASCII program where through a collaboration we realized that the key piece of a supercomputing system that needed to be accelerated was not the entire investment. We could ride on the commercial investments for most of the components of the supercomputing systems at that time except for one, which was the high- bandwidth switching between processors. And so that kind of thoughtful connection between the leaders in commercial computing and the leaders on the government side has been able historically to identify which areas are critical to attain government mission imperatives and where we can leverage commercial technology and where we need to accelerate that in a surgical fashion. So it has, in our view, been a very good partnership based on very high-bandwidth technical communications, understanding of applications and knowing when the government should be leveraging commercial investments and when they need to accelerate parts of that investment to attain unique mission goals, and again, as I have said before, once those barriers are crossed in terms of either the scalability of the system or the internal bandwidth of the system, it opens up thousands of new applications where there were ready problems to be analyzed but those applications weren't large enough to drive that breakthrough. So that is how the effect works of the leadership coming from some of the government agencies and then being realized broadly across industry. That is the essence of where this leadership has come from so successfully over the years. Mr. Lipinski. I want to follow up with Dr. Rappa on that. Dr. Rappa, you discussed the importance of public-private partnerships to realizing the benefits of big data and stated specifically that we must intensify and accelerate the national investment in proven models. What characteristics make a public-private partnership successful and what models should we be investing in? What were you referring to there? Dr. Rappa. Well, I think first of all, we have been doing this now for six years and so I think we do have a fairly interesting, novel model for producing talent in this field with a kind of proven track record based on data, based on market value of the graduates, but I think it comes really, you know, partly from the university community, partly from the academic community. Obviously we have a set of missions to educate students but we need to also, I think, do that by trying to really understand the employer, what are they looking for when they hire talent, what are the kinds of skills that they need in order to be effective on the job, and I think employers need to sort of be open to working with the academic community. You know, there is a certain amount of dissidence that naturally occurs because there are two different worlds with different missions but I think it is really--I think we have shown that it is possible with organizational innovation, with a focused effort, with a sense of openness to engage the private sector in a very positive way, not just at NC State but at other universities. There are many, many examples now that I hope we are providing some leadership on but that other universities are working with our model but also pursuing other creative models to do this. There are probably about two dozen around the country already. Mr. Lipinski. Thank you. Dr. Jahanian, anything you want to add about public-private partnerships? Dr. Jahanian. Yes, indeed. There is no question that when we think about the innovation ecosystem in this country, it includes academia, it includes the private sector, it includes government investment and a talent-rich workforce. The private sector is investing heavily in cloud computing, as you know. It is investing heavily in making computational resources also available. I think there are opportunities for the Federal investment to leverage that and make some of that available. Of course that is commercially available today to our researchers, to our scientists and engineers who could rely on those systems. We have announced a number of partnerships, one with IBM and Google, another one with Microsoft that make some of those resources available to the research community. Dr. McQueeney already mentioned this, that there is high- bandwidth communication between the private sector and various Federal agencies. I can tell you from NSF's perspective, it is a very, very rich collaboration. On my advisory committee, I have a number of the senior leader from the private sector who serve on my advisory committee advising us on our portfolio, on our investments in addition to academics who serve on my advisory committee. The final comment that I want to make is, there are a number of programs at NSF, and I know you are familiar with all of them, including SBIR, including I-Corps and so on that focus on transfer of knowledge from lab to practice. Federal Government invests heavily in advancing frontiers of knowledge. For us to accelerate those programs such as I-Corps, SBIR and so on serves a tremendous purpose, and here again, there are opportunities to engage the private sector and accelerate the transfer of knowledge to practice to benefit the Nation. Thank you. Mr. Lipinski. Thank you. Chairman Massie. Thank you, Mr. Lipinski. I now recognize Mr. Bridenstine from Oklahoma for five minutes. Mr. Bridenstine. Thank you, Mr. Chairman. I also serve on the House Armed Services Committee, and I am aware that the Department of Defense is moving towards cloud-based computing solutions, and this of course creates some consternation about security issues, cyber hacking, other cyber crimes, and I was wondering if any of your organizations are involved in helping the Department of Defense work through these issues and what those solutions might be, if you could share with us on that? Dr. McQueeney. Sure, if I could start? You are quite right to raise the concern about security for any systems used by the Defense Department especially, although it would be true for all Federal agencies. And when you move to a cloud computing model, there is an extra imperative to be concerned about security, and if you think of it in terms of the DOD might think of it, if that environment should be compromised by an enemy, it is a bigger piece of resource than an individual machine so it requires special vigilance. Now, the good news technically is, the way we handle virtualization, which is the foundation of how cloud computing is delivered from a compute virtualization point of view, there are actually sophisticated techniques that can provide additional security in a virtualized environment that we can provide even when using things running on bare metal. We have additional abilities to instrument the operation of that cloud and to very rapidly detect any kind of pattern or behavior that is indicative of a threat. We did a project a number of years ago with the U.S. Air Force and they graciously let us write a short press release on it where we built a cloud computing environment that was at the cutting edge a few years ago. We instrumented it very thoroughly with watching the package flowing on the interconnected network that built the cloud in question and we very carefully isolated it from the rest of the world, introduced known cyber attacks into it and were able to show that if we knew the patterns of command and control, as the defense folks might say, of these cyber attacks, we could actually spot them assembling themselves and interrupt them before they had a chance to launch. So having tremendous control over the environment out of which we were getting compute resources gave us abilities to do additional security and additional monitoring, even if we assumed the security was not perfect and could be breached, could we essentially in real time detect that breach and interrupt it before it stopped. I thought that was a very forward-looking piece of work that was driven by the Air Force CIO's office. Mr. Bridenstine. Excellent. Go ahead. Dr. Jahanian. As you alluded to, these new environments, whether it is mobile platforms or cloud computing, are introducing new challenges, and we recognize that attackers and defenders are coevolving and there are enormous challenges to protecting our critical infrastructure and our cyber infrastructure. I wanted to mention NSF's Secure and Trustworthy Cyberspace program, which is a research program addressing many of the challenges that we alluded to, and this is a research program that addresses not only the technology issues but also transition to practice. Furthermore, the NITRD research and development subcommittee has a working group that focuses on coordination of activity across various agencies on cybersecurity and there is rich dialog involving various agencies on that issue. Mr. Bridenstine. Excellent. Are there any other things that the Department of Defense could do to help you guys with the objective of securing cloud computing for the Department of Defense? Dr. Rappa. So I am currently co-directing a project with a colleague at NC State, which is the science of security project that is done in collaboration with Carnegie-Mellon University and University of Illinois, and we are trying to bring together large groups, multidisciplinary groups of faculty to really try to understand the underpinning of the security problem and how to produce science around it. It is a very long-term challenge but it is one which I think has to start with getting the faculty across different disciplines focused on it and certainly I think it has been a tremendous opportunity and I look forward to moving into the future. Dr. McQueeney. Yeah, Dr. Rappa makes a very interesting point, to close the loop here. The cybersecurity problem is itself a big data and fast-data problem, and in fact, with some of the advanced persistent threats that we see today, which depend on breaching an infrastructure and then laying dormant for several months, what the attacker is trying to do is to wait out how long you keep your log file data so that when they launch themselves, it is difficult to do forensics, and so what we have learned is that these log files are actually the essence of the big data you need to do pattern analysis, pattern discovery on forensics, you know, should any attack occur. So in fact, most of the science behind big data including data at rest and large-scale computation and fast- data that are eating very high-speed streams is directly relevant to the subject of cyber defense. Mr. Bridenstine. Thank you. Chairman Massie. Thank you, Mr. Bridenstine. If the Ranking Member is amenable to this, I think we will do another round of questions? Ms. Wilson. Yes. Chairman Massie. Did you have something to introduce into the record? Ms. Wilson. I do. Thank you, Mr. Chair. Mr. Kilmer has lots of conflicts. As we saw him come to the meeting, he had to leave, and I want to ask unanimous consent on behalf of Mr. Kilmer to introduce a report on big data from IDC into the record, and then I have a question. Chairman Massie. Without objection, so ordered. It will be set into the record. [The information appears in Appendix II] Ms. Wilson. Thank you. This question is for everyone. We have all had several discussions lately about the value of NSF-funded research to society and how we might certify that value based on the grant proposal. I think we might use big data instructively here. It is an incredibly interdisciplinary field where tools are developed in the pursuit of one narrow research question, let us say in the social sciences might have profound applications across many fields of science and even in many sectors of the economy that can't possibly be anticipated at the time of the proposal. What is the potential for data analytics being developed in one little seemingly irrelevant corner having unintended benefits to other fields and societal applications? And if you have concrete examples, that would be even better for us to understand. Thank you. Dr. Jahanian. Okay. I guess I will start. There is no question there are all sorts of explorations that we are doing in the area of big data that we can't even begin to see the potential impact of it. I will give you an example. NSF has been investing and other agencies with the private sector in what is known as the area of machine learning. These investments have taken place for at least 20 or 30 years. In fact, IBM has also led efforts in this area. I can tell you that it is investments of the last 20 or 30 years that have come to fruition such that these machine learning algorithms essentially allow us to look at these large data sets and identify trends and be able to adapt. Essentially, they have a broad range of applications from weather forecasting to financial modeling to biomedical research and so on that have had tremendous, tremendous impact and now we use these techniques as if they are off-the-shelf solutions available that you can buy. These are through years of investment that we have made that have come to fruition, so that is an example of that. We are investing in all sorts of areas in natural language understanding, in information retrieval, in various algorithms and approaches to automated scalable approaches to reasoning that could be applied to understanding relationship between gene sequence structure and biological functions. These are all essentially the kinds of investments that we are making that some of us we could see how it comes to fruition. Some of it relies on decades of investment that we have already made in computational techniques and data-intensive techniques. Dr. McQueeney. If I could offer you an example from the medical world, one of the critical problems in medicine is the loss of premature infants due to infections, and physicians have struggled for a long time with identifying the onset of an infection at a very early point because as these infections can grow exponentially, the earlier you can intercept them, the more likely you are to have a lifesaving benefit for someone who is very vulnerable such as a premature infant. We have done work with the Toronto Hospital for Sick Kids where a physician up there had an idea that all the instrumentation in the NICU that is--you know, you have probably been in a hospital room or intensive-care room, all the instruments around the bed, someone comes in every half an hour and writes down those numbers but the instruments are producing readings continuously, and this physician had the idea that if we kept all that data and we stored all that data as it came out of the machines in real time, which was a tremendous aggregation from a velocity of data point of view and correlated with the eventual issues that these premature infants had, we might be able to detect patterns using techniques such as machine learning that we were just hearing about that would give us an early identification of an upcoming infection, the ability to treat it before it got out of control, and her theories were absolutely correct. There were signatures in the data that gave up to 24 hours advance notice of an onset of an infection that was time for the doctors to in many cases provide some kind of lifesaving therapy. So there is an example of very, very deep mathematics, computer science being applied to a problem where the data was being produced every day by these instruments and it wasn't being captured and it wasn't being looked at and it wasn't being correlated with results to produce a fantastic outcome. Dr. Rappa. I would just sum up by saying that really big data is part of a decades-long process that really started with computerization in the 1940s and 1950s and eventually got interconnected through the Internet in the 1970s, 1980s and 1990s that the world that we are turning into, data is going to be everywhere. It is going to affect exactly what happens here. It is going to affect hospitals, universities, every corner of the economy literally, and so we need to take approaches to that to try to develop understanding around big data, how it is applied, how the tools of analytics are applied across, you know, virtually every sector of the economy, and so I would take a very broad view, not looking at it as specifically, you know, a realm of computer technology or some other sort of isolated realm but looking at it as, you know, unfortunately as the big thing it is. Dr. Jahanian. May I offer another example as I was thinking about it? I am reminded of the work by Daphne Koller and her collaborators at Stanford on classifying breast cancer via image analysis. As you know, 40,000 women die from this disease each year. By extending essentially image analysis techniques to hundreds of, I should say thousands and thousands of biopsy images, they were able to identify a subset of cellular features. Out of 6,000 possible features, they were able to essentially identify a few of them that were predictive of survival time among breast cancer patients. What is really surprising is that the feature that they identified, it wasn't just from--the best feature, I should say, that is a predictor of survival, was not from the cancerous tissue itself but it was from the surrounding tissue, and that has led to new kinds of treatments. It has led to new kinds of diagnosis techniques and also a very personalized treatment that could aim to improve survival times in patients. That is a very, very concrete example. Another example is the work that Google had done during H1N1 virus. I will be very brief about this. Before they actually discovered a vaccine, we wanted to track the spread of disease. Google engineers used data that had nothing to do with the virus directly from billions of essentially web searches from around the world together from publicly available, essentially historic data on flu trends, to predict the spread of flu virus down to small regions in the country--or across the world, rather. This is a remarkable essentially application of data that one would have never thought could be applicable to something like H1N1 virus. Ms. Wilson. Thank you very much. Chairman Massie. Thank you, Ms. Wilson. Thank you for that very excellent example of how we can use--a private company can find information in the data. We got a little bit out of order so the last question is going to be mine. I reserve five minutes for myself. And the question I want to ask is, we have heard about banks that are too big to fail, and we also know that the Internet is now too big to fail. We recently in the House passed a CISPA bill which is somewhat controversial but some people felt it was necessary to do because the Internet was so big and pervasive in our lives. So my question to you is, are there any big data sets that are too big to fail? In other words, are there ones that are pervasive that we have let through osmosis become--we have become too dependent upon or maybe not too dependent but we are dependent upon these data sets, for instance, weather, you know, and early warning systems? Not all of those, I imagine, are government systems. Some of them are private but possibly the government is relying on these systems and so I would be remiss if I didn't ask this question now before something fails, but tell us what is too big to fail right now? What would we bail out, and is there sufficient redundancy in the collection, storage and access of these data sets? Thank you. Dr. McQueeney. Well, first, I would just like to say that we were delighted to support that cyber bill, and I congratulate you on such broad bipartisan support in the House for getting that acted upon. Data sets have the property that they can often be subdivided and often be replicated, and so we have a lot of techniques by which we can assure the continuity of data if we take the time to do it, and if there were very valuable historical records on things like long-term weather trends that were only stored in one place, that actually could be a concern because that is literally irreplaceable data. But essentially all of the IT techniques needed to take those large data sets and segment them and replicate them in different secure places so they could be re-created do exist but I think you raise an interesting point, that it is worthwhile to periodically check that we are being appropriately vigilant with the digital archives that are so valuable. Chairman Massie. Dr. Jahanian? Dr. Jahanian. I don't have a specific example. What I can tell you is that similar to the issue of cybersecurity, as Nation's critical infrastructure and more generally the Internet is playing a vital role in integrating the economic, you know, political, societal fabric of our society, we are going to become more and more dependent on data, and data is going to play an increasingly significant role in our day-to- day lives, and for that reason, I think the same sort of issues that apply to all sorts of IT solutions that we take for granted will increasingly be applied to data. From a research and engineering community's point of view, it is not just failure of the data but making that data accessible and also making the data accessible to broad community of scientists and engineers is an issue that we are quite concerned about. Chairman. Massie. Thank you very much. I was part of the bipartisan on CISPA, opposing CISPA actually, but that is okay. I want to thank the witnesses for their valuable testimony and the Members for their questions today. The Members in the Committee may have additional questions for you, and we will ask that you respond to those in writing. The record will remain open for two weeks for additional comments and written questions from the Members. The witnesses are excused and this hearing is adjourned. [Whereupon, at 11:35 a.m., the Subcommittees were adjourned.] Appendix I ---------- Answers to Post-Hearing Questions Responses by Dr. Michael Rappa [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT] Responses by Dr. Farnam Jahanian [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT] Appendix II ---------- Additional Material for the Record IDC IVIEW, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East, submitted by Representative Derek Kilmer [GRAPHIC(S) NOT AVAILABLE IN TIFF FORMAT]