Training students to extract value from big data : summary of a workshop, Washington, D.C., 11-12 April 2014.

Auteur(s)
Mellody, M. (Rapp.)
Jaar
Samenvatting

Data sets–whether in science and engineering, economics, health care, public policy, or business–have been growing rapidly; the recent National Research Council (NRC) report Frontiers in Massive Data Analysis documented the rise of “big data,” as systems are routinely returning terabytes, petabytes, or more of information (National Research Council, 2013). Big data has become pervasive because of the availability of high-throughput data collection technologies, such as information-sensing mobile devices, remote sensing, radiofrequency identification readers, Internet log records, and wireless sensor networks. Science, engineering, and business have rapidly transitioned from the longstanding state of striving to develop information from scant data to a situation in which the challenge is now that the amount of information exceeds a human’s ability to examine, let alone absorb, it. Web companies–such as Yahoo, Google, and Amazon–commonly work with data sets that consist of billions of items, and they are likely to increase by an order of magnitude or more as the Internet of Things1 matures. In other words, the size and scale of data, which can be overwhelming today, are only increasing. In addition, data sets are increasingly complex, and this potentially increases the problems associated with such concerns as missing information and other quality concerns, data heterogeneity, and differing data formats. Advances in technology have made it easier to assemble and access large amounts of data. Now, a key challenge is to develop the experts needed to draw reliable inferences from all that information. The nation’s ability to make use of the data depends heavily on the availability of a workforce that is properly trained and ready to tackle these high-need areas. A report from McKinsey & Company (Manyika et al., 2011) has predicted shortfalls of 150,000 data analysts and 1.5 million managers who are knowledgeable about data and their relevance. It is becoming increasingly important to increase the pool of qualified scientists and engineers who can extract value from big data. Training students to be capable in exploiting big data requires experience with statistical analysis, machine learning, and computational infrastructure that permits the real problems associated with massive data to be revealed and, ultimately, addressed. The availability of repositories (of both data and software) and computational infrastructure will be necessary to train the next generation of data scientists. Analysis of big data requires cross-disciplinary skills, including the ability to make modelling decisions while balancing trade-offs between optimization and approximation, all while being attentive to useful metrics and system robustness. To develop those skills in students, it is important to identify whom to teach, that is, the educational background, experience, and characteristics of a prospective data science student; what to teach, that is, the technical and practical content that should be taught to the student; and how to teach, that is, the structure and organization of a data science program. The topic of training students in big data is timely, as universities are already experimenting with courses and programs tailored to the needs of students who will work with big data. Eight university programs have been or will be launched in 2014 alone.2 The workshop that is the subject of this report was designed to enable participants to learn and benefit from emerging insights while innovation in education is still ongoing. On April 11-12, 2014, the standing Committee on Applied and Theoretical Statistics (CATS) convened a workshop to discuss how best to train students to use big data. CATS is organized under the auspices of the NRC Board on Mathematical Sciences and Their Applications. To conduct the workshop, a planning committee was fist established to refine the topics, identify speakers, and plan the agenda. The workshop was held at the Keck Center of the National Academies in Washington, D.C., and was sponsored by the National Science Foundation (NSF). About 70 persons–including speakers, members of the parent committee and board, invited guests, and members of the public–participated in the 2-day workshop. The workshop was also webcast live, and at least 175 persons participated remotely. A complete statement of task is shown in Box 1.1. The workshop explored the following topics: *The need for training in big data. *Curricula and coursework, including suggestions at different instructional levels and suggestions for a core curriculum. *Examples of successful courses and curricula. *Identification of the principles that should be delivered, including sharing of resources. Although the title of the workshop was “Training Students to Extract Value from Big Data,” the term big data is not precisely defied. CATS, which initiated the workshop, has tended to use the term massive data in the past, which implies data on a scale for which standard tools are not adequate. The terms data analytics and data science are also becoming common. They seem to be broader, with a focus on using data–maybe of unprecedented scale, but maybe not–in new ways to inform decision making. This workshop was not developed to explore any particular one of these definitions or to develop definitions. But one impetus for the workshop was the current fragmented view of what is meant by analysis of big data, data analytics, or data science. New graduate programs are introduced regularly, and they have their own notions of what is meant by those terms and, most important, of what students need to know to be proficient in data-intensive work. What are the core subjects in data science? By illustration, this workshop began to answer that question. It is clear that training in big data, data science, or data analytics requires a multidisciplinary foundation that includes at least computer science, machine learning, statistics, and mathematics, and that curricula should be developed with the active participation of at least these disciplines. The chapters of this summary provide a variety of perspectives about those elements and about their integration into courses and curricula. Although the workshop summarized in this report aimed to span the major topics that students need to learn if they are to work successfully with big data, not everything could be covered. For example, tools that might supplant MapReduce, such as Spark, are likely to be important, as are advances in Deep Learning. Means by which humans can interact with and absorb huge amounts of information– such as visualization tools, iterative analysis, and human-in-the-loop systems–are critical. And such basic skills as data wrangling, cleaning, and integration will continue to be necessary for anyone working in data science. Educators who design courses and curricula must consider a wide array of skill requirements. The present report has been prepared by the workshop rapporteur as a factual summary of what occurred at the workshop. The planning committee’s role was limited to planning and convening the workshop. The views contained in the report are those of individual workshop participants and do not necessarily represent the views of all workshop participants, the planning committee, or the NRC. (Author/publisher)

Publicatie

Bibliotheeknummer
20150555 ST [electronic version only]
Uitgave

Washington, D.C., National Academies Press (NAP), 2014, XII + 54 p., 18 ref. - ISBN 0-309-31437-2 / ISBN 978-0-309-31437-4

Onze collectie

Deze publicatie behoort tot de overige publicaties die we naast de SWOV-publicaties in onze collectie hebben.