CIOReview
| | September 20179CIOReviewin applied mathematics. Wickham and Perez are leaders in the open source R and Python communities. R and Python are programming languages. But R and Python are also vibrant ecosystems of code, ideas and methodologies at the center of modern data science. Wickham and Perez have contributed to multiple projects in the R and Python communities. But each is responsible for at least one Big Idea in data science. Wickham is most well known in the R community for having authored ggplot2, which is arguably the most elegant and powerful graphics package in the data science world. John Tukey, the father of data science, emphasized the importance of visualization and exploratory data analysis as a necessary prelude to model and algorithm construction. Wickham's ggplot embodies a deep philosophy of visualization called the "grammar of graphics." A grammar is the "fundamental principles or rules of an art or science." Instead of approaching every visualization or graphic as a one-off, each with its own set of rules and logic, the grammar of graphics imposes a well defined structure for building any visualization. Much like in Ikebana, the Japanese art of flower arrangement, visualization is built up in a series of layers or aesthetics while adhering to a set of simple rules. Working with a well-defined grammar frees up the data scientist to focus and reveal the underlying structure, pattern, and meaning of data.Wickham recently has gone further by creating a new set of powerful libraries called "tidyverse", which is built on similar grammatical principles. At the core of tidyverse is a package called dplyr, which embodies a "grammar of data manipulation." The "two-by-four" of the data science world is a data structure called a data frame. A large part of data analysis, especially during the exploratory data analysis phase, consists of manipulating, arranging, and re-arranging data frames. The set and sequence of operations on data frame is a core part of data analysis. It also has tended to be tedious and arcane. Wickham's dplyr package provides an elegant and consistent method for handling the most common data manipulation tasks faced by data scientists. Wickham's Big Idea has been to take a grammatical approach to data science and making that the basis of a series of innovative R libraries. His Big Idea has fundamentally transformed how data scientists approach their daily job.Perez comes from the world of modern science which is increasingly built on computation and data-intensive modes of scientific discovery. Alan Turing award winner Jim Gray referred to this new approach to scientific investigation as the Fourth Paradigm. If we look at the detection and confirmation of gravitational waves, for example, much of the experimental work relied on sophisticated data analysis and data science. The preferred tool set for the analysis of gravitational waves is Python. Indeed Python is now the de facto standard for scientific computing and data analysis.As a graduate student Perez, along with some of his colleagues, developed an interactive command shell tool called IPython along with a web-based interface called the IPython notebook. Now called Project Jupyter, Perez's work is utilized on a daily basis by tens of thousands of scientists and data scientists. Perez's great insight, which led to his Big Idea, was to notice that the available tools in the programming world don't support the iterative workflow of science. Empirical science progresses in a cycle: we pose questions, form hypotheses, conduct experiments, collect and analyze data. Based on the data we make adjustments to our questions and hypotheses. The cycle then begins all over again. Scientific innovation is sparked in part by how rapidly we can iterate each cycle.As an open source innovator Perez led the development of a tool set that more closely matches the scientific workflow. As an added benefit the Jupyter notebook paradigm provides a mechanism for ensuring reproducible research. Much like a laboratory notebook the Jupyter environment gives researchers the ability to annotate their work, collaborate in teams, share and disseminate their findings. It is inevitable that very soon research papers will be published in the style of a Jupyter notebook, where the narrative of the paper is only a part of the research. The published paper, in the form of an interactive notebook, will also contain the data (or links to it) and the computation in the form of code. As data-driven intelligent software begins to dominate the internet, spawning further software innovations, we need to acknowledge that the emerging technologies stand on the shoulder of giants like Hadley Wickham and Fernando Perez. The pervasiveness of digital data is being driven by the Internet of Things (IoT), which means that everything in the physical world potentially transmits and receives data
< Page 8 | Page 10 >