50 years of Data Science
Drawing on work by Tukey, Chambers, Breiman and Cleveland, Stanford statistics professor, David Donoho present a vision of data science based on the activities of people who are ‘learning from data’.
-
John Tukey’s The Future of Data Analysis, asserts that Statistics must become concerned with the handling and processing of data, its size, and visualization.
-
John Chambers’s S language, the predecessor of R, is the forerunner of the “notebook” concept, where an academic paper can be made reproducible, scripted, shareable (i.e. Jupyter Notebook)
-
Leo Breiman’s Two Cultures notes that concern strictly with prediction accuracy is different from inference about models, and that the former is under-represented in academia but prevalent in industry, where it has turned into “machine learning.”
-
William S. Cleveland 2001 paper Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics addressed academic statistics departments and proposed a plan to reorient their work.
His paper reviews the recent spectacle about data science in the popular media, and about how/whether Data Science is really different from Statistics.
He also describe an academic field dedicated to improving that activity in an evidence-based manner. His premises is that this new field is a better academic enlargement of statistics and machine learning than today’s Data Science Initiatives, while being able to accommodate the same short-term goals.
He propose to call the following collection of activities below as a would-be field “Greater Data Science”
- Data Exploration and Preparation
- Data Representation and Transformation
- Computing with Data
- Data Modeling
- Data Visualization and Presentation
- Science about Data Science
He contended that Information technology skills are a premium but scientific understanding and statistical insight should be firmly in the driver’s seat.
Check out a thoughtful essay by Stanford statistics professor David Donoho, titled “50 Years of Data Science”