Faculty in the Statistics Department at the University of California, Berkeley have developed an integrated program of research and education to support undergraduate research experiences, graduate research traineeships, and postdoctoral fellowships. The common research theme of the training activities is how to leverage the predictive power of statistical machine learning to address questions of causality and interpretability. The project aims to prepare the next generation of statisticians and data scientists to tackle new, important problems that arise from the analysis of massive data. Intuitively it seems that more reliable and precise inferences can be drawn from larger data sets. However, decisions and interventions must be interpretable and justified by statistical measures of uncertainty, which are challenging in this setting. This program will infuse ideas, energy, and resources in an integrated way at all levels of the educational program, from the undergraduate major to the postdoctoral experience, recruiting students and preparing them to participate in the extraordinary range of opportunities in this exciting new field.

The research in this project will pursue theory to bridge the gap between causal inference and machine learning research, including high-dimensional inference, multiple testing, causal inference with interference, and causality and gene expression. The project is at the frontiers of statistics and data science, bridging the divide between machine learning and causal inference with potential impact far beyond the discipline of statistics. Plans are to redesign and expand the engagement of undergraduates in research through a graduate student mentorship program; to design new courses at the graduate and undergraduate levels, including an introductory course that builds on connections between data science, social sciences, and ethics; and to enhance graduate research training via a research symposium. The program will include a graduate professional development training series that addresses topics in technology, presentation and writing skills, and building an inclusive science community. The project will also provide significant training in teaching for graduate students and postdoctoral associates. Through a combination of channels, the innovations in training will spread to other institutions and disciplines, e.g., demonstrating the power of machine learning in policy and education settings where causal inference is central. The program also includes the development of educational materials with plans to disseminate them widely throughout the broader community. The project will emphasize recruitment and retention efforts targeted to increase the diversity of domestic students.