Data science underpins modern AI and many advances in healthcare, yet human judgment permeates every stage of the data science life cycle. These judgment calls introduce hidden uncertainties that go well beyond sampling variability and drive many of the risks associated with AI.
We introduce veridical data science, grounded in three fundamental principles—Predictability, Computability, and Stability (PCS)—to make such uncertainties explicit and assessable and to aggregate reality-checked algorithms for better results. The PCS framework unifies and extends best practices in statistics and machine learning and is illustrated through case studies including identifying causal drivers of heart disease and brain regions, reducing cost of prostate cancer detection, and improving uncertainty quantification beyond standard conformal prediction.
We then cover 2 software packages and best practices for implementing PCS principles in practice. We start with vflow, a Python package that makes it easy to implement and evaluate the effect of judgment calls on a data-science pipeline. We demonstrate its usage on popular real-world datasets from the BLADE benchmark. Then, we will introduce MERITS that a veridical simulation study should satisfy. We accompany this with simChef, an R package for cooking up reproducible, high-quality simulations in a flexible, efficient, and low-code manner.
Bin
Yu is CDSS Chancellor's Distinguished Professor in
Statistics, EECS, Center for Computational Biology, and Senior Advisor
at the Simons Institute for the Theory of Computing, all at UC
Berkeley. Her research focuses on the practice and theory of
statistical machine learning, veridical data science, responsible and
safe AI, and solving interdisciplinary data problems in neuroscience,
genomics, and precision medicine. She and her team have developed
algorithms such as iterative random forests (iRF), stability-driven
NMF, adaptive wavelet distillation (AWD), Contextual Decomposition for
Transformers (CD-T), SPEX and ProxySPEX for interpreting deep learning
models, especially for compositional interpretability.
She is a member of the National Academy of Sciences and of the American Academy of Arts and Sciences. She was a Guggenheim Fellow, President of Institute of Mathematical Statistics (IMS), and delivered the Tukey Lecture of the Bernoulli Society, the Breiman Lecture at NeurIPS, the IMS Rietz Lecture, and the Wald Memorial Lectures (the highest honor of IMS), and Distinguished Achievement Award and Lecture (formerly Fisher Lecture) of COPSS (Committee of Presidents of Statistical Societies). She holds an Honorary Doctorate from The University of Lausanne. She is on the Editorial Board of Proceedings of National Academy of Science (PNAS) and a co-editor of the Harvard Data Science Review (HDSR).
Tiffany
Tang is a Clare Boothe Luce Assistant Professor at the
University of Notre Dame in the Department of Applied and
Computational Mathematics and Statistics. Her research interests
broadly range from interpretable machine learning to responsible
applications of AI in healthcare to open-source software development.
Previously, she completed her PhD in statistics at UC Berkeley,
advised by Bin Yu, and a postdoctoral fellowship in the University of
Michigan Statistics department with Liza Levina and Ji Zhu.
Chandan
Singh is a researcher in the Deep learning group at MSR,
working on interpretability and LLMs, with the broad goal of improving
science and medicine using data. Recently, he has focused on how to
reliably use LLMs to extract new insights in clinical data and
language fMRI. Separately, he has also worked on developing highly
accurate transparent models, such as improving linear models and
decision trees. He received his PhD from UC Berkeley in 2022 advised
by Professor Bin Yu.
In this workshop, we present methods to define and estimate the causal effects of categorical, continuous, and multivariate exposures. The methods are based on a generalization of the static and dynamic interventions that may be familiar to some of you. This generalization has been recently called modified treatment policies (MTPs). MTPs are hypothetical interventions where the post-intervention exposure is defined as a modification of the natural value of the exposure that can also depend on the unit’s history. This short course will introduce the lmtp R package for estimating the causal effects of these general estimands in both point-treatment and longitudinal studies. We will discuss identification of MTPs, estimation with a targeted minimum-loss based estimator and a sequentially doubly-robust estimator, and provide guidance on estimator choice and software usage.
Iván
Díaz is an Associate Professor of Biostatistics at New
York University Grossman School of Medicine. His research focuses on
the development of non-parametric statistical methods for causal
inference from observational and randomized studies with complex
datasets, using machine learning. This includes but is not limited to
mediation analysis, methods for continuous exposures, longitudinal
data including survival analysis, and efficiency guarantees with
covariate adjustment in randomized trials. He also works applying
these methods to healthcare research, including in neurology, critical
care, opioid use research, and other areas.
Kara
Rudolph is an Associate Professor of Epidemiology at
Columbia University, Mailman School of Public Health. Her research
interests are in developing and applying causal inference methods to
inform the prevention and treatment of opioid use disorder, the
prevention of violence, and understanding health effects of the
environment. She is interested in approaches for transportability and
data fusion, mediation, effect heterogeneity, and complex
exposures.
Nick
Williams is a Senior Data Analyst in Columbia
University's Mailman School of Public Health, Department of
Epidemiology. His interests are in the development of statistical
computing tools for novel causal inference methods. He's the author
and maintainer of multiple R packages.
This half-day short course offers a selective, statistics-first introduction to the foundations of transfer learning and closely related multi-task learning ideas. The central question is how distribution shift between source and target populations affects generalization, and how explicit statistical assumptions make principled knowledge transfer possible.
The course covers three core threads: domain adaptation bounds expressed through divergence measures between source and target distributions; covariate shift and density-ratio based reweighting for valid risk estimation and estimator construction; and posterior drift addressed via biased regularization as a route to "safe transfer" that guards against negative transfer. Throughout, the emphasis is on precise problem formulations, the key theoretical results that follow from them, and the intuition for when transfer helps, when it hurts, and why.
The format is lecture-based, with guided derivations, brief conceptual check-ins, and short numerical illustrations accompanying each thread to make the theoretical takeaways concrete. Participants will leave with a unified perspective on divergence-based analysis, covariate shift correction, and biased regularization—one that can inform both applied methodological choices and further theoretical research.
Participants should have a working knowledge of mathematical statistics at the level of a first-year Ph.D. course, including probability, expectation and concentration inequalities, and basic asymptotics. Familiarity with standard supervised learning (regression and classification, empirical risk minimization, bias–variance trade-off) and with regularized estimators such as ridge and lasso will be assumed. Comfort with linear algebra and multivariable calculus is expected. No coding background is required: the course is theoretical and methodological in focus, and some numerical illustrations may be shown by the instructor if time permits.
Yang
Feng is a Professor of Biostatistics in the School of
Global Public Health at New York University, where he is also
affiliated with the Center for Data Science. He received his Ph.D. in
Operations Research from Princeton University in 2010.
Dr. Feng’s research focuses on the theoretical and methodological foundations of machine learning, high-dimensional statistics, network models, and nonparametric statistics. His work addresses critical applications in Alzheimer’s disease prognosis, cancer subtype classification, genomics, electronic health records, and biomedical imaging, with the goal of enabling more accurate risk assessment and clinical decision-making. He has published over 70 peer-reviewed papers in leading journals across statistics, machine learning, econometrics, and medicine. His research has been supported by grants from the National Institutes of Health (NIH) and the National Science Foundation (NSF), including the NSF CAREER Award.
Currently, Dr. Feng serves as the Review Editor for the Journal of the American Statistical Association (JASA) and The American Statistician (2026–2028). He also serves as an Associate Editor for several premier journals, including JASA Theory & Methods, the Journal of Business & Economic Statistics, the Journal of Computational & Graphical Statistics, and the Annals of Applied Statistics. He is a Fellow of the American Statistical Association (2022) and the Institute of Mathematical Statistics (2023), and has been an elected member of the International Statistical Institute since 2017.
Diffusion generative models have emerged as a cornerstone of modern generative AI, delivering state-of-the-art performance across a wide range of data generation tasks. At their core, diffusion models seek to gradually transform pure noise into new data samples that emulate a target data distribution, accomplished by learning to reverse a forward stochastic process that progressively converts data into Gaussian noise. Despite their empirical successes, the statistical and algorithmic foundations of diffusion models remain far from mature. This lack of fundamental understanding limits their broader adoption, especially in applications that demand interpretability and reproducibility.
This short course provides a timely introduction to diffusion models and presents recent progress toward understanding their striking effectiveness, with an emphasis on core principles and statistical insights. We will examine the fundamental mechanisms of score-based diffusion models; characterize the statistical limits of learning score functions; analyze the convergence behavior of diffusion-based samplers; explore how these models adapt to unknown low-dimensional data structures; discuss conditional generation via diffusion guidance; and highlight ideas for accelerating inference through higher-order approximations. Throughout this short course, we will connect theoretical advances to practical applications, illustrating how fundamental insights can inform effective algorithm design.
Basic linear algebra and basic probability.
Yuxin
Chen is a Professor of Statistics and Data Science at the
University of Pennsylvania. Before joining UPenn, he was an assistant
professor of electrical and computer engineering at Princeton
University. He completed his Ph.D. in Electrical Engineering at
Stanford University and was also a postdoc scholar at Stanford
Statistics. His current research interests include high-dimensional
statistics, diffusion models, reinforcement learning, and
optimization. He has received the Alfred P. Sloan Research Fellowship,
the Leo Breiman junior award, the SIAM Activity Group on Imaging
Science Best Paper Prize, the ICCM Best Paper Award (gold medal), and
was selected as a finalist for the Best Paper Prize for Young
Researchers in Continuous Optimization. He has also received the
Princeton Graduate Mentoring Award.
Yuting
Wei is an Associate Professor in the Department of
Statistics and Data Science at the Wharton School of the University of
Pennsylvania. Prior to joining Penn in 2021, she spent two years as an
Assistant Professor at Carnegie Mellon University and one year at
Stanford University as a Stein Fellow. She received her Ph.D. in
Statistics from the University of California, Berkeley. Dr. Wei is a
recipient of the 2025 Gottfried E. Noether Early Career Scholar Award,
the Google Research Scholar Award, the NSF CAREER Award, and the Erich
L. Lehmann Citation from the Berkeley Statistics Department. Her
research interests lie in learning from high-dimensional and
structured data, and advancing the theoretical foundations of
reinforcement learning and diffusion models.
Optimization lies at the heart of modern data science, offering scalable solutions for high-dimensional problems in statistics and machine/deep learning. The first part of the course will cover: (i) the fundamentals of gradient-based optimization and (ii) advanced optimization methods. These algorithms will be illustrated through applications in high-dimensional statistics and machine learning, including sparse regression, matrix completion, graphical models and feed-forward neural networks. The second part will explore key recent developments in optimization driven by challenges in machine and deep learning. It will briefly cover: (i) Federated and distributed learning, where decentralized optimization techniques enable efficient model training across multiple devices while preserving data privacy. (ii) Minimax optimization, a powerful framework for adversarial learning, robust statistics, and generative modeling. (iii) Bilevel optimization, which has gained prominence in the last 2-3 years for applications such as hyperparameter tuning, meta-learning, and reinforcement learning. The course will balance core concepts with sufficient technical depth, providing an accessible yet insightful perspective on the latest advances in optimization.
George
Michailidis is a Professor in the Department of
Statistics & Data Science at UCLA, where he joined the faculty in
2022. His academic career includes 17 years on the faculty in
Statistics and EECS at the University of Michigan, followed by roles
as a Professor of Statistics and Computer Science and the Founding
Director of the Informatics Institute at the University of Florida.
His research focuses on high-dimensional statistics for complex,
temporally dependent data; optimization for modern machine learning
(encompassing minimax, bilevel, and federated architectures); change
point analysis; and interpretable machine learning. Bridging theory
and practice, he regularly collaborates across engineering,
biomedicine, and finance to deploy his methods on large-scale
datasets. Alongside his standard teaching duties in optimization for
data science and machine learning, he has also delivered a
corresponding short course on multiple occasions.
This short course is designed for researchers in statistics and data analysis who are eager to explore the latest trends in deep learning and apply these methods to solve complex statistical problems. The course focuses on cutting-edge topics in the deep learning community, including transformers, diffusion models, reinforcement learning and large language models. In this short course participants will gain hands-on experience in exploring and applying deep learning methodologies to tackle various statistical challenges. Basic knowledge of Python programming will be helpful but not necessary.
Hongtu
Zhu is the Kenan Distinguished Professor of
Biostatistics, Statistics, Radiology, Computer Science and Genetics at
the University of North Carolina at Chapel Hill. He was a DiDi Fellow
and Chief Scientist of Statistics at DiDi Chuxing between 2018 and
2020 and held the Endowed Bao-Shan Jing Professorship in Diagnostic
Imaging at MD Anderson Cancer Center between 2016 and 2018. He is an
internationally recognized expert in statistical learning, medical
image analysis, precision medicine, biostatistics, artificial
intelligence, and big data analytics. He received an established
investigator award from the Cancer Prevention Research Institute of
Texas in 2016, the INFORMS Daniel H. Wagner Prize for Excellence in
Operations Research Practice in 2019, the IMS 2027 Medallion award and
Lecture, and the COPSS 2025 Snedecor Award. He has published more than
365 papers in top journals, including Nature, Science, Cell, Nature
Genetics, Nature Communication, PNAS, AOS, JASA, Biometrika, and
JRSSB, as well as presenting 65+ conference papers at top conferences,
including meetings for Neurips, ICLR, ICML, AAAI, and KDD. He is the
coordinating editor of JASA and the editor of JASA ACS.
Xiao
Wang is Head and J.O. Berger and M.E. Bock Professor of
Statistics at Purdue University. He earned his Ph.D. from the
University of Michigan, and his research centers on machine learning,
nonparametric statistics, and functional data analysis with particular
emphasis on developing methods for high-dimensional and complex data.
His work has been featured in leading statistical journals and machine
learning conferences, and he is a fellow of the Institute of
Mathematical Statistics (IMS) and the American Statistical Association
(ASA). He currently serves as an associate editor for JASA,
Technometrics, and Lifetime Data Analysis.
Runpeng
Dai obtained his B.S. in Statistics from Shanghai
University of Finance and Economics and is now a PhD candidate in
Department of Biostatistics at University of North Carolina at Chapel
Hill. His research interest lies in Reinforcement learning and Large
language model. He has several internship experiences in Tencent
Seattle AI labs and DiDi Chuxing.