Berkeley Statistics Annual Research Symposium (BSTARS)

The Berkeley Statistics Annual Research Symposium (BSTARS) surveys the latest research developments in the department, with an emphasis on possible applications to statistical problems encountered in industry. The conference consists of keynote lectures given by faculty members, talks by PhD students about their thesis work, and presentations of industrial research by alliance members. The day-long symposium gives our graduate students, faculty, and industry partners an opportunity to connect, discuss research, and review the newest development happening on-campus and in the field of statistics. This year's event had a full schedule of 25 thunder talks, 12 posters, and fascinating presentation by our Keynote speaker.



BSTARS 2017 will be March 23rd, 1:30pm-8:30pm, at The Alumni House, UC Berkeley.

1:30-2:00 Arrival, coffee, and pastries
2:00-2:10 Welcome and opening remarks
  • Michael Jordan, Statistics Department Professor and Chair
  • Frances Hellman, Dean, Math & Physical Sciences
2:10-3:10 Thunder Talks
3:10-3:30 Break
3:30-4:30 Thunder Talks 2
4:30-5:50 Poster Sessions
5:50-6:30 Keynote by Professor Jasjeet Sekhon
6:30-8:30 Dinner

Keynote Speaker


Causal Inference in the Age of Big Data

The rise of massive datasets that provide fine-grained information about human beings and their behavior offers unprecedented opportunities for evaluating the effectiveness of social, behavioral, and medical treatments. With the availability of fine-grained data, researchers and policy makers are increasingly unsatisfied with estimates of average treatment effects based on experimental samples that are unrepresentative of populations of interest. Instead, they seek to target treatments to particular populations and subgroups. However, even large-scale experiments are often underpowered to detect the effects of interest.  In order to increase power, researchers are increasingly making modeling assumptions and combining experiments with large-scale observational data. Because of these inferential challenges, Machine Learning (ML) is now being used for evaluating and predicting the effectiveness of interventions in a wide range of domains from technology firms to clinical medicine and election campaigns. However, there are a number of issues that arise with the use of ML for causal inference. For example, although ML and related statistical models are good for prediction, they are not designed to estimate causal effects. Instead, they focus on predicting observed outcomes. Treatment effects, however, are never directly observed, and creating validation datasets where ground truth is known is difficult. Such validation is of particular importance because although ML algorithms have been designed to overcome prediction challenges when the data generating process is unknown, they cannot overcome bias when treatment assignment is a function of variables that are not observed. In this talk, I will discuss some recent methodological developments and examples of using ML methods to draw causal inferences.

Jasjeet S. Sekhon is Robson Professor of Political Science and Statistics and Senior Fellow at the Berkeley Institute for Data Science. His current research focuses on methods for causal inference in observational and experimental studies and evaluating social science, public health, medical, and digital interventions. He advises companies and election campaigns on statistical methods and on the design, implementation and analysis of large-scale experiments.

Thunder Talks by Industry Alliance Program Members

Microsoft Presenter: Dr. Gireeja Ranade, Postdoc Researcher, MSR

State Street Presenter: Dr. John Arabadjis, Managing Director, GX Labs

StubHub Presenters:
Dr. Xin Heng, Senior Manager, Data Science
Dr. Corey Reed, Lead Data Scientist

Citadel Presenter: Dr. Mikhail Traskin, Quantitative Researcher

Clindata Insight Presenter: Dr. Peng Yang, President

Deloitte Presenter: Dr. Bambo Sosina, Senior Specialist

Voleon Presenter: Dr. Stephen Reid, Research Staff

TGS Management Presenters:
Dr. Philip Naecker, Chief Technology Officer
Michael Junge, Director of Talent Acquisition

Thunder Talks and Poster Presentations of PhD Students and Postdocs

Rebecca Barter, Statistics Department

Acute Rejection in Kidney Transplant Patients With HIV: Developing Strategies for Dynamic Prediction

Over the past few decades, HIV has evolved from a death sentence to a manageable chronic condition with HAART therapies drastically extending the life expectancy of HIV positive individuals. As a consequence of prolonged survival, HIV-associated conditions such as kidney and liver disease are resulting in an increased demand for organ transplantation. While transplantation is proving effective in terms of patient survival, HIV-positive patients exhibit a surprisingly high rate of kidney rejection relative to their HIV-negative counterparts. Together with the Narwal lab at UCSF, we are developing novel analytic strategies for dynamically predicting and understanding kidney rejection based on 'omics data measured from a range of graft biopsy samples taken over time. In this talk I will discuss our first steps towards an analytic solution and discuss the challenges presented by this problem.

Billy Fang, Statistics Department

On the Risk of Convex-constrained Least Squares Estimators Under Misspecification

The constrained least squares estimator is a natural estimator of the mean of an isotropic Gaussian vector when it is known that the mean lies in some convex set. The risk of this estimator is known to be characterized by the statistical dimension of a tangent cone. As a step toward understanding the behavior of the estimator when the mean does not lie in the convex set (model misspecification), we prove an analogous characterization of risk in the misspecified case when the convex set is a polyhedron, and show that the risk can be much smaller than in the well-specified setting.  This is joint work with Professor Adityanand Guntuboyina.

Ryan Giordano, Statistics Department

How Bad Could it Be? Worst-case Prior Sensitivity Estimates for Variational Bayes

In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior and choice of likelihood, since this choice is made by the modeler and is necessarily somewhat subjective. For example, the process of prior elicitation may be prohibitively time-consuming; two practitioners may have irreconcilable subjective prior beliefs; or the model may be so complex and high-dimensional that humans cannot reasonably express their prior beliefs as formal distributions. All of these circumstances might give rise to a range of reasonable prior choices.  A posterior quantity is “robust” to the extent that it does not change much when calculated under these different prior choices.

Variational Bayes is an approximate Bayesian inference methodology that is increasingly popular due to its fast performance on large datasets.  We provide relatively easy-to-use techniques to quantify the robustness of Variational Bayes posterior mean estimates to an expressive non-parametric class of prior perturbations.  We also provide tools to calculate the worst-case additive prior perturbation within a suitably defined metric ball, effectively answering the question "how bad could it be?"  We apply our techniques to a range of models and compare our results to Markov Chain Monte Carlo estimates when available.

Sören Künzel, Statistics Department

Heterogenous Treatment Effect Estimation Using Random Forest

There is growing interest in estimating heterogeneous treatment effects in experimental and observational studies with application ranging from personalized medicine to online advertisement recommendation systems. We describe a number of algorithms that are based on random forest to estimate the conditional average treatment effect (CATE) function and we compare them using theoretical results under a simple causal model and simulation studies. 

Lihua Lei, Statistics Department

Bayesian Inference on Gaussian Graphical Models With Soft G-Wishart Distributions

The Gaussian graphical model has been a popular tool in statistics to model the dependence or the causality among a group of variables. We propose a new family of conjugate priors and build a Bayesian hierarchical model based on that. We then develop a fully automatic and tuning-free procedure to estimate the graph structure as well as the precision matrix. Via intensive simulations, we show that our method consistently outperforms the existing frequentist approaches, e.g. the graphical LASSO, in the finite samples.

Jamie Murdoch, Statistics Department

Peeking into the Black Box: A Step Towards Interpretable Deep Learning

Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear.   As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns.  In this talk I consider the popular Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This work is joint with Facebook Artificial Intelligence Research. 

Kellie Ottoboni, Statistics Department

Simple Random Sampling: Not So Simple

The theory of inference from simple random samples (SRSs) is fundamental in statistics; many statistical techniques and formulae assume that the data are an SRS. True random samples are rare; in practice, people tend to draw samples by using pseudo-random number generators (PRNGs) and algorithms that map a set of pseudo-random numbers into a subset of the population. Most statisticians take for granted that the software they use "does the right thing," producing samples that can be treated as if they are SRSs. In fact, the PRNG and the algorithm for drawing samples matter enormously. We show, using basic counting principles, that some widely used methods cannot generate all SRSs of a given size, and those that can do not always do so with equal frequencies in simulations. We compare the "randomness" and computational efficiency of commonly-used PRNGs to PRNGs based on cryptographic hash functions, which avoid these pitfalls. We judge these PRNGs by their ability to generate SRSs and find in simulations that their relative merits varies by seed, population and sample size, and sampling algorithm. These results are not just limited to SRSs but have implications for all resampling methods, including the bootstrap, MCMC, and Monte Carlo integration.

Fanny Perraudeau, Biostatistics Department

Zero-inflated Negative Binomial Model for Single-cell RNA

Single-cell RNA sequencing (scRNA-Seq) is a powerful and relatively young technique for characterizing molecular states of individual cells through their tran- scriptional profiles. It represents a major advance with respect to standard bulk RNA sequencing, which is only capable of measuring gene expression lev- els averaged over millions of cells. Accessing cell-to-cell variability is crucial for disentangling complex heterogeneous tissues, and for understanding dynamic biological processes, such as embryo development and cancer. Because of the tiny amount of RNA present in a single cell, the input material needs to go through many round of amplification before being sequenced. This results in a strong amplification bias as well as the presence of dropouts, i.e., genes that fail to be detected even though they are expressed in the sample. Here, we present a general and flexible zero-inflated negative binomial (ZINB) model that leads to low dimensional representations of the data by accounting for zero-inflation (dropouts) and for the count nature of the data.

Farzad Pourbabaee, Economics Department

Large Deviations of Factor-models With Regularly-varying Tails: Asymptotics and Efficient Estimation

I analyze the large deviation probability of factor models generated from components with regularly-varying tails, a large subclass of heavy tailed distributions. An efficient sampling method (based on conditioning) for tail probability estimation of this class is introduced and shown to exponentially outperform the classical Monte-Carlo estimator, in terms of coverage probability and/or confidence interval length. The obtained theoretical results are applied to financial portfolios, verifying that deviation probability of return to portfolios of many securities is robust against asset individual idiosyncratic risks.

Aaditya Ramdas

Liberals Meet Conservatives:  Deep Generative Adversarial Networks (GANs) Meet Kernel Two Sample Tests

I will talk about recent work that combines the most classical topic in statistics (hypothesis testing) with the most modern hammer in ML (deep learning). Generative adversarial networks (GANs) are at the current frontier of fully unsupervised learning of complex data like natural images. Imagine that we have a complex dataset (like handwritten digits, or landscapes, or cat photos) and we would like to learn a generative model that can output realistic new data with similar qualities to the input data. GANs attempt to accomplish this by setting two deep networks against each other in a two-player minimax game --- a generator produces new data, and a discriminator attempts to differentiate input data from generated data. However, the training of such GANs has been a major practical issue, and it has also been unclear how to evaluate such generative models (likelihood-free, label-free model criticism). I will describe recent progress towards both of these goals, achieved by deriving new results on hyperparameter tuning for maximizing the power of kernel two sample testing. This work has been accepted for publication at a deep learning conference, ICLR 2017.

Sujayam Saha, Statistics Department

Better FDR Control With Covariates

In multiple testing problems where informative covariates can be naturally assigned to hypotheses, e.g. hypotheses are arranged in spacetime, we propose and validate a likelihood based method that provides more powerful inference while retaining control over false discovery rate. We establish a general alternating maximization framework for computing our estimate and present both theoretical and numerical performance. Further, we present an application of our methodology in detecting neural activation.

Courtney Schiffman, Statistics Department

Variable Selection in Untargeted Metabolomics Studies

Untargeted metabolomics research allows for a widespread study of possible biomarkers associated with a variety of diseases. In these high dimensional, untargeted studies involving highly correlated metabolites, accurate and stable variable selection methods are key. We developed a variable selection method for untargeted metabolomics research, which relies on several parametric and machine learning variable selection methods to identify metabolites associated with disease and exposure. We show that our variable selection method identifies potentially significant metabolites which can easily be missed if relying upon standard variable selection methods in Metabolomics research.

Yotam Shem-Tov, Economics Department

Efficient Variance Estimators of Different sample Average

Since Neyman (1923/1990), the question of how to conduct inference for the Sample Average Treatment Effect (SATE) using the difference-in-means estimator has remained open. No consistent variance estimator exists, and various conservative estimators have been suggested. We show that when the estimand of interest is the Sample Average Treatment Effect of the Treated (SATT or SATC for controls) a consistent and substantially more efficient variance estimator exists. Although these estimands are equal to the SATE both in expectation and asymptotically, potentially large difference in both efficiency and coverage can occur by the change of estimand, even asymptotically. We provide analytical results, simulations, and a real data empirical application to illustrate the gains and concerns from a change of estimand. When the estimand of interest is the SATT (or SATC) a confidence interval for the SATE can be inefficient and provide incorrect coverage. We derive new variance formulas that provide both efficiency gains and correct coverage for the SATT (and SATC). We show that inference on the SATT and SATC yields a decomposition of the uncertainty concerning the SATE to two components: (i) uncertainty over the effect of the treatment on the units who have been exposed to the treatment regime; and (ii) uncertainty over the effect the treatment will have on units who have not been exposed under the current treatment allocation.

Simon Walter, Statistics Department

Estimating Where a Model is Valid

In several problems of practical interest a model holds for only a subset of the feature space; for example, if a material is subject to stress, it is well known that the distortion of the material can be described by a linear model, provided the stress applied is not too extreme. We describe and investigate the properties of an extension of the Patient Rule Induction Method (PRIM) that permits simultaneous estimation of the parameters of a model and the subset of the feature space where the model is valid. Our extensions are tailored to estimate rectangular, convex and star-shaped subsets.

Shusen Wang, Statistics Department

Sketched Ridge Regression

Matrix sketching, including random projection and random column/row selection, turns big matrix into much smaller ones without losing too much useful information. Previous work has applied matrix sketching to speed up the least squares regression (LSR) on the n >> d data. Theoretical analysis of the sketched LSR is has been well established and refined. How the results extend to the ridge regression is yet unclear. Our recent work studies two types of the sketched ridge regression—the classical sketch and the Hessian sketch--- from the optimization perspective and statistical perspective and draws many useful conclusions. The optimization analysis shows that the sketched solutions can be nearly as good as the optimal; in contrast, the statistical analysis clearly indicates that the two sketched solutions significantly increases bias or variance. Our conclusion is that the practical usefulness of the sketched ridge regression may be very limited. We also propose to use a simple method which we call the model averaging to improve the quality of the sketched solution, both theoretically and empirically. We argue that model average has several very useful applications in practice.

Yu Wang, Statistics Department

Global Identifiability of Complete Dictionary Learning Using L1 Minimization

Global identifiability is a fundamental property to consider in dictionary learning. Suppose data is from a generative model with infinite sample size, will L1 minimization framework be able to recover the reference dictionary? Under the complete case, we obtained a necessary and sufficient condition for the reference dictionary to be the unique "sharp'' local minimum among all possible dictionaries. This result shed light on the global identifiability of general sparse coding problems.

Chelsea Zhang, Statistics Department

Generating Smooth, Optimal Trajectories for Robots

In robotics, trajectory optimization algorithms seek a minimum-cost path from a start to a goal configuration. Popular trajectory optimization software, such as TrajOpt, represents the trajectory as a sequence of linear segments between waypoints. However, this parameterization requires many waypoints for a semblance of smoothness. We propose to represent trajectories as elements of a reproducing kernel Hilbert space (RKHS), which are inherently smooth. We adapt TrajOpt to use the RKHS parameterization of trajectories and provide a preliminary implementation.