Berkeley Statistics Annual Research Symposium (BSTARS)
The Berkeley Statistics Annual Research Symposium (BSTARS) surveys the latest research developments in the department, with an emphasis on possible applications to statistical problems encountered in industry. The conference consists of keynote lectures given by faculty members, talks by PhD students about their thesis work, and presentations of industrial research by alliance members.
Schedule
BSTARS 2017 will be March 23rd, 1:30pm8:30pm, at The Alumni House, UC Berkeley.
1:302:00  Arrival, coffee, and pastries 
2:002:10  Welcome and opening remarks

2:103:10  Thunder Talks 
3:103:30  Break 
3:304:30  Thunder Talks 2 
4:305:50  Poster Sessions 
5:506:30  Keynote by Professor Jasjeet Sekhon 
6:308:30  Dinner 
Keynote Speaker
Causal Inference in the Age of Big Data
Jasjeet S. Sekhon is Robson Professor of Political Science and Statistics and Senior Fellow at the Berkeley Institute for Data Science. His current research focuses on methods for causal inference in observational and experimental studies and evaluating social science, public health, medical, and digital interventions. He advises companies and election campaigns on statistical methods and on the design, implementation and analysis of largescale experiments.
Thunder Talks by Industry Alliance Program Members
Microsoft Presenter: Dr. Gireeja Ranade, Postdoc Researcher, MSR
State Street Presenter: Dr. John Arabadjis, Managing Director, GX Labs
StubHub Presenters:
Dr. Xin Heng, Senior Manager, Data Science
Dr. Corey Reed, Lead Data Scientist
Citadel Presenter: Dr. Mikhail Traskin, Quantitative Researcher
Clindata Insight Presenter: Dr. Peng Yang, President
Deloitte Presenter: Dr. Bambo Sosina, Senior Specialist
Voleon Presenter: Dr. Stephen Reid, Research Staff
TGS Management Presenters:
Dr. Philip Naecker, Chief Technology Officer
Michael Junge, Director of Talent Acquisition
Thunder Talks and Poster Presentations of PhD Students and Postdocs
Rebecca Barter, Statistics Department
Acute Rejection in Kidney Transplant Patients With HIV: Developing Strategies for Dynamic Prediction
Over the past few decades, HIV has evolved from a death sentence to a manageable chronic condition with HAART therapies drastically extending the life expectancy of HIV positive individuals. As a consequence of prolonged survival, HIVassociated conditions such as kidney and liver disease are resulting in an increased demand for organ transplantation. While transplantation is proving effective in terms of patient survival, HIVpositive patients exhibit a surprisingly high rate of kidney rejection relative to their HIVnegative counterparts. Together with the Narwal lab at UCSF, we are developing novel analytic strategies for dynamically predicting and understanding kidney rejection based on 'omics data measured from a range of graft biopsy samples taken over time. In this talk I will discuss our first steps towards an analytic solution and discuss the challenges presented by this problem.
Billy Fang, Statistics Department
On the Risk of Convexconstrained Least Squares Estimators Under Misspecification
The constrained least squares estimator is a natural estimator of the mean of an isotropic Gaussian vector when it is known that the mean lies in some convex set. The risk of this estimator is known to be characterized by the statistical dimension of a tangent cone. As a step toward understanding the behavior of the estimator when the mean does not lie in the convex set (model misspecification), we prove an analogous characterization of risk in the misspecified case when the convex set is a polyhedron, and show that the risk can be much smaller than in the wellspecified setting. This is joint work with Professor Adityanand Guntuboyina.
Ryan Giordano, Statistics Department
How Bad Could it Be? Worstcase Prior Sensitivity Estimates for Variational Bayes
In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior and choice of likelihood, since this choice is made by the modeler and is necessarily somewhat subjective. For example, the process of prior elicitation may be prohibitively timeconsuming; two practitioners may have irreconcilable subjective prior beliefs; or the model may be so complex and highdimensional that humans cannot reasonably express their prior beliefs as formal distributions. All of these circumstances might give rise to a range of reasonable prior choices. A posterior quantity is “robust” to the extent that it does not change much when calculated under these different prior choices.
Sören Künzel, Statistics Department
Heterogenous Treatment Effect Estimation Using Random Forest
There is growing interest in estimating heterogeneous treatment effects in experimental and observational studies with application ranging from personalized medicine to online advertisement recommendation systems. We describe a number of algorithms that are based on random forest to estimate the conditional average treatment effect (CATE) function and we compare them using theoretical results under a simple causal model and simulation studies.
Lihua Lei, Statistics Department
Bayesian Inference on Gaussian Graphical Models With Soft GWishart Distributions
The Gaussian graphical model has been a popular tool in statistics to model the dependence or the causality among a group of variables. We propose a new family of conjugate priors and build a Bayesian hierarchical model based on that. We then develop a fully automatic and tuningfree procedure to estimate the graph structure as well as the precision matrix. Via intensive simulations, we show that our method consistently outperforms the existing frequentist approaches, e.g. the graphical LASSO, in the finite samples.
Jamie Murdoch, Statistics Department
Peeking into the Black Box: A Step Towards Interpretable Deep Learning
Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this talk I consider the popular Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This work is joint with Facebook Artificial Intelligence Research.
Kellie Ottoboni, Statistics Department
Simple Random Sampling: Not So Simple
The theory of inference from simple random samples (SRSs) is fundamental in statistics; many statistical techniques and formulae assume that the data are an SRS. True random samples are rare; in practice, people tend to draw samples by using pseudorandom number generators (PRNGs) and algorithms that map a set of pseudorandom numbers into a subset of the population. Most statisticians take for granted that the software they use "does the right thing," producing samples that can be treated as if they are SRSs. In fact, the PRNG and the algorithm for drawing samples matter enormously. We show, using basic counting principles, that some widely used methods cannot generate all SRSs of a given size, and those that can do not always do so with equal frequencies in simulations. We compare the "randomness" and computational efficiency of commonlyused PRNGs to PRNGs based on cryptographic hash functions, which avoid these pitfalls. We judge these PRNGs by their ability to generate SRSs and find in simulations that their relative merits varies by seed, population and sample size, and sampling algorithm. These results are not just limited to SRSs but have implications for all resampling methods, including the bootstrap, MCMC, and Monte Carlo integration.
Fanny Perraudeau, Biostatistics Department
Zeroinflated Negative Binomial Model for Singlecell RNA
Singlecell RNA sequencing (scRNASeq) is a powerful and relatively young technique for characterizing molecular states of individual cells through their tran scriptional profiles. It represents a major advance with respect to standard bulk RNA sequencing, which is only capable of measuring gene expression lev els averaged over millions of cells. Accessing celltocell variability is crucial for disentangling complex heterogeneous tissues, and for understanding dynamic biological processes, such as embryo development and cancer. Because of the tiny amount of RNA present in a single cell, the input material needs to go through many round of amplification before being sequenced. This results in a strong amplification bias as well as the presence of dropouts, i.e., genes that fail to be detected even though they are expressed in the sample. Here, we present a general and flexible zeroinflated negative binomial (ZINB) model that leads to low dimensional representations of the data by accounting for zeroinflation (dropouts) and for the count nature of the data.
Farzad Pourbabaee, Economics Department
Large Deviations of Factormodels With Regularlyvarying Tails: Asymptotics and Efficient Estimation
I analyze the large deviation probability of factor models generated from components with regularlyvarying tails, a large subclass of heavy tailed distributions. An efficient sampling method (based on conditioning) for tail probability estimation of this class is introduced and shown to exponentially outperform the classical MonteCarlo estimator, in terms of coverage probability and/or confidence interval length. The obtained theoretical results are applied to financial portfolios, verifying that deviation probability of return to portfolios of many securities is robust against asset individual idiosyncratic risks.
Aaditya Ramdas
Liberals Meet Conservatives: Deep Generative Adversarial Networks (GANs) Meet Kernel Two Sample Tests
I will talk about recent work that combines the most classical topic in statistics (hypothesis testing) with the most modern hammer in ML (deep learning). Generative adversarial networks (GANs) are at the current frontier of fully unsupervised learning of complex data like natural images. Imagine that we have a complex dataset (like handwritten digits, or landscapes, or cat photos) and we would like to learn a generative model that can output realistic new data with similar qualities to the input data. GANs attempt to accomplish this by setting two deep networks against each other in a twoplayer minimax game  a generator produces new data, and a discriminator attempts to differentiate input data from generated data. However, the training of such GANs has been a major practical issue, and it has also been unclear how to evaluate such generative models (likelihoodfree, labelfree model criticism). I will describe recent progress towards both of these goals, achieved by deriving new results on hyperparameter tuning for maximizing the power of kernel two sample testing. This work has been accepted for publication at a deep learning conference, ICLR 2017.
Sujayam Saha, Statistics Department
Better FDR Control With Covariates
In multiple testing problems where informative covariates can be naturally assigned to hypotheses, e.g. hypotheses are arranged in spacetime, we propose and validate a likelihood based method that provides more powerful inference while retaining control over false discovery rate. We establish a general alternating maximization framework for computing our estimate and present both theoretical and numerical performance. Further, we present an application of our methodology in detecting neural activation.
Courtney Schiffman, Statistics Department
Variable Selection in Untargeted Metabolomics Studies
Untargeted metabolomics research allows for a widespread study of possible biomarkers associated with a variety of diseases. In these high dimensional, untargeted studies involving highly correlated metabolites, accurate and stable variable selection methods are key. We developed a variable selection method for untargeted metabolomics research, which relies on several parametric and machine learning variable selection methods to identify metabolites associated with disease and exposure. We show that our variable selection method identifies potentially significant metabolites which can easily be missed if relying upon standard variable selection methods in Metabolomics research.
Yotam ShemTov, Economics Department
Efficient Variance Estimators of Different sample Average
Since Neyman (1923/1990), the question of how to conduct inference for the Sample Average Treatment Effect (SATE) using the differenceinmeans estimator has remained open. No consistent variance estimator exists, and various conservative estimators have been suggested. We show that when the estimand of interest is the Sample Average Treatment Effect of the Treated (SATT or SATC for controls) a consistent and substantially more efficient variance estimator exists. Although these estimands are equal to the SATE both in expectation and asymptotically, potentially large difference in both efficiency and coverage can occur by the change of estimand, even asymptotically. We provide analytical results, simulations, and a real data empirical application to illustrate the gains and concerns from a change of estimand. When the estimand of interest is the SATT (or SATC) a confidence interval for the SATE can be inefficient and provide incorrect coverage. We derive new variance formulas that provide both efficiency gains and correct coverage for the SATT (and SATC). We show that inference on the SATT and SATC yields a decomposition of the uncertainty concerning the SATE to two components: (i) uncertainty over the effect of the treatment on the units who have been exposed to the treatment regime; and (ii) uncertainty over the effect the treatment will have on units who have not been exposed under the current treatment allocation.
Simon Walter, Statistics Department
Estimating Where a Model is Valid
In several problems of practical interest a model holds for only a subset of the feature space; for example, if a material is subject to stress, it is well known that the distortion of the material can be described by a linear model, provided the stress applied is not too extreme. We describe and investigate the properties of an extension of the Patient Rule Induction Method (PRIM) that permits simultaneous estimation of the parameters of a model and the subset of the feature space where the model is valid. Our extensions are tailored to estimate rectangular, convex and starshaped subsets.
Shusen Wang, Statistics Department
Sketched Ridge Regression
Matrix sketching, including random projection and random column/row selection, turns big matrix into much smaller ones without losing too much useful information. Previous work has applied matrix sketching to speed up the least squares regression (LSR) on the n >> d data. Theoretical analysis of the sketched LSR is has been well established and refined. How the results extend to the ridge regression is yet unclear. Our recent work studies two types of the sketched ridge regression—the classical sketch and the Hessian sketch from the optimization perspective and statistical perspective and draws many useful conclusions. The optimization analysis shows that the sketched solutions can be nearly as good as the optimal; in contrast, the statistical analysis clearly indicates that the two sketched solutions significantly increases bias or variance. Our conclusion is that the practical usefulness of the sketched ridge regression may be very limited. We also propose to use a simple method which we call the model averaging to improve the quality of the sketched solution, both theoretically and empirically. We argue that model average has several very useful applications in practice.
Yu Wang, Statistics Department
Global Identifiability of Complete Dictionary Learning Using L1 Minimization
Global identifiability is a fundamental property to consider in dictionary learning. Suppose data is from a generative model with infinite sample size, will L1 minimization framework be able to recover the reference dictionary? Under the complete case, we obtained a necessary and sufficient condition for the reference dictionary to be the unique "sharp'' local minimum among all possible dictionaries. This result shed light on the global identifiability of general sparse coding problems.
Chelsea Zhang, Statistics Department
Generating Smooth, Optimal Trajectories for Robots
In robotics, trajectory optimization algorithms seek a minimumcost path from a start to a goal configuration. Popular trajectory optimization software, such as TrajOpt, represents the trajectory as a sequence of linear segments between waypoints. However, this parameterization requires many waypoints for a semblance of smoothness. We propose to represent trajectories as elements of a reproducing kernel Hilbert space (RKHS), which are inherently smooth. We adapt TrajOpt to use the RKHS parameterization of trajectories and provide a preliminary implementation.