# BSTARS 2019

**2019 Berkeley Statistics Annual Research Symposium (BSTARS)**

The Berkeley Statistics Annual Research Symposium (BSTARS) surveys the latest research developments in the department, with an emphasis on possible applications to statistical problems encountered in industry. The conference consists of keynote lectures given by faculty members, talks by PhD students about their thesis work, and presentations of industrial research by alliance members. The day-long symposium gives our graduate students, faculty, and industry partners an opportunity to connect, discuss research, and review the newest development happening on-campus and in the field of statistics.

Schedule

BSTARS 2019 will be March 21st, 1:30pm-8:30pm, at The Alumni House, UC Berkeley.

1:30-2:00 Arrival, coffee, and pastries

2:00-2:10 Welcome and opening remarks

David Culler, Interim Dean, Data Science

2:10-3:15 Thunder Talks

3:15-3:30 Break

3:30-4:55 Thunder Talks 2

4:55-6:00 Poster Sessions

6:00-6:45 Keynote by Professor Rasmus Nielsen

6:45-8:30 Dinner, from Cancun in Berkeley

## Keynote Speaker: Rasmus Nielsen

Rasmus Nielsen is a professor of computational biology at UC Berkeley (since 2008) and a professor of biology at University of Copenhagen (since 2004). He graduated with a PhD in Integrative Biology from UC Berkeley in 1998, did two years of postdoc at Harvard University and held his first faculty position at Cornell University 2000 – 2004 in the department of Biometrics (now BSCB). RN’s research focuses on developing statistical and computational methods for analyses of genomic data. His methods, distributed in popular packages such as IM and PAML, have been used in numerous scientific studies. He is a Senior Editor for Genetics and an Associate Editor for Molecular Biology and Evolution.

His work is on statistical and population genetic analyses of genomic data, in particular methods for detecting natural selection, describing population genetic variation, inferring demography, and methods for association mapping. Much of his current research concerns statistical analysis of next-generation sequencing data, both in the context of medical genetics and population genetics. Many of the methods he has developed are heavily used by other researchers, including the phylogeny based methods for detecting positive selection implemented in PAML, the methods for inferring demographic histories implemented in the IM and IMa programs, the method for detecting selective sweeps implemented in SweepFinder, and the methods for analysing Next Generation Sequencing (NGS) data implemented in ANGSD.

**Lab**: http://www.nielsenlab.org/

## Thunder Talks & Poster Presentations by PhDs and Post-Docs

**Amir Gholami**, Post-Doc EECS

*Neural Networks Through the Lens of the Hessian*: The de-facto method used for training neural networks is Stochastic Gradient Descent (SGD) via randomly selected batches. However, the performance of SGD is very sensitive to hyper-parameter tuning and currently there is no clear way to find these hyper-parameters except for brute force search. Another important problem is that in practice training of SGD is performed for a certain amount of time, without the knowledge of whether we have actually reached a local minima. In this talk, we will present a novel Hessian based framework for analyzing neural network training and inference. This framework has resulted in significant development for a number of problems ranging from speeding up training to model compression.

**Boying Gong**, PhD Biostatistics

*Differentially Methylated Region Detection with Change Point Models*: Whole-genome bisulfite sequencing (WGBS) provides a precise measure of methylation across the genome, yet presents a challenge in identifying regions that are differentially methylated (DMRs) between different conditions. Many methods have been developed, which focus primarily on the setting of two-group comparison. We develop a DMR detecting method MethCP for WGBS data, which is applicable for a wide range of experimental designs beyond the two-group comparisons, such as time-course data. MethCP identifies DMRs based on change point detection, which naturally segments the genome and provides region-level differential analysis. For simple two-group comparison, we show that our method outperforms developed methods in accurately detecting the complete DM region on a simulated dataset and an Arabidopsis dataset. Moreover, we show that MethCP is capable of detecting wide regions with small effect sizes, which can be common in some settings but existing techniques are poor in detecting such DMRs. We also demonstrate the use of MethCP for time-course data on another dataset following methylation throughout seed germination in Arabidopsis.

**Zoe Vernon**, PhD Statistics

Leveraging Molecular Data in Drug Discovery: Detecting relationships between differential expression measures is increasingly important as the cost of genomic technologies continues to decrease and large scale profiling of molecular features in diseases and their changing expression after exposure to drugs become more readily available. Previous work has shown that drugs which reverse the expression of disease associated genes have potential to be efficacious for treating the disease in question. Statistics that summarize this reversal of require detection of local negative dependency at the extremes of the expression profiles. The described pattern of association requires is often not detected standard measures of correlation, particularly in the presence of outliers. Additionally, previous statistics in this field rely on applying thresholds to the disease expression profiles to isolate genes believed to be associated with the disease. The arbitrary nature of the thresholds and the resulting loss of information is problematic. We propose a rank based count statistic designed to detect local dependencies that does not use such a threshold. This statistic is robust to outliers and we have derived results about the asymptotic behavior. In simulation studies and real data we see that our statistic is comparable to or outperforms other measures in these scenarios.

**Koulik Khamaru**, PhD Statistics

Model Misspecification and EM Convergence: Mixture models play a central role in statistical applications, where they are used to capture heterogeneity of the data arising from several un- derlying subpopulations. In the settings of Gaussian mixture models, one of the popular ways to estimate the underlying subpopulations is by using the EM algorithm. While recent years have witnessed substantial progress in proving ”exponential convergence” rate of EM for correctly specified Gaussian mixture models with well-separated mixing components, the be- havior of EM algorithm under misspecified models is not well understood. Indeed, contrary to the statistical minimax rates of n−1/2 in correctly specified Gaussian mixture models, depending on misspecification type, the statistical minimax rates of misspecified models can be as large as n−1/4 or n−1/8—which possibly indicates that analyzing the behavior of EM algorithm in these settings is rather challenging.

In a recent series of works we explain a connection between the rates of convergence of the EM algorithm and the statistical minimax rates. More concretely, contrary to the ”exponential” convergence of EM in correctly specified models, the EM algorithm exhibit a ”polynomial” rate of conver- gence in misspecified Gaussian mixture models; furthermore, the degree of this polynomial rate is related to the statistical minimax rates of n−1/4 or n−1/8. Finally, characterization of the slow convergence of EM under over-specified singular mixtures is not merely of theoretical interest. In fact the model dependent rate of convergence of the EM algorithm may potentially provide practical novel methodologies for model selection.

**Aurelien Bibaut**, PhD Biostatistics

*More Efficient Policy Value Evaluation through Regularized Targeted Learning*: We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In particular, we introduce a novel doubly-robust estimator for the OPE problem in RL, based on the Targeted Maximum Likelihood Estimation principle from the statistical causal inference literature. We also introduce several variance reduction techniques that lead to impressive performance gains in off-policy evaluation. We show empirically that our estimator uniformly wins over existing off-policy evaluation methods across multiple RL environments and various levels of model misspecification. Finally, we further the existing theoretical analysis of estimators for the RL off-policy estimation problem by showing their rate of convergence and characterizing their asymptotic distribution.

**Amanda Glazer**, PhD Statistics

*Look Who’s Talking: Gender Differences in Academic Job Talks*: Prior research has shown that in academic job talks in engineering, female presenters received more questions and interruptions than male presenters. We seek to replicate and expand on this work by analyzing academic job talks in STEM fields over the past few years at U.C. Berkeley. Our coding scheme includes additional measures to differentiate between types of questions and comments in order to further understand gender based differences. Our method of analysis differs from previous work in that we use permutation tests to evaluate gender based differences. Initial results from Engineering replicate previous results and find that women face more comments than men.

**Yue You**, PhD Biostatistics

*Targeted Learning of the Population Size based on Capture-Recapture Designs*: We propose a modern method to estimate population size based on capture-recapture designs of K (>1) samples. The observed data is formulated as a biased sample of n i.i.d. K-dimensional vectors of binary indicators from a conditional distribution given the vector not equal to 0, where the k-th component indicates that subject being caught by the k-th sample. The target quantity is the probability that the vector is not equal to 0. We focus on models assuming a single constraint so that the target quantity is identified by a nonparametric statistical target parameter. We provide solutions for common constraints (K-way additive interaction being 0 and conditional independence). For K-way multiplicative interaction being 0, the statistical target parameter is only defined when the probability on each of the 2^K cells is positive, and the MLE naturally suffers from the curse of dimensionality. We propose a targeted MLE that combines machine learning to smooth across the cells while targeting the fit towards the single valued target parameter of interest. For each problem, we provide simulations w.r.t. assumption violations, inference with confidence intervals and experimental designs with R software.

**Saad Mouti**, Post-Doc CDAR

*Sustainable Responsible Investing and the Cross-Section of Return and Risk*: The identification of factors that predict the cross-section of stock returns has been a focus of asset pricing theory for decades. We address this challenging problem for both equity performance and risk, the latter through the maximum drawdown measure. We test a variety of regression-based models used in the field of supervised learning including penalized linear regression, tree-based models, and neural networks. Using empirical data in the US market from January 1980 to June 2018, we find that a number of firm characteristics succeed in explaining the cross-sectional variation of active returns and maximum drawdown, and that the latter has substantially better predictability. Non-linear models materially add to the predictive power of linear models. Finally, environmental, social, and governance impact enhances predictive power for non-linear models when the number of variables is reduced.

**Sara Stoudt**, PhD Statistics

*Species Distribution and Abundance Models: The Good, The Bad, and The Not Identifiable*: Ecologists commonly make strong parametric assumptions when formulating statistical models. Such assumptions have sparked repeated debates in the literature about statistical identifiability of species distribution and abundance models, among others. At issue is whether the assumption of a particular parametric form serves to impose artificial statistical identifiability that should not be relied upon or instead whether such an assumption is part and parcel of statistical modeling. We borrow from the econometrics literature to introduce a broader view of the identifiability problem than has been taken in ecological debates. In particular we review the concept of nonparametric identifiability and show what can go wrong when we lack this strong form of identifiability.

**Nima Hejazi**, PhD Biostatistics

*Robust Inference on the Causal Effects of Stochastic Interventions under Two-Phase Sampling, with Applications in Vaccine Efficacy Trials*: Much of the focus of statistical causal inference has been devoted to assessing the effects of static interventions, which specify a fixed contrast of counterfactual intervention values to evaluate a given causal effect. Under violations of the assumption of positivity, the evaluation of such interventions faces a host of problems, chief among them non-identification and inefficiency. Stochastic interventions provide a promising solution to these fundamental issues, by allowing for the counterfactual intervention distribution to be defined as a function of its natural (observed) distribution. While such approaches are promising, real data analyses are often further complicated by economic constraints, such as when the primary variable of interest is far more expensive to collect than auxiliary covariates. Two-phase sampling schemes offer a promising solution to such problems --- unfortunately, their use produces side effects that require further adjustment when inference remains the principal goal of a study. We present a novel approach for use in such settings: An augmented targeted minimum loss-based estimator for the causal effects of stochastic interventions, with guarantees of consistency, efficiency, and multiple robustness even in the presence of two-phase sampling. We illustrate the utility of employing our proposed nonparametric estimator via simulation study, demonstrating that it attains fast convergence rates even when incorporating flexible machine learning estimators; moreover, we introduce two recent open source software implementations of the methodology, the txshift and tmle3shift R packages. Using data from a recent HIV vaccine efficacy trial, we show that the proposed methodology obtains efficient inference on a parameter defined as the overall risk of HIV infection in the vaccine arm of an efficacy trial, under arbitrary posited shifts of the distribution of an immune response marker away from its observed distribution in the efficacy trial. The resultant technique provides a highly interpretable variable importance measure for ranking multiple immune responses based on their utility as immunogenicity study endpoints in future HIV-1 vaccine trials that evaluate putatively improved versions of the vaccine.

**Miyabi Ishihara**, PhD Statistics

*Characterization of Spatial and Temporal Trends of Extreme Precipitation using Functional Principal Component Analysis*: Characterizing variability and changes in precipitation, including extreme precipitation, is important for understanding and monitoring natural hazards. Past studies have often used frequency-based approach to quantify changes in the events, which requires us to define an anomalous event by calculating a frequency of threshold exceedance and aggregating across space. However, the choice of the threshold value, time window, and spatial boundary for defining anomalies is not trivial. Therefore, a method that allows a characterization of precipitation without any prior specification of anomaly criteria (such as regional boundaries or fixed temporal windows) is beneficial. This study uses functional principal component analysis (FPCA) to characterize seasonal mean and extreme precipitation using measurements from the Global Historical Climatology Network Daily over the contiguous United States. FPCA is a flexible method that allows us to identify modes of temporal variability and spatial patterns of precipitation variability at a variety of scales. Using this method, we also characterize nonlinear trends in the distribution of precipitation and detect anomalous spatio-temporal events.

**Runjing ‘Bryan’ Liu**, PhD Statistics

*Evaluating Sensitivity to the Stick Breaking Prior in Bayesian Nonparametrics*: A central question in many probabilistic clustering problems is how many distinct clusters are present in a particular dataset. A Bayesian nonparametric (BNP) model addresses this question by placing a generative process on cluster assignment. However, like all Bayesian approaches, BNP requires the specification of a prior. In practice, it is important to quantitatively establish that the prior is not too informative, particularly when the particular form of the prior is chosen for mathematical convenience rather than because of a considered subjective belief.

We derive local sensitivity measures for a truncated variational Bayes (VB) approximation and approximate nonlinear dependence of a VB optimum on prior parameters using a local Taylor series approximation. Using a stick-breaking representation of a Dirichlet process, we consider perturbations both to the scalar concentration parameter and to the functional form of the stick- breaking distribution.

Unlike previous work on local Bayesian sensitivity for BNP, we pay special attention to the ability of our sensitivity measures to extrapolate to different priors, rather than treating the sensitivity as a measure of robustness per se. Extrapolation motivates the use of multiplicative perturbations to the functional form of the prior for VB. Additionally, we linearly approximate only the computationally intensive part of inference -- the optimization of the global parameters -- and retain the nonlinearity of easily computed quantities as functions of the global parameters.

We apply our methods to estimate sensitivity of the expected number of distinct clusters present in the Iris dataset to the BNP prior specification. We evaluate the accuracy of our approximations by comparing to the much more expensive process of re-fitting the model.

**Dangxing Chen**, Post-Doc CDAR

*Predicting Portfolio Return Volatility at Medium Horizons*: Commercially available factor models provide good predictions of short-horizon (e.g. one day or one week) portfolio volatility, based on estimated portfolio factor loadings and responsive estimates of factor volatility. These predictions are of significant value to certain short-term investors, such as hedge funds. However, they provide limited guidance to long-term investors, such as Defined Benefit pension plans, insurance companies, sovereign wealth funds, endowments, and individual owners of Defined Contribution pension plans. Because return volatility is variable and mean-reverting, the square root rule for extrapolating short-term volatility predictions to medium-horizon (one year to ten years) risk predictions systematically overstates (understates) medium-horizon risk when short-term volatility is high (low). In this paper, we propose a computationally feasible method for extrapolating to medium horizon risk predictions in one-factor models that substantially outperforms the square root rule.

**Jake Soloff**, PhD Statistics

*Nonparametric Maximum Likelihood for the Poisson Compound Decision Problem*: The problem of estimating a collection of means based on independent Poisson counts has a classical empirical Bayes solution due to Robbins. This estimator has some problems, however, stemming from the fact that it uses the empirical probability mass function to estimate a mixed Poisson distribution. We study the nonparametric maximum likelihood estimator for a mixed Poisson distribution and the advantages conferred to the resulting nonparametric empirical Bayes estimator.

**Kellie Ottoboni**, PhD Statistics

*SUITE in Practice: Piloting Risk-limiting Post-election Audits in Michigan*: Risk-limiting audits (RLAs) provide statistical evidence that election outcomes are correct by examining random samples of paper ballots. SUITE is an RLA procedure for stratified samples, where ballots are divided into distinct groups and samples are drawn independently from each group. Michigan used SUITE for its first pilot RLAs of the November 2018 election. I'll talk the Jupyter notebook tool I wrote to run the audit from start to finish and what I learned from observing a post-election audit.