BSTARS Conference 2016

Berkeley Statistics Annual Research Symposium (BSTARS)

The Berkeley Statistics Annual Research Symposium (BSTARS) surveys the latest research developments in the department, with an emphasis on possible applications to statistical problems encountered in industry. The conference consists of keynote lectures given by faculty members, talks by PhD students about their thesis work, and presentations of industrial research by alliance members.

Schedule

BSTARS 2016 will be March 14, 1:30pm-8:30pm, at The Alumni House, UC Berkeley.

1:30-2:00 Arrival, coffee, and pastries
2:00-2:10 Welcome and opening remarks
  • Michael Jordan, Statistics Department Professor and Chair
  • Frances Hellman, Dean, Math & Physical Sciences
2:10-3:10 Thunder Talks
3:10-3:30 Break
3:30-4:30 Thunder Talks 2
4:30-5:50 Poster Sessions
5:50-6:30 Keynote by Teaching Professor Ani Adhikari
6:30-8:30 Dinner

Keynote Speaker

Ani Adhikari, Teaching Professor of Statistics at UC Berkeley, will discuss how, in response to the exploding demand for training in Data Science, the University has embarked on the ambitious project of designing a new data science education curriculum, with plans to grow into a new major and minor in Data Science. The pilot course in this program has been a great success, with nearly 500 students enrolled this semester.

Professor Adhikari has received the Distinguished Teaching Award at UC Berkeley and the Dean's Award for Distinguished Teaching at Stanford University. While her research interests are centered on applications of statistics in the natural sciences, her primary focus has always been on teaching and mentoring students. She teaches courses at all levels and has a particular affinity for teaching statistics to students who have little mathematical preparation. She received her undergraduate degree from the Indian Statistical Institute, and her Ph.D. in Statistics from Berkeley.

 

Thunder Talks by Industry Alliance Members

Yang Wang, Data Scientist 

Citadel

Dr. Tao Shi, Sr. Quantitative Researcher 
Dr. Yuhong Wu, Sr. Quantitative Researcher

Deloitte

Dr. David Steier, Director, Deloitte Consulting LLP

Dr. Miklos Racz, Postdoctoral Researcher, Microsoft Research

Dr. Azin Ashkan, Senior Researcher

Dr. Jeffrey R. Bohn, Chief Science Officer and Head of GX Labs
State Street Global Exchange

Dr. Peter Frazier, Staff Data Scientist 

Veracyte

Dr. Jing Huang, Vice President of Bioinformatics 

Thunder Talks and Poster Presentations of PhD Students

Rebecca Barter, Statistics Department

Superheat: Supervised Heatmaps for Visualizing Complex Data

Technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Accordingly, computationally intensive statistical and machine learning algorithms are being used to seek answers to increasingly complex questions. Although visualization has the potential to be a powerful aid to the modern information extraction process, visualizing high-dimensional data is an ongoing challenge. Here, we introduce the supervised heatmap, a new graph that builds upon existing clustered heatmaps commonly used in fields such as bioinformatics. Supervised heatmaps have two primary aims: to provide a means of visual extraction of the information contained within high-dimensional datasets, and to provide a visual assessment of the performance of model fits to these datasets.

Wilson Cai, Biostatistics Program

Symmetric Tensor Regression with Applications in Neuroimaging Data Analysis

Classical regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. Modern applications in medical imaging generate covariates of more complex forms, such as multidimensional arrays (tensors). Traditional statistical and computational methods are proving insufficient for analysis of these high-throughput data due to their ultrahigh dimensionality and complex structure. In this article, we consider regression with symmetric tensor covariates. Such data occur naturally in many applications, such as functional neuroimaging and network data analysis. The proposed method allows trait to be continuous, binary, count of events, or multivariate. Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation. This method will require the implementation of a fast, highly scalable estimation algorithm. Effectiveness of the new methods is demonstrated on both synthetic and real imaging data.

Ryan Copus and Hannah Laqueur, both are PhD students in the School of Law and MA students in the Biostatistics Program 

Machines Learning Justice: The Case for Judgmental Bootstrapping of Legal Decisions

"Justice," the trope goes, "is what the judge ate for breakfast." We propose a new tool for reducing inconsistency in legal decision making: Judgmental Bootstrapping Models ("JBMs") built with machine learning methods. By providing judges with recommendations generated from statistical models of themselves, JBMs can help those judges make more consistent, fairer, and better decisions. To illustrate these advantages, we build a JBM of release decisions for the California Board of Parole Hearings. The JBM correctly classifies 79% of validation-set parole decisions, and if the model would have recommended against parole but the Board nonetheless granted it, the Board was two and a half times more likely to be reversed.

Aditya Devarakonda, EECS Department

Avoiding Communication in Machine Learning

The unprecedented volume of data currently being processed by data analytics frameworks requires scalable machine learning algorithms. However, existing methods scale to just hundreds of processors, after which inter-processor communication dominates and becomes the primary bottleneck. Recent results in numerical linear algebra suggest that large speedups are possible by re-organizing algorithms to avoid inter-processor communication. The work on communication-avoiding Krylov subspace methods is particularly relevant and we show how applying similar algorithmic transformations can lead to faster, more scalable machine learning algorithms.

Ryan Giordano, Statistics Department

Robust Inference with Variational Bayes

In Bayesian analysis, the posterior follows from the data, and a choice of a prior and a likelihood. One usually hopes that the posterior is robust to reasonable variation in the choice of prior and likelihood, since this choice is made by the modeler and is necessarily somewhat subjective. Despite the fundamental importance of the problem and a considerable body of literature, the tools of robust Bayes are not commonly used in practice, in part due to the difficulty of calculating robustness measures from MCMC draws. In contrast to MCMC, variational Bayes (VB) techniques are readily amenable to robustness analysis. We develop local prior robustness measures for mean-field variational Bayes(MFVB), a VB technique that imposes a particular factorization assumption on the variational posterior approximation and demonstrates the accuracy of our method on a range of real-world problems.

Christine Kuang, Statistics Department

Topic-Sentiment Model with Document-Level Covariates

Text data analysis is becoming increasingly important with the rapid growth of text data. Two methods of text analysis are topical analysis and sentiment analysis. Both have equally valuable applications in making inference about social and political cultures, attitudes, and processes. We propose a model based on the Structural Topic Model (STM) which simultaneously detects both topic and sentiment of text data. We present the contributions of the model with an application example.

Karl Kumbier, Statistics Department

Detecting Spatial Gene Expression Patterns in Late-stage Drosophila Melanogaster

Spatially defined gene interactions have long been known to take part in developmental processes. The recent abundance of spatial gene expression data is opening up new opportunities to understand local gene-gene interactions that are highly differentiated across spatial regions. We analyze spatial gene expression patterns from a large microscopy dataset of Drosophila embryos at developmental stages 9-10. We will discuss the challenges inherent in representing embryo images during these stages and present an organ classification and registration model that modifies state of the art computer vision algorithms to produce mid-level image features well suited to bioimaging tasks. By combining our classification model with non-negative matrix factorization, we produce parts-based representations of spatial gene expression in various organ systems. Our so called “principal patterns” of gene expression are interpretable, low dimensional representations of the data that serve as a late stage analogue to the Drosophila fate map.

Soumendu Mukherjee, Statistics Department

Stochastic Block Model with Noise

We consider the Stochastic Block Model (SBM) with noisy observations and discuss how much noise is tolerable from a Statistical perspective for consistent community detection (we use several popular community detection algorithms for this purpose, including spectral clustering). The noise parameters are unidentifiable per se. However, under models that incorporate node covariates, e.g. by modeling the edge probabilities as suitable functions of node covariates, one can estimate the noise parameters.

Kellie Ottoboni, Statistics Department

Model-based Matching for Causal Inference in Observational Studies

Drawing causal inferences from nonexperimental data is difficult due to the presence of confounders, variables that affect both the selection into treatment groups and the outcome. Matching methods can be used to subset the data to groups which are comparable with respect to important variables, but matching often fails to create sufficient balance between groups. Model-based matching is a nonparametric method for matching which groups observations that would be alike if none had received the treatment. We use model-based matching to conduct stratified permutation tests of association between the treatment and outcome, controlling for other variables. Under standard assumptions from the causal inference literature, model-based matching can be used to estimate average treatment effects.

Sujayam Saha, Statistics Department

High-dimensional Inference Using Random Projections (Joint Work with Aditya Guntuboyina and Bin Yu)

Random projections, and more generally sketching algorithms, have been used to great success to mitigate the curse of dimensionality in prediction problems in machine learning. Here we explore and develop the possibility of employing random projections to the aim of computational gain and wider applicability in the problem of statistical inference in high-dimensional linear models. We propose a novel method of constructing confidence intervals (and assigning p-values) for coefficients of individual predictors.

Nihar Shah, EECS Department

Estimation from Pairwise Comparisons: Statistical and Computational Aspects

Data in the form of pairwise comparisons between various items arises in many applications, including crowdsourcing, sports, and others. There are various parametric models for analyzing pairwise comparison data, including the Bradley-Terry-Luce (BTL) and Thurstone models, but their reliance on strong parametric assumptions is limiting. In this work, we study a flexible model for pairwise comparisons, under which the probabilities of outcomes are required only to satisfy a natural form of stochastic transitivity; this class includes parametric models as special cases, but is considerably more general. Despite this greater flexibility, we show that the matrix of probabilities can be estimated at the same rate as in standard parametric models. On the other hand, unlike in the BTL and Thurstone models, computing the minimax-optimal estimator in the stochastically transitive model is not tractable, and we explore various computationally tractable alternatives.

Yuting Wei, Statistics Department

Adaptive Estimation of Planar Convex Sets

In this work, we consider adaptive estimation of an unknown planar compact, convex set from noisy measurements of its support function on a uniform grid. Both the problem of estimating the support function at a point and that of estimating the convex set are studied. Data-driven adaptive estimators are proposed and their optimality properties are established. For pointwise estimation, it is shown that the estimator optimally adapts to every compact, convex set instead of a collection of large parameter spaces as in the conventional minimax theory in nonparametric estimation literature. For set estimation, the estimators adaptively achieve the optimal rate of convergence. In both these problems, our analysis makes no smoothness assumptions on the unknown sets.

Fanny Yang, EECS Department

Statistical and Computational Guarantees for the Baum-Welch Algorithm

The Hidden Markov Model (HMM) is one of the mainstays of statistical modeling of discrete time series, with applications including speech recognition, computational biology, computer vision and econometrics. Estimating an HMM from its observation process is often addressed via the Baum-Welch algorithm, which is known to be susceptible to local optima. In this paper, we first give a general characterization of the basin of attraction associated with any global optimum of the population likelihood. By exploiting this characterization, we provide non-asymptotic finite sample guarantees on the Baum-Welch updates, guaranteeing geometric convergence to a small ball of radius on the order of the minimax rate around a global optimum. As a concrete example, we prove a linear rate of convergence for a hidden Markov mixture of two isotropic Gaussians given a suitable mean separation and an initialization within a ball of large radius around (one of) the true parameters. To our knowledge, these are the first rigorous local convergence guarantees to global optima for the Baum-Welch algorithm in a setting where the likelihood function is nonconvex.

Poster Session

The following graduate students and postdocs will also be presenting posters 

Postdocs: Kimon Fountoulakis - Statistics Department, Fred Roosta - Statistics Department

PhD students: Jianbo Chen - Statistics Department, Fanny Perraudeau - Biostatistics Program

MA students of the Statistics Department (presenting in groups of five to six): Aksam Ahmad, Siyao Chang, Dylan Daniels, Thibault Doutre, Trong Hoang Duong, Yueqi Feng, Boying Gong, Yuchao Guo, Jianglong Huang, Sicun Huang, Alanna Iverson, Mengfei Jiang, Mingyung Kim, Chenzhi Li, Zhuangdi Li, Hao Lyu, Fengshi Niu, Tomofumi Ogawa, Jamie Palumbo, Weiyan Shi, Shamindra Shrotriya, Peter Sujan, Chenyu Wang, Chih-Hui Wang, Cenzhuo Yao, Lu Zhang, Tianyi Zhang, Hongfei Zhao, Luyun Zhao, Ye Zhi, Xinyue Zhou, Yun Zhou, and Jiang Zhu.