BSTARS 2015

Berkeley Statistics Annual Research Symposium (BSTARS)

The Berkeley Statistics Annual Research Symposium (BSTARS) surveys the latest research developments in the department, with an emphasis on possible applications to statistical problems encountered in industry. The conference consists of keynote lectures given by faculty members, talks by PhD students about their thesis work, and presentations of industrial research by alliance members.

Schedule

BSTARS 2015 will be March 30, 1:30pm-8:30pm, at The Alumni House, UC Berkeley.

1:30-2:00 Arrival, coffee, and pastries

2:00-2:20 Welcome and opening remarks

- 5 min Philip Stark, Statistics Department Professor and Chair

- 5 min Frances Hellman, Dean of the College of Letter and Science

2:20-3:00 Thunder Talks

3:00-3:20 Break

3:20-4:00 Thunder Talks 2

4:00-5:20 Poster Sessions

5:20-6:20 Keynote by Professor Michael Jordan

6:30-8:30 Dinner

Keynote Speaker

Michael I. Jordan is the Pehong Chen Distinguished Professor in the Department of Electrical Engineering and Computer Science and the Department of Statistics at the University of California, Berkeley. He received his Masters in Mathematics from Arizona State University, and earned his PhD in Cognitive Science in 1985 from the University of California, San Diego. He was a professor at MIT from 1988 to 1998. His research interests bridge the computational, statistical, cognitive and biological sciences, and have focused in recent years on Bayesian nonparametric analysis, probabilistic graphical models, spectral methods, kernel machines and applications to problems in distributed computing systems, natural language processing, signal processing and statistical genetics. Prof. Jordan is a member of the National Academy of Sciences, a member of the National Academy of Engineering and a member of the American Academy of Arts and Sciences. He is a Fellow of the American Association for the Advancement of Science. He has been named a Neyman Lecturer and a Medallion Lecturer by the Institute of Mathematical Statistics. He received the David E. Rumelhart Prize in 2015 and the ACM/AAAI Allen Newell Award in 2009. He is a Fellow of the AAAI, ACM, ASA, CSS, IEEE, IMS, ISBA and SIAM.

Thunder Talks and/or Poster Sessions by Industry Alliance Members

Adobe Presenter: Matt Hoffman, Research Scientist

Deloitte Presenter: James Guszcza, Chief Data Scientist

Genentech Presenter: Thomas Bengtsson, Principal Statistical Scientist

Genomic Health Presenters:

Gregory Alexander, Director, Biostatistics

Michael Crager, Senior Biomedical Data Manager

Francois Collin, Senior Statistician

Global quantitative measures using next-generation sequencing for breast cancer presence outperform individual tumor markers in plasma

Hua Analytical Technology Presenter: Michael Xuan, Chairman

Microsoft Presenter: Parikshit Gopalan, Researcher, Azure Storage Team

Uber Presenter: David Purdy, Data Science Manager

Technicolor Presenter: Kevin Xu, Researcher

Veracyte Presenter: Jing Huang, VP of Bioinformatics

Thunder Talks and/or Poster Presentations of PhD Students

Reza Abbasi Asl, EECS Department

Feature Representation for Modeling Visual Cortex Area V4: Dictionary Learning vs. Deep Convolutional Networks

Several studies have been conducted to model different areas of the visual cortex of the brain; however, there is no unified understanding of the extrastriate visual regions such as area V4 and their functional roles in visual processing. As a collaboration between the Yu Group in the Statistics Department and the Gallant Lab in the Psychology Department at UC Berkeley, a deep convolutional network is designed to model V4 neurons in the brain. It has been shown that this model slightly outperforms the dictionary learning approach in terms of predicting neural activity. In particular, a mixed model is introduced to illustrate the importance of the first and second layers of both models as well as a fair comparison. This is a joint work with Yuansi Chen, Adam Bloniarz, Jack Gallant, and Bin Yu.

Yuansi Chen, EECS Department

Invariant Modeling of Visual Cortex Area V4 via Scattering Transform

We build predictive models of neural activity to understand tuning properties of areas of the extrastriate visual cortex. Areas V1 and V2 are typically modeled by applying direction and speed sensitive Gabor wavelet filters to natural stimuli. To model regions further along the ventral and dorsal visual streams, we introduce more complex representations of images based on scattering transform. Orientation tuning of V4 voxels are presented visually.

Arturo Fernandez, Statistics Department

A Low Rank Pipeline for Linear Models

Recent attention and advances have been made in low-rank approximation theory applied to machine learning problems. At the same time, full QR or SVD decompositions can be costly. Here we explore a randomized approach to matrix factorization and apply it to linear model fitting methods such as Ordinary Least Squares and the LASSO. In doing so, we tie in different concepts from numerical linear algebra, statistics, and optimization theory.

Inna Gerlovina, Biostatistics Department

Small Sample Inference and Edgeworth Expansions

For a relatively small sample size, true sampling distribution of even asymptotically linear estimators might be quite far from normal. Relying on asymptotics in these cases might result in poor approximation and consequently faulty conclusions. First, we explore the extent of possible departures from normality in the context of high-dimensional data and multiple testing procedures that require estimating distal tail probabilities. Second, we present a method that applies Edgeworth expansions to data analysis and uses higher empirical moments to increase the accuracy of approximation.

Christine Kuang, Statistics Department

Natural Language Processing Methods for Solving Accessory Pollution

Accessory pollution is a problem of interest for eBay search. Accessory pollution occurs when search results are cluttered with the accessories of the items searched, rather than the item itself. My talk will be going through the different natural language processing methods/features I researched during my summer internship at eBay that could aid in solving this problem.

Miles Lopes, Statistics Department

A Residual Bootstrap for High-Dimensional Regression with Near Low-Rank Designs

We study the residual bootstrap (RB) method in the context of high-dimensional linear regression. Specifically, we analyze the distributional approximation of linear contrasts obtained from ridge regression. When regression coefficients are estimated via least squares, classical results show that RB consistently approximates the laws of contrasts, provided that p << n, where the design matrix is of size n-by-p. Up to now, relatively little work has considered how additional structure in the linear model may extend the validity of RB to the setting where p/n ~ 1. In this setting, we propose a version of RB that resamples residuals obtained from ridge regression. Our main structural assumption on the design matrix is that it is nearly low rank --- in the sense that its singular values decay according to a power-law profile. Under a few extra technical assumptions, we derive a simple criterion for ensuring that RB consistently approximates the law of a given contrast.

Phillip Moritz, EECS Department

Simulation-based Reinforcement Learning Using Policy Gradients

Reinforcement learning is an area of machine learning concerned with how agents can map observations of their environment to actions in order to maximize a given reward function. Most such problems in the real world have a high dimensional state space and the optimal mapping from observations to actions is highly nonlinear. This makes it neccessary to use a rich class of function approximators to represent policies. In the supervised setting, neural networks have proven to be effective because of their rich representational power, their compact representation and the fact that they can be optimized effectively using gradient-based optimization; because of this, many useful models like convolutional and recurrent networks have been developed. In the reinforcement learning setting, gradients of the policy can still be obtained using the likelihood ratio method. This leads to practical algorithms for reinforcement learning that do not require a model for the environment dynamics. We survey some of these policy gradient algorithms and present an algorithm ("Trust Region Policy Optimization") that is effective in practice. We present results on learning how to play ATARI games from vision data using reinforcement learning and 3d robotics walking using a modern physics simulator with realistic contact dynamics. This is joint work with John Schulman and Sergey Levine from UC Berkeley.

Robert Nishihara, EECS Department

One the CovergenceRate of Decomposable Submodular Function Minimization

Submodular functions describe a variety of discrete problems in machine learning, signal processing and computer vision. However, minimizing submodular functions poses a number of algorithmic challenges. Recent work introduced an easy-to-use, parallelizable algorithm for minimizing submodular functions that decompose as the sum of "simple" submodular functions. Empirically, this algorithm performs extremely well, but no theoretical analysis was given. In this paper, we show that the algorithm converges linearly, and we provide upper and lower bounds on the rate of convergence. Our proof relies on the geometry of submodular polyhedra and draws on results from spectral graph theory.

Kellie Ottoboni, Statistics Department

Is Salt Bad for Nations?

The World Health Organization (WHO) is running a major campaign to reduce salt consumption worldwide. However, the main lines of evidence that salt is bad come from observational studies on hypertension. We investigated WHO’s real outcome of interest: mortality. We collected data on mortality, alcohol, tobacco, sodium, and economic factors in 36 countries in 1990 and 2010. Using a nonparametric permutation test, we studied whether the sodium intake helps predict change in life expectancy after accounting for other known health predictors.

Fanny Perraudeau, Biostatistics Department

Gene Expression Response to Occupational Benzene Exposure

Benzene, a ubiquitous environmental pollutant, has been long been implicated in induction of Acute Myeloid Leukemia (AML). However, its mutagenic effect and its mode of action are still unclear. This study used RNA-sequencing (RNA-seq) technology to examine the effect of benzene exposure on gene expression in an occupational exposed cohort in China.

Suzette Puente, Statistics Department

Modeling of Animal Movement and an Application to Elk and Hunter Data

We demonstrate the use of stochastic differential equations and potential functions to model the change in an animal's position in response to covariates. We also consider serial correlation, since it is reasonable to believe that an animal's step size is autocorrelated with its previous p steps. We apply these methods to data (GPS recordings) from the Starkey Experimental Forest and Range in Oregon to model the movement of elk in response to environmental factors and the presence of hunters.

Jeffrey Regier, Statistics Department

Scalable Variational Inference for a Generative Model of Astronomical Images

A central problem in astronomy is to infer the locations and other latent properties of stars and galaxies appearing in telescopic images. In these images, each pixel records a count of the photons originating from stars, galaxies, and the background that entered a particular region of a telescope's lens during an exposure. Each count is well modeled as a Poisson random variable, whose rate parameter is a deterministic function of the latent properties of nearby stars and galaxies. In this talk, I present a generative, probabilistic model of astronomical images, as well as a scalable procedure for inferring the latent properties of imaged stars and galaxies from the model. Experimental results suggest that principled probabilistic models are a viable alternative to ad hoc approaches.

Funan Shi, Statistics Department

Detecting treatment effects in clinical trials for Alzheimer’s disease with florbetapir-PET: An alternative statistical approach to SUVr

Based on Positron Emission Tomography (PET), SUVr is currently the de facto measurement for gauging the accumulation of Amyloid plaques in Alzheimer's Disease (AD) patients. Being a ratio, the non-linear properties of SUVr yields undesirable statistical properties when noise is present in the denominator, which in turn may inflate sample size requirements. Upon discovering that multiple aspects of our PET data (a recent Ph2 AD clinical trial, and the ADNI AV-45 PET repository) exhibits linear properties, it was apparent that the linear regression framework was a more statistically sound approach to detecting longitudinal changes in Amyloid burden compared to SUVr. Simulations verify that linear method can be significantly more powerful then SUVr in detecting longitudinal changes (including potential treatment effects).

Linda Tran, Statistics Department

A regularized particle filter for high-dimensional state-space models

Many interesting forecasting problems can be cast as hidden Markov models with non-linear state transitions. Particle filters are a suite of sampling algorithms that have been developed to estimate the hidden states in these models, but they have issues with high-dimensional state spaces. The number of samples typically required is more than the dimensions of the state, thus increasing the computational complexity of the algorithm. To resolve this issue, we introduce a simple tweak to the algorithm and demonstrate that better forecasts can be made using fewer samples than the dimensions of the state.

Siqi Wu, Statistics Department

Relating Developmental Transcription Factors (TFs) Based on Drosophila Embryonic Gene Expression Images

Siqi Wu^1,2, Antony Joseph^1,2, Ann S. Hammonds², William W. Fisher², Richard Weiszmann², Susan E. Celniker², Bin Yu¹ and Erwin Frise²

¹Department of Statistics, University of California Berkeley, Berkeley, CA 94720

²Department of Genome Dynamics, Division of Life Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720

TFs play a central role in controlling gene expression. A fundamental problem in systems biology is to understand the interactions between the TFs. We combined Nonnegative Matrix Factorization with a new stability model selection criterion to decompose the expression patterns of all known TFs into a group of data-driven “principal patterns”. The representation of the expression patterns as learned principal patterns allows for a compact and interpretable representation. Based on the learned patterns, we constructed spatially local TF networks that agreed well with the known gap-gene network.

Shijing Yao, EECS Department

Coordinate Descent by Stochastic Search

A Multi-angle Imaging SpectroRadiometer (MISR) is a scientific instrument aboard the Terra satellite which was launched by NASA in 1999. This device was designed to measure the intensity of solar radiation reflected by the Earth’s planetary surface and atmosphere (in various directions and spectral bands). Retrieval of Aerosol Optical Depth (AOD) from the atmosphere through the MISR image is one of the goals. The current operational-retrieval-algorithm searches a pre-defined solution space and finds the optimal one by setting a threshold to a chi-square statistic. This approach suffers from limited accuracy and leads to severe underestimation for high AOD values. Moreover, missing retrievals could occur when none of the pre-defined solutions meets the threshold criteria. A recently developed Hierarchical Bayesian model with MCMC method was shown to improve spatial resolution and coverage. However, compared to the threshold method, it was reported computationally expensive and slow. Meanwhile, the algorithm did not well define the convergence criteria and had systematic overestimation for low AOD values.

In this work, we propose a new algorithm based on Approximate Bayesian Computation (ABC), to retrieve AOD. Our algorithm is 100x faster than the MCMC method but still inherits the accuracy from MCMC method. We tested the retrievals against the AERONET ground measurement (the industry standard) in the Baltimore-Washington DC metropolitan area and the results show excellent agreement. We believe our fast AOD retrieval algorithm in high spatial resolution will be very useful for its ultimate adoption in urban-level remote sensing.

Yuchen Zhang

Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing

Abstract: Crowdsourcing is a popular paradigm for effectively collecting labels at low cost. The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods.