Summarizing large-scale, multiple-document news data: sparse methods and human validation
News media significantly drives the course of events. Understanding how has long been an active and important area of research. Now, as the amount of online news media available grows, there is even more information calling for analysis, an ever increasing range of inquiry that one might conduct. We believe subject-specific summarization of multiple news documents at once can help. In this paper we adapt scalable statistical techniques to perform this summarization under a predictive framework using a vector space model of documents. We reduce corpora of many millions of words to a few representative key-phrases that describe a specified subject of interest. We propose this as a tool for news media study.
We consider the efficacies of four different feature selection approaches---phrase co-occurrence, phrase correlation, $L^1$ regularized logistic regression (L1LR), and $L^1$ regularized linear regression (Lasso)---under many different pre-processing choices. To evaluate these different summarizers we establish a survey by which non-expert human readers rate generated summaries. Data pre-processing decisions are important; we also study the impact of several different techniques for vectorizing the documents, and identifying which documents concern a subject.
We find that the Lasso, which consistently produces high-quality summaries across the many pre-processing schemes and subjects, is the best choice of feature selection engine. Our findings also reinforce the many years of work suggesting the tf-idf representation is a strong choice of vector space, but only for longer units of text. Though we focus here on print media (newspapers), our methods are general and could be applied to any corpora, even ones of considerable size.