KDD Summer School on Mining the Big Data

Program Schedule at a Glance

Program (Tentative)

TITLE: Mining Heterogeneous Information Networks (Slides)
SPEAKER: Jiawei Han (UIUC)

Real world objects are largely interconnected, forming complex heterogeneous but semi-structured information networks. Different from some studies on social network analysis where friendship networks or web page networks form homogeneous information networks, heterogeneous information network reflect complex and structured relationships among multiple typed objects. For example, in a university network, objects of multiple types, such as students, professors, courses, departments, and multiple typed relationships, such as teach and advise are intertwined together, providing rich information.
We explore new methodologies for mining hidden knowledge in such heterogeneous information networks, including integrated ranking and clustering, classification, data integration, trust analysis, role discovery and prediction. We show that structured information networks are informative, and link analysis on such networks is powerful at uncovering critical knowledge hidden in large networks. We also present a few promising research directions on mining heterogeneous information networks.

Jiawei Han is Abel Bliss Professor in Engineering, in the Department of Computer Science at the University of Illinois. He has been researching into data mining, information network analysis, and database systems, with over 600 publications.
He served as the founding Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data (TKDD) and on the editorial boards of several other journals. Jiawei has received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and IEEE Computer Society W. Wallace McDowell Award (2009), and Daniel C. Drucker Eminent Faculty Award (2011). He is a Fellow of ACM and IEEE. He is currently the Director of Information Network Academic Research Center (INARC) supported by the Network Science-Collaborative Technology Alliance (NS-CTA) program of U.S. Army Research Lab. His book with Micheline Kamber and Jian Pei, "Data Mining: Concepts and Techniques" (Morgan Kaufmann) has been used worldwide as a textbook.

TITLE: Large Graph Mining - Patterns, Tools and Cascade Analysis (Slides)
SPEAKER: Christos Faloutsos (CMU)

What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (like, e.g., "eigenSpokes"). For tools, we present "oddBall"; for discovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs, running on top of the "hadoop" system. Finally, for cascades and propagation, we present results on epidemic thresholds as well as fast immunization algorithms.

Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the SIGKDD Innovations Award (2010), eighteen "best paper"; awards, (including two "test of time";) and four teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 200 refereed articles, 11 book chapters and one monograph. He holds five patents and he has given over 30 tutorials and over 10 invited distinguished lectures. His research interests include data mining for graphs and streams, fractals, database performance, and indexing for multimedia and bio-informatics data.

TITLE: Modeling Opinions and Beyond in Social Media (Slides)

Social media has two main distinctive characteristics: social networks and user opinions. Both topics have received extensive research attention over the past decade. In this lecture, I will focus on user opinions. Opinions are important because they are central to almost all human activities and are key influencers of our behaviors. Our beliefs and perceptions of reality, and the choices we make, are to a considerable degree conditioned on how others see and evaluate the world. For this reason, when we need to make a decision we often seek out the opinions of others. This is not only true for individuals but for organizations as well. Modeling and mining of opinions has been a major research direction in the past 10 years. Its inception and rapid growth coincide with those of the social media on the Web, e.g., reviews, blogs, micro-blogs, Twitter, forum discussions, and social networks, because for the first time in human history, we have a huge volume of opinionated data recorded in digital forms. In this lecture, I will first define a model of the problem and a model of the opinionated posting at the abstraction level. I will then discuss some detailed models of relations of opinion components from both the linguistic perspective and the statistical modeling perspective. Beyond that, I will also discuss the modeling of comments about opinions and more. Finally, we will move “behind the scene” to model the behaviors of people who post opinions and to discover their possible hidden motives.

Bing Liu is a professor of Computer Science at University of Illinois at Chicago (UIC). He received his PhD in Artificial Intelligence from the University of Edinburgh. Before joining UIC, he was with the National University of Singapore. His current research interests include opinion mining and sentiment analysis, opinion spam (e.g., fake reviews) detection, Web mining, and data mining. He has published extensively in leading conferences and journals in these fields, and has also written a textbook titled "Web Data Mining: Exploring Hyperlinks, Contents and Usage Data" published by Springer (first and second editions). Due to his research on fake review detection, he was featured in a front page article of The New York Times on Jan 27, 2012. On professional services, Liu has served as program chairs of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), IEEE International Conference on Data Mining (ICDM), ACM Conference on Web Search and Data Mining (WSDM), SIAM Conference on Data Mining (SDM), ACM Conference on Information and Knowledge Management (CIKM), and Pacific Asia Conference on Data Mining (PAKDD). He has also served extensively as senior program committee members, track chairs, and areas chairs in data mining, Web mining, natural language processing, and AI conferences. Additionally, he was or is on the editorial boards of many leading journals, e.g., Data Mining and Knowledge Discovery, ACM Transactions on the Web, and IEEE Transactions on Knowledge and Data Engineering.

TITLE: Two Computational Paradigms for Big Data (Slides)
SPEAKER: Ravi Kumar (Google)

This tutorial will discuss two non-conventional computational models for analyzing massive data. The first is data streams: in this model, data arrives in a stream and the algorithm is tasked with computing a function of the data without explicitly storing it. We will present a few algorithms in this model and pose some challenges arising in this setting. The second is map-reduce: in this model, data is distributed across many machines and computation is done as sequence of map and reduce operations. As before, we will present a few algorithms in this model and discuss their scalability.

Ravi Kumar has been a senior staff research scientist in Google since June 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! He obtained his PhD in Computer Science from Cornell University in 1998. His primary interests are web and data mining, social networks, algorithms for large data sets, and theory of computation. He serves on the editorial boards of JACM, TKDD, and TKDE.

TITLE: Methods for Mining Social Media and Networks (Slides: part 1, part 2)
SPEAKER: Jure Leskovec (Stanford)

The tutorial investigates computational techniques for modeling social networks and social media. First part will discuss methods for extracting and tracking information and contagion as they spread over the network. We will examine methods for extracting temporal patterns by which information popularity grows and fades over time. We show how to quantify and maximize the influence of media outlets on the popularity and attention given to particular piece of content, and how to build predictive models of information diffusion and adoption. Second part will focus on models for extracting structure from social networks and predicting emergence of new links in social networks. In particular, we will examine methods based on Supervised Random Walks for learning to rank nodes on a graph and consequently recommend new friendships in social networks. We will also consider the problem of detecting dense overlapping clusters in networks and present efficient model based methods for network community detection.

Jure Leskovec is an assistant professor of Computer Science at Stanford University. His research focuses on the mining and modeling of large social and information networks as the study of phenomena across the social, technological, and natural worlds. Problems he investigates are motivated by large scale data, the Web and online media.
Jure holds a bachelor’s degree in computer science from University of Ljubljana, Slovenia, and a Ph.D. in machine learning from the Carnegie Mellon University. Prior to joining Stanford, he worked as a postdoctoral researcher at Cornell University. Jure has authored the Stanford Network Analysis Platform (SNAP), a general purpose network analysis and graph mining library that easily scales to massive networks with hundreds of millions of nodes, and billions of edges. He received ACM KDD dissertation award, Microsoft Research Faculty Fellowship, Alfred P. Sloan Fellowship, and appeared on the IEEE Intelligent Systems magazine "AI's 10 to Watch". Jure also holds three patents.

TITLE: Managing and Mining Billion-Node Graphs (Slides)
SPEAKER: Haixun Wang (MSRA)

We are facing challenges at all levels ranging from infrastructures to programming models for managing and mining large graphs. A lot of algorithms on graphs are ad-hoc in the sense that each of them assumes that the underlying graph data can be organized in a certain way that maximizes the performance of the algorithm. In other words, there is no standard graph systems based on which graph algorithms are developed and optimized. In response to this situation, a lot of graph systems have been proposed recently. In this tutorial, we discuss several representative systems. Still, we focus on providing perspectives from a variety of standpoints on the goals and the means for developing a general purpose graph system. We highlight the challenges posed by the graph data, the constraints of architectural design, the different types of application needs, and the power of different programming models that support such needs.

Haixun Wang is a senior researcher at Microsoft Research Asia in Beijing, China, where he manages the group of Data Management, Analytics, and Services. Before joining Microsoft, he had been a research staff member at IBM T. J. Watson Research Center for 9 years. He was Technical Assistant to Stuart Feldman (Vice President of Computer Science of IBM Research) from 2006 to 2007, and Technical Assistant to Mark Wegman (Head of Computer Science of IBM Research) from 2007 to 2009. Haixun Wang has published more than 120 research papers in referred international journals and conference proceedings. He is on the editorial board of Distributed and Parallel Databases (DAPD), IEEE Transactions of Knowledge and Data Engineering (TKDE), Knowledge and Information System (KAIS), Journal of Computer Science and Technology (JCST). He is PC co-Chair of CIKM 2012, ICMLA 2011, WAIM 2011. Haixun Wang got the ER 2008 Conference best paper award (DKE 25 year award), and ICDM 2009 Best Student Paper run-up award.