Day 1 | May 14

Nando de Freitas - University of British Columbia
Big data: Statistical and computational challenges and opportunities

I will review several successful consumer products and argue that huge datasets played a central role in their construction. Among these products, I will describe Zite, a personalized magazine built by my team and acquired by CNN, which uses massive classifiers to label nearly a billion documents and personalize content for millions of users. I will then discuss two machine learning ideas at the core of many of these products: Ensemble methods and Bayesian optimization. I will briefly present new theoretical results on online random forests, a very popular ensemble method, and point out theoretical problems and practical challenges in scaling this technique further. I will conclude with an overview of Bayesian optimization, present our theoretical advances in this area, and share our experiences in applying this technique to automatic algorithm configuration, information extraction, massive online analytics and intelligent user interfaces.

Peter Richtarik - University of Edinburgh
Big Data Optimization: why Parallelizing like Crazy and being Lazy can be Good

Optimization with big data calls for new approaches and theory helping us understand what we can and cannot expect. In the big data domain classical approaches are not applicable as the computational cost of even a single iteration is often too excessive; these methods were developed in the past when problems of huge sizes were rare to find. We thus need new methods which would be simple, gentle with data handling and memory requirements, and scalable. Our ability to solve truly huge scale problems goes hand in hand with our ability to utilize modern parallel computing architectures such as multicore processors, graphical processing units, and computer clusters. In this talk I will describe a new approach to big data (convex) optimization which uses what may seem to be an 'excessive' amount of randomization and utilizes what may look as a 'crazy' parallelization scheme. I will explain why this approach is in fact efficient and effective and well suited for big data optimization tasks arising in many fields, including machine and statistical learning, social media and engineering.Time permitting, I may comment on other optimization methods suited for big data application which also utilize randomization and parallelization.

Chris Nott - IBM
Some Big Data Case Studies from IBM

Insight into the use of big data will be shared in the session. It will explore some examples of how IBM big data and analytics technologies have been applied to deliver value.

John O'Donovan - Financial Times
Using Data for Insight in the Media

Data is used extensively in the media, both to reinforce stories and gain insight into metrics and usage. As the amount of data becomes over whelming, this talk will show some of the tools used to deal with the problems of managing big data.

Thomas Lansdall-Welfare - University of Bristol
Big data content analysis of online media

The analysis of media content has long been a key task in the social sciences, and its automation and application to large amounts of online media promises to revolutionise the practice of media analysis. In this talk I will overview work done at Bristol during the past few years on the content analysis of online media, with a particular focus on sentiment analysis of Twitter content, notably the detection of major changes in the collective mood in the UK for the period from 2009 to 2012. It will cover the gathering, storage, analysis and visualisation of Twitter content, with an eye to automating or aiding tasks that are of interest for social science researchers. This will include a discussion of data management, data visualisation and key statistical tasks necessary for the content analysis of large amounts of online media. This is joint work with Bill Lampos, Nello Cristianini and the Pattern Analysis Group at the University of Bristol.

Peter Gilks - Barclays
How Barclays are Using Tableau Software to Liberate Data

Within Barclays retail bank we have large amounts of really interesting and useful data that goes mostly unnoticed by the majority of colleagues. We have taken it upon ourselves to help liberate this data from its shackles and make it available and accessible to as many colleagues as possible. One of the key ways we are doing this is by using Tableau Software to visually quickly explore large datasets to find the hidden insights and turn the data into accessible and informative charts, stories and dashboards. I will demonstrate how we do this and show some examples of how simple 'data liberation' has made a significant impact on how we do things like identify customer product needs, tackle the root causes of complaints, make life easier for new customers and spot geographic hotspots.

Marianne Bouchart - Bloomerg News
Appraising Big Financial Data Today

Marianne Bouchart, Web Producer EMEA initiating data journalism projects at Bloomberg News will talk about big financial data. She will demonstrate how her organization deals with it on a daily basis in terms of data journalism, data visualisations and analysis. She'll also tackle the question of trust in social media data within the financial industry and showcase how Bloomberg journalists and clients integrate Twitter data, as well as blogs and other online sources in their workflow.

Andrei Grigoriev - SAP
Big Genome Data - Statistical Analysis in Life Sciences

Day 2 | May 15

Zoubin Ghahramani - Cambridge University
Modelling Big Data though Bayesian Machine Learning

Building models of complex systems is central to many problems in science, engineering, and commerce. The probabilistic framework for modelling uses probability theory to represent and manipulate all forms of uncertainty. By applying Bayes rule (which used to be called "inverse probability") a system can infer unknown quantities from data, and learn models from data. This approach forms the basis of probabilistic machine learning, and works best when the models are realistic and flexible. Flexibility is achieved by allowing nonparametric models, whose complexity can grow with the data. I will give an quick review of some of our work in Bayesian modelling, including recent work on matrix factorisation, recommendation, and network modelling. I will pay particular attention to the issue of scalability: how to get Bayesian methods to scale to large data sets.

Patrick Wolfe - UCL
Big Data and Statistical Network Analysis: Challenges and Opportunities at the Forefront of Modern Statistics

We are surrounded by networks and other forms of "big data" that challenge the assumptions of traditional statistical data analysis. From Facebook's graph search to large-scale recommender systems like Amazon's, how do we model mathematical objects that describe relationships and interactions rather than points in space? How do we turn unbounded data rates to our advantage? And what does it all mean for modern methods of detection and estimation? In this talk I will share research perspectives on these questions, drawing on my work with partners across a variety of professional sectors and academic disciplines. The combination of challenges and opportunities presented by this area make it unique and exciting at the forefront of modern mathematics, statistics, and computing.

Paul Wilkinson- Cloudera
Introduction to Cloudera Impala

Cloudera Impala is a modern SQL engine for Hadoop. Impala provides fast, interactive SQL queries directly on your data stored in HDFS and HBase. This talk gives a high-level overview of Impala, it's architecture, and provides a comparison to Apache Hive and Google's Dremel.

Stefan Sperber - Last.fm
Data Rules Everything Around Me: Big Data Processing at Last.fm

This session will give an overview of how (Big) Data is used at Last.fm on a day-to-day basis. For this, three case studies will be presented: generating listening charts, matching and integrating third-party data, and user-behaviour analysis. I will describe the resulting challenges, sketch out solutions we developed and explain why we often favour simpler approaches to a Big Data problem over more complex ones.

Rachel Schutt - Columbia University & Johnson Research Lab
Educating Next Generation Data Scientists

Data Science is an emerging field in industry, yet not well-defined as an academic discipline (or even in industry for that matter). I proposed the "Introduction to Data Science" course at Columbia University in March, 2012, based on my experience as a statistician at Google, and on a series of informal interviews and conversations I had with data scientists in industry, and professors. The course was offered in the Autumn of 2012, and was the first course at Columbia that had the term "Data Science" in the title. Throughout the semester we examined the central question of "What is Data Science?" from two perspectives: (1) Data Science is what Data Scientists do and (2) Data Science as the potential to be a deep academic discipline. I'll discuss the process I went through to create and design a Data Science course, how I engaged the data science community, and how we examined that central question.

Yoram Bachrach - Microsoft Research
Information Aggregation and Fingerprinting For Big Datasets

Recent technological changes have made it easier than ever to collect massive amounts of data, but have also made it more difficult to distill them into knowledge that can be used for decision making. I will discuss how we can quantify the level of collective intelligence such data allows us to achieve, and demonstrate how it depends on the method we use to process the data. I will demonstrate this notion of collective intelligence by showing how to combine the opinions of experts using probabilistic graphical models that allow us to both reach the best decision and determine who the most competent experts are. For domains where the raw information streams are too costly to store, I will also discuss fingerprinting techniques that allow capturing only the 'essence'' of the data that is required to reach a decision, and show how these can be used for big data recommender systems.

Martin Goodson - QuBit
Online analytics: large scale behavioural modelling at Qubit

Qubit helps businesses to improve their websites by analysing the behaviour of website users. We build models to predict the user's intentions and then tailor websites to the user. Qubit collects around a billion data points per day, including which pages were visited by users, as well as page-based activity right down to mouse movements and clicks. In this talk I'll summarise the infrastructure and tools we use to do statistical analysis on large datasets, along with an outline of some of our ongoing research projects. These projects involve the analysis of free text to the prediction of purchase events from clickstream data. I'll also describe the analysis challenges we are trying to solve.

Adrian Jones- SAS
Big Data... The Story So Far

SAS works with customers to solve ever more challenging problems with increasing amounts of data. SAS is focused on exploiting the data to derive a meaningful value in an appropriate timeframe for the situation. The "Big Data" wave of recent years has driven the demand, and influenced SAS' approach to development and deployment of analytics. This paper will look at how SAS views the "Big Data" trend and share some of the learnings from the journey based on experiences in the field.