Spark Summit 2014 - Day 2

2014-07-01

https://blog.arjon.es/2014/spark-summit-2014-day-2/

#spark
#data

Following the Spark Summit 2014 - Day 2

# The Emergence of the Enterprise Data Hub

Mike Olson (Chief Strategy Officer, Cloudera)

Cloudera plans to port Hadoop Ecosystem to Spark, as replacement to M/R.
Cloudera will keep support Impala, among Spark components. IMHO, it is split efforts and I can understand why they are doing this, beside biz decision of course!

# The Future of Spark

Patrick Wendell (Databricks)

## Goals of project

Empower Data scientists and engineers to do their job
Expressive & clean API
Unified runtime across many environments
Powerful standard libraries

## API

Focus on API stability on Spark 1.0+ (breaking patchs are automatically rejected)
- Minor: Every 3 months (1.1 August), 1.2, 1.3
- Maintenance are kept active 1.0.1, 1.0.2, etc

## Future is about libraries

Focus on high-level libraries
Packaged and distributed w/ Spark to provide full inter-operability

## Spark SQL

More active process
Notion of schema RDDs
Focus now are:
- Optimization
- Language extension (towards SQL92)
- Integration

## What about Shark?

Will be replaced by Spark SQL.
JDBC server component preview on 1.0.1
Final release to 1.1

## Spark Core

Allow extension/innovation by defining internal API’s
Internal Storage API
Spark shuffle API (sort-based, pipeline)

## Timeline

Spark 1.0.1

JSON Support

Spark 1.1

Generalized Shuffle Interface
MLlib stats algorithms
JDBC Server
Sort-based shuffle

Spark 1.2

Refactor Storage Engine

Spark 1.3+

SparkR

# Next Spark Summit will be NYC! Early 2015

# Beyond Analytics — Building Data Products for Data Natives

Monica Rogati - @mrogati (VP of Data at Jawbone)

## Data Natives:

Beyond digital natives, expect smart and seamlessly adapt
Expect things to KNOW what they want, ie: Expect the thermostat programs itself
The promise: better, richer, easier lives
quite not there yet!

## Data Products:

Context, Personalization by Using Data, from You, Others and The World

How data product can drive life changes (eat, sleep, exercise, achieve your goals)

Data Science is not about charts and Graphs is about delivery better experiences

## Analytics + Exploration to Build Data Products

Good Instrumentation
Reliable Data Flow (fault tolerance, scalable)
Data Cleanup
Fast Iteration (if it takes 30min to have a top distro, we not gonna check the data)
Good UX

More than that:

The virtuous cycle of smart interactions: More & better data comes from better UX, ie: Auto-complete for food app.

Mutable Ideas