Often we have to work with JSON data sets, now and then data comes on CSV format. I received a great tip from @diegodellera who told me about textql - Execute SQL against structured text like CSV or TSV.
Since I got to know News.me it surprises me with its simplicity and yet power to recommend me the best stories to read. I always thought it would be fun try to build something similar. So I decided to create a PoC of Twitter’s top stories using Apache Spark.
I took the TRACK B:Advanced Apache Spark Workshop and I can say it was really great learn more about Spark internals and its libraries. The Databricks’ team were awesome. All slides and training material are already online: Spark Summit 2014 Training.
After our first day on Spark Summit 2014 I was very excited to try Spark SQL with JSON manipulation. So I download and compile the SNAPSHOT version of Spark to try this feature.
Following the Spark Summit 2014 - Day 2
Following the Spark Summit 2014 - Day 2
My personal notes during Spark Summit 2014.
Following the Spark Summit 2014 - Afternoon talks notes
Most of people I work with understand the concept of MVP and believe on Lean Startup paradigm to build products. I found that an agreement on how to evaluate and choose which features must be part of the MVP is harder to achieve.
By default, Spark uses Java Serialization which is recognized as bad solution for the processor and memory usage on Serialization and Deserialization besides the low compression on generate bytes.