Working with CSV files on shell

Often we have to work with JSON data sets, now and then data comes on CSV format. I received a great tip from @diegodellera who told me about textql - Execute SQL against structured text like CSV or TSV.

2014-07-29

https://blog.arjon.es/2014/working-with-csv-files-on-shell/

#bash
#data

Processing Twitter’s top stories with Apache Spark (part 1)

Since I got to know News.me it surprises me with its simplicity and yet power to recommend me the best stories to read. I always thought it would be fun try to build something similar. So I decided to create a PoC of Twitter’s top stories using Apache Spark.

2014-07-07

https://blog.arjon.es/2014/processing-twitters-top-stories-with-apache-spark-part-1/

Spark Summit - Training Day

I took the TRACK B:Advanced Apache Spark Workshop and I can say it was really great learn more about Spark internals and its libraries. The Databricks’ team were awesome. All slides and training material are already online: Spark Summit 2014 Training.

2014-07-03

https://blog.arjon.es/2014/spark-summit-training-day/

#spark
#data

Processing JSON with Spark SQL

After our first day on Spark Summit 2014 I was very excited to try Spark SQL with JSON manipulation. So I download and compile the SNAPSHOT version of Spark to try this feature.

2014-07-01

https://blog.arjon.es/2014/processing-json-with-spark-sql/

#json
#spark

Spark Summit 2014 - Day 2

Following the Spark Summit 2014 - Day 2

2014-07-01

https://blog.arjon.es/2014/spark-summit-2014-day-2/

#spark
#data

Spark Summit 2014 - Day 2 (Afternoon)

Following the Spark Summit 2014 - Day 2

2014-07-01

https://blog.arjon.es/2014/spark-summit-2014-day-2-afternoon/

#spark
#data

Spark Summit 2014 - Day 1

My personal notes during Spark Summit 2014.

2014-06-30

https://blog.arjon.es/2014/spark-summit-2014-day-1/

#spark
#data

Spark Summit 2014 - Day 1 (Afternoon)

Following the Spark Summit 2014 - Afternoon talks notes

2014-06-30

https://blog.arjon.es/2014/spark-summit-2014-day-1-afternoon/

#spark
#data

Great questions for Product Development and MVP definition

Most of people I work with understand the concept of MVP and believe on Lean Startup paradigm to build products. I found that an agreement on how to evaluate and choose which features must be part of the MVP is harder to achieve.

2014-04-28

https://blog.arjon.es/2014/great-questions-for-product-development-and-mvp-definition/

#product

How to change default serializer on Apache Spark Shell

By default, Spark uses Java Serialization which is recognized as bad solution for the processor and memory usage on Serialization and Deserialization besides the low compression on generate bytes.

2014-04-14

https://blog.arjon.es/2014/how-to-change-default-serializer-on-apache-spark-shell/

#spark