A few days ago I received an email from a student of Universidad Tecnológica Nacional asking me for advice about what kind of skills he needed acquire to be hired as Big Data Engineer, I felt it was something worth writing about and hopefully it can generate a sane debate and help more people.
ABSTRACT: Apache Spark es un nuevo framework de procesamiento distribuido para big data, escrito en Scala con wrappers para Python y Java, que viene generando mucha atención de la comunidad por su potencia, simplicidad de uso y velocidad de procesamiento. Ya siendo llamado como el remplazo de Apache Hadoop.
Working with JSON datasets is really common task nowadays, almost any API will output information on this format, but is still complex to manipulate this format when compared with plain-text combined with common unix commands like cut
, awk
, sed
, etc.
To reduce this gap jq
was developed with exactly this paradigm in mind jq is like sed for JSON data. This post will walk through the details to: select fields (projection), flatten arrays, filter jsons based on a field value and convert JSON to CSV/TSV.
ABSTRACT: Working with big volumes of data is a complicated task, but it’s even harder if you have to do everything in real time and try to figure it all out yourself. This session will use practical examples to discuss architectural best practices and lessons learned when solving real-time social media analytics, sentiment analysis, and data visualization decision-making problems with AWS.
2014-11-15 EDIT: Check the Talk recording and Slides here
November 13 I will be sharing the stage with Socialmetrix’s Solutions Architect Sebastian Montini at Amazon AWS re:Invent, we will talk about our experience developing Socialmetrix’s big data and realtime infrastructure, architecture evolution and lessons learned.
Esta charla fue presentada en la Maestría en Explotación de Datos y Descubrimiento del Conocimiento. En el marco de su 10° Aniversario bajo la temática Hablemos de Big Data (Big Data Talks)
A small collection of tips & tricks I learned working with Spark so far, I hope it can help you as well. If you have more tricks, please let me know!
Today I launched a spark job that was taking to long to complete and I forgot to start it through screen so I need find a way to keep it running after I disconnect my terminal of the cluster.
The book “DATA + DESIGN | a simple introduction to preparing and visualizing information” is a excellent reference to create visualization to several types of data, it guides you through simple and complex data with very clear Dos and Don’ts tips. On top of all it is free.
Introducción (en Español) al framework de procesamiento distribuido en memoria Apache Spark. Elementos básicos de Spark, RDD, incluye demo de las librerías SparkSQL y Spark Streaming