Instalando Datastax Analytics (Cassandra y Spark) con Azure Templates

La última semana tuve la oportunidad de contar la experiencia de Socialmetrix instalando y configurando clusters de Datastax Analytics en Azure. Datastax brinda una solución comercial en un bundle, conteniendo Cassandra, Spark y Solr integrados. Las charlas se dieron en Argentina Big Data Meetup. Hosted by Jampp y el Nardoz Meetup. Hosted by Medallia

2016-11-16

https://blog.arjon.es/2016/instalando-datastax-analytics-cassandra-y-spark-con-azure-templates/

Making Hadoop 2.6 + Spark-Cassandra driver play nice together

We have been using Spark Standalone deploy for more than one year now, but recently I tried to use Azure’s HDInsight which runs on Hadoop 2.6 (YARN deploy).

After provisioning the servers, all small tests worked fine, I have been able to run Spark-Shell, read and write to Blob Storage, until I tried to write to Datastax Cassandra cluster which constantly returned a error message: Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {10.0.1.4}:9042

2015-10-12

https://blog.arjon.es/2015/making-hadoop-2.6--spark-cassandra-driver-play-nice-together/

Reading compressed data with Spark using unknown file extensions

This post could also be called Reading .gz.tmp files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from .gz.tmp to .gz, therefore those files are unavailable to be read by SparkContext.textFile API. This post presents our workaround to process those files.

2015-10-02

https://blog.arjon.es/2015/reading-compressed-data-with-spark-using-unknown-file-extensions/

Vagrant + Spark + Zeppelin a toolbox to the Data Analyst (or Data Scientist)

Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI. This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master branch).

2015-08-23

https://blog.arjon.es/2015/vagrant--spark--zeppelin-a-toolbox-to-the-data-analyst-or-data-scientist/

¿Por qué cambiar de Apache Hadoop a Apache Spark?

Esta charla describe la experiencia de Socialmetrix con más de un año usando Apache Spark en producción, las razones que nos llevaron al cambio de Hadoop+Hive a Spark y los hechos que tomamos en cuenta para soportaron la toma de esta decisión.

2015-07-01

https://blog.arjon.es/2015/por-que-cambiar-de-apache-hadoop-a-apache-spark/

WISIT2014 - Clasificando Tweets en Realtime con Apache Spark

ABSTRACT: Apache Spark es un nuevo framework de procesamiento distribuido para big data, escrito en Scala con wrappers para Python y Java, que viene generando mucha atención de la comunidad por su potencia, simplicidad de uso y velocidad de procesamiento. Ya siendo llamado como el remplazo de Apache Hadoop.

2014-11-28

https://blog.arjon.es/2014/wisit2014-clasificando-tweets-en-realtime-con-apache-spark/

Quick tips & tricks I learned working with Spark

A small collection of tips & tricks I learned working with Spark so far, I hope it can help you as well. If you have more tricks, please let me know!

2014-09-04

https://blog.arjon.es/2014/quick-tips-tricks-i-learned-working-with-spark/

#spark

Avoiding a Spark Job to die when disconnecting from shell

Today I launched a spark job that was taking to long to complete and I forgot to start it through screen so I need find a way to keep it running after I disconnect my terminal of the cluster.

2014-08-30

https://blog.arjon.es/2014/avoiding-a-spark-job-to-die-when-disconnecting-from-shell/

#spark
#bash

Introduccion a Apache Spark (en Español)

Introducción (en Español) al framework de procesamiento distribuido en memoria Apache Spark. Elementos básicos de Spark, RDD, incluye demo de las librerías SparkSQL y Spark Streaming

2014-08-14

https://blog.arjon.es/2014/introduccion-a-apache-spark-en-espanol/

Processing Twitter’s top stories with Apache Spark (part 1)

Since I got to know News.me it surprises me with its simplicity and yet power to recommend me the best stories to read. I always thought it would be fun try to build something similar. So I decided to create a PoC of Twitter’s top stories using Apache Spark.

2014-07-07

https://blog.arjon.es/2014/processing-twitters-top-stories-with-apache-spark-part-1/

Mutable Ideas

Tag: spark