La última semana tuve la oportunidad de contar la experiencia de Socialmetrix instalando y configurando clusters de Datastax Analytics en Azure. Datastax brinda una solución comercial en un bundle, conteniendo Cassandra, Spark y Solr integrados. Las charlas se dieron en Argentina Big Data Meetup. Hosted by Jampp y el Nardoz Meetup. Hosted by Medallia
Tag: spark
We have been using Spark Standalone deploy for more than one year now, but recently I tried to use Azure’s HDInsight which runs on Hadoop 2.6 (YARN deploy).
After provisioning the servers, all small tests worked fine, I have been able to run Spark-Shell, read and write to Blob Storage, until I tried to write to Datastax Cassandra cluster which constantly returned a error message: Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {10.0.1.4}:9042
This post could also be called Reading .gz.tmp
files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from .gz.tmp
to .gz
, therefore those files are unavailable to be read by SparkContext.textFile
API. This post presents our workaround to process those files.
Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI.
This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master
branch).
Esta charla describe la experiencia de Socialmetrix con más de un año usando Apache Spark en producción, las razones que nos llevaron al cambio de Hadoop+Hive a Spark y los hechos que tomamos en cuenta para soportaron la toma de esta decisión.
ABSTRACT: Apache Spark es un nuevo framework de procesamiento distribuido para big data, escrito en Scala con wrappers para Python y Java, que viene generando mucha atención de la comunidad por su potencia, simplicidad de uso y velocidad de procesamiento. Ya siendo llamado como el remplazo de Apache Hadoop.
A small collection of tips & tricks I learned working with Spark so far, I hope it can help you as well. If you have more tricks, please let me know!
Today I launched a spark job that was taking to long to complete and I forgot to start it through screen so I need find a way to keep it running after I disconnect my terminal of the cluster.
Introducción (en Español) al framework de procesamiento distribuido en memoria Apache Spark. Elementos básicos de Spark, RDD, incluye demo de las librerías SparkSQL y Spark Streaming
Since I got to know News.me it surprises me with its simplicity and yet power to recommend me the best stories to read. I always thought it would be fun try to build something similar. So I decided to create a PoC of Twitter’s top stories using Apache Spark.