Skip to main content

Mutable Ideas

Tag: data

Data science approach to organizing my playlist

A couple of years ago I created a Spotify’s playlist where I add all tracks I liked, just as the main repository of things I’d like to listen to, no matter the mood I was when I added that song. As time goes, this playlist became less enjoyable to listen due to the change in rhythm - From listen to a Metal song it jumps to Bossa Nova, which is very annoying. This post contains a few data science approaches I applied to organize this playlist and what worked and what didn’t.

Where to Find Datasets to Learn Big Data & Data Science

Sometimes you just need data to learn how a algorithm works, to run a stress test or just to have a excuse to spin up several machines in a cluster and see how it crush the data. More often than not, it is incredibly hard to obtain data, and a few colleagues I’ve talked about had similar problem, so this post is a collection of links and references for datasets I know have been open source. Please contribute =)

Reading compressed data with Spark using unknown file extensions

This post could also be called Reading .gz.tmp files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from .gz.tmp to .gz, therefore those files are unavailable to be read by SparkContext.textFile API. This post presents our workaround to process those files.

Vagrant + Spark + Zeppelin a toolbox to the Data Analyst (or Data Scientist)

Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI. This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master branch).