El uso de ambientes virtuales permite isolar las dependencias del proyecto de otras instaladas en las carpetas de sistemas. Cada .venv
contiene su propia version de los binarios de Python y sus dependencias. A continuación una simple demostración de como se podria organizar venv para compartir el proyecto de forma efectiva.
Tag: data
A couple of years ago I created a Spotify’s playlist where I add all tracks I liked, just as the main repository of things I’d like to listen to, no matter the mood I was when I added that song. As time goes, this playlist became less enjoyable to listen due to the change in rhythm - From listen to a Metal song it jumps to Bossa Nova, which is very annoying. This post contains a few data science approaches I applied to organize this playlist and what worked and what didn’t.
Sometimes you just need data to learn how a algorithm works, to run a stress test or just to have a excuse to spin up several machines in a cluster and see how it crush the data. More often than not, it is incredibly hard to obtain data, and a few colleagues I’ve talked about had similar problem, so this post is a collection of links and references for datasets I know have been open source. Please contribute =)
Although tagcloud seems a little bit outdated and criticized visualization format, I have no doubt it can be useful sometimes. And if you can create one with only a few key strokes it is pretty sweet. Below I’ll show the technic of extracting Twitter #hashtags but you can use this technic to virtually any text source.
This post could also be called Reading .gz.tmp
files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from .gz.tmp
to .gz
, therefore those files are unavailable to be read by SparkContext.textFile
API. This post presents our workaround to process those files.
Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI.
This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master
branch).
Several tutorials have an assumption you own a data set. Often that is not the case and you just can’t take advantage of the tutorial because you don’t have data to play along. To comply with social networks Terms and Conditions you can’t publish your data sets, but you can create your own! Follow through these few commands.
This posts shows how to create heatmaps of conversations taking place on Twitter, this is a proof of concept technic to learn more about our current datasets, this knowledge would be latter applied to the product development cycle. My objective here is to share a simple way to create a quick visualization and be able to make an internal demo.
A few days ago I received an email from a student of Universidad Tecnológica Nacional asking me for advice about what kind of skills he needed acquire to be hired as Big Data Engineer, I felt it was something worth writing about and hopefully it can generate a sane debate and help more people.
The book “DATA + DESIGN | a simple introduction to preparing and visualizing information” is a excellent reference to create visualization to several types of data, it guides you through simple and complex data with very clear Dos and Don’ts tips. On top of all it is free.