Often we have to work with JSON data sets, now and then data comes on CSV format. I received a great tip from @diegodellera who told me about textql - Execute SQL against structured text like CSV or TSV.
Tag: data
I took the TRACK B:Advanced Apache Spark Workshop and I can say it was really great learn more about Spark internals and its libraries. The Databricks’ team were awesome. All slides and training material are already online: Spark Summit 2014 Training.
Following the Spark Summit 2014 - Day 2
Following the Spark Summit 2014 - Day 2
My personal notes during Spark Summit 2014.
Following the Spark Summit 2014 - Afternoon talks notes
Today I had to quickly find the most frequent Hashtags on my smallish dataset. After some research I just found a awesome shell tool to manipulate json: jq a json grep+sed+awk tool
With jq everything else was simple, just pipeline a few commands:
$ cat tweets.json | \ jq -r '.entities.hashtags[].text' | sort | uniq -c | \ sort -nr | $ cat tweets.json | \ jq '.text' | \ # select the text field on my JSON tr 'A-Z' 'a-z' | \ # convert text to lower case egrep -oe'#[0-9a-z_]+' | \ # select the hashtag sort | uniq -c | \ # count the number of different hashtags sort -nr | head -10 # reverse sort by frequency and get top 10 A couple of minutes later, the output was:
It is a collection of programs for processing delimited-text data through the command line or using shell scripts.