Twitter JSON Manipulation
Today I had to quickly find the most frequent Hashtags on my smallish dataset. After some research I just found a awesome shell tool to manipulate json: jq a json grep+sed+awk tool
With jq everything else was simple, just pipeline a few commands:
$ cat tweets.json | \
jq -r '.entities.hashtags[].text' | sort | uniq -c | \
sort -nr |
$ cat tweets.json | \
jq '.text' | \ # select the text field on my JSON
tr 'A-Z' 'a-z' | \ # convert text to lower case
egrep -oe'#[0-9a-z_]+' | \ # select the hashtag
sort | uniq -c | \ # count the number of different hashtags
sort -nr | head -10 # reverse sort by frequency and get top 10
A couple of minutes later, the output was:
487 #bigdata
131 #java
59 #analytics
34 #truoptik
33 #cloud
24 #jobs
16 #job
15 #healthcare
15 #hadoop
15 #followfriday
Of course there are more scalable approaches, but for an small dataset it works just fine without any setup.