Wednesday, March 4, 2015

The 10 commandments for hiring Data Scientists

As a Data Scientist (whatever it means), I get a lot of job offers over LinkedIn and other channels. Although I’m not actively looking for a job, I still go through them. One just because I’m curious to find out what exactly do organizations look for in a Data Scientist, and secondly to amuse myself. This post is about the later part, it amuses me to no end what some people want in a Data Scientist, and I’ve made a consolidated list for all the recruiters and organizations who are looking to hire one (or more).

Warning : If satire is not your cup of tea (coffee/soda) you should most certainly not refrain from not reading this article.

Rule 1: Thou shalt not have a freaking clue what Data Science is all about.

But you should still throw in terms like Artificial Intelligence, deep learning, Neural Networks, SVM (admit it, you don’t even know what it stands for). You are not concerned whether the applicant can apply his knowledge to solve the problem at hand, all you really want to know is whether he knows the difference between supervised and unsupervised learning.
In short don’t worry about what the applicant can do, just worry about how much he can memorize and regurgitate. Oh! and have him describe the Apriori algorithm over the telephonic interview.

Rule 2: Thou shalt use a fishing net to grab as many as you can, and figure out what to do with them later.

Have like 10 or 15 openings for the same post. Be very vague about what it is that you exactly expect these people to do. Better yet have a complete lack of understanding of what your problems are and how you think Data Scientists can help you solve it. Just be sure to mention you have tons of data. Yeah that’ll make them bite.

Rule 3: Thou shalt put ‘Data Scientist’ even if what you really want a code monkey.

Well why not? I mean you can’t attract developers to work for you with ‘We need you to work 12 hours a day, 7 days a week, 365 days a year’. But hey if you just change that position from Software Developer to Data-Scientist, lo and behold the bees come flying to your honeypot. And if they can fix your crappy website css code on the side while doing data science-y stuff it’s a win-win.

Rule 4: Thou shalt never mention salary range or benefits.

It’s not like Data-Scientists are in hot demand or anything. They should be grateful you even put up a job post for them to see and apply. And who do they think they are demanding top notch compensation for the efforts and handwork they put in acquiring their skills. And if you really think about it, free laundry is all the benefits someone needs anyways.

Rule 5: Thou shalt not care about the age v/s experience paradox.

We want a PhD. with 10 years of work experience, who’s young and has the zeal and energy of someone just out of the college, coz you know we need his skills to make more people click our in-your-face video pop-up ads. Plus we really can’t just say under–30 single male (that would get us sued), so we just go with young, energetic, likes to work in a startup environment, doesn’t mind staying up late in office (hey free pizza and sodas!).

Rule 6: Thou shalt extol the virtues of working in a startup in a way that would make a Bangladeshi garment factory owner blush.

  • Long never ending hours at work - CHECK
  • Jack of All trades job duties - CHECK
  • Low pay but promise of Stock Option - CHECK
  • No real usable health care coverage - CHECK
  • Foosball/Ping-Pong table - CHECK
  • Screwing over loyal employees by selling and cashing out - PRICELESS

Rule 7: Thine post shalt be scattered with worthless terms like web-scale, big data, well-funded startup.

As if terms like ‘leverage’, ‘synergy’ were not enough. Our dear data scientist must know how to work with ‘BIG DATA’. The more the meaningless and worthless terms in our job posting the better, it will allow us to hire the crème de la crème of analytics talent. Also throw in the fact that all the founders have PhDs in bio-informatics or AI or machine learning etc. coz you know that is so critical to have when it comes to effective leadership.
Also when was the last time someone advertised themselves as a ‘piss-poorly funded startup’?

Rule 8: Thine Data-Scientists must know every programming language under the sun (even the ones not invented so far).

Coz you know programming is where it’s at. If you can’t code you can’t do jack. And what do you mean you only know R or python ? All the cool kids are using Ruby or Node.js. And don’t tell us you can’t write enterprise applications using Java/J2EE/EJBs. Oh! and please do explain in great detail how a Hashtable works. No job interview is complete without it. In short if you see an IP address the first thing that should cross your mind is Visual Basic GUI.

Rule 9: Hadoop Hadoop Hadoop, wait I’m forgetting something, Ah! yes Hadoop.

GEORGE: “Why don’t they have Hadoop in the mix?”

JERRY: “What do you need Hadoop for?”

GEORGE: “Hadoop is now the number one data crunching framework in America.”

JERRY: “You know why? Because people like to say ”Hadoop.“ ”Excuse me, do you have any Hadoop?“ ”We need more Hadoop.“ ”Where is Hadoop? No Hadoop?"

Rule 10: Thine Data-Scientist should be able to code, design web-apps, be an Agile Scrum master, Software Architect, Project Manager, Product Manager, Sales/Marketing Guru. Did I mention a unicorn ?

OK no satire on this one. Just straight up practical advice. Data scientists are not all knowing superhuman beings. Figure out what is it that your organization wants to do with data and hire well trained and decently experienced people who can solve your challenges. If looking for novice employees make sure the job has enough breathing room for them to grow into gradually. And lastly don’t go looking for unicorns because a) they don’t exist, and b) the best thing they can do is make 6–8 year old girls scream in high pitch.


footnote : The difference in percentage of animals harmed in writing this post and percentage of animals that would have been harmed had I not written this post is not statistically significant.

Sunday, February 15, 2015

Visualizing India v/s Pakistan One Day International Results

This is my small effort to pickup streamgraph support in R developed by Bob Rudis. (Described here).

What you see is per year aggregations of results of all India v/s Pakistan One day Internationals. I pulled the records from Wikipedia and used rvest by Hadley Wickham. for extracting the results. After that a little data munging using dplyr and lubridate and voilà.



You can see an interactive version of the chart above over at rpubs.

Blue's are India and Green's are Pakistan in accordance with their team colors. India had an abysmal records against Pakistan right up until mid 90s, but it has picked up quite a bit after that. And of course India has won all 6 of it's Cricket world cup matches against Pakistan.

As of today the tally stands at: India 51 wins and Pakistan 72 wins. Below's a detailed breakdown.


Running a chi-square test for dependency between the result and the venue didn't find any association between the two, which in layman terms means the results have been unrelated to the venue.

For the nerds (Oh sorry Data Scientists), the code is shown below.


setwd("~/code/R/workspaces/cricket")
library(stringr)
library(rvest)
library(lubridate)
library(dplyr)
library(streamgraph)

# Wikipedia is our best go to source
indvspak <- html('https://en.wikipedia.org/wiki/List_of_ODI_cricket_matches_played_between_India_and_Pakistan')
# Summary table
results.summary <- indvspak %>% html_node('.wikitable') %>% html_table()

# Any dependency btween venue and result ?
chisq.test(results.summary[2:3,3:5])

# The XPATH expression below was obtained using Chrome's Element Inspector.
results <-  indvspak %>%
  html_node(xpath='//*[@id="mw-content-text"]/table[4]') %>% html_table()

# Sensible headers
colnames(results) <- c('MatchNum','Date','Winner','WonBy','Venue','MoM')

# Fix Date
results$Date <- ymd(str_replace(results$Date,'^0([0-9]{4}-[0-9]{2}-[0-9]{2}).*$','\\1'))
# Extract just the year in a new field
results$year <- year(results$Date)

# So that we get our colors as per team colors
results$Winner <- factor(results$Winner,levels=c('India','Pakistan','No result'),ordered=T)

results %>% select(year,Winner) %>%
  group_by(year,Winner) %>% tally() %>%
  streamgraph("Winner", "n", "year", offset="zero", interpolate="linear") %>%
  sg_legend(show=TRUE,
            label="Ind v/s Pak One Day International Results : Over the years") %>%
  sg_axis_x(1, "year", "%Y") %>%
  sg_colors("GnBu")

Monday, January 5, 2015

How to use Twitter’s Search REST API most effectively.

This blog post will discuss various techniques to use Twitter’s search REST API most effectively, given the constraints and limits of the said API. I’ll be using python for demonstration, but any native API which supports the Twitter REST API will do.

Introduction

Twitter provides the REST search api for searching tweets from Twitter’s search index. This is different than using the streaming filter API, in that the later is real-time and starts giving you results from the point of query, while the former is retrospective and will give you results from past, up to as far back as the search index goes (usually last 7 days). While the streaming API seems like the thing to use when you want to track a certain query in real time, there are situations where you may want to use the regular REST search API. You may also want to combine the two approaches, i.e. start 2 searches, one using the streaming filter API to go forward in time and one using the REST search API to go backwards in time, in order to get some on-going and past context for your search term.

Either way if the REST Search API is something you want to use, then there are a few limitations you need to be aware of and some techniques you can use to maximize the resources the API gives you. This post will explore approaches to use the REST search API optimally in order to find as much information as fast as possible and yet remain within the constraints of the API. To start with the API Rate Limit page details the limits of various Twitter APIs, and as per the page the limit for the Search API is 180 Requests per 15 mins window for per-user authentication. Now here’s the kicker, most code samples on the internet for the search API use the Access Token Auth method, which is limited to the aforementioned 180 Requests/15 mins limit, and per request you can ask for maximum 100 tweets, giving you a grand total limit of 18,000 tweets/15 mins, If you download 18K tweets before 15 mins, you won’t be able to get any more results until your 15 min. window expires and you search again. Also you need to be aware of the following limitations of the search API.

Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.

and

Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead.

What this means is, using the search API you are not going to get all the tweets that match your search criteria, even if they are present in your desired timeframe. This is an important point to keep in mind when drawing conclusions about the size of the dataset obtained from using the search REST API.

The problem

So given this background information, can we do something about the following points ?

  • Could we query at a rate faster than 18K tweets/15 mins ?
  • Could we maintain a search context across our API rate limit window, so as to avoid getting duplicate results when searching repeatedly over a long period of time ?
  • Could we do something about the fact that not all tweets matching the search criteria will be returned by the API ?

And the answer to all these 3 questions is YES. There wouldn’t be a point to this blog post if the answers were no, would there ?

The Solution

I’ll be using python and the excellent Tweepy API for this purpose, but any API in any programming language that supports Twitter’s REST APIs will do.

To start with our first question about being able to search at a rate greater than 18K tweets/15 mins. The solution is to use Application only Auth instead of the Access Token Auth. Application only auth has higher limits, precisely up to 450 request/sec and again with a limitation of requesting maximum 100 tweets per request, this gives a rate of 45,000 tweets/15-min, which is 2.5 times more than the Access Token Limit.

The code sample below shows how to use App Only Auth using the Tweepy API.

import tweepy

# Replace the API_KEY and API_SECRET with your application's key and secret.
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
 
api = tweepy.API(auth, wait_on_rate_limit=True,
                   wait_on_rate_limit_notify=True)
 
if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)
 
# Continue with rest of code

The secret is the AppAuthHandler instead of the more frequent OAuthHandler which you find being used in lots of code samples. This sets up App-only Auth and gives you higher limits. Also as an added bonus notice the wait_on_rate_limit & wait_on_rate_limit_notify flags set to true. What this does is make the Tweepy API call auto wait (sleep) when it hits the rate limit and continue upon expiry of the window. This avoids you to have to program this part manually, which as you’ll shortly see makes your program much more simple and elegant.

Next we tackle the second question about maintaining a search context when querying repeatedly over a long time frame. REST APIs by their very nature are stateless, i.e. there is no implicit context maintained by the server in between successive calls to the same API which can tell it what results have been sent to the client so far. So what we need is a way for the client to tell the API server where it is in a search result context, so that the server can then send the next set of results (This is called pagination). The search REST API allows this by accepting two input parameters as part of the API viz. max_id & since_id which serve as the upper and lower bounds of the unique IDs that Twitter assigns each tweet. By manipulating these two inputs during successive calls to the search API you can paginate your results. Below is a code sample that does just that.

import sys
import jsonpickle
import os

searchQuery = '#someHashtag'  # this is what we're searching for
maxTweets = 10000000 # Some arbitrary large number
tweetsPerQry = 100  # this is the max the API permits
fName = 'tweets.txt' # We'll store the tweets in a text file.


# If results from a specific ID onwards are reqd, set since_id to that ID.
# else default to no lower limit, go as far back as API allows
sinceId = None

# If results only below a specific ID are, set max_id to that ID.
# else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -1L

tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
    while tweetCount < maxTweets:
        try:
            if (max_id <= 0):
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            since_id=sinceId)
            else:
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1))
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1),
                                            since_id=sinceId)
            if not new_tweets:
                print("No more tweets found")
                break
            for tweet in new_tweets:
                f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
                        '\n')
            tweetCount += len(new_tweets)
            print("Downloaded {0} tweets".format(tweetCount))
            max_id = new_tweets[-1].id
        except tweepy.TweepError as e:
            # Just exit if any error
            print("some error : " + str(e))
            break

print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))

The above code will write all the downloaded tweets in a text file. Each line representing a tweet encoded in JSON format. The tweets in the file are in reversed order of the creation timestamp i.e. going from most recent to most farthest. There’s probably some room for beautifying the above code, but it works and can download literally millions of tweets at the optimal rate of 45K tweets/15-mins. Just run the code in a background process and it will go back as far as the search API allows until it has exhausted all the results. What’s more using the initial values for max_id and/or since_id you can fetch results to and from arbitrary IDs. This is really helpful if you want to the program repeatedly to fetch newer results since last run. Just look up the max ID (the ID of the first line) from the previous run and set that to since_id for the next run. If you’ve to stop your program before exhausting all the possible results and rerun it again to fetch the remaining results, you can look up the min ID (the ID of the last line) and pass that as max_id for the next run to start from that ID and below.

Now we look at our third question, given the fact that the search results will not contain all possible matching tweets, can we do something about it ? The answer is yes, but it gets a bit tricky. The idea is that; Of the tweets you have fetched there will be quite a lot of retweets, and chances are that some of the original tweets of these retweets are not in the results downloaded. But each retweet also encodes the entire original tweet object in its JSON representation. So if we pick out these original tweets from retweets then we can augment our results by including the missing original tweets in the result set. We can easily do this as each tweet is assigned a unique ID, thus allowing us to use set functions to pick out only the missing tweets.

This approach is not as complicated as it sounds, and can be easily accomplished in any programming language. I have a working code written in R (not shown here). I leave it as an exercise to the reader to implement it in python or whichever language of his/her choice. From my tests for various search queries , I get anywhere from 2% to 10% more tweets this way, so it’s a worthwhile exercise, and it completes your dataset in that you have all the original tweets of every retweet found in your dataset.

Conclusion

I highlighted some of the limitations of Twitter’s search REST API; how you can best use it to the fullest allowed rate limit. I also explained approaches to paginate results as well as extending the result set by another 2% to 10% by extracting missing original tweets from the retweets. Using these approaches you should be able to download a whole lot more tweets at a much faster rate.

Technical Notes:

  • Tweepy also has a api.Cursor method which could possibly replace the whole while loop in the second code sample, but it seems the Cursor API suffers from memory leak and will eventually crash your program. Hence my approach is based on modification of this answer on stackoverflow.
  • For extracting the missing original tweets from retweets, think of the following pseudo-code.
    • Store all downloaded tweets in a set (say set A)
    • From this set filter out the retweets & extract the original tweet from these retweets (say set B)
    • Insert in set A all unique tweets from set B that are not already in set A