Sunday, February 15, 2015

Visualizing India v/s Pakistan One Day International Results

This is my small effort to pickup streamgraph support in R developed by Bob Rudis. (Described here).

What you see is per year aggregations of results of all India v/s Pakistan One day Internationals. I pulled the records from Wikipedia and used rvest by Hadley Wickham. for extracting the results. After that a little data munging using dplyr and lubridate and voilĂ .



You can see an interactive version of the chart above over at rpubs.

Blue's are India and Green's are Pakistan in accordance with their team colors. India had an abysmal records against Pakistan right up until mid 90s, but it has picked up quite a bit after that. And of course India has won all 6 of it's Cricket world cup matches against Pakistan.

As of today the tally stands at: India 51 wins and Pakistan 72 wins. Below's a detailed breakdown.


Running a chi-square test for dependency between the result and the venue didn't find any association between the two, which in layman terms means the results have been unrelated to the venue.

For the nerds (Oh sorry Data Scientists), the code is shown below.


setwd("~/code/R/workspaces/cricket")
library(stringr)
library(rvest)
library(lubridate)
library(dplyr)
library(streamgraph)

# Wikipedia is our best go to source
indvspak <- html('https://en.wikipedia.org/wiki/List_of_ODI_cricket_matches_played_between_India_and_Pakistan')
# Summary table
results.summary <- indvspak %>% html_node('.wikitable') %>% html_table()

# Any dependency btween venue and result ?
chisq.test(results.summary[2:3,3:5])

# The XPATH expression below was obtained using Chrome's Element Inspector.
results <-  indvspak %>%
  html_node(xpath='//*[@id="mw-content-text"]/table[4]') %>% html_table()

# Sensible headers
colnames(results) <- c('MatchNum','Date','Winner','WonBy','Venue','MoM')

# Fix Date
results$Date <- ymd(str_replace(results$Date,'^0([0-9]{4}-[0-9]{2}-[0-9]{2}).*$','\\1'))
# Extract just the year in a new field
results$year <- year(results$Date)

# So that we get our colors as per team colors
results$Winner <- factor(results$Winner,levels=c('India','Pakistan','No result'),ordered=T)

results %>% select(year,Winner) %>%
  group_by(year,Winner) %>% tally() %>%
  streamgraph("Winner", "n", "year", offset="zero", interpolate="linear") %>%
  sg_legend(show=TRUE,
            label="Ind v/s Pak One Day International Results : Over the years") %>%
  sg_axis_x(1, "year", "%Y") %>%
  sg_colors("GnBu")

Monday, January 5, 2015

How to use Twitter’s Search REST API most effectively.

This blog post will discuss various techniques to use Twitter’s search REST API most effectively, given the constraints and limits of the said API. I’ll be using python for demonstration, but any native API which supports the Twitter REST API will do.

Introduction

Twitter provides the REST search api for searching tweets from Twitter’s search index. This is different than using the streaming filter API, in that the later is real-time and starts giving you results from the point of query, while the former is retrospective and will give you results from past, up to as far back as the search index goes (usually last 7 days). While the streaming API seems like the thing to use when you want to track a certain query in real time, there are situations where you may want to use the regular REST search API. You may also want to combine the two approaches, i.e. start 2 searches, one using the streaming filter API to go forward in time and one using the REST search API to go backwards in time, in order to get some on-going and past context for your search term.

Either way if the REST Search API is something you want to use, then there are a few limitations you need to be aware of and some techniques you can use to maximize the resources the API gives you. This post will explore approaches to use the REST search API optimally in order to find as much information as fast as possible and yet remain within the constraints of the API. To start with the API Rate Limit page details the limits of various Twitter APIs, and as per the page the limit for the Search API is 180 Requests per 15 mins window for per-user authentication. Now here’s the kicker, most code samples on the internet for the search API use the Access Token Auth method, which is limited to the aforementioned 180 Requests/15 mins limit, and per request you can ask for maximum 100 tweets, giving you a grand total limit of 18,000 tweets/15 mins, If you download 18K tweets before 15 mins, you won’t be able to get any more results until your 15 min. window expires and you search again. Also you need to be aware of the following limitations of the search API.

Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.

and

Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead.

What this means is, using the search API you are not going to get all the tweets that match your search criteria, even if they are present in your desired timeframe. This is an important point to keep in mind when drawing conclusions about the size of the dataset obtained from using the search REST API.

The problem

So given this background information, can we do something about the following points ?

  • Could we query at a rate faster than 18K tweets/15 mins ?
  • Could we maintain a search context across our API rate limit window, so as to avoid getting duplicate results when searching repeatedly over a long period of time ?
  • Could we do something about the fact that not all tweets matching the search criteria will be returned by the API ?

And the answer to all these 3 questions is YES. There wouldn’t be a point to this blog post if the answers were no, would there ?

The Solution

I’ll be using python and the excellent Tweepy API for this purpose, but any API in any programming language that supports Twitter’s REST APIs will do.

To start with our first question about being able to search at a rate greater than 18K tweets/15 mins. The solution is to use Application only Auth instead of the Access Token Auth. Application only auth has higher limits, precisely up to 450 request/sec and again with a limitation of requesting maximum 100 tweets per request, this gives a rate of 45,000 tweets/15-min, which is 2.5 times more than the Access Token Limit.

The code sample below shows how to use App Only Auth using the Tweepy API.

import tweepy

# Replace the API_KEY and API_SECRET with your application's key and secret.
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
 
api = tweepy.API(auth, wait_on_rate_limit=True,
                   wait_on_rate_limit_notify=True)
 
if (not api):
    print ("Can't Authenticate")
    sys.exit(-1)
 
# Continue with rest of code

The secret is the AppAuthHandler instead of the more frequent OAuthHandler which you find being used in lots of code samples. This sets up App-only Auth and gives you higher limits. Also as an added bonus notice the wait_on_rate_limit & wait_on_rate_limit_notify flags set to true. What this does is make the Tweepy API call auto wait (sleep) when it hits the rate limit and continue upon expiry of the window. This avoids you to have to program this part manually, which as you’ll shortly see makes your program much more simple and elegant.

Next we tackle the second question about maintaining a search context when querying repeatedly over a long time frame. REST APIs by their very nature are stateless, i.e. there is no implicit context maintained by the server in between successive calls to the same API which can tell it what results have been sent to the client so far. So what we need is a way for the client to tell the API server where it is in a search result context, so that the server can then send the next set of results (This is called pagination). The search REST API allows this by accepting two input parameters as part of the API viz. max_id & since_id which serve as the upper and lower bounds of the unique IDs that Twitter assigns each tweet. By manipulating these two inputs during successive calls to the search API you can paginate your results. Below is a code sample that does just that.

import sys
import jsonpickle
import os

searchQuery = '#someHashtag'  # this is what we're searching for
maxTweets = 10000000 # Some arbitrary large number
tweetsPerQry = 100  # this is the max the API permits
fName = 'tweets.txt' # We'll store the tweets in a text file.


# If results from a specific ID onwards are reqd, set since_id to that ID.
# else default to no lower limit, go as far back as API allows
sinceId = None

# If results only below a specific ID are, set max_id to that ID.
# else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -1L

tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
    while tweetCount < maxTweets:
        try:
            if (max_id <= 0):
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            since_id=sinceId)
            else:
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1))
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1),
                                            since_id=sinceId)
            if not new_tweets:
                print("No more tweets found")
                break
            for tweet in new_tweets:
                f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
                        '\n')
            tweetCount += len(new_tweets)
            print("Downloaded {0} tweets".format(tweetCount))
            max_id = new_tweets[-1].id
        except tweepy.TweepError as e:
            # Just exit if any error
            print("some error : " + str(e))
            break

print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))

The above code will write all the downloaded tweets in a text file. Each line representing a tweet encoded in JSON format. The tweets in the file are in reversed order of the creation timestamp i.e. going from most recent to most farthest. There’s probably some room for beautifying the above code, but it works and can download literally millions of tweets at the optimal rate of 45K tweets/15-mins. Just run the code in a background process and it will go back as far as the search API allows until it has exhausted all the results. What’s more using the initial values for max_id and/or since_id you can fetch results to and from arbitrary IDs. This is really helpful if you want to the program repeatedly to fetch newer results since last run. Just look up the max ID (the ID of the first line) from the previous run and set that to since_id for the next run. If you’ve to stop your program before exhausting all the possible results and rerun it again to fetch the remaining results, you can look up the min ID (the ID of the last line) and pass that as max_id for the next run to start from that ID and below.

Now we look at our third question, given the fact that the search results will not contain all possible matching tweets, can we do something about it ? The answer is yes, but it gets a bit tricky. The idea is that; Of the tweets you have fetched there will be quite a lot of retweets, and chances are that some of the original tweets of these retweets are not in the results downloaded. But each retweet also encodes the entire original tweet object in its JSON representation. So if we pick out these original tweets from retweets then we can augment our results by including the missing original tweets in the result set. We can easily do this as each tweet is assigned a unique ID, thus allowing us to use set functions to pick out only the missing tweets.

This approach is not as complicated as it sounds, and can be easily accomplished in any programming language. I have a working code written in R (not shown here). I leave it as an exercise to the reader to implement it in python or whichever language of his/her choice. From my tests for various search queries , I get anywhere from 2% to 10% more tweets this way, so it’s a worthwhile exercise, and it completes your dataset in that you have all the original tweets of every retweet found in your dataset.

Conclusion

I highlighted some of the limitations of Twitter’s search REST API; how you can best use it to the fullest allowed rate limit. I also explained approaches to paginate results as well as extending the result set by another 2% to 10% by extracting missing original tweets from the retweets. Using these approaches you should be able to download a whole lot more tweets at a much faster rate.

Technical Notes:

  • Tweepy also has a api.Cursor method which could possibly replace the whole while loop in the second code sample, but it seems the Cursor API suffers from memory leak and will eventually crash your program. Hence my approach is based on modification of this answer on stackoverflow.
  • For extracting the missing original tweets from retweets, think of the following pseudo-code.
    • Store all downloaded tweets in a set (say set A)
    • From this set filter out the retweets & extract the original tweet from these retweets (say set B)
    • Insert in set A all unique tweets from set B that are not already in set A

Saturday, December 13, 2014

Slides from my talk at Elasticsearch DC Meetup Dec 11 '14.

On Dec 11th, 2014 I presented a talk on 'Scaling Elasticsearch for Production' below are the slides.

And the Video is available too, at Elasticsearch Site.