Tuesday, May 5, 2015

Book Review : Data Driven Security


I work with the two authors of this book. In fact one of them is my manager. But a) I don’t like to suck up to my colleagues and b) I’m sure they don’t like being sucked up to either. Despite this if you think my review will be biased then stop reading now. Go watch some cat videos.

Cover Data Driven Security is a first of it’s kind book that aims to achieve the impossible; To be a book that integrates all 3 dimensions of ‘Data Science’, a) Math and Statistical Knowledge, b) Coding/Hacking skills, and c) Domain Knowledge. Domain in this case being the Information Security Domain. If these 3 dimensions are unknown to you, look at the figure on the right.. DS Traditionally books available for data science have tackled only one dimension at a time or at best two. This book is unique in that regard as it tackles all 3 dimensions. This is worth mentioning especially when you consider that concepts like statistical and machine learning are not part of traditional InfoSec tools. Traditional InfoSec tools are based around the concept of signature matching, i.e. determining if a threat matches from a set of already known badness such as a virus, malware, network activity, ip address, domain name. This approach is always playing catch up and the good guys are always one step (in fact several steps) behind the bad guys. This is where data driven security comes in. The idea is to use data analysis techniques for security research and build the next generation of InfoSec tools that can spot badness before it is known. A fascinating field, trust me. The challenge of writing on such a subject can not be overstated. The book needs to be approachable by readers coming in from any of the three dimensions. Also each of the three dimensions is so vast and wide that you can find hundreds of books dedicated to just one single dimension. So have the authors been successful in this endeavor ? Read on to find out…

At a glance

The book (ISBN: 978–1–118–79372–5) is published by Wiley Publications in Feb, 2014. Wiley has been publishing some really interesting titles over the past few years in the Data Analysis, Statistics domain. The book is absolutely gorgeous from cover to cover. The page quality is very high and a lot of effort has gone into making the code and figures look stunning. It is one of the best visually pleasing books I have in my collection (in addition to anything by Stephen Few and Edward R. Tufte). All code presented is properly commented, something I seldom find in technical books. There is a ton of code in this book, which is expected as coding is one of the skills in data science. The authors have done a great job presenting code in python and R. Both python and R offer libraries and interactive environments for data analysis and by presenting the code in 2 languages the authors have made the book accessible to wide audience.
In addition to the book the authors have a website: datadrivensecurity.info for the book, a blog and a podcast where they discuss all things InfoSec and data science.


Chapter 1: The Journey to Data Driven Security

Chapter 1 starts with a brief history of data analysis, from the classical statistical analysis techniques of the Nineteenth and Twentieth century to the modern algorithmic approaches of the Twenty-First century. You have enough anecdotes to convince you of the importance of data analysis in case you are still wondering why analyze data in the first place. The chapter then explores the skill sets required for data analysis: Domain expertise, Data Management, Programming, Statistics and finally Visualization. Each topic is given its due credit and you’ll learn how each of these pieces fits in to the mosaic. The chapter rounds off with a very important section ‘Centering on a Question’. It is very much possible that your data analysis can lead you to many directions if you don’t have a proper research question framed (Your’s truly is guilty of this many times over). In short you learn the history of data analysis, skill-sets required for it and the importance of framing the right question(s).

Chapter 2: Building Your Analytics Toolbox: A Primer on Using R and Python for Security Analysis

Chapter 2 is all about setting up Python and R development environments for your data analysis. The authors start by explaining their reasons for using Python and R, and more importantly why both (avoiding the situation of having the hammer as your only tool). Next you learn how to set up Python using the Canopy distribution and R using Rstudio. You also get some sample code to test your respective setups. Next the chapter introduces you to the concept of a data frame; a tabular data structure often used in data analysis. You gee a taste of both R’s native data.frame as well as Python’s DataFrame from the pandas package. lastly you’ll see how to organize your code for a typical analysis project. I recommend you don’t skip this chapter even if you’re familiar with either Python or R or even both.

Chapter 3: Learning the “Hello World” of Security Data Analysis

Chapter 3 is where things get real. You start by importing AlientVault’s IP Reputation database in your Python and R environment. You then get a feel of the data by performing some basic introspection of various fields and their data types and appropriate statistical summaries of them. Next you perform some basic charting using R’s ggplot2and Python’s Matplotlib. Even if you are familiar with basics of exploratory data analysis (EDA) I suggest you don’t skip the ‘Homing In on a Question’ section, this is where you’ll learn how to use EDA for answering specifics questions about your data. There are quite a few examples here both in Python and R that use bar charts and heatmaps to explore relationships between various fields and derive answers from these relationships.

Chapter 4: Performing Exploratory Security Data Analysis

Chapter 4 dives into Exploratory Data Analysis (EDA) for InfoSec research. You get the know the the details about IPv4 addresses, and how they can be grouped together using Autonomous System Numbers (ASNs) and why that is useful. You also get to learn how to use a GeoIP service like Maxmind’s free geoip API for tying an IP address to coordinates on a map. Once you have the geo coordinates you can use charting APIs to plot the IPs on a world map. After that you get to see how to augment IP addresses with other useful attributes such as the IANA block information for that IP.
Next the chapter dives into basics of correlation analysis and lays down some core concepts behind correlation analysis. Lastly you get to build some graph data structures and visualize the graph nature of the relationships in IP addresses in the ZueS Botnet. All these techniques are foundations of initial exploratory analysis and although this chapter covers quite a bit of diverse but related concepts both from statistics as well as information security, it does a good job of tying all together. This will be the foundation for the analysis in later chapters.

Chapter 5: From Maps to Regression

Chapter 5 starts with basic concepts of plotting geographical maps, it walks you through plotting with latitude and longitude data, plotting per country stats using Choropleth plots, and zooming in on a specific country (USA in this example). Plotting some numbers on a geographical map is pointless unless it enables you to derive some information / insight from that plot. The chapter looks at a potentially interesting data point and then uses box plots to see if the data point is indeed an outlier. Finally the mapping part concludes by showing you how to aggregate the data at county level.
The last part is a quick introduction to regression analysis (There are multitudes of books written on just this one subject). You get to learn how to build regression models and perform analysis based on model parameters. You also see some caveats you need to keep in mind when interpreting regression models. Finally you get to see how to apply regression analysis for seeing if reported alien sightings have any impact on the infection rate of ZeroAccess rootkit. Yes you read that right and the authors are not fools, they chose these 2 variables to prove a point about multicollinearity a common problem in regression analysis.

Chapter 6: Visualizing Security Data

Chapter 6 is all about a picture speaking louder than a thousand words. Effective visualization is the key foundation of data analysis. The chapter starts with explaining the need for visualization and semi deep-dives into understanding visual perception and why it is important in building effective visualizations. These are topics that deserve their own books let alone chapters, but yet the authors manage to convey the gist of it all in the first few sections of the chapter.
The chapter then moves on to specific examples of visualization like bar charts, bubble charts, treemaps, distribution visualization using histogram and density plots. You also get a taste of visualizing time series data. Lastly you get to build a movie from your data. (I kid you not.)

Chapter 7: Learning from Security Breaches

Chapter 7 devotes to the art of examining and analyzing security breaches. The authors introduce you to the Veris Framework developed by one of the authors for capturing information related to data breaches to be used in Verizon’s annual Data Breach Investigations Report (DBIR). Before examining the details of the VERIS f/w the authors explain why it is necessary to analyze data breaches, what sort of research questions can be answered and what are some of the considerations when designing a data collection framework for the same.
Next the authors introduce the veris framework, its various sections, and enumerations used in them. You get to learn how VERIS tracks assets, actors, threats, actions, and how they affect Confidentiality, Integrity, & Availability (CIA triad) of the breached data. You also learn how to code up discovery/response and the subsequent impact of the data breach on the victim organization.
Next you get to play with some real life database which is captured in the VERIS Community DataBase (VCDB). VCDB is a project used to capture publicly disclosed data breaches and encode them in the VERIS format. The VERIS format is a JSON specification, and you see code examples of doing basic uni-variate and bi-variate analysis like bar-charts and heatmaps.

Chapter 8: Breaking Up with Your Relational Database

RDBMS , NOSQL and everything in between that’s what Chapter 9 is all about. With a quick primer on SQL/RDBMS you get to get your feet wet with MariaDB (MySQL fork), you learn how to create a small schema for storing InfoSec entities, as well as difference in terms of speed of a disk backed v/s memory backed storage engine. From RDBMS we move to NOSQL (Not Only SQL and not No SQL). The authors first explore BerkeleyDB a very popular key-value datastore. You have sample code in both R and python for interaction with BerkeleyDB. Next the chapter deals with Redis a very popular data-structure datastore. You learn about the various data structures supported by Redis and a couple of its advanced features. The authors also tackle Hadoop & MapReduce for processing security data at scale, and also touch base with MongoDB and passing reference to elasticsearch and Neo4J. Overall the chapter deals with some very popular RDBMSs and NOSQL databases, and provide you code samples to interact with them in python and R.

Chapter 9: Demystifying Machine Learning

Chapter 9 is all about Machine Learning in the InfoSec domain. Now let’s get this straight, ML is a very vast and widely spread topic. There are entire books devoted just to certain aspects of it. But even then the authors have managed to cover enough ground and should definitely pique your interest about ML if you haven’t been exposed to it yet. The chapter starts with defining ML, not an easy thing to do. The chapter shows you how to build a model to detect malware from non-malware using classification techniques. Then the chapter deals with model validation techniques/issues, risks of overfitting, feature selection which are some of the common things you do when building a ML model. Next the chapter looks at various supervised and unsupervised learning techniques. Finally you get 3 examples, clustering breach data, multidimensional scaling of victim industries, and hierarchical clustering of victim industries.
It is impossible to do full justice to ML even in a whole book let alone a single chapter, but you still get enough to get you started.

Chapter 10: Designing Effective Security Dashboards

A ‘Dashboard is not an Automobile’. Chapter 10 is about creating effective InfoSec Dashboards. The chapter introduces you to bullet graphs (a creation of Stephen Few) as a much saner and efficient alternative to Gauges and dials. You also see examples of other interesting dashboard visualizations like Sparklines. The authors have some good advice about things to do and don’t when designing dashboards.
Next the authors deal with a concrete example of conveying and managing security via Dashboards. The authors stress on the simple and yet extremely effective bar charts, and bullet graphs, as opposed to fancier but confusing UI elements like 3D charts, pie charts etc. To illustrate this point the authors have provided a couple of Dashboard makeover examples.
Finally the authors talk about designing dashboards for InfoSec. Stressing on two simple questions a) What is going on ? & b) So what ?, the authors explain what should and what should not be presented on an InfoSec Dashboard and how most effectively to present it.

Chapter 11: Building Interactive Security Visualizations

Chapter 11 is all about interactive visualizations, interactive being the keyword. You learn when to move from static to dynamic visualization, and more importantly why. As the authors point out prefer static and go dynamic only if dynamism augments or aids in exploration or illuminates a topic in a way that can’t be done using static images. The authors present an example of each of these three cases and discuss the pros and cons of dynamic visualization in these context. Next the authors present ways to create dynamic visualizations using Tableau, a very popular Business Intelligence and Visualization tool, and also using D3.js a free and open source javascript charting library. As with other chapters you get to tie in this topic in InfoSec by designing an interactive threat explorer using jQuery, vega and opentip.

Chapter 12: Moving Towards Data-Driven Security

The authors provide their own advice for InfoSec research based on their experience and acumen in Chapter 12. They recommend ‘panning for gold’ rather than ‘drilling for oil’; that is to say not getting bogged down on a specific focus but explore the data and then focus on the questions you want to ask. They offer practical advice on various roles one can play in the InfoSec domain ranging from the Hacker, Coder, Data Munger, Visualizer, Thinker, Statistician to Security Domain Expert. For each role they provide a list of resources to sharpen your skill sets. Lastly they offer tips on moving your entire organization towards data-driven security and building security data teams.


Appendix A provides a vast list of web links. From Data Cleansing, Analytics and Visualization tools, to aggregation sites and blogs to follow. There is a ton of material worth checking out and bookmarking here.

Conclusion and Other thoughts

So how do I rate this book ? This is a rather difficult question considering that nothing like this has ever been attempted before. Sure there are plenty of books about traditional InfoSec research and tools, and there are even more books on Statistics, and Machine Learning, and Visualization, not to mention gazillions of books on Programming/Coding. But a book that touches all 3 aspects of Data Science is indeed very rare.
Having said that I like this book very much, it covers every aspect of Data Science with a focus on InfoSec in just enough detail to give it justice. The code samples are great but more important is the very serious advice the authors have to offer (albeit in a lighter tone). This book is by no means a small achievement, not only in InfoSec books but Data Science books as well. I don’t see any reason why this books should not be in your collection if you deal with InfoSec and/or Data Science. Even if your domain is not InfoSec but if you are interested in Data Science I would still highly recommend this book as it will show you how to make Data Science work for your domain using InfoSec as an example.

Wednesday, March 4, 2015

The 10 commandments for hiring Data Scientists

As a Data Scientist (whatever it means), I get a lot of job offers over LinkedIn and other channels. Although I’m not actively looking for a job, I still go through them. One just because I’m curious to find out what exactly do organizations look for in a Data Scientist, and secondly to amuse myself. This post is about the later part, it amuses me to no end what some people want in a Data Scientist, and I’ve made a consolidated list for all the recruiters and organizations who are looking to hire one (or more).

Warning : If satire is not your cup of tea (coffee/soda) you should most certainly not refrain from not reading this article.

Rule 1: Thou shalt not have a freaking clue what Data Science is all about.

But you should still throw in terms like Artificial Intelligence, deep learning, Neural Networks, SVM (admit it, you don’t even know what it stands for). You are not concerned whether the applicant can apply his knowledge to solve the problem at hand, all you really want to know is whether he knows the difference between supervised and unsupervised learning.
In short don’t worry about what the applicant can do, just worry about how much he can memorize and regurgitate. Oh! and have him describe the Apriori algorithm over the telephonic interview.

Rule 2: Thou shalt use a fishing net to grab as many as you can, and figure out what to do with them later.

Have like 10 or 15 openings for the same post. Be very vague about what it is that you exactly expect these people to do. Better yet have a complete lack of understanding of what your problems are and how you think Data Scientists can help you solve it. Just be sure to mention you have tons of data. Yeah that’ll make them bite.

Rule 3: Thou shalt put ‘Data Scientist’ even if what you really want a code monkey.

Well why not? I mean you can’t attract developers to work for you with ‘We need you to work 12 hours a day, 7 days a week, 365 days a year’. But hey if you just change that position from Software Developer to Data-Scientist, lo and behold the bees come flying to your honeypot. And if they can fix your crappy website css code on the side while doing data science-y stuff it’s a win-win.

Rule 4: Thou shalt never mention salary range or benefits.

It’s not like Data-Scientists are in hot demand or anything. They should be grateful you even put up a job post for them to see and apply. And who do they think they are demanding top notch compensation for the efforts and handwork they put in acquiring their skills. And if you really think about it, free laundry is all the benefits someone needs anyways.

Rule 5: Thou shalt not care about the age v/s experience paradox.

We want a PhD. with 10 years of work experience, who’s young and has the zeal and energy of someone just out of the college, coz you know we need his skills to make more people click our in-your-face video pop-up ads. Plus we really can’t just say under–30 single male (that would get us sued), so we just go with young, energetic, likes to work in a startup environment, doesn’t mind staying up late in office (hey free pizza and sodas!).

Rule 6: Thou shalt extol the virtues of working in a startup in a way that would make a Bangladeshi garment factory owner blush.

  • Long never ending hours at work - CHECK
  • Jack of All trades job duties - CHECK
  • Low pay but promise of Stock Option - CHECK
  • No real usable health care coverage - CHECK
  • Foosball/Ping-Pong table - CHECK
  • Screwing over loyal employees by selling and cashing out - PRICELESS

Rule 7: Thine post shalt be scattered with worthless terms like web-scale, big data, well-funded startup.

As if terms like ‘leverage’, ‘synergy’ were not enough. Our dear data scientist must know how to work with ‘BIG DATA’. The more the meaningless and worthless terms in our job posting the better, it will allow us to hire the crème de la crème of analytics talent. Also throw in the fact that all the founders have PhDs in bio-informatics or AI or machine learning etc. coz you know that is so critical to have when it comes to effective leadership.
Also when was the last time someone advertised themselves as a ‘piss-poorly funded startup’?

Rule 8: Thine Data-Scientists must know every programming language under the sun (even the ones not invented so far).

Coz you know programming is where it’s at. If you can’t code you can’t do jack. And what do you mean you only know R or python ? All the cool kids are using Ruby or Node.js. And don’t tell us you can’t write enterprise applications using Java/J2EE/EJBs. Oh! and please do explain in great detail how a Hashtable works. No job interview is complete without it. In short if you see an IP address the first thing that should cross your mind is Visual Basic GUI.

Rule 9: Hadoop Hadoop Hadoop, wait I’m forgetting something, Ah! yes Hadoop.

GEORGE: “Why don’t they have Hadoop in the mix?”

JERRY: “What do you need Hadoop for?”

GEORGE: “Hadoop is now the number one data crunching framework in America.”

JERRY: “You know why? Because people like to say ”Hadoop.“ ”Excuse me, do you have any Hadoop?“ ”We need more Hadoop.“ ”Where is Hadoop? No Hadoop?"

Rule 10: Thine Data-Scientist should be able to code, design web-apps, be an Agile Scrum master, Software Architect, Project Manager, Product Manager, Sales/Marketing Guru. Did I mention a unicorn ?

OK no satire on this one. Just straight up practical advice. Data scientists are not all knowing superhuman beings. Figure out what is it that your organization wants to do with data and hire well trained and decently experienced people who can solve your challenges. If looking for novice employees make sure the job has enough breathing room for them to grow into gradually. And lastly don’t go looking for unicorns because a) they don’t exist, and b) the best thing they can do is make 6–8 year old girls scream in high pitch.

footnote : The difference in percentage of animals harmed in writing this post and percentage of animals that would have been harmed had I not written this post is not statistically significant.

Sunday, February 15, 2015

Visualizing India v/s Pakistan One Day International Results

This is my small effort to pickup streamgraph support in R developed by Bob Rudis. (Described here).

What you see is per year aggregations of results of all India v/s Pakistan One day Internationals. I pulled the records from Wikipedia and used rvest by Hadley Wickham. for extracting the results. After that a little data munging using dplyr and lubridate and voilà.

You can see an interactive version of the chart above over at rpubs.

Blue's are India and Green's are Pakistan in accordance with their team colors. India had an abysmal records against Pakistan right up until mid 90s, but it has picked up quite a bit after that. And of course India has won all 6 of it's Cricket world cup matches against Pakistan.

As of today the tally stands at: India 51 wins and Pakistan 72 wins. Below's a detailed breakdown.

Running a chi-square test for dependency between the result and the venue didn't find any association between the two, which in layman terms means the results have been unrelated to the venue.

For the nerds (Oh sorry Data Scientists), the code is shown below.


# Wikipedia is our best go to source
indvspak <- html('https://en.wikipedia.org/wiki/List_of_ODI_cricket_matches_played_between_India_and_Pakistan')
# Summary table
results.summary <- indvspak %>% html_node('.wikitable') %>% html_table()

# Any dependency btween venue and result ?

# The XPATH expression below was obtained using Chrome's Element Inspector.
results <-  indvspak %>%
  html_node(xpath='//*[@id="mw-content-text"]/table[4]') %>% html_table()

# Sensible headers
colnames(results) <- c('MatchNum','Date','Winner','WonBy','Venue','MoM')

# Fix Date
results$Date <- ymd(str_replace(results$Date,'^0([0-9]{4}-[0-9]{2}-[0-9]{2}).*$','\\1'))
# Extract just the year in a new field
results$year <- year(results$Date)

# So that we get our colors as per team colors
results$Winner <- factor(results$Winner,levels=c('India','Pakistan','No result'),ordered=T)

results %>% select(year,Winner) %>%
  group_by(year,Winner) %>% tally() %>%
  streamgraph("Winner", "n", "year", offset="zero", interpolate="linear") %>%
            label="Ind v/s Pak One Day International Results : Over the years") %>%
  sg_axis_x(1, "year", "%Y") %>%