Sunday, June 28, 2015
Wednesday, May 27, 2015
And here's the slide deck of the talk...
Tuesday, May 5, 2015
I work with the two authors of this book. In fact one of them is my manager. But a) I don’t like to suck up to my colleagues and b) I’m sure they don’t like being sucked up to either. Despite this if you think my review will be biased then stop reading now. Go watch some cat videos.
Data Driven Security is a first of it’s kind book that aims to achieve the impossible; To be a book that integrates all 3 dimensions of ‘Data Science’, a) Math and Statistical Knowledge, b) Coding/Hacking skills, and c) Domain Knowledge. Domain in this case being the Information Security Domain. If these 3 dimensions are unknown to you, look at the figure on the right..
Traditionally books available for data science have tackled only one dimension at a time or at best two. This book is unique in that regard as it tackles all 3 dimensions. This is worth mentioning especially when you consider that concepts like statistical and machine learning are not part of traditional InfoSec tools. Traditional InfoSec tools are based around the concept of signature matching, i.e. determining if a threat matches from a set of already known badness such as a virus, malware, network activity, ip address, domain name. This approach is always playing catch up and the good guys are always one step (in fact several steps) behind the bad guys. This is where data driven security comes in. The idea is to use data analysis techniques for security research and build the next generation of InfoSec tools that can spot badness before it is known. A fascinating field, trust me. The challenge of writing on such a subject can not be overstated. The book needs to be approachable by readers coming in from any of the three dimensions. Also each of the three dimensions is so vast and wide that you can find hundreds of books dedicated to just one single dimension. So have the authors been successful in this endeavor ? Read on to find out…
At a glance
The book (ISBN: 978–1–118–79372–5) is published by Wiley Publications in Feb, 2014. Wiley has been publishing some really interesting titles over the past few years in the Data Analysis, Statistics domain. The book is absolutely gorgeous from cover to cover. The page quality is very high and a lot of effort has gone into making the code and figures look stunning. It is one of the best visually pleasing books I have in my collection (in addition to anything by Stephen Few and Edward R. Tufte). All code presented is properly commented, something I seldom find in technical books. There is a ton of code in this book, which is expected as coding is one of the skills in data science. The authors have done a great job presenting code in python and R. Both python and R offer libraries and interactive environments for data analysis and by presenting the code in 2 languages the authors have made the book accessible to wide audience.
In addition to the book the authors have a website: datadrivensecurity.info for the book, a blog and a podcast where they discuss all things InfoSec and data science.
Chapter 1: The Journey to Data Driven Security
Chapter 1 starts with a brief history of data analysis, from the classical statistical analysis techniques of the Nineteenth and Twentieth century to the modern algorithmic approaches of the Twenty-First century. You have enough anecdotes to convince you of the importance of data analysis in case you are still wondering why analyze data in the first place. The chapter then explores the skill sets required for data analysis: Domain expertise, Data Management, Programming, Statistics and finally Visualization. Each topic is given its due credit and you’ll learn how each of these pieces fits in to the mosaic. The chapter rounds off with a very important section ‘Centering on a Question’. It is very much possible that your data analysis can lead you to many directions if you don’t have a proper research question framed (Your’s truly is guilty of this many times over). In short you learn the history of data analysis, skill-sets required for it and the importance of framing the right question(s).
Chapter 2: Building Your Analytics Toolbox: A Primer on Using R and Python for Security Analysis
Chapter 2 is all about setting up Python and R development environments for your data analysis. The authors start by explaining their reasons for using Python and R, and more importantly why both (avoiding the situation of having the hammer as your only tool). Next you learn how to set up Python using the Canopy distribution and R using Rstudio. You also get some sample code to test your respective setups. Next the chapter introduces you to the concept of a data frame; a tabular data structure often used in data analysis. You gee a taste of both R’s native data.frame as well as Python’s DataFrame from the pandas package. lastly you’ll see how to organize your code for a typical analysis project. I recommend you don’t skip this chapter even if you’re familiar with either Python or R or even both.
Chapter 3: Learning the “Hello World” of Security Data Analysis
Chapter 3 is where things get real. You start by importing AlientVault’s IP Reputation database in your Python and R environment. You then get a feel of the data by performing some basic introspection of various fields and their data types and appropriate statistical summaries of them. Next you perform some basic charting using R’s ggplot2and Python’s Matplotlib. Even if you are familiar with basics of exploratory data analysis (EDA) I suggest you don’t skip the ‘Homing In on a Question’ section, this is where you’ll learn how to use EDA for answering specifics questions about your data. There are quite a few examples here both in Python and R that use bar charts and heatmaps to explore relationships between various fields and derive answers from these relationships.
Chapter 4: Performing Exploratory Security Data Analysis
Chapter 4 dives into Exploratory Data Analysis (EDA) for InfoSec research. You get the know the the details about IPv4 addresses, and how they can be grouped together using Autonomous System Numbers (ASNs) and why that is useful. You also get to learn how to use a GeoIP service like Maxmind’s free geoip API for tying an IP address to coordinates on a map. Once you have the geo coordinates you can use charting APIs to plot the IPs on a world map. After that you get to see how to augment IP addresses with other useful attributes such as the IANA block information for that IP.
Next the chapter dives into basics of correlation analysis and lays down some core concepts behind correlation analysis. Lastly you get to build some graph data structures and visualize the graph nature of the relationships in IP addresses in the ZueS Botnet. All these techniques are foundations of initial exploratory analysis and although this chapter covers quite a bit of diverse but related concepts both from statistics as well as information security, it does a good job of tying all together. This will be the foundation for the analysis in later chapters.
Chapter 5: From Maps to Regression
Chapter 5 starts with basic concepts of plotting geographical maps, it walks you through plotting with latitude and longitude data, plotting per country stats using Choropleth plots, and zooming in on a specific country (USA in this example). Plotting some numbers on a geographical map is pointless unless it enables you to derive some information / insight from that plot. The chapter looks at a potentially interesting data point and then uses box plots to see if the data point is indeed an outlier. Finally the mapping part concludes by showing you how to aggregate the data at county level.
The last part is a quick introduction to regression analysis (There are multitudes of books written on just this one subject). You get to learn how to build regression models and perform analysis based on model parameters. You also see some caveats you need to keep in mind when interpreting regression models. Finally you get to see how to apply regression analysis for seeing if reported alien sightings have any impact on the infection rate of ZeroAccess rootkit. Yes you read that right and the authors are not fools, they chose these 2 variables to prove a point about multicollinearity a common problem in regression analysis.
Chapter 6: Visualizing Security Data
Chapter 6 is all about a picture speaking louder than a thousand words. Effective visualization is the key foundation of data analysis. The chapter starts with explaining the need for visualization and semi deep-dives into understanding visual perception and why it is important in building effective visualizations. These are topics that deserve their own books let alone chapters, but yet the authors manage to convey the gist of it all in the first few sections of the chapter.
The chapter then moves on to specific examples of visualization like bar charts, bubble charts, treemaps, distribution visualization using histogram and density plots. You also get a taste of visualizing time series data. Lastly you get to build a movie from your data. (I kid you not.)
Chapter 7: Learning from Security Breaches
Chapter 7 devotes to the art of examining and analyzing security breaches. The authors introduce you to the Veris Framework developed by one of the authors for capturing information related to data breaches to be used in Verizon’s annual Data Breach Investigations Report (DBIR). Before examining the details of the VERIS f/w the authors explain why it is necessary to analyze data breaches, what sort of research questions can be answered and what are some of the considerations when designing a data collection framework for the same.
Next the authors introduce the veris framework, its various sections, and enumerations used in them. You get to learn how VERIS tracks assets, actors, threats, actions, and how they affect Confidentiality, Integrity, & Availability (CIA triad) of the breached data. You also learn how to code up discovery/response and the subsequent impact of the data breach on the victim organization.
Next you get to play with some real life database which is captured in the VERIS Community DataBase (VCDB). VCDB is a project used to capture publicly disclosed data breaches and encode them in the VERIS format. The VERIS format is a JSON specification, and you see code examples of doing basic uni-variate and bi-variate analysis like bar-charts and heatmaps.
Chapter 8: Breaking Up with Your Relational Database
RDBMS , NOSQL and everything in between that’s what Chapter 9 is all about. With a quick primer on SQL/RDBMS you get to get your feet wet with MariaDB (MySQL fork), you learn how to create a small schema for storing InfoSec entities, as well as difference in terms of speed of a disk backed v/s memory backed storage engine. From RDBMS we move to NOSQL (Not Only SQL and not No SQL). The authors first explore BerkeleyDB a very popular key-value datastore. You have sample code in both R and python for interaction with BerkeleyDB. Next the chapter deals with Redis a very popular data-structure datastore. You learn about the various data structures supported by Redis and a couple of its advanced features. The authors also tackle Hadoop & MapReduce for processing security data at scale, and also touch base with MongoDB and passing reference to elasticsearch and Neo4J. Overall the chapter deals with some very popular RDBMSs and NOSQL databases, and provide you code samples to interact with them in python and R.
Chapter 9: Demystifying Machine Learning
Chapter 9 is all about Machine Learning in the InfoSec domain. Now let’s get this straight, ML is a very vast and widely spread topic. There are entire books devoted just to certain aspects of it. But even then the authors have managed to cover enough ground and should definitely pique your interest about ML if you haven’t been exposed to it yet. The chapter starts with defining ML, not an easy thing to do. The chapter shows you how to build a model to detect malware from non-malware using classification techniques. Then the chapter deals with model validation techniques/issues, risks of overfitting, feature selection which are some of the common things you do when building a ML model. Next the chapter looks at various supervised and unsupervised learning techniques. Finally you get 3 examples, clustering breach data, multidimensional scaling of victim industries, and hierarchical clustering of victim industries.
It is impossible to do full justice to ML even in a whole book let alone a single chapter, but you still get enough to get you started.
Chapter 10: Designing Effective Security Dashboards
A ‘Dashboard is not an Automobile’. Chapter 10 is about creating effective InfoSec Dashboards. The chapter introduces you to bullet graphs (a creation of Stephen Few) as a much saner and efficient alternative to Gauges and dials. You also see examples of other interesting dashboard visualizations like Sparklines. The authors have some good advice about things to do and don’t when designing dashboards.
Next the authors deal with a concrete example of conveying and managing security via Dashboards. The authors stress on the simple and yet extremely effective bar charts, and bullet graphs, as opposed to fancier but confusing UI elements like 3D charts, pie charts etc. To illustrate this point the authors have provided a couple of Dashboard makeover examples.
Finally the authors talk about designing dashboards for InfoSec. Stressing on two simple questions a) What is going on ? & b) So what ?, the authors explain what should and what should not be presented on an InfoSec Dashboard and how most effectively to present it.
Chapter 11: Building Interactive Security Visualizations
Chapter 12: Moving Towards Data-Driven Security
The authors provide their own advice for InfoSec research based on their experience and acumen in Chapter 12. They recommend ‘panning for gold’ rather than ‘drilling for oil’; that is to say not getting bogged down on a specific focus but explore the data and then focus on the questions you want to ask. They offer practical advice on various roles one can play in the InfoSec domain ranging from the Hacker, Coder, Data Munger, Visualizer, Thinker, Statistician to Security Domain Expert. For each role they provide a list of resources to sharpen your skill sets. Lastly they offer tips on moving your entire organization towards data-driven security and building security data teams.
Appendix A provides a vast list of web links. From Data Cleansing, Analytics and Visualization tools, to aggregation sites and blogs to follow. There is a ton of material worth checking out and bookmarking here.
Conclusion and Other thoughts
So how do I rate this book ? This is a rather difficult question considering that nothing like this has ever been attempted before. Sure there are plenty of books about traditional InfoSec research and tools, and there are even more books on Statistics, and Machine Learning, and Visualization, not to mention gazillions of books on Programming/Coding. But a book that touches all 3 aspects of Data Science is indeed very rare.
Having said that I like this book very much, it covers every aspect of Data Science with a focus on InfoSec in just enough detail to give it justice. The code samples are great but more important is the very serious advice the authors have to offer (albeit in a lighter tone). This book is by no means a small achievement, not only in InfoSec books but Data Science books as well. I don’t see any reason why this books should not be in your collection if you deal with InfoSec and/or Data Science. Even if your domain is not InfoSec but if you are interested in Data Science I would still highly recommend this book as it will show you how to make Data Science work for your domain using InfoSec as an example.