Saturday, December 13, 2014

Slides from my talk at Elasticsearch DC Meetup Dec 11 '14.

On Dec 11th, 2014 I presented a talk on 'Scaling Elasticsearch for Production' below are the slides. Video should be coming soon.

Saturday, March 29, 2014

Book Review : Data Smart

Data Smart: Using Data Science to Transform Information into InsightData Smart: Using Data Science to Transform Information into Insight by John Foreman
My rating: 5 of 5 stars

Full 5 stars.
There is no reason why you should not buy this book, if you even are remotely connected with things like 'Data Science', 'Analytics', 'Forecasting' etc.
I enjoyed all chapters and especially Chapters 4 (Optimization), 6 (Regression), 8 (Forecasting).
Seriously buy this book, now.

It's very easy read, and yet the author does not merely skimp important concepts, so you get best of both worlds, a good solid foundation and practical implementation.

One thing I like is for 90% of the time, the subject matter and the spreadsheet diagrams are on the same set of pages, so you don't have to go back and forth between pages to sync text and images.

View all my reviews

Saturday, February 22, 2014

Searching, Analyzing & Visualizing Security Feeds

If you work with Computer or Network Security, then terms like CVE, CPE, CWE, CCE, etc. should be very familiar to you. If not, you're in the wrong field :).

For those who don't work in these fields but are curious about it, these are some of the security related feeds provided by independent organizations such as MITRE or NIST and are part of the "Making Security Measurable" initiative. These feeds provide meta data about things related to Computer/Network Security such as standard names for platforms/operating-systems/software/hardware, standard names for common vulnerabilities, weakness, configurations. Using these standard names help different vendors identify and tag security vulnerabilities, platforms, etc in a non-ambiguous way.
Almost any vendor in this space relies on these feeds, and incorporates them in their products in some way or another. We use them too, but .....

We have more than one security related products, and each provides a unique take on Computer/Network security, and this is not unique to our way of working. You'll see this pattern in the whole of security industry, organizations having products catering to SIEM (Security Information and Event Management), VM (Vulnerability Management), Log Management, Compliance etc.

In our case each of these product offerings intake some or all of these feeds (which are provided in XML format), parse the feed and load it in a RDBMS. 
The problems with this approach are ...

  • Each product team has it's own code base for parsing these feeds and it's own DB schema for representing these feeds in DB. So a lot of work is duplicated, and there are no standards across products on how to model these feeds in each product.
  • Each Product only takes in a subset of the available feeds, and brings in new feeds if and when needed. So it's a never ending cycle for each new feed. Write parsers, design DB schma, and code the ETL procedures.
  • Due the rather complex nature of these feed formats, and rather limited ability to model such complex structures in a RDBMS, we end up throwing away a lot of information and cherry pick only important attributes such as ID, name, description, title etc. and keep our models simple. 
  • And perhaps the most important, no text search capability within or across feeds. These feeds have some very elaborate descriptions , code samples, so full  text search is very critical.

So while this approach works , as you can see it's not very efficient, duplication of work, no standard model across products and limitations of the storage platform result in discarding information that could potentially be useful.


So what's needed is

  • A independent meta store that can handle any and all available feeds.
  • A flexible storage platform not limited by shortcomings of a RDBMS system and in turn can retain almost all available information in the feeds.
  • A very simple REST API for each product to tap in to the meta store.
  • A full text search as well as a field based search interface as part of the REST API.
  • A framework / API for descriptive statistical analysis, exploratory analysis on the information store.
  • A very simple dashboard to visualize these feeds, for birds eye view.

As most of these feeds are published as XMLs , initially I thought of using a 'XML DB' for the job. I have quite a bit of experience with 'Oracle XMLDB', but although it can offer a much flexible platform for storage, we'll still need to define a proper DB schema based on the underlying XML schema, to take full advantage of 'XMLDB' features, and we'll still need to write our one API layer on top of XMLDB. So the amount of efforts is not reduced significantly. Not to mention any statistical analysis or dashboarding is added effort.


So what's the alternative for developing such a solution ? .... ElasticSearch.

Why ? , because .....

  • ES offers a very flexible storage model, not only is it a full text search product but a very competent storage platform for unstructured or semi-structured data (NoSQL DB if you will).
  • It works very well schema less, and efforts to define a Index Mapping (equivalent to a DB Schema) are very minimal compared to a traditional RDBMS schema designing. So you can work in a mix environment where you define mappings for a key set of attributes and leave the rest for ES's dynamic mapping.
  • Full text search is ES's bread and butter, and it also works well for field based querying/filtering.
  • ES provides statistical APIs (facets in pre-1.0 release, and aggregations in 1.0+ releases) out of box, without us having to write a single piece of code.
  • ES provides kibana for building dashboards on data stored in ES. Kibana does all the heavy lifting and you can build very intuitive dashboards in a very short amount of time.
So what's need to be done by us ? Well not much really, here's what I've done so far..

  • A simple perl script which can download data feeds (XMLs), convert XML to JSON (trust me with perl this is a breeze, 2 lines of code ), minimal normalization of data if needed and then use elasticsearch perl API to bulk index the feed. (The whole code is about 100 to 120 lines).
  • Define a minimal set of mappings for the feeds. Again the idea is to make heavy use of ES' dynamic mappings for most fields and only provide explicit mappings for a select few key attributes.
  • Built kibana dashboards for search as well as summarizing feed data in graphs.
In short the coding effort was a very small perl script, es mapping template, and kibana dashboard building, all accomplished in a matter of hours, as opposed to the current approach which requires days/weeks for each new feed we want to work with.

Overall I'm very pleased and satisfied with what has been achieved. Below are some Kibana dashboards I've build.
Please note that as nice as Kibana is, what's more important is the full text search capabilities that we get from ES, and a very easy and intuitive REST API, which can be used by any product to tap in to this feed store, that's more important to us. Not to mention the ridiculously small amount of time spent to put this all together. 


CVEs by Score (Also Adobe Ouch..)

CVEs by OS

CPEs

CCEs

CWEs

CCEs