Bhaskar Karambelkar's Blog

Data Scientists And Software Engineering


Tags: Data Science Software Engineering

Coding != Software Engineering

As has been mentioned ad nauseam: Any one can code. What has not been said enough is that there is a lot more to coding than merely assembling a set of instructions in a programming language of your choice to make the machine do what you want. Not as catchy as ‘Any one can code!’, is it?

As a professional software developer turned data scientist, I feel compelled to share some software engineering wisdom with my fellow data scientist who may have followed a non coding heavy path. If you are an academic researcher turned data scientist, or perhaps a data analyst used to point-and-click GUI tools, or excel (the horror!), then learning and being able to code in a programming language can be a liberating, exhilarating, but also a very scary experience. But fear no more. This post and the follow up posts in this series are just for professionals like you. This series will introduce you and encourage you to explore software engineering in more detail, in order to become proficient in writing good code no matter the programming language.

So What is Software Engineering?

Instead of referencing a formal definition which you can easily look up using Google, let me tell you what I think the practice of software engineering aims to accomplish. Writing more or less working code is the easy part. What software engineering aims to accomplish is making the code portable, concise, relatively bug free, secure, performant within given constraints, and reusable, with limited man-power and budget. And believe me, this is not as easy as you might think it is, nor is this process a natural extension of the practice of coding. By that I mean that you can’t learn software engineering by coding more and more, let alone master it. Proper software engineering is an art+science on to it self, of which coding skills are an important but nonetheless only a small piece.

The reason I chose to explain software engineering this way, is because, the term software engineering is in itself somewhat controversial and debated. So instead of drowning you in the controversy about the term, I present my understanding of the intent behind software engineering. But be forewarned that this is one person’s opinion so take it for what it’s worth.

So Why Must Data Scientists Care?

For many a reasons. As I mentioned previously more and more data analysis is now done in code rather than point-and-click GUI tools. This places the added burden of learning how to code on an data analyst / researcher who may not have had exposure to coding before. Even if you had taken a programming language class before, it was mostly to teach you the syntax of the programming language rather than teach you about software engineering.

The implications of the above are they you may write code that works but it may have all sorts of issues. It may not be portable on account of using non-portable APIs/Libraries. It may not be optimized/concise and as a result non performant at scale. It may have a large surface area for bugs and security vulnerabilities. It may not be maintainable in the long run, and hence prevent you from reproducing your results in the future. Given all this, I would go as far as to argue that if you care about the veracity and reproducibility of your research/analysis then you absolutely must care about software engineering.

Cliff Notes for Software Engineering

I must warn you this page is only meant to get you started on software engineering, and not tell you all that there is about it. Even then, following most of my advice below will make you a better coder (ahem software engineer).

  • Have a basic understanding of how computer systems work. The hardware, the software (including OS/Kernel), the network, the Internet. This will go a long way, I promise.
  • Try and pick up at least 2 or 3 programming languages. Broaden your repertoire.
  • Learn about various programming paradigms like imperative, object-oriented, functional etc.
  • Many a modern languages don’t strictly fall under a single paradigm. Form a habit of recognizing which features adhere to which paradigms.
  • Embrace each programing language’s idiosyncrasies rather that fight them. If in doubt always remember that people lot smarter than you put hours into developing the language/API/library you are using.
  • Have a good understanding of the standard library of a programming language. It will prevent you from inefficiently duplicating functionality that was at your disposal from the get go.
  • In addition to the standard library research a bit on some leading 3rd party libraries/APIs available. Someone always has the problem solved before you. Your skill lies in finding it rather than duplicating it.
  • Start picking up on how to distinguish efficient vs inefficient code. Efficiency can defined in terms of performance, conciseness, resource consumption etc.
  • Teach yourself the principles of application security and secure coding. Not being a full time software developer doesn’t alleviate you from writing secure code.
  • Think beyond your immediate use case. Think of use cases in future or use cases by users other than yourself. “It suffices my needs” is a narrow mindset.
  • Write less code and more comments. Think of that someone who has to read your code six months or an year from now. Even if that someone is you, I can tell you from experience that reading properly commented code can do wonders to lower your stress levels.
  • Be critical of your coding abilities rather than being confident about them. Let that imposter syndrome be your motivation to improve.
  • Automate your testing. Be it unit tests or integration tests, take out the human as much as possible from the equation.
  • Learn about software delivery pipelines. Continuous integration (CI), automated deployments, devops are not just buzzwords. They play a critical part in your overall product development.
  • Familiarize yourself with distributed computing, cloud environments, virtualization and container technologies.

The last four ones are special and deserve to be separated out from the rest. Always follow them no matter how big/small your program/script is, and how much you are pressed for time. Excuse the shouting because they are that much important.




Anything More?

Yes! A lot more. Over the course of this series I will expand on each of my bullet points in a separate blog post that will deep dive in to the point. In the mean time feel free to look up software engineering, the controversy around it. If you have any comments to share find me on Twitter (link at the bottom).