Data Debt

Thu 13 February 2014

There have been some conversations about technical debt lately. Perhaps it was sparked by this presentation by Joshua Slayton of AngelList where he suggests that accumulating technical debt is okay since it gives you speed in return (as long as you pay it off later). It got a lot of reactions on Twitter, this being my favorite:

While I haven't yet made up my mind on this, I believe there's another kind of debt that companies should pay attention to these days - the data debt.

With an ever-increasing emphasis on storing and measuring everything, it is easy to cut corners in the name of expediency. Moreover, the new breed of NoSQL databases make it easy to just dump the data into storage without imposing any logical structure on the underlying information. While this is not wrong per se, letting this fester unchecked can, in the long run and in the worst case, make analysis of this data cumbersome to the point of futility. And if you're running a data business, you have no business accumulating any of this debt!

Here are a few rules of thumb to follow to avoid this debt or pay it down faster:

  • If at all possible, impose structure on the collected data. Don't postpone due to laziness.
  • Use a consistent naming convention for databases, documents and collections.
  • Keep a master list of all data sources along with information on frequency of collection, notes on intended use (if known), dependencies on other data etc.
  • Monitor everything. It sucks when a process fails and you only find out weeks later.

What are some other rules of thumb that you follow? Would love to hear your thoughts on this.