The Exponential Time Value of Big Data

Kitenga worked on big data problems throughout 2008 and into 2009 with a bioinformatics company that builds amazing products around a complex biological ontology. Their ontology is carefully maintained and curated by teams of PhD biologists who read journal articles and translate the core findings of experimental research into a computationally-useful hierarchy of relationships.

Our work with them was to design a new search technology that leveraged the ontology to make it easier for researchers to find and use journal articles. The problem space began with around 17 million abstracts, each about 1K in size, but was complicated by the extensive markup due to the ontology being applied to cell lines, proteins, biological processes, DNA sequences, and a range of other features of the text. The team was moving fast to add and test features like author disambiguation, related document discovery, and efficient user interface artifacts to slice and dice through the search results sets. UI features were the primary goal markers that were set up in two week development cycles based on agile development strategies.

And something interesting happened that we've seen before: the indexing time (we were using a customized Solr installation) began to interfere with our ability to do agile development work. Indexing times stretched out to more than four days, running day and night on 16Gb Linux machines with multi-core processors. We would do initial dev work on slices of the document collection, of course, but even reasonable slices could take 30 minutes to complete, given load times for ontologies, construction of finite state taggers, and other features of the indexing pipeline. And there is nothing more heartbreaking than running an indexing job for 4 days and having it crash due to a network failure or a resource emergency from another production or development group.

I call this effect the time value of big data (borrowing from the "time value of money") and it is common to underestimate how much impact it can have on trying to solve problems as they scale up (or even scale out). There is often an exponential relationship between time and data size as the problems get bigger. Debugging failures and analyzing outcomes take weeks to reproduce. Days are spent re-engineering logging strategies to enhance the understanding of what piece of anomalous data broke the system. Pagers go off in the night as systems fail. RDBMS queries spend days processing as materialized tables spill out to swap disk. Disk runs out.

Effective planning can help reduce the impact by building scalability into the architecture from the outset, flattening the time value curve for big data problems, but the expertise and intuitions are often trumped by the need to get to a prototype. That's why I recommend taking a step back after the first prototype is completed and assessed, and looking at the scalability issues and making decisions early enough that you are not locked into a path that will carry you forward but with increasingly reduced returns on your development inputs.