|
Techniques for getting around this limitation have borrowed heavily from thinking about peer-to-peer (P2P) technologies using methods like consistent hashing (associating a data identifier with an address or location in a way that is consistent regardless of which node is asked), but these methods have not yet fully penetrated into the world of Hadoop and related technologies precisely because they are solving a problem that is just at the threshold of interest. After all, if the big data cluster is a consistent collection of compute nodes, the worst that can happen is that a few fail here and there. For a larger and more heterogeneous computing and storage ecosystem, those constraints start to fray. Projects like Voldemort and related ideas for distributed databases borrow from P2P methods to reduce the impact of failures. Here at Kitenga, current research interests include investigating the idea that nodes may enter and depart the storage fabric at any time due to a range of factors, including geographical and environmental constraints, but their failure modes need to be managed to maximize their reliable contribution to the data solution. Moreover, the data from these sources can't be treated like traditional relational tables. Instead, complex geospatial query operations need to operate efficiently over the distributed nodes, delivering noise-free query resolutions while reducing failures. |