by Angela Guess
Philip Howard discusses some misconceptions about what constitutes Big Data in a recent article: “There is a lot of confusion around “big data”. People naturally assume that big data means lots of data. Which is true. But it isn’t any old data or, at least it doesn’t have to be. The reason I bring this up is because just last week I heard about a company investigating the possibility of using Hadoop to store and support the analysis of several years’ worth of transactional data. Now, it is possible to think of reasons to use Hadoop for this purpose: it might be cheaper or you might prefer Java programmers to SQL developers but this is not the sort of environment where Hadoop would naturally spring to mind as an application. Moreover, I don’t care how large your organisation is, you won’t need huge quantities of disk for a few years of transactions. This isn’t, relatively speaking, “big” data, it’s actually pretty small data but if you are used to storing only 3 months’ worth of data then maybe it looks big.”
Howard continues, “So we need to be clear about what we mean by big. Generally speaking we are at least talking about hundreds of terabytes and more often petabytes. The next thing to think about big data is what sort of data it is. Hadoop and MapReduce are particularly useful when it comes to analysing semi-structured and unstructured data, while traditional data warehouses, using traditional analytic techniques, are not. On the other hand, you can do things with structured data in a conventional data warehouse that would be much more difficult to do using MapReduce. So there is a good case for treating Hadoop and MapReduce on the one hand and data warehousing on the other, as complimentary. However, if you are going to start looking at using Hadoop for transactional data then they become competitive, which is something else entirely.”
photo credit: Lauren Manning
















