Atanu Biswas and Bimal Roy
(Professors at the Indian Statistical Institute, Kolkata)
In the era of Big Data, we are like the blind men of the parable standing in front of an elephant, trying to conceptualize the big animal from glimpses only. However, unless we understand the full nature of the Big Data, we’ll feel only one part of the elephant body, and conclude that the elephant is either like a winnowing basket (ear), a plowshare (tusk), a plow (trunk), a pillar (foot), and so on.
It’s not easy to realize why all the fancy analytics tools that we use for small datasets cannot be replicated when our data grows. For example, if we want to find a simple average of ‘n’ numbers, we just add them, and divide the sum by ‘n’. Apparently, the approach is the same whether ‘n’ is 100 or 100 billion. However, if all the numbers are largely positive, then the sum of 100 billion such numbers might be overflowed in the computer memory. So, we need to adjust the algorithm somehow appropriately to find the average. That’s the extra bit of cosmetic surgery needed for handling Big Data.
Consider the example of simple multiplication. We learned the multiplication tables in our elementary classes, which are stored in our memory. However, we need some additional techniques for multiplying two big numbers, say with hundreds of digits. We use our memory, multiply one number by every digit of the other, one by one, starting from the unit place. For each digit of the second number, the result of the multiplication is presented in a row, where we go on multiplying each digit of the first number starting from the unit digit and use the ‘carrying method’ to add the residual if any. For the next digit of the second number, we move to the next row and start by moving one place to the left. Finally, we add all the rows. This algorithm for multiplication is a derivative of the knowledge of tables, combined with some special techniques. This can be interpreted as a Big Data problem. And special techniques on top of the standard multiplication tables of single digits are needed for solving them.
Consider another simple mathematical problem, that of sorting. Suppose we are to sort 5 numbers in increasing order. In our elementary classes, we could easily sort them by looking at the numbers; certainly, some algorithm within our brain runs to sort them manually. A smarter student might at once sort 10 such numbers, maybe 20. But, certainly, we cannot sort 100 numbers, or say 100,000 numbers, just by looking at them. We need some algorithm on top of our unexplainable internal algorithm, which we learn in higher classes. This is another example of Big Data problem that we are tackling for years.
Data analytics mostly comprises statistical methodologies like regression analyses, classification and clustering techniques, some standard estimation and testing procedures, etc. While most of such theories are neatly developed in the statistical literature, and easily applied for small to moderate data size, one might need to manipulate intelligently and devise some novel techniques for the unusual format of data. But, the real challenge, even for standard ready-to-use techniques, lies in the limitations of using loads of data with a huge number of variables. One reason is the presence of ‘spurious’ or ‘nonsense’ correlations among different variables, the more variables we handle we face more and more such correlations. And unless we can identify and carefully separate out the unimportant variables, and learn to handle lots of spurious correlations, we cannot aspire to have meaningful analyses of data. However, that’s a daunting task, anyway! It’s theoretically challenging too. In addition, even in a standard regression analysis, for example, with loads of data and say, 10,000 variables, we need additional computational techniques. For example, we might need to invert a 10000×10000 matrix, which becomes intractable.
Then, what’s about the ocean of data? Now, with the advent of computers and the internet, and with virtually everything is confined under the system of ‘Internet of Things’, a gigantic amount of data is generated continuously. However, the ever-expanding horizon of ‘Data’ is now growing faster than ever. An IBM report of mid-2017 described that 2.5 quintillion bytes of data are created per day, and according to a Forbes article (2015), by 2020, new information of about 1.7 megabytes per second is expected to be created for every human being on the planet. Such Big Data is boon and curse at the same time. Are we really capable of leveraging them? With the present expertise, the answer is blatant ‘NO’. A computer cannot handle data of size more than its RAM at the same token, be it 4 GB or 64 GB. Thus, we need to devise statistical techniques to accommodate this huge data, in a stage by stage way– exactly how the additional technique of ‘carrying’ and writing line by line could be deployed for multiplication of two big numbers, on top of the a-priority knowledge of multiplication of two single-digit numbers. And only the top statisticians collaborating with excellent computational experts might produce such techniques, that too in a case by case way.
The quest for understanding the whole elephant would continue.