Big data - proceed with caution
While collection and analysis on big data hold great promise, quantity doesn't always translate to quality. Quantity of data is represented by the number of records and the number of variables, and one can argue that good old statistical sampling techniques are still relevant. If variability is captured with a random sample, there will be very little incremental benefit, if any, of doing analysis on all rows.
The second dimension of big data is the number of variables. Data sets of only 10 variables with 10 distinct values for each variable, gives potentially 10 billion pattern combinations; and with an increase in the number of variables, the potential for extracting spurious and non-explicable patterns and correlations also increases.
The key question remains – how can analytics handle stream data that keeps increasing in volume? Each method, technique or algorithm has an optimal point, after which there are diminishing returns, plateau and then degradation, while computational requirements continue to grow. Some suggest that algorithms need to be rewritten to move the optimal point further down a path of data infinity.
And what makes it more challenging is that big data is characterized more by the data variety and velocity rather than by sheer volume. Data doesn't only come in a standard structured format, it comes in a stream in the form of free text, pictures, sounds and whatever else may come to play. And it comes with a high degree of variability where formats within the stream can change as the data are captured.
So, all this necessitates that analytical technologies are further redesigned in a way that they can take advantage of massively parallel processing architectures and be able to exploit heterogeneous data with high volumes and velocity and still be able to produce robust and accurate models. Some argue that traditional statistical methods that open more questions than give answers may not survive in this data flood era, and that new machine-learning methods are needed to deal with "big data noise" and see the "big picture around the corner".
There is often the naïve assumption that analysis will happen on this big data as it has been collected. In most cases, only relevant subsets of data will be needed for analysis, which will be integrated with other relevant data sources and most likely aggregated to allow for knowledge induction and generalization.
The big challenge is also around data management. Since data is getting collected in different time points from different locations – temporal and spatial variability – and comes in different formats without adequate metadata describing who, what, when, how and from where, this can pose serious issues in terms of contextualizing and acting on intelligence extracted from such data.
And then there are issues around system and components design. Since not all big data and all business requirements are the same, designers will need to carefully consider functionality, conceptual model, organization and interfaces that would meet the needs of end-users. Answering the business question is more important than processing all the data, so knowing how much data is enough for a given set of business questions is important to know, since this will drive design and architecture of a processing system.
Then there is the challenge with data ownership, privacy and security. Who owns Twitter or Facebook data – service providers where data is stored or account holders? There are serious attempts by researchers to develop algorithms that will automatically randomize personal data among large data collections to mitigate privacy concerns. International Data Corporation has suggested five levels of increasing security: privacy, compliance-driven, custodial, confidential and lockdown, and there is still work ahead to define these security levels in respect to analytical exploitation before any legislative measures are in place.
One must not forget what the end-goal is here. It is about business value and the advantage of being able to make decisions founded on big data analytics that were beyond reach before. And the main challenges here are to prioritize big data analytical engagements so that resources are used on high priority, high value business questions. Successful completion of such complex big data analytics projects will require multiple experts from different domains and different locations to share data, as well as analytical technologies, and be able to provide input and share the exploration of results.
Therefore, big data analytic systems must support collaboration in an equally big way! And lastly, results of analytics must be interpretable to the end-users and be relevant to questions at hand, and some measures of relevance and interest are needed to rank and reduce the sea of patterns so that only relevant, non-trivial and potentially useful results are presented. And in presenting and disseminating results of analytics – the method of visualization plays a special role both in interpretation and collaboration purposes.
Not all of these challenges may be equally relevant in all situations, but at least it is helpful to be aware of them. While there will not be a U-turn on big data, as well as big data analytics, the issue is how to address some of these challenges while keeping an eye on ball, which is to ensure that big data technologies deliver on their promises of providing better answers, quicker, to more complex questions.