Big data - proceed with caution
While collection and analysis
on big data hold great promise, quantity doesn't always translate to quality.
Quantity of data is represented by the number of records and the number of
variables, and one can argue that good old statistical sampling techniques are
still relevant. If variability is captured with a random sample, there will be
very little incremental benefit, if any, of doing analysis on all rows.
The second dimension of big
data is the number of variables. Data sets of only 10 variables with 10
distinct values for each variable, gives potentially 10 billion pattern
combinations; and with an increase in the number of variables, the potential
for extracting spurious and non-explicable patterns and correlations also
increases.
The key question remains – how
can analytics handle stream data that keeps increasing in volume? Each method,
technique or algorithm has an optimal point, after which there are diminishing
returns, plateau and then degradation, while computational requirements
continue to grow. Some suggest that algorithms need to be rewritten to move the
optimal point further down a path of data infinity.
And what makes it more
challenging is that big data is characterized more by the data variety and
velocity rather than by sheer volume. Data doesn't only come in a standard
structured format, it comes in a stream in the form of free text, pictures,
sounds and whatever else may come to play. And it comes with a high degree of
variability where formats within the stream can change as the data are
captured.
So, all this necessitates that
analytical technologies are further redesigned in a way that they can take
advantage of massively parallel processing architectures and be able to exploit
heterogeneous data with high volumes and velocity and still be able to produce
robust and accurate models. Some argue that traditional statistical methods
that open more questions than give answers may not survive in this data flood
era, and that new machine-learning methods are needed to deal with "big
data noise" and see the "big picture around the corner".
There is often the naïve
assumption that analysis will happen on this big data as it has been collected.
In most cases, only relevant subsets of data will be needed for analysis, which
will be integrated with other relevant data sources and most likely aggregated
to allow for knowledge induction and generalization.
The big challenge is also
around data management. Since data is getting collected in different time
points from different locations – temporal and spatial variability
– and comes in different formats without adequate metadata describing who,
what, when, how and from where, this can pose serious issues in terms of
contextualizing and acting on intelligence extracted from such data.
And then there are issues
around system and components design. Since not all big data and all business
requirements are the same, designers will need to carefully consider
functionality, conceptual model, organization and interfaces that would meet
the needs of end-users. Answering the business question is more important than
processing all the data, so knowing how much data is enough for a given set of
business questions is important to know, since this will drive design and
architecture of a processing system.
Then there is the challenge
with data ownership, privacy and security. Who owns Twitter or Facebook data
– service providers where data is stored or account holders? There are
serious attempts by researchers to develop algorithms that will automatically
randomize personal data among large data collections to mitigate privacy
concerns. International Data Corporation has suggested five levels of
increasing security: privacy, compliance-driven, custodial, confidential and
lockdown, and there is still work ahead to define these security levels in
respect to analytical exploitation before any legislative measures are in
place.
One must not forget what the
end-goal is here. It is about business value and the advantage of being able to
make decisions founded on big data analytics that were beyond reach before. And
the main challenges here are to prioritize big data analytical engagements so
that resources are used on high priority, high value business questions.
Successful completion of such complex big data analytics projects will require
multiple experts from different domains and different locations to share data,
as well as analytical technologies, and be able to provide input and share the
exploration of results.
Therefore, big data analytic
systems must support collaboration in an equally big way! And lastly, results
of analytics must be interpretable to the end-users and be relevant to
questions at hand, and some measures of relevance and interest are needed to
rank and reduce the sea of patterns so that only relevant, non-trivial and
potentially useful results are presented. And in presenting and disseminating
results of analytics – the method of visualization plays a special role both in
interpretation and collaboration purposes.
Not all of these challenges may
be equally relevant in all situations, but at least it is helpful to be aware
of them. While there will not be a U-turn on big data, as well as big data analytics,
the issue is how to address some of these challenges while keeping an eye on
ball, which is to ensure that big data technologies deliver on their promises
of providing better answers, quicker, to more complex questions.
Goran Dragosavac
Even when big data analytics are focused on interpreting patterns of data – exploration without specific questions in mind – it’s essential to have parameters, comparison intelligence, metrics, and oversight by knowledgeable people, to determine if the analytics results are promising — or junk.
ReplyDeleteBig or small data, there is always variety in data. Data type, fields, ads or non-ads, listings, images, descriptions, and more. Speed is strategically important. Our Big data services ensures fast deliveries for making business impact.
ReplyDelete