Dealing with Data Variety

Written by Bryan Smith

January 30, 2017
Why the third “V” of Big Data is driving huge investment in 2017 and unlocking access to external data.

Here at ThinkData, we pride ourselves on solving the “variety” issue associated with accessing external data. It’s truly the problem we built our company around solving. For those of you familiar with Gartner’s infamous “3 V’s of Big Data”, you will know that back in 2001, Gartner defined big data as high-volume, high-velocity, and high-variety:

The years following led to an explosion in activity focused on adopting the proper tools and processes needed to get a handle on Big Data.

Companies needed to start properly collecting and organizing their own data in order to start mining it for new insight and opportunity. Once caught up on the ability to store massive amounts of data, these organizations needed to learn how to manage the firehose of new data that was being pushed into their newly created lake of usable information. Ultimately, the “Volume” and “Velocity” side of Big Data took the major focus of an organizations’ Big Data efforts — and rightfully so. Organizations were taking stock of their inventory and learning how to better manage their newly discovered information supply chain.

As we fast forward to 2017, we’re seeing focus shift from how to handle the size and speed of data, to how to deal with the seemingly endless variety of data created from both internal and external sources. A lot of this focus has been driven by the unavoidable value associated with leveraging external data that, in principle, comes in a variety of structures and formats.

We’re not the only ones seeing this trend. Dealing with “Variety” of Big Data was recently highlighted in Tableau’s White Paper on the “Top 10 Big Data Trends for 2017”:

In Tableau’s words, data formats are multiplying and connectors [pulling in multiple datasets in varying formats] are becoming crucial. This truly is one of the largest barriers facing those who are trying to get the most out of Big Data, and the problem is only amplified when you start looking outside of your environment to access third party or publicly available external data.

Since the introduction of open data to the marketplace, there have been numerous predictions made regarding its social, economic, and political impact. In 2011 The European Commission projected that by opening up data the EU stood to have €40-billion per year injected into their economy. Two years later, McKinsey published a frequently referenced paper that valued open data at $3-trillion per year. More recently, Gartner estimated that within two years 80% of organizations would be consuming open data.

Although varying in conclusion, each of these firms’ predictions share similar characteristics. Firstly, they all recognize that open data represents a hugely untapped and incredibly valuable resource. Secondly, and more importantly, the predictions are based on the underlying assumption that once open data is made publicly available, it simply also becomes accessible.

Available ≠ Accessible

Varying formats, structures, access requirements, release schedules, and a general lack of standards in even the most basic cases (“VARIETY”) makes available external data wholly inaccessible. In order to unlock the value of external data, it is necessary to adopt a solution that can standardize and normalize any source of external data and present it in a way that matches an organizations’ own internal standards.

So, how do we deal with the issue? At ThinkData, we believe that to manage external data and all of its variety, organizations will have to adopt a new data processing framework . This framework will need to manage several functions, among them:

FIND — A systematic approach to finding the right sources of external data, possibly thousands.

ORGANIZE — Record, maintain, and update the data. Verify and track metadata.

NORMALIZE — Standardize raw data feeds and generate common formatting.

TRANSFORM — Define ideal output. Transform the data to internal specifications before transported into the organization.

SHARE — Manage access enterprise-wide to unlock data value to the fullest extent.

This system is not intended to displace traditional data infrastructure, but to integrate with it and keep it efficient. A solution such as this moves data processing tasks outside of an organization’s core internal infrastructure, eliminating the need to waste storage, slow down performance, or spend resources on processing tasks that can take place outside of an organizations’ environment. It will also help businesses avoid the inevitable scenario of bombarding their data lake with external data that is not yet ready to be leveraged, creating big messes that are not easily cleaned up — think oil spill.

Variety is inevitable when dealing with external data. But rather than tacking the problem onto existing Big Data infrastructure, take advantage of new technologies that will allow you to standardize data outside your existing infrastructure. This keeps your internal environment clean and organized, eliminating a lot of stress on data systems, scientists, and engineers. Through a solution like the one outlined above, any external data can be identified, transformed, and constantly monitored, made ready for access whenever your organization needs it. This is how the complex issue of variety in external data is boiled down to a manageable process.

2017 is definitely the year of solving the data variety issue and finally unlocking the true value external data poses for your organization.