Data lakes are the billion-dollar evolution to data warehousing but are not the panacea for data management headaches, says Shemas Eivers, chairman of Cork-based tech firm Client Solutions.
We are now creating more bytes worldwide daily than there are grains of sand but any view of data being the “new oil” or other such valuable resources misses the mark. Oil is finite. Exponential data growth is a certainty and is hurtling towards mind-boggling volumes.
The latest solution to the vast expansion is the concept of the data lake, touted as the evolution of data warehouses. Data warehouses are viewed as being too constrained and limited in their ability to consume data while the capacity of great data lakes to absorb information is considered vastly Superior. As fintech, artificial intelligence and cybersecurity advance at breakneck speed, having a lake to pour all our data into sounds like a great solution to the problem.
A data lake is where you pour large data volumes in unstructured or semi-unstructured form. A data warehouse is a well-structured data set that has been validated and has been processed through validation layers although as with everything in IT these boundaries are often blurred.
The giants of the industry are convinced. SAP, Microsoft, Oracle and IBM are among those investing heavily in data lakes. The data lakes market is forecast to grow at a compound rate of +35% per year until 2026, according to data analytics firm ‘The Research Insights’, and to expand in value from $2.53bn in 2016 to $8.81bn by 2021, according to research by ‘Markets and Markets’.
There you have, a Data Lake is the solution to all your IT department’s issues. Let’s pour all the data we have into such lakes and let the user at it as they please and everybody will be happy, right?
Wrong, in my opinion. Lakes look calm, inviting and safe but many a horror film has lake scenes. Letting users swim in data lakes is fraught with risks. Just like real lakes, people can drown. They can feel very cold or insecure when they try to tread water while peering into the darker recesses.
Analytics is the current vogue message in many organisations but an IT department may be absconding from its duty by using data lakes as the core solution. The desire should be to have the lake as a temporary puddle in most situations. The promise is that a user-friendly visualisation tool will seamlessly allow quick and easy reporting and analysis on the data in the lake. This may work for specialist users such as data scientists or in simple data sets.
The reality is that the average user working with the best tools, still needs to understand the data because you can quite easily get the wrong answers and that is nearly worse than no answer. I believe that individual users will all too often drown in their data lake because the work has not been done to help them navigate it properly.
Real data is often complex, error ridden, messy and not easily merged with different data sets even when it relates to the same core data. If it was easy, then the data warehouse architect would have little to do. If the data hasn’t been validated and is not curated on an on-going basis then a fancy tool, just like new swimming trunks, will not automatically make you a strong swimmer
We’ve been here before over the past 40 years, and each time the tools are flashier, and the hype is greater, but the excuses this time are no different than before: data volume is so huge, needs are so diverse, and users still can’t define their requirements. This last statement is incorrect of course - the myth of requirements leading the data warehouse design is often referenced but is wrong. A properly designed data warehouse should NOT be based on user requirements but should simply allow flexible access to all the data in scope.
One can certainly use data lake technologies and visualisation tools, but you still need foundations. Applying some preparation and organisation to the data is far better for everyone than just letting people work willy-nilly at the raw data, producing different answers due to different interpretations, or the failure to recognise or address data issues appropriately.
IT within an organisation has a responsibility to help that process as much as they can - we are supposed to be the professionals. You wouldn’t let any user produce your accounts or your tax returns just because you’ve given them a fancy tool.
If swimming in a data lake is part of the solution, then data lifeguards and data swimming instructors are the very minimum requirements we must think about before we let users dip their toes in the water.
- Shemas Eivers is chairman and co-founder of Client Solutions, which partners with some of the largest and most innovative technology companies in the world, including Microsoft, SAP, BMC Remedy and MicroStrategy.