After some intense analysis, I’ve come to the conclusion that the datalake concept is only marginally beneficial to a business.
Specifically to a business who has the need to collect clickstream data or track web information and only if they are very high volume.
- While the data lake concept is much hyped and marketed it is simply a low-cost technology to store raw and I repeat raw data using 1980s techniques and then analyze it with an arcane functionally limited language like “R”.
My primary concern is to always look at organizing and processing data based on solid data quality principles. And as such a data lake or HDFS or hadoop cluster can only serve as an initial landing area, as opposed to a staging area. After which appropriate data quality processes still need to be followed in a traditional format.
I believe a business may make a major mistake by simply pursuing the implementation of the data lake without fully realizing it’s place in an overall information architecture needed to provide a business accurate answers to business questions.
Furthermore considering how Hadoop aka. Data lake designed as a low cost method of collecting and storing vast amounts of data, expensive hardware should definitely be avoided.
As I said before you embark on a “big data” Hadoop or a data lake be sure you understand the implications and the required data quality and cleansing that will be required to make the information analytically acceptable and actionable.
You might also want to consider that Google abandoned Map Reduce and Hadoop.”years ago”
Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System