Data lakes are an emerging concept in the world of ICT. Largely due to its abilities to accommodate both structured and unstructured data, and not be bound by spatial limitations, the concept is gathering immense popularity especially in the world of large enterprises where millions to trillions of data points are generated everyday. Data lakes allow enterprises to absorb and store these massive amounts of valuable business data in a highly scalable and efficient manner.

However – data lakes on their own are of little value to an enterprise. Data that is stored in an unstructured manner can lead to chaos, delayed decisions and general inefficiencies of processes. The main challenge of a simple data lake architecture is that – raw data is stored with little to no visibility of the lake’s contents. To meet the needs of users, data lakes are required to be equipped with defined mechanisms that are then able to execute processes such as data cataloging and data protection.

Without these elements in place, data cannot be easily found, nor can the same data be trusted to be accurate – thus, resulting in ‘swamp’ like data containers. This is known as data swamping and is one of the worst possible outcomes to a data lake implementation. To meet the needs of enterprise users, data lakes need to have additional controls to maintain governance, semantic consistency and access.

Here, we delve into some of the more common questions that are asked when enterprises debate the need for a big data solution such as a data lake.  

 

1. What is the justification for a ‘data lake’ project?

A recent study conducted by Aberdeen Group illustrates that organisations that utilise data lakes outperform their competitions by 9% – where organic revenue growth is concerned.

Data lakes help such companies to store structured data such as sales data, and other unstructured data such as log files, IoT sensor data from stores, social media data and clickstream data from their customers in order to identify and act upon potential business insights faster. This helps business owners to boost productivity and increase customer retention.

 

2.What is the difference between a data warehouse and a data lake?

Data lakes are fundamentally different from data warehouses – because a data lake ingests both structured and unstructured data, and is capable of storing data in its raw format.

For example; unstructured data can be that off a twitter stream – which in its native form cannot be ingested by a data warehouse. Thanks to capabilities to ingest all forms of data, data lakes can accommodate all types of data regardless of size or format.

In comparison, a data warehouse is – a database that is optimised to analyse relational data coming off transactional systems and lines of business applications. Data warehouses follow a schema-first method where the data schema is defined in advance. This facilitates optimisations for generating fast SQL queries – where the results are typically used for operational reports and analysis. Data inside a data warehouse is cleaned, enriched and transformed into making it a ‘single source of truth’ that users can trust.

 

3. Who are users to a data lake?

Basically anyone in the organisation, who is authorised to  extract insights from the wealth of data that has accumulated, can make use of the organisation’s data lake.

 

4. I could use an alternative solution such as DropBox and achieve the same results, right?

It’s not as easy as it sounds. If storage is the only factor that is being evaluated, then yes – DropBox may actually be a more cost effective solution. However in addition to data storage facilities, a data lake can offer a number of built-in, or – third party – analytical tools as well. As a result, data lake analysts can swiftly run through millions of points of data to extract priceless business insights.

 

5. How much will a data lake cost me?

Data lakes are ideally suited for cloud computing. This is because cloud computing includes advantages such as – peak performance, scalability, reliability, availability, diverse analytical engines and invaluable economies of scale. Data lakes that leverage AWS S3 as the underlying storage technology, can guarantee 99.99999999% durability for prices as competitive as $5/TB (per terabyte).

 

6. Can you provide a typical ‘real life use case’ scenario for a data lake?

With a fully fledged data lake solution consisting of relevant analytical tools in place, a business organisation will find it possible to dynamically predict performance indicators such as foreseeable sales figures or even prices of company shares. Data lakes can gather all data from all devices such as points of sale (POS), weather reports, social media data and more to infer the various foreseeable business possibilities.

Such an activity involves processing structured and unstructured data, and performing data transformations that can then be fed into an appropriate machine learning model – to derive the desired predictions and possibilities.

Bonus – Can you recommend a suitable data lake solution for my team?

Traditional data lake platforms are built on top of the open source hadoop ecosystem. Mitra Innovation’s Praedictio data lake platform tackles the inherent challenges of traditional Data Lake solutions by leveraging the AWS cloud services to provide a inherently scalable and secure data lake. To learn more about Praedictio, download the whitepaper – Introducing Praedictio – an effective business predictions framework

 

Follow us as we explore the newest frontiers in ICT innovation, and we apply such technologies to solving real world problems faced by enterprises, organisations and individuals. Thank you so much for reading! 🙂

Dushan Devinda

Intern - Research and Development | Mitra Innovation

Leave a Reply