Delta Lake – Deliver reliability and high performance on your Data Lake

In my earlier blog, I discussed the need and features of Lake House Architecture.  The Delta Lake is an open-source project aimed to implement Lake house architecture with pluggable components for Storages and Computes. 

In modern data architecture, at least 3 types of systems are used in combination: The Data warehouse, Data Lake and Streaming system.  This combination demands large ecosystem from Amazon Kinesis/Apache Kafka/Azure Event-IoT Hub, distributed computing like Hadoop, Spark, Flink, Storm and finally to Data Lake sinks like HDFS, S3, ADLS, GCS or even quarriable systems like Data Warehouse, Databases, NoSQL Storages etc. 

Unfortunately, Data Lake is weak in the Performance or Quality required to support high-end business processes on their own.  As a result, till yesterday, only alternative available that offered Performance, Security, Scalability, Concurrency was Data Warehouse Solutions.  But, the Data Warehouse solutions bring with them, unwanted fats like Cost, inability to deal with different data formats and so on.

The Delta Lake can be looked at as a boon as solution to the problems industry is facing to be ‘Data-Driven Decision Making’ Enterprises.

The Delta Lake:

It’s an open-source project that allows building Lake House Architecture on the top of Data Lakes like HDFS, S3, ADLS, GCS using compute engines like Spark, Flink, PrestoDB, Trino through APIs of languages like Java, Python, Scala, SQL, Rust, Ruby and so on.

At the heart of Delta Lake, data is represented as a Table.  There can be multiple tables representing different kinds of data.  These tables are different than Database tables in the sense that they can accommodate stream data on the fly and operations on them may also change results with fresh data pumped in.  The ultimate point to notice is the real-time performance of these operations as these tables along with their data are managed well diligently within the memory of the computes despite the big data size.

The Delta Lake Standards offer list of key features discussed below…

  • ACID Transactions: A typical data processing lifecycle normally has multiple data pipelines ingesting data from one/multiple data sources, applying different pipelines concurrently and writing processed data to Data Sinks concurrently (Ref Lambda Architecture).  Seems easy but realization is challenging for the reason that implementation, maintenance is tedious and further more tedious is ensuring data integrity due to lack of transaction management.  Delta Lake brings ACID Transactions with strongest isolation level to seamlessly address this concern.  That’s great!
  • Open Formats: The Apache Parquet format has a unique native proposition of efficient compression and encoding schemes. Its an open format supported by most of the Big Data Solutions.  The Delta Lake uses it as a native format to preserve and work with all the data. Its columnar structure adds the performance multi-fold while dealing with Big Data Analytics.
  • Schema Enforcement and Evolution: There are data systems which enforce/validate schema while reading and not while writing leading to entering corrupt data into the system. If it goes un-noticed until irreversible damage happens to the system, makes the system vulnerable for inconsistency.  The Schema enforcement in Delta Lake safeguard the data quality by rejecting the writes of corrupt, disoriented data. The schema represents the column names, types and constraints.  Only data abiding by the schema enters the system and otherwise outrightly rejected.  It also assures that until you make affirmative choice of changing it, it does not change thereby prevents Data Dilution.  The Schema Evolution lets you easily change the current schema of tables to accommodate data that is changing over the time.  It can be made to automatically adopt a schema with new columns while appending/overwriting data.
  • Delta Sharing: The ‘Delta Sharing’ is industry’s first open protocol to secure sharing of the data across tools, applications, organizations irrespective of the Compute and Storage platforms.  It facilitates sharing the live data without using any intermediate staging storage. It also can share tera-byte scale of data reliably and efficiently with easy governance, auditing and tracking to the data sets.
  • Scalable Metadata Handling: In big data, even metadata also is big and needs its management like other data including distributed and replicated storage.  The Delta Lake treats metadata as normal data leveraging distributed processing power of compute and distributed & partitioned storage across the files at ease. It obviously augments the feature of metadata as highly scalable and available.
  • Unified Batch and Stream Sources and Sinks: A table in Delta Lake is a major building block in the Batch as well as a Stream data pipeline. A table gets populated with static data or automatically and dynamically populated with real-time data. Relationship among tables can be easily established to entertain even complex queries. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Delta Everywhere: Use of connectors and plug-ins to connect to a variety of tools, applications, platforms through delta share are one of the important needs of the hour to design a collaborative and accommodative system. Support of API for multiple languages makes it possible for a single system to be developed by different skill-sets.
  • Audit History and Time Travel: Delta Lake transaction log records details about every change made to the data providing a full history of the audit trail of the changes. Data versioning enables rollbacks and reproducible machine learning experiments.
  • Upserts and Deletes: To provide support for use-case like Slowly Changing Dimension (SCD) Operations, Change-Data-Capture, Stream Upserts, the support of Insert, Merge, Update and Delete operations are essential.  Be assured from Delta Lake for extensive support through API.
  • Compatibility with Apache Spark API: Delta Lake offers 100 % Compatibility with Apache Spark API.  Developers can quickly and easily use existing spark data pipelines in the Delta Lake environment doing minimal changes to it.

In nutshell, together, the features of Delta Lake improve both the manageability and performance of working with data in storage objects and enable a “Lake House” paradigm that combines the key features of data warehouses and data lakes augmented with High Performance, Scalability with cost economics.

Delta Lake Data Maturity Layers:

                In a typical Data Processing Pipeline, data changes its shape as it passes through different processing stages.  After ingesting data from the Data Source, different operations like Data Cleaning, de-duplicating, enriching, transforming and converting into a dimensional model are the inevitable steps to carry out before being brought for actual analytical processing.  This leads to have layers of maturity levels across the data pipeline.  The Databricks Delta Lake has suggested and used the following levels of maturity. 

The Bronze Layer: The layer contains the raw data ingested directly from Source Files.  The structure of these tables thus resembles the structure of ingested data from Data Source.  The data exist here is in the pre-state of cleaning and further processing of data.  The Bronze tables represent data in the ‘Delta’ format irrespective of the format of the ingested data (JSON, Parquet, Avro, ORC). The tables in the Bronze Layer are further enriched by cleaning, de-duplicating, transforming, wrangling to get a new version of data which is called as Silver Layer.

The Silver Layer: The layer contains tables having cleaned and enriched data thereby offering a refined view of data.  It may be necessary to do look-up, joins with other Bronze/Silver tables, updates and so on to prepare the Data Model in this layer.  The Silver Layer represents the data up-to-date for analytical processing.

The Gold Layer: The data from the silver layer will be converted into Dimensional Model, aggregation to represent a Gold Layer.  The aggregation done here can be used further for reporting, business processing and Dashboard designing by end-user or other applications.

Delta Lake Implementation platforms:

  • Delta Lake on Spark: The Delta Lake distribution is available for Spark 2.x and 3.x. The Delta Lake package version 0.7.0 works on Spark 3.x with Delta Core Scala 2.11.x.
  • Azure Databricks: A data analytics platform optimized for the Microsoft Azure cloud services platform. It is an enterprise-level platform designed around Spark Engine to offer solutions for designing analytics processing pipeline.  Azure Databricks support Delta Lake development around and upon Spark engine and Databricks File System.  This is one of the platforms offering complete support of Delta Lake within is Storage and Compute Environments.
  • Azure Synapse Analytics: The Azure Synapse Analytics is a managed unified Azure service offering end-to-end solutions for all types of business analytics operations. Its one of the features is Spark Pool which enables data engineers and data scientists to either interactively or in batch mode design data pipeline using programming languages like Scala, PySpark, .Net. The Synapse Analytics also has a service called Server-less SQL Pool which offers a quarriable layer around all non-quarriable data formats. Interestingly, the Server-less SQL Pool can be queried to read Apache Delta Lake files and serve them to reporting tools. The Delta Lake file has a similar format irrespective of whether they are created using Apache Spark, Azure Databricks or any other product which uses Delta Lake Format. The Spark API of Spark Pool has a support to query, modify Delta Lake files using programming language. A full support of Delta Lake is yet to be introduced.

Many organizations are using and contributing to Delta Lake as of today, few names are Databricks, Alibaba Group, Viacom, Informatica etc.  The structure of data changes over time as business concerns and requirements change. However, with the help of Delta Lake, adding new Dimensions as the data changes are simple. Delta Lakes improve the performance, reliability, and manageability of Data Lakes.

Above all, Delta Lake has made it extremely easy for organizations to adapt ‘Data-Driven Decision Making’ and make a value out of the data.

Blog by Mr. Chandrashekhar Deshpande (MCT)

Technology expertise: Analytics, Machine Learning and Artificial Intelligence, Hadoop spectrum and Azure HDInsight, Apache Spark-Azure Databricks-Azure ML Services-Azure Cognitive Services, Java-Python-Scala, GoF and JEE and Frameworks.