What is ACID function and how it was impact into Data lake storage environments? –Part2

--

Delta Lake Framework

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Specifically, Delta Lake offers:

· ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.

· Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.

· Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.

· Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.

· Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.

· Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

Insert data

Delta Lake supports creating tables directly based on the path using DataFrameWriter (Scala or Java/Python). Delta Lake also supports creating tables in the metastore using standard DDL CREATE TABLE. When you create a table in the metastore using Delta Lake, it stores the location of the table data in the metastore. This pointer makes it easier for other users to discover and refer to the data without having to worry about exactly where it is stored.

Generate some new employee records, load them into a DataFrame and write the DataFrame into the Delta Lake table as below. The data was loaded and schema showed below.

We can query the employee table data throw the Spark dataframe with 5 rows returning.

Update Data

The following example demonstrates how to update data by writing a DataFrame and Delta Lake has supporting merge operation but not showed here. We can use Delta Lake update/merge operation number way to use our code base.

Delete Data

You can remove data that matches a predicate from a Delta table. For instance, to delete empId 677509, you can run the following:

Conclusion

The Hadoop/Cloud transformed the big data processing landscape and allowed engineers to build efficient data pipelines. However, There is lot of critical gap in how engineers manage their storage layer with big data, both on-prem and cloud. They had to go through workarounds and build complicated data pipelines to deliver data to consumers. If you want complete functional enabled Delta Lake framework need go with Databricks.

--

--

Selvam Rangasamy-Senior Data Engineer & Architect
Selvam Rangasamy-Senior Data Engineer & Architect

Written by Selvam Rangasamy-Senior Data Engineer & Architect

I am Big Data Engineer & Solution Architect experience in various Cloud & Big data distribution systems, primarily on Hadoop & AWS Cloud services.