What is ACID function and how it was impacting into Data lake storage environments? –Part1

In RDBMS databases world there is no ACID function issue but Cloud/Hadoop technology world one of the major issue was ACID function implementation.

I decided to compare different and similar open-source projects like Delta Lake, Hudi, Iceberg and Hive. The idea is simple: prepare environment for all four technologies and compare them from Apache Spark and consumption perspectives including Hive. Before that will see what that is ACID function.

A transaction is a collection of instructions. To maintain the integrity of a database, all transactions must obey ACID properties. ACID is an acronym for atomicity, consistency, isolation, and durability. Let’s go over each of these properties.

1. Atomicity

A transaction is an atomic unit; hence, all the instructions within a transaction will successfully execute, or none of them will execute. The following transaction transfers 20 dollars from A’s bank account to B’s bank account. If any of the instructions fail, the entire transaction should abort and rollback.

2. Consistency

A database is initially in a consistent state, and it should remain consistent after every transaction. Suppose that the transaction in the previous example fails after Write(A_b) and the transaction is not rolled back; then, the database will be inconsistent as the sum of “A” and “B”’s money, after the transaction, will not be equal to the amount of money they had before the transaction.

3. Isolation

If the multiple transactions are running concurrently, they should not be affected by each other; i.e., the result should be the same as the result obtained if the transactions were running sequentially. Suppose B_bal is initially 100. If a context switch occurs after B_bal *= 20, T2 will read the incorrect value of 100 as the updated value will not have been written back to the database. This violates the isolation property as the result is different from the answer that would have been​obtained if T1 had finished before T2.

T1 adds 20% interest to “B”s savings account and T2 adds 20 pounds to “B”s account.

4. Durability

Changes that have been committed to the database should remain even in the case of software and hardware failure. For instance, if “B” s account contains $120, this information should not disappear upon hardware or software failure.

One of the biggest challenges when working with data lakes is updating data in an existing datasets and pulling out only the changes since the latest ingestion.

A Quick Comparison

Please check below table comparison for the entire four frameworks and will see details into following subsequent series.

--

--

Selvam Rangasamy-Senior Data Engineer & Architect

I am Big Data Engineer & Solution Architect experience in various Cloud & Big data distribution systems, primarily on Hadoop & AWS Cloud services.