Why I have written this series? I was worked 6 projects in hadoop(hortonworks/Clouera) and AWS technology. As per data ingestion layer used for HDFS & S3 storage but it will not support following feature like ACID, incremental data loading, data duplicate etc .. so we were used HBase, dynamodb and some scripts to achieve those functionality and it is good amount development effect involved with some bug. We were faced server performance issue in Hbase databases (good for random Read/Write Operations) like row key, delete etc.. and need provide separate infrastructure those framework including maintains. As per my suggestion try to use Apache Hudi/Delta Lake in your project and there is any heavy read operation consider Apache iceberg.
A Quick Comparison
Please check below table comparison for the entire four frameworks.
The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table.
· Iceberg avoids unpleasant surprises. Schema evolution works and won’t inadvertently un-delete data. Users don’t need to know about partitioning to get fast queries.
· Schema evolution supports add, drop, update, or rename, and has no side-effects Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries.
· Partition layout evolution can update the layout of a table as data volume or query patterns…
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. Upsert refers to the ability to insert records into an existing dataset if they do not already exist or to update them if they do. By efficiently managing how data is laid out in Hdfs, S3 and GCP, Hudi allows data to be ingested and updated in near real time. Hudi carefully maintains metadata of the actions performed on the dataset to help ensure that the actions are atomic and consistent.
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Specifically, Delta Lake offers:
· ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
· Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
· Streaming and batch unification: A table in Delta Lake is…
In RDBMS databases world there is no ACID function issue but Cloud/Hadoop technology world one of the major issue was ACID function implementation.
I decided to compare different and similar open-source projects like Delta Lake, Hudi, Iceberg and Hive. The idea is simple: prepare environment for all four technologies and compare them from Apache Spark and consumption perspectives including Hive. Before that will see what that is ACID function.
A transaction is a collection of instructions. To maintain the integrity of a database, all transactions must obey ACID properties. ACID is an acronym for atomicity, consistency, isolation, and durability. …