What is ACID function and how it was impact into Data lake storage environments? –Part3

--

Apache Hudi

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. Upsert refers to the ability to insert records into an existing dataset if they do not already exist or to update them if they do. By efficiently managing how data is laid out in Hdfs, S3 and GCP, Hudi allows data to be ingested and updated in near real time. Hudi carefully maintains metadata of the actions performed on the dataset to help ensure that the actions are atomic and consistent.

These features make Hudi suitable for the following use cases:

  • Working with streaming data from sensors and other Internet of Things (IoT) devices that require specific data insertion and update events.
  • Complying with data privacy regulations in applications where users might choose to be forgotten or modify their consent for how their data can be used.
  • Implementing a change data capture (CDC) system that allows you to apply changes to a dataset over time.

Insert data

The following examples demonstrate how to launch the interactive Spark shell, use Spark shell to work with Hudi on Hadoop. We can also use the Hudi DeltaStreamer utility or other tools to write to a dataset. Throughout this section, the examples demonstrate working with datasets using the Spark shell while connected to the master node using SSH as the default hadoop user.

Generate some new employee records, load them into a DataFrame and write the DataFrame into the Hudi table as below. The data was loaded and schema showed below.

We can query the employee table data throw the Spark dataframe with 5 rows returning.

The following example demonstrates how to upsert data by writing a DataFrame. Unlike the previous insert example. Will have made two operations for below one is update empid is 677509 and added new empid 185760.There separate update/increment option available in Hudi but not showing here.

Delete Data

There is several ways to delete the records and I have used simple delete in empid 940761.

Conclusion

It has excellent framework for data storage layer in Hadoop & Cloud platform to supporting all the DML operation along with ACID/ Time Travel function. Apache Hudi fills a big void for processing data on top of HDFS, S3, GCP etc.., and thus mostly co-exist nicely with these technologies.

--

--

Selvam Rangasamy-Senior Data Engineer & Architect

I am Big Data Engineer & Solution Architect experience in various Cloud & Big data distribution systems, primarily on Hadoop & AWS Cloud services.