What is ACID function and how it was impacting into Data lake storage environments? –Part4

--

Apache Iceberg

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table.

User experience

· Iceberg avoids unpleasant surprises. Schema evolution works and won’t inadvertently un-delete data. Users don’t need to know about partitioning to get fast queries.

· Schema evolution supports add, drop, update, or rename, and has no side-effects Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries.

· Partition layout evolution can update the layout of a table as data volume or query patterns change.

· Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes.

· Version rollback allows users to quickly correct problems by resetting tables to a good state.

· It will read the PB data one single server itself and the framework design for heavy read.

· It will not support update/merge statement.

· It will not support single record delete but support entire partition.

Insert Data

Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

This command creates a path-based catalog named local for tables under $PWD/warehouse and adds support for Iceberg tables to Spark’s built-in catalog:

We can query the employee table data throw the Spark dataframe with 5 rows returning.

The below command will show you incremental Insert.

We can query the employee table data throw the Spark dataframe with 6 rows returning.

Delete Data

Delete queries accept a filter to match rows to delete. Iceberg can delete data as long as the filter matches entire partitions of the table, or it can determine that all rows of a file match. If a file contains some rows that should be deleted and some that should not, Iceberg will throw an exception. The below empid(677511) store into single partition that’s reason it was deleted otherwise entire partitions will dropped.

Conclusion

Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine. The iceberg not support update/delete operation, it was one of the major drawback.

--

--

Selvam Rangasamy-Senior Data Engineer & Architect

I am Big Data Engineer & Solution Architect experience in various Cloud & Big data distribution systems, primarily on Hadoop & AWS Cloud services.