As ChatGPT usage becomes more popular, I often come across customers and data users citing ChatGPT responses in their discussions. I love the excitement around ChatGPT and the desire to learn about modern data architectures like data lakes, data networks and data fabrics. ChatGPT is a great resource for gaining high-level insights and developing awareness of any technology. However, caution is needed when diving deeper into a particular technology. ChatGPT is trained on historical data and depending on how their question is phrased, it may offer inaccurate or misleading information.
I took the free version of ChatGPT for a test drive (in March 2023) and asked a few simple questions about the data lakehouse and its components. Here are some of the answers that weren’t right, and our explanation of where and why it went wrong. Hopefully this blog will give ChatGPT an opportunity to learn and correct itself in light of my 2023 contribution to social good.
I thought this was a pretty comprehensive list. One key component that is missing is a common, common spreadsheet format that can be used by all analytics services that access lake house data. When using a data lake house, the table format is an important point because it acts as an abstraction layer, making it easy to access all the structured, unstructured data in the lake house at the same time through any engine or tool. The table format provides the necessary structure for unstructured data that is missing from the data lake, using a schema or metadata definition to bring it closer to the data warehouse. Popular desktop formats include Apache Iceberg, Delta Lake, Hudi, and Hive ACID.
Also, the data lake layer is not limited to cloud object storage. Many companies still have vast amounts of data on premises and data lakes that are not limited to public clouds. They can be built on-premises or as hybrid deployments using private clouds, HDFS stores, or Apache Ozone.
At Cloudera, we also provide machine learning as part of our lakehouse, so data scientists have easy access to reliable data from the data lakehouse to quickly launch new machine learning projects and build and deploy new models for advanced analytics.
I like how ChatGPT started this answer, but it quickly jumps into features and even gives the wrong answer on the feature comparison. Features are not the only way to determine which spreadsheet format is better. It depends on compatibility, openness, versatility and other factors that can guarantee a wider use of data for diverse users, guarantee security and management and future reliability of your architecture.
Here’s a high-level feature comparison chart if you want a breakdown of what’s available in Lake Delta vs. Apache Iceberg.
This answer is a bit dangerous because of its fallibility and shows why I think these tools are not ready for deeper analysis. At first glance, it may seem like a reasonable answer, but its premise is wrong, which makes you question the entire answer and other answers as well. Saying “Delta Lake is built on top of the Apache Iceberg” is incorrect, because the two are completely different, unrelated table formats, and one has nothing to do with the concept of the other. They are created by various organizations to solve common data problems.
I’m impressed that ChatGPT got this one right, even though it made a few mistakes with our product names and missed a few that are important to the implementation of the lake house.
CDP components that support the data lake house architecture include:
- Apache Iceberg table format integrated with CDP to provide structure to the vast amount of structured, unstructured data in your data lake.
- Data services including a cloud native data warehouse called CDW, a data engineering service called CDE, a data streaming service called data in motion and a machine learning service called CML.
- Cloudera Shared Data Experience (SDX), which provides a unified data catalog with automatic data profiles, unified security and unified management of all your data in both public and private cloud.
ChatGPT is a great tool for getting a high-level understanding of new technologies, but I’d say use it carefully, validate its responses, and only use it in the awareness phase of the buying cycle. As you move into the observation or comparison phase, it is not yet reliable.
Also ChatGPT responses are being updated so hopefully it will be fixed before you read this blog.
To learn more about Cloudera’s Lake House, visit web page and if you’re ready to start, watch! Cloudera Now demo.