Data Lakehouse Architectural Options
AWS and Databricks are the two main proponents of the data lakehouse concept, so they define a lakehouse’s architecture. Data lakehouse systems usually include the following five layers.
The first layer pulls data from various sources and delivers it to the next layer. It combines data streaming and batch processing, using various protocols to connect to internal and external resources (i.e., an RDMS, NoSQL database, CRM application, etc.). Components used during the ingestion stage might include Amazon’s Data Migration Service (DMS) to import data and Apache Kafka to stream data.
The second layer stores the data in various cost-effective platforms like Amazon S3. The client has tools to read objects from the data store, allowing many APIs and other components to access and use the data. A data lakehouse is most useful for cloud repo services that separate storage and compute, although it can also work on-premises.
This layer is the main component that differentiates data lakehouses from other storage architectures. It is a centralized catalog providing metadata about each object in the data lake, allowing users to implement various management features (i.e., ACID transactions, caching, versioning, and zero-copy cloning).
The metadata layer enables schema architectures like star or snowflake schemas. It allows organizations to manage schemas and provides data auditing and governance functionality. Schema management includes evolution and enforcement features, allowing users to control data quality by rejecting writes that don’t meet the schema.
A unified management interface makes auditing and access control easier.
This layer hosts several APIs that allow end-users to process data quickly and perform advanced analytics tasks. For instance, a metadata API helps identify the objects required for a given application. Some ML libraries can read formats like Parquet, enabling direct queries of the metadata layer. Other APIs help developers optimize the data structure and transformation.
The data consumption layer includes tools and applications that support analytics tasks like data visualization, ML jobs, queries, and business intelligence (BI) dashboards.