GCP Data Lakehouse
The recommended data architecture on GCP is:
Data Format: Parquet
Storage: Google Cloud Storage (GCS)
ETL: DataFlow
Lakehouse: BigQuery
The recommended workflow is:
GRAX writes data to GCS in Parquet
Data Flow
Notified when new Parquet is available
Python script reads new Parquet data
Extracts objects and fields for downstream
Transforms fields into computed fields for business logic
Loads into Big Query
Big Query
Queries and joins multiple data sets
GRAX ETL data
GRAX datalake data
Additional data sets
Anti-patterns are:
Reading entire Parquet files vs specific columns
Polling for new Parquet vs getting a push notification
Moving GRAX data to track ingestion
Last updated
Was this helpful?