GCP Data Lakehouse

The recommended data architecture on GCP is:

The recommended workflow is:

GRAX writes data to GCS in Parquet
Data Flow
- Notified when new Parquet is available
- Python script reads new Parquet data
- Extracts objects and fields for downstream
- Transforms fields into computed fields for business logic
- Loads into Big Query
Big Query
- Queries and joins multiple data sets
  - GRAX ETL data
  - GRAX datalake data
  - Additional data sets

Anti-patterns are:

Last updated 2 months ago

Was this helpful?