GCP Data Lakehouse

The recommended data architecture on GCP is:

  • Data Format: Parquet

  • Storage: Google Cloud Storage (GCS)

  • ETL: DataFlow

  • Lakehouse: BigQuery

The recommended workflow is:

  • GRAX writes data to GCS in Parquet

  • Data Flow

    • Notified when new Parquet is available

    • Python script reads new Parquet data

    • Extracts objects and fields for downstream

    • Transforms fields into computed fields for business logic

    • Loads into Big Query

  • Big Query

    • Queries and joins multiple data sets

      • GRAX ETL data

      • GRAX datalake data

      • Additional data sets

Anti-patterns are:

  • Reading entire Parquet files vs specific columns

  • Polling for new Parquet vs getting a push notification

  • Moving GRAX data to track ingestion

Last updated

Was this helpful?