# GCP Data Lakehouse

The recommended data architecture on GCP is:

* Data Format: Parquet
* Storage: Google Cloud Storage (GCS)
* ETL: DataFlow
* Lakehouse: BigQuery

The recommended workflow is:

* GRAX writes data to GCS in Parquet
* Data Flow
  * Notified when new Parquet is available
  * Python script reads new Parquet data
  * Extracts objects and fields for downstream
  * Transforms fields into computed fields for business logic
  * Loads into Big Query
* Big Query
  * Queries and joins multiple data sets
    * GRAX ETL data
    * GRAX datalake data
    * Additional data sets

Anti-patterns are:

* Reading entire Parquet files vs specific columns
* Polling for new Parquet vs getting a push notification
* Moving GRAX data to track ingestion
