GRAX Data Lake automatically organizes your CRM data as Parquet files in S3 for arbitrary on-demand analytics queries.
GRAX Data Lakehouse uses AWS Glue to catalog your data lake data, and AWS Athena to query it by SQL queries that run on the Athena in a serverless and scalable fashion.
IAM
Create a access key and secret with access to:
S3 Parquet data (read only)
S3 Athena results data (write)
Athena and Glue resources
First use an admin role to create a policy for data lake resources, then a user and access key.
Next create a table around the Parquet data. Here we create a small table with a few standard Salesforce fields, as well as GRAX internal fields for managing versions.
QUERY="CREATE EXTERNAL TABLE IF NOT EXISTS object_account (
id string,
name string,
grax__idseq string,
grax__deleted timestamp
)
STORED AS PARQUET
LOCATION 's3://$BUCKET/parquet/v2/org=$ORG/object=Account/'
tblproperties ('parquet.compress'='SNAPPY');"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text
Query
Finally we can query our data lake with aws athena. First we count how many Account versions we have and select one record:
Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.
QUERY="CREATE OR REPLACE VIEW object_account_live AS
WITH max_idseq AS (
SELECT id AS mid, MAX(grax__idseq) AS max_idseq
FROM object_account
GROUP BY 1
),
live AS (
SELECT *
FROM object_account o
JOIN max_idseq m ON m.mid = o.id
AND grax__idseq = max_idseq
AND grax__deleted IS NULL
)
SELECT * FROM live
"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text