AWS Data Lakehouse
GRAX Data Lake automatically organizes your CRM data as Parquet files in S3 for arbitrary on-demand analytics queries.
GRAX Data Lakehouse uses AWS Glue to catalog your data lake data, and AWS Athena to query it by SQL queries that run on the Athena in a serverless and scalable fashion.
IAM
Create a access key and secret with access to:
- S3 Parquet data (read only)
- S3 Athena results data (write)
- Athena and Glue resources
First use an admin role to create a policy for data lake resources, then a user and access key.
export ACCOUNTID=87338EXAMPLE
export AWS_PROFILE=admin
export AWS_REGION=us-east-1
export BUCKET=my-grax-bucket
cat >policy.json <<EOF
{
"Statement": [
{
"Action": [
"athena:StopQueryExecution",
"athena:StartQueryExecution",
"athena:ListQueryExecutions",
"athena:GetQueryResults",
"athena:GetQueryExecution"
],
"Effect": "Allow",
"Resource": "arn:aws:athena:$AWS_REGION:$ACCOUNTID:workgroup/primary"
},
{
"Action": [
"s3:ListBucketMultipartUploads",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
},
"Effect": "Allow",
"Resource": "arn:aws:s3:::$BUCKET"
},
{
"Action": "s3:GetObject",
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
},
"Effect": "Allow",
"Resource": "arn:aws:s3:::$BUCKET/parquet/v2/*"
},
{
"Action": [
"s3:PutObject",
"s3:ListMultipartUploadParts",
"s3:GetObject",
"s3:AbortMultipartUpload"
],
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
},
"Effect": "Allow",
"Resource": "arn:aws:s3:::$BUCKET/athena-results/*"
},
{
"Action": "glue:*",
"Condition": {
"ForAnyValue:StringEquals": {
"aws:CalledVia": "athena.amazonaws.com"
}
},
"Effect": "Allow",
"Resource": ["arn:aws:glue:$AWS_REGION:$ACCOUNTID:*"]
}
],
"Version": "2012-10-17"
}
EOF
aws iam create-policy --policy-name datalake-athena --policy-document file://policy.json
aws iam create-user --user-name datalake-athena
aws iam attach-user-policy --user-name datalake-athena --policy-arn arn:aws:iam::$ACCOUNTID:policy/datalake-athena
aws iam create-access-key --user-name datalake-athena
{
"AccessKey": {
"UserName": "datalake-athena",
"AccessKeyId": "AKIA4WWOSMD6PEXAMPLE",
"Status": "Active",
"SecretAccessKey": "<REDACTED>",
"CreateDate": "2024-07-25T21:03:08+00:00"
}
}
Save the access key and secret for next steps.
Databases and Tables
Next configure an environment with your data lake credentials, then list what objects are in your data lake.
# configure AWS credential chain
export AWS_ACCESS_KEY_ID=AKIA4WWOSMD6PEXAMPLE
export AWS_SECRET_ACCESS_KEY=<REDACTED>
export AWS_REGION=us-east-1
export BUCKET=my-grax-bucket
export ORG=00D46000001EXAMPLE
Next create a table around the Parquet data. Here we create a small table with a few standard Salesforce fields, as well as GRAX internal fields for managing versions.
QUERY="CREATE EXTERNAL TABLE IF NOT EXISTS object_account (
id string,
name string,
grax__idseq string,
grax__deleted timestamp
)
STORED AS PARQUET
LOCATION 's3://$BUCKET/parquet/v2/org=$ORG/object=Account/'
tblproperties ('parquet.compress'='SNAPPY');"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text
If you need to query all the objects and fields, AWS Glue Data Crawler can automatically scan S3 and create the tables.
Query
Finally we can query our data lake with aws athena
. First we count how many Account versions we have and select one record:
QUERY="SELECT COUNT(*) FROM object_account"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text
654059a8-8455-4ecf-b539-3a694847aa15
aws athena get-query-results --query-execution-id 654059a8-8455-4ecf-b539-3a694847aa15
{
"ResultSet": {
"Rows": [
{
"Data": [
{
"VarCharValue": "_col0"
}
]
},
{
"Data": [
{
"VarCharValue": "126178"
}
]
}
]
}
}
QUERY="SELECT Id, Name FROM object_account LIMIT 1"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text
f7717a2a-19ef-4b81-9d1a-858abb847a6a
aws athena get-query-results --query-execution-id f7717a2a-19ef-4b81-9d1a-858abb847a6a
{
"ResultSet": {
"Rows": [
{
"Data": [
{
"VarCharValue": "Id"
},
{
"VarCharValue": "Name"
}
]
},
{
"Data": [
{
"VarCharValue": "0014600000zEXAMPLE"
},
{
"VarCharValue": "Example Acct"
}
]
}
]
}
}
Views
Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.
QUERY="CREATE OR REPLACE VIEW object_account_live AS
WITH max_idseq AS (
SELECT id AS mid, MAX(grax__idseq) AS max_idseq
FROM object_account
GROUP BY 1
),
live AS (
SELECT *
FROM object_account o
JOIN max_idseq m ON m.mid = o.id
AND grax__idseq = max_idseq
AND grax__deleted IS NULL
)
SELECT * FROM live
"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text
Now we can query the live data easily:
QUERY="SELECT COUNT(*) FROM object_account_live"
aws athena start-query-execution \
--query-string $QUERY \
--query-execution-context Database=default \
--result-configuration OutputLocation=s3://$BUCKET/athena-results \
--output text
e8a35172-27c2-418b-911b-7cd470837797
aws athena get-query-results --query-execution-id e8a35172-27c2-418b-911b-7cd470837797
{
"ResultSet": {
"Rows": [
{
"Data": [
{
"VarCharValue": "_col0"
}
]
},
{
"Data": [
{
"VarCharValue": "23157"
}
]
}
]
}
}
Resetting Your Data Lake
First disable all objects in GRAX Data Lake. This stops new data from writing.
Next, clear out the Parquet data from S3:
aws s3 rm --recursive s3://$BUCKET/parquet/v2/org=00D46000001EXAMPLE/
Next delete all objects in GRAX Data Lake. This will reset objects back to the beginning of time.
Finally, re-enable all objects in GRAX Data Lake. This will rewrite all data from the beginning of time.
Updated 12 days ago