DuckDB Data Lakehouse
GRAX Data Lake automatically replicates your Salesforce data as Parquet files in S3 for arbitrary on-demand analytics queries.
DuckDB lets you query your data lake data with a rich dialect of SQL from your laptop. First install it from https://duckdb.org or Mac Homebrew:
brew install duckdbIAM
Create a access key and secret with access to:
S3 parquet data (read only)
First use an admin role to create a policy for the data lake access, then a user and access key with the policy.
export AWS_PROFILE=admin
export BUCKET=my-grax-bucket
cat >policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket*",
"s3:GetBucket*",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::$BUCKET",
"arn:aws:s3:::$BUCKET/*"
]
}
]
}
EOF
aws iam create-policy --policy-name datalake-duckdb --policy-document file://policy.json
aws iam create-user --user-name datalake-duckdb
aws iam attach-user-policy --user-name datalake-duckdb --policy-arn arn:aws:iam::873385320700:policy/datalake-duckdb
aws iam create-access-key --user-name datalake-duckdbSave the access key and secret for next steps.
Query
Next configure an environment with your data lake credentials, then list what objects are in your data lake.
Finally we can query our data lake with duckdb. First we count how many Account versions we have and select one record:
Ad-Hoc Queries and Views
Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.
Now we can query the live data easily:
Cached Data and Tables
Ad-hoc queries in the data lake are extremely powerful. As you ask questions of the data, you find common patterns that will be more efficient to cache locally versus always querying the data lake.
A common pattern with Salesforce data is turning user IDs into user names. We can easily create a persistent table with this data.
Now when you specify the local grax.duckdb database, you can always query users quickly and without hitting the data lake at all.
Finally we can mix and match, querying the data lake for Accounts and the local cache for users:
Last updated
Was this helpful?

