Next configure an environment with your data lake credentials, then list what objects are in your data lake.
# configure AWS credential chain
export AWS_ACCESS_KEY_ID=AKIA4WWOSMD6PEXAMPLE
export AWS_SECRET_ACCESS_KEY=<REDACTED>
export BUCKET=my-grax-bucket
# verify access and data lake data
aws s3 ls s3://$BUCKET/parquet/v2/
PRE org=00D46000001EXAMPLE/
aws s3 ls s3://$BUCKET/parquet/v2/org=00D46000001EXAMPLE/
PRE object=Account/
PRE object=AccountContactRelation/
PRE object=Asset/
PRE object=Case/
PRE object=Contact/
PRE object=Opportunity/
PRE object=User/
Finally we can query our data lake with duckdb. First we count how many Account versions we have and select one record:
duckdb grax.duckdb
-- use AWS env vars
CREATE OR REPLACE PERSISTENT SECRET s3 (TYPE S3, PROVIDER CREDENTIAL_CHAIN);
-- count all Account versions
SELECT COUNT(*) FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=Account/**/*.parquet");
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 125562 │
└──────────────┘
SELECT Id, Name FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=Account/**/*.parquet") LIMIT 1;
┌────────────────────┬──────────────┐
│ Id │ Name │
│ varchar │ varchar │
├────────────────────┼──────────────┤
│ 0014600000zEXAMPLE │ Example Acct │
└────────────────────┴──────────────┘
Ad-Hoc Queries and Views
Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.
CREATE OR REPLACE VIEW object_account_live AS
WITH object_account AS (
SELECT *
FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=Account/**/*.parquet")
),
object_account_max_idseq AS (
SELECT id AS mid, MAX(grax__idseq) AS max_idseq
FROM object_account
GROUP BY 1
),
object_account_live AS (
SELECT *
FROM object_account o
JOIN object_account_max_idseq m ON m.mid = o.id
AND grax__idseq = max_idseq
AND grax__deleted IS NULL
)
SELECT * FROM object_account_live;
Now we can query the live data easily:
SELECT COUNT(*) FROM object_account_live;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 23145 │
└──────────────┘
SELECT Id, Name, LastModifiedById, source__modified FROM object_account_live ORDER BY source__modified DESC LIMIT 1;
┌────────────────────┬───────────────┬────────────────────┬──────────────────────────┐
│ Id │ Name │ LastModifiedById │ source__modified │
│ varchar │ varchar │ varchar │ timestamp with time zone │
├────────────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ 0014600000zEXAMPLE │ Example Acct │ 00546000001EXAMPLE │ 2024-07-25 10:31:06-07 │
└────────────────────┴───────────────┴────────────────────┴──────────────────────────┘
Cached Data and Tables
Ad-hoc queries in the data lake are extremely powerful. As you ask questions of the data, you find common patterns that will be more efficient to cache locally versus always querying the data lake.
A common pattern with Salesforce data is turning user IDs into user names. We can easily create a persistent table with this data.
CREATE OR REPLACE TABLE user AS
WITH object_user AS (
SELECT * FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=User/**/*.parquet")
),
object_user_max AS (
SELECT id AS mid, MAX(grax__idseq) AS max_idseq
FROM object_user
GROUP BY 1
),
object_user_live AS (
SELECT *
FROM object_user o
JOIN object_user_max m ON m.mid = o.id
AND grax__idseq = max_idseq
AND grax__deleted IS NULL
)
SELECT Id, Name FROM object_user_live;
Now when you specify the local grax.duckdb database, you can always query users quickly and without hitting the data lake at all.
duckdb grax.duckdb
SELECT COUNT(*) FROM user;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 103 │
└──────────────┘
SELECT Id, Name, LastModifiedById, source__modified FROM object_account_live ORDER BY source__modified DESC LIMIT 1;
Finally we can mix and match, querying the data lake for Accounts and the local cache for users:
SELECT
a.Id,
a.Name,
u.Name AS LastModifiedByName
FROM object_account_live a
JOIN user u ON u.Id = a.LastModifiedById
ORDER BY a.source__modified DESC
LIMIT 1;
┌────────────────────┬──────────────┬─────────────────────┐
│ Id │ Name │ LastModifiedByName │
│ varchar │ varchar │ varchar │
├────────────────────┼──────────────┼─────────────────────┤
│ 0014600000zEXAMPLE │ Example Acct │ Jane Example │
└────────────────────┴──────────────┴─────────────────────┘