DuckDB Data Lake

GRAX Data Lake automatically replicates your Salesforce data as Parquet files in S3 for arbitrary on-demand analytics queries.

DuckDB lets you query your data lake data with a rich dialect of SQL from your laptop. First install it from https://duckdb.org or Mac Homebrew:

brew install duckdb

IAM

Create a access key and secret with access to:

S3 parquet data (read only)

First use an admin role to create a policy for the data lake access, then a user and access key with the policy.

export AWS_PROFILE=admin
export BUCKET=my-grax-bucket

cat >policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket*",
        "s3:GetBucket*",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::$BUCKET",
        "arn:aws:s3:::$BUCKET/*"
      ]
    }
  ]
}
EOF

aws iam create-policy --policy-name datalake-duckdb --policy-document file://policy.json
aws iam create-user --user-name datalake-duckdb
aws iam attach-user-policy --user-name datalake-duckdb --policy-arn arn:aws:iam::873385320700:policy/datalake-duckdb
aws iam create-access-key --user-name datalake-duckdb

{
    "AccessKey": {
        "UserName": "datalake-duckdb",
        "AccessKeyId": "AKIA4WWOSMD6PEXAMPLE",
        "Status": "Active",
        "SecretAccessKey": "<REDACTED>",
        "CreateDate": "2024-07-25T21:03:08+00:00"
    }
}

Save the access key and secret for next steps.

Query

Next configure an environment with your data lake credentials, then list what objects are in your data lake.

# configure AWS credential chain
export AWS_ACCESS_KEY_ID=AKIA4WWOSMD6PEXAMPLE
export AWS_SECRET_ACCESS_KEY=<REDACTED>
export BUCKET=my-grax-bucket

# verify access and data lake data
aws s3 ls s3://$BUCKET/parquet/v2/
    PRE org=00D46000001EXAMPLE/

aws s3 ls s3://$BUCKET/parquet/v2/org=00D46000001EXAMPLE/
    PRE object=Account/
    PRE object=AccountContactRelation/
    PRE object=Asset/
    PRE object=Case/
    PRE object=Contact/
    PRE object=Opportunity/
    PRE object=User/

Finally we can query our data lake with duckdb. First we count how many Account versions we have and select one record:

duckdb grax.duckdb

-- use AWS env vars
CREATE OR REPLACE PERSISTENT SECRET s3 (TYPE S3, PROVIDER CREDENTIAL_CHAIN);

-- count all Account versions
SELECT COUNT(*) FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=Account/**/*.parquet");

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       125562 │
└──────────────┘

SELECT Id, Name FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=Account/**/*.parquet") LIMIT 1;

┌────────────────────┬──────────────┐
│         Id         │     Name     │
│      varchar       │   varchar    │
├────────────────────┼──────────────┤
│ 0014600000zEXAMPLE │ Example Acct │
└────────────────────┴──────────────┘

Ad-Hoc Queries and Views

Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.

CREATE OR REPLACE VIEW object_account_live AS
WITH object_account AS (
  SELECT * 
  FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=Account/**/*.parquet")
),
object_account_max_idseq AS (
  SELECT id AS mid, MAX(grax__idseq) AS max_idseq
  FROM object_account
  GROUP BY 1
), 
object_account_live AS (
  SELECT *
  FROM object_account o
  JOIN object_account_max_idseq m ON m.mid = o.id
  AND grax__idseq = max_idseq
  AND grax__deleted IS NULL
)
SELECT * FROM object_account_live;

Now we can query the live data easily:

SELECT COUNT(*) FROM object_account_live;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        23145 │
└──────────────┘

SELECT Id, Name, LastModifiedById, source__modified FROM object_account_live ORDER BY source__modified DESC LIMIT 1;

┌────────────────────┬───────────────┬────────────────────┬──────────────────────────┐
│         Id         │     Name      │  LastModifiedById  │     source__modified     │
│      varchar       │    varchar    │      varchar       │ timestamp with time zone │
├────────────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ 0014600000zEXAMPLE │ Example Acct  │ 00546000001EXAMPLE │ 2024-07-25 10:31:06-07   │
└────────────────────┴───────────────┴────────────────────┴──────────────────────────┘

Cached Data and Tables

Ad-hoc queries in the data lake are extremely powerful. As you ask questions of the data, you find common patterns that will be more efficient to cache locally versus always querying the data lake.

A common pattern with Salesforce data is turning user IDs into user names. We can easily create a persistent table with this data.

CREATE OR REPLACE TABLE user AS
WITH object_user AS (
  SELECT * FROM READ_PARQUET("s3://my-grax-s3bucket/parquet/v2/org=00D46000001EXAMPLE/object=User/**/*.parquet")
),
object_user_max AS (
  SELECT id AS mid, MAX(grax__idseq) AS max_idseq
  FROM object_user
  GROUP BY 1
), 
object_user_live AS (
  SELECT *
  FROM object_user o
  JOIN object_user_max m ON m.mid = o.id
  AND grax__idseq = max_idseq
  AND grax__deleted IS NULL
)
SELECT Id, Name FROM object_user_live;

Now when you specify the local grax.duckdb database, you can always query users quickly and without hitting the data lake at all.

duckdb grax.duckdb

SELECT COUNT(*) FROM user;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│          103 │
└──────────────┘

SELECT Id, Name, LastModifiedById, source__modified FROM object_account_live ORDER BY source__modified DESC LIMIT 1;

Finally we can mix and match, querying the data lake for Accounts and the local cache for users:

SELECT
  a.Id,
  a.Name,
  u.Name AS LastModifiedByName
FROM object_account_live a
JOIN user u ON u.Id = a.LastModifiedById
ORDER BY a.source__modified DESC
LIMIT 1;

┌────────────────────┬──────────────┬─────────────────────┐
│         Id         │  Name        │ LastModifiedByName  │
│      varchar       │ varchar      │       varchar       │
├────────────────────┼──────────────┼─────────────────────┤
│ 0014600000zEXAMPLE │ Example Acct │ Jane Example        │
└────────────────────┴──────────────┴─────────────────────┘

Last updated 2 months ago

Was this helpful?