Snowflake Tables from Data Lake

This guide sets up a Snowflake integration that reads Salesforce data from your GRAX Data Lake S3 bucket, loads it into Snowflake tables, and keeps those tables refreshed on a schedule.

Every command runs inside a Snowflake worksheet except one edit to an AWS IAM role trust policy in Step 2. All SQL is given as copy-paste blocks; run them top-to-bottom.

The example names (GRAX_DATA, GRAX_WH, GRAX_S3_INT, etc.) can be changed to match your naming standards, but if you change one, update every occurrence. Values written as <ALL_CAPS_IN_ANGLE_BRACKETS> are placeholders you must replace with your own values before running the SQL.

Before you start

Gather the items below. If anything is missing, stop here and contact GRAX Support or your AWS administrator.

Item
Details

S3 bucket name

The S3 bucket used by your GRAX application for the Data Lake. Available in your GRAX environment configuration.

Salesforce org ID

The 15- or 18-character ID of the org whose data you want to load. Also in your GRAX environment configuration.

AWS IAM role ARN

A role Snowflake will assume to read from S3, with s3:GetObject and s3:ListBucket on the parquet/v2/ prefix of the bucket. If you don't already have one, follow Snowflake's S3 storage integration guidearrow-up-right (Steps 1 and 2 only; the rest is covered below), or ask GRAX Support.

AWS trust-policy edit access

You will paste two values into that role's trust policy in Step 2. This is the only AWS-side change required.

Snowflake privileges

ACCOUNTADMIN, or a role with CREATE INTEGRATION and EXECUTE TASK granted at the account level.

Snowflake database and warehouse

Used to hold the tables and run the copy task. The examples below use GRAX_DATA and GRAX_WH. If you don't have them yet, create them with the block below the table.

CREATE DATABASE IF NOT EXISTS GRAX_DATA;
CREATE WAREHOUSE IF NOT EXISTS GRAX_WH
  WAREHOUSE_SIZE = 'XSMALL'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

The rest of this guide uses the PUBLIC schema, which exists by default in every Snowflake database, so no schema creation is needed. If you want to isolate this integration from other workloads in GRAX_DATA, create a dedicated schema now and substitute its name for PUBLIC everywhere below:

CREATE SCHEMA IF NOT EXISTS GRAX;

Step 1: Create the storage integration

The storage integration is Snowflake's record of which AWS role it will assume and which S3 locations it is permitted to read. Replace the three placeholders below with the role ARN, bucket name, and Org ID you gathered above.

Step 2: Finish the AWS trust policy

The integration you just created has a generated IAM user and external ID. Paste both into the IAM role's trust policy so the role will accept Snowflake's assume-role calls.

From the output, note two values:

  • STORAGE_AWS_IAM_USER_ARN, which looks like arn:aws:iam::123456789012:user/abc1-s-v2st1234

  • STORAGE_AWS_EXTERNAL_ID, which looks like ACME_SFCRole=1_abcDEF123...

In the AWS console (or your preferred IAM tooling), edit the role's trust policy to match the block below, substituting both values:

Save the trust policy. Snowflake validates the connection the first time it reads from S3, at the end of Step 3.

Step 3: Create the schema, file format, and external stage

These three objects are the plumbing that turns the storage integration into something you can query. Replace the two placeholders with the same bucket name and Org ID you used in Step 1.

Verify the connection by listing what the stage sees:

You should get back Parquet files organized under object=<SalesforceObjectName>/ prefixes (for example object=Account/, object=Contact/). Do not continue until LIST @GRAX_STAGE returns files; every later step depends on it.

Step 4: Load your first object

This step creates a Snowflake table for the Account object and loads data into it. The INFER_SCHEMA function reads a sample of the Parquet files and generates the correct column definitions automatically, so you don't have to hand-write the schema. MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE tells COPY INTO to map Parquet fields to Snowflake columns by name rather than position, which protects you from column-order changes in future GRAX exports.

The block is safe to re-run. CREATE TABLE IF NOT EXISTS leaves an existing table untouched, ALTER TABLE ... SET is idempotent if already applied, and COPY INTO only loads files it hasn't loaded before. Schema evolution is enabled so new Salesforce fields appear as new columns on the next copy instead of breaking it.

To confirm it worked:

To load another object, repeat the two statements above and change every occurrence of Account (case-sensitive, since it is part of the S3 path) and ACCOUNT (the Snowflake table name) to the Salesforce API name of the object you want, for example Contact, Opportunity, or Lead. The Salesforce API name is what GRAX uses as the folder name under object=.../ in S3.

Step 5: Automate ongoing refresh

Step 5 replaces the per-object COPY INTO from Step 4 with a stored procedure that loops over every object you care about, plus a scheduled task that runs the procedure on a cadence. The procedure uses CREATE TABLE IF NOT EXISTS, so it creates missing tables on first run and leaves existing ones alone, and it sets ENABLE_SCHEMA_EVOLUTION = TRUE, so new Salesforce fields appear as new columns without breaking the copy.

Edit the ARRAY_CONSTRUCT(...) list below to include every Salesforce object you want to keep in Snowflake.

Create the task that calls the procedure on a schedule. The example runs every 60 minutes; adjust SCHEDULE if you want a different cadence.

To trigger the task immediately instead of waiting for the first scheduled run:

Verify and query

Check that the task is scheduled and see recent runs:

Query your Salesforce data:

Last updated

Was this helpful?