# AWS Data Lakehouse

GRAX Data Lake automatically organizes your CRM data as Parquet files in S3 for arbitrary on-demand analytics queries.

GRAX Data Lakehouse uses AWS Glue to catalog your data lake data, and AWS Athena to query it by SQL queries that run on the Athena in a serverless and scalable fashion.

## CloudFormation Quick Deploy

Use [CloudFormation Quick Deploy](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateURL=https://s3.amazonaws.com/grax-public-templates/master/cloudformation/lakehouse.yml\&stackName=lakehouse) to set up a Lakehouse.&#x20;

You will need to switch to the correct region that your GRAX deployment and data is in.&#x20;

By default the stack creates a new S3 Bucket, Glue Data Catalog and Athena Database all configured properly for storing and querying data.

Connect the S3 Bucket to GRAX with the provide Role ARN.

### Existing Bucket

If you already have a bucket with GRAX data, fill in `S3BucketName` .&#x20;

The template will configure Glue and Athena to use the existing bucket and data.

<figure><img src="/files/7wM9tYK1XZnPb080R6Gs" alt=""><figcaption></figcaption></figure>

After you create the stack you will need to configure your bucket to send event notifications to the SQS queue in `SQSQueueArn` output. If your org ID is `00D46000001EXAMPLE`, use `parquet/v2/ORG%3D00D46000001EXAMPLE` as your notification prefix filter to avoid issues with the `=` special character.

{% embed url="<https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-enable-disable-notification-intro.html>" %}

{% embed url="<https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html>" %}

### IAM Access Keys

To connect the lakehouse to an external system like Metabase or your laptop, set `S3AccessMethod` to `User` . The IAM key and secret is available in Secrets Manager.

<figure><img src="/files/ChRcX22YiwTeIEpXvMLj" alt=""><figcaption></figcaption></figure>

## Query

Next configure an environment with your data lake credentials, then list what objects are in your data lake.

```bash
# configure AWS credential chain
export AWS_ACCESS_KEY_ID=AKIA4WWOSMD6PEXAMPLE
export AWS_SECRET_ACCESS_KEY=<REDACTED>
export AWS_REGION=us-east-1
export BUCKET=my-grax-bucket
export ORG=00D46000001EXAMPLE
```

Finally we can query our data lake with `aws athena`. First we count how many Account versions we have and select one record:

```bash
QUERY="SELECT COUNT(*) FROM object_account"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text

654059a8-8455-4ecf-b539-3a694847aa15

aws athena get-query-results --query-execution-id 654059a8-8455-4ecf-b539-3a694847aa15
```

```json
{
  "ResultSet": {
    "Rows": [
      {
        "Data": [
          {
            "VarCharValue": "_col0"
          }
        ]
      },
      {
        "Data": [
          {
            "VarCharValue": "126178"
          }
        ]
      }
    ]
  }
}
```

```bash
QUERY="SELECT Id, Name FROM object_account LIMIT 1"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text

f7717a2a-19ef-4b81-9d1a-858abb847a6a

aws athena get-query-results --query-execution-id f7717a2a-19ef-4b81-9d1a-858abb847a6a
```

```json
{
  "ResultSet": {
    "Rows": [
      {
        "Data": [
          {
            "VarCharValue": "Id"
          },
          {
            "VarCharValue": "Name"
          }
        ]
      },
      {
        "Data": [
          {
            "VarCharValue": "0014600000zEXAMPLE"
          },
          {
            "VarCharValue": "Example Acct"
          }
        ]
      }
    ]
  }
}
```

## Views

Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.

```bash
QUERY="CREATE OR REPLACE VIEW object_account_live AS
WITH max_idseq AS (
  SELECT id AS mid, MAX(grax__idseq) AS max_idseq
  FROM object_account
  GROUP BY 1
),
live AS (
  SELECT *
  FROM object_account o
  JOIN max_idseq m ON m.mid = o.id
  AND grax__idseq = max_idseq
  AND grax__deleted IS NULL
)
SELECT * FROM live
"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text
```

Now we can query the live data easily:

```bash
QUERY="SELECT COUNT(*) FROM object_account_live"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text

e8a35172-27c2-418b-911b-7cd470837797

aws athena get-query-results --query-execution-id e8a35172-27c2-418b-911b-7cd470837797
```

```json
{
  "ResultSet": {
    "Rows": [
      {
        "Data": [
          {
            "VarCharValue": "_col0"
          }
        ]
      },
      {
        "Data": [
          {
            "VarCharValue": "23157"
          }
        ]
      }
    ]
  }
}
```

## Resetting Your Data Lake

First disable all objects in GRAX Data Lake. This stops new data from writing.

Next, clear out the Parquet data from S3:

```bash
aws s3 rm --recursive s3://$BUCKET/parquet/v2/org=00D46000001EXAMPLE/
```

Next delete all objects in GRAX Data Lake. This will reset objects back to the beginning of time.

Finally, re-enable all objects in GRAX Data Lake. This will rewrite all data from the beginning of time.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://documentation.grax.com/reuse-data/data-lake/aws-data-lakehouse.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
