# AWS Data Lakehouse

GRAX Data Lake automatically organizes your CRM data as Parquet files in S3 for arbitrary on-demand analytics queries.

GRAX Data Lakehouse uses AWS Glue to catalog your data lake data, and AWS Athena to query it by SQL queries that run on the Athena in a serverless and scalable fashion.

## CloudFormation Quick Deploy

Use [CloudFormation Quick Deploy](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateURL=https://s3.amazonaws.com/grax-public-templates/master/cloudformation/lakehouse.yml\&stackName=lakehouse) to set up a Lakehouse.&#x20;

You will need to switch to the correct region that your GRAX deployment and data is in.&#x20;

By default the stack creates a new S3 Bucket, Glue Data Catalog and Athena Database all configured properly for storing and querying data.

Connect the S3 Bucket to GRAX with the provide Role ARN.

### Existing Bucket

If you already have a bucket with GRAX data, fill in `S3BucketName` .&#x20;

The template will configure Glue and Athena to use the existing bucket and data.

<figure><img src="https://4150568565-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwHKnqFEg4DROpG3KCq3D%2Fuploads%2FMnwPwjtH6Tp5j2buGOYI%2FCloudFormation%20-%20Stack%20lakehouse.jpeg?alt=media&#x26;token=65ea86f5-0c3d-493e-8e5e-a97adf81adae" alt=""><figcaption></figcaption></figure>

After you create the stack you will need to configure your bucket to send event notifications to the SQS queue in `SQSQueueArn` output. If your org ID is `00D46000001EXAMPLE`, use `parquet/v2/ORG%3D00D46000001EXAMPLE` as your notification prefix filter to avoid issues with the `=` special character.

{% embed url="<https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-enable-disable-notification-intro.html>" %}

{% embed url="<https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html>" %}

### IAM Access Keys

To connect the lakehouse to an external system like Metabase or your laptop, set `S3AccessMethod` to `User` . The IAM key and secret is available in Secrets Manager.

<figure><img src="https://4150568565-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwHKnqFEg4DROpG3KCq3D%2Fuploads%2FVEwDYr1lIhgLkdSwqara%2Flakehouse-athena-access-key%20_%20Secrets%20Manager%20_%20us-east-1%20%C2%B7%2011.24am%20%C2%B7%2008-07.jpeg?alt=media&#x26;token=e2d3b144-e4e6-48cf-a207-cabca6f5d8eb" alt=""><figcaption></figcaption></figure>

## Query

Next configure an environment with your data lake credentials, then list what objects are in your data lake.

```bash
# configure AWS credential chain
export AWS_ACCESS_KEY_ID=AKIA4WWOSMD6PEXAMPLE
export AWS_SECRET_ACCESS_KEY=<REDACTED>
export AWS_REGION=us-east-1
export BUCKET=my-grax-bucket
export ORG=00D46000001EXAMPLE
```

Finally we can query our data lake with `aws athena`. First we count how many Account versions we have and select one record:

```bash
QUERY="SELECT COUNT(*) FROM object_account"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text

654059a8-8455-4ecf-b539-3a694847aa15

aws athena get-query-results --query-execution-id 654059a8-8455-4ecf-b539-3a694847aa15
```

```json
{
  "ResultSet": {
    "Rows": [
      {
        "Data": [
          {
            "VarCharValue": "_col0"
          }
        ]
      },
      {
        "Data": [
          {
            "VarCharValue": "126178"
          }
        ]
      }
    ]
  }
}
```

```bash
QUERY="SELECT Id, Name FROM object_account LIMIT 1"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text

f7717a2a-19ef-4b81-9d1a-858abb847a6a

aws athena get-query-results --query-execution-id f7717a2a-19ef-4b81-9d1a-858abb847a6a
```

```json
{
  "ResultSet": {
    "Rows": [
      {
        "Data": [
          {
            "VarCharValue": "Id"
          },
          {
            "VarCharValue": "Name"
          }
        ]
      },
      {
        "Data": [
          {
            "VarCharValue": "0014600000zEXAMPLE"
          },
          {
            "VarCharValue": "Example Acct"
          }
        ]
      }
    ]
  }
}
```

## Views

Your GRAX data lake has rows for every version of every record. However many analytics questions start by only looking at the most current "live" data. Here we create a view that reads all versions but returns just the latest "live" data.

```bash
QUERY="CREATE OR REPLACE VIEW object_account_live AS
WITH max_idseq AS (
  SELECT id AS mid, MAX(grax__idseq) AS max_idseq
  FROM object_account
  GROUP BY 1
),
live AS (
  SELECT *
  FROM object_account o
  JOIN max_idseq m ON m.mid = o.id
  AND grax__idseq = max_idseq
  AND grax__deleted IS NULL
)
SELECT * FROM live
"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text
```

Now we can query the live data easily:

```bash
QUERY="SELECT COUNT(*) FROM object_account_live"

aws athena start-query-execution \
   --query-string $QUERY \
   --query-execution-context Database=default \
   --result-configuration OutputLocation=s3://$BUCKET/athena-results \
   --output text

e8a35172-27c2-418b-911b-7cd470837797

aws athena get-query-results --query-execution-id e8a35172-27c2-418b-911b-7cd470837797
```

```json
{
  "ResultSet": {
    "Rows": [
      {
        "Data": [
          {
            "VarCharValue": "_col0"
          }
        ]
      },
      {
        "Data": [
          {
            "VarCharValue": "23157"
          }
        ]
      }
    ]
  }
}
```

## Resetting Your Data Lake

First disable all objects in GRAX Data Lake. This stops new data from writing.

Next, clear out the Parquet data from S3:

```bash
aws s3 rm --recursive s3://$BUCKET/parquet/v2/org=00D46000001EXAMPLE/
```

Next delete all objects in GRAX Data Lake. This will reset objects back to the beginning of time.

Finally, re-enable all objects in GRAX Data Lake. This will rewrite all data from the beginning of time.
