LogoLogo
TrustAPI Docs
  • Application
  • Support
  • Platform
  • Infrastructure
  • Security
  • Notices
  • Overview
  • Protect Data
    • Auto Backup
      • Auto Backup API Usage
      • Supported Objects
      • Delete Tracking
      • Salesforce Metadata Backup
      • Missing Field Permissions
      • Viewing Records
      • Viewing Files
    • Archive
      • Support for Big Objects
    • Restore
      • Restore Best Practices
    • Purge
  • Reuse Data
    • Global Search
    • Data Lake (formerly History Stream)
      • AWS Data Lakehouse
      • DuckDB Data Lake
      • Heroku Data Lakehouse
      • Azure Data Lake
      • Data Lake FAQ
      • Data Lake v1 (formerly History Stream)
    • Salesforce Sandbox Seeding
      • Sandbox Seeding Walkthrough
    • Public API
    • Managed Package
      • Second Generation
        • Features
        • Install
        • Update
        • Uninstall
      • First Generation
        • Features
        • Configure
        • Uninstall
        • Migrate
      • Frequently Asked Questions
  • Other
    • Settings
      • Connecting Salesforce
      • Connecting Storage
      • Sandbox Refresh
    • Notifications
    • Permissions
      • Integration User
      • Integration User Scripts
    • Troubleshooting
      • Debugging Salesforce Triggers
    • Auto Updates

Copyright © 2025 GRAX, Inc.

On this page
  • Connection Details
  • Jupyter Notebook
  • Connecting from Python

Was this helpful?

Export as PDF
  1. Reuse Data
  2. Data Lake (formerly History Stream)

Heroku Data Lakehouse

Last updated 1 month ago

Was this helpful?

GRAX Data Lake automatically organizes your CRM data as Parquet files in S3 for arbitrary on-demand analytics queries.

AWS Athena lets you query your data lake data by SQL queries that run on the Athena in a serverless and scalable fashion.

Heroku is a serverless PaaS (Platform as a Service) that allows you to deploy applications without worrying about the underlying infrastructure.

GRAX provides a Heroku Add-on to make querying your Athena Data Lake easy.

Connection Details

When you add the GRAX Data Lake add-on to your Heroku application, it exposes environment variables that allow you to make an AWS Athena connection.

They are:

GRAX_AWS_ACCESS_KEY_ID=
GRAX_AWS_SECRET_ACCESS_KEY=
GRAX_AWS_REGION=
GRAX_S3_STAGING_DIR=
GRAX_ATHENA_WORKGROUP=
GRAX_ATHENA_DATABASE=

Jupyter Notebook

You can deploy a configurable JupyterHub installation that expects the add-on environment variables from the add-on and automatically configures a client that returns a pandas data frame.

import grax_athena
df = grax_athena.query_data_lake("SELECT COUNT(*) FROM object_account")

Connecting from Python

You can use any Athena client to connect to and query the data lake. Here is an example using the pyathena client:

import os
from pyathena import connect

# Retrieve AWS credentials from environment variables
aws_access_key_id = os.environ.get('GRAX_AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.environ.get('GRAX_AWS_SECRET_ACCESS_KEY')
aws_region = os.environ.get('GRAX_AWS_REGION', 'us-east-1')

# Athena connection parameters
s3_staging_dir = os.environ.get('GRAX_S3_STAGING_DIR')
athena_database = os.environ.get('GRAX_ATHENA_DATABASE')
athena_workgroup = os.environ.get('GRAX_ATHENA_WORKGROUP')

# Establish connection to Athena
conn = connect(aws_access_key_id=aws_access_key_id,
                aws_secret_access_key=aws_secret_access_key,
                s3_staging_dir=s3_staging_dir,
                work_group=athena_workgroup,
                region_name=aws_region)
print("Connection to Athena established successfully.")

# Execute the query
cursor = conn.cursor()
query = f"SELECT COUNT(*) FROM {athena_database}.\"object_account\""
cursor.execute(query)

# Fetch and print the result
result = cursor.fetchone()
count = result[0]
print(f"Number of rows in object_account: {count}")