Heroku Data Lakehouse

GRAX Data Lake automatically organizes your CRM data as Parquet files in S3 for arbitrary on-demand analytics queries.

AWS Athena lets you query your data lake data by SQL queries that run on the Athena in a serverless and scalable fashion.

Heroku is a serverless PaaS (Platform as a Service) that allows you to deploy applications without worrying about the underlying infrastructure.

GRAX provides a Heroku Add-on to make querying your Athena Data Lake easy.

Connection Details

When you add the GRAX Data Lake add-on to your Heroku application, it exposes environment variables that allow you to make an AWS Athena connection.

They are:

GRAX_AWS_ACCESS_KEY_ID=
GRAX_AWS_SECRET_ACCESS_KEY=
GRAX_AWS_REGION=
GRAX_S3_STAGING_DIR=
GRAX_ATHENA_WORKGROUP=
GRAX_ATHENA_DATABASE=

Jupyter Notebook

You can deploy a configurable JupyterHub installation that expects the add-on environment variables from the add-on and automatically configures a client that returns a pandas data frame.

import grax_athena
df = grax_athena.query_data_lake("SELECT COUNT(*) FROM object_account")

Connecting from Python

You can use any Athena client to connect to and query the data lake. Here is an example using the pyathena client:

import os
from pyathena import connect

# Retrieve AWS credentials from environment variables
aws_access_key_id = os.environ.get('GRAX_AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.environ.get('GRAX_AWS_SECRET_ACCESS_KEY')
aws_region = os.environ.get('GRAX_AWS_REGION', 'us-east-1')

# Athena connection parameters
s3_staging_dir = os.environ.get('GRAX_S3_STAGING_DIR')
athena_database = os.environ.get('GRAX_ATHENA_DATABASE')
athena_workgroup = os.environ.get('GRAX_ATHENA_WORKGROUP')

# Establish connection to Athena
conn = connect(aws_access_key_id=aws_access_key_id,
                aws_secret_access_key=aws_secret_access_key,
                s3_staging_dir=s3_staging_dir,
                work_group=athena_workgroup,
                region_name=aws_region)
print("Connection to Athena established successfully.")

# Execute the query
cursor = conn.cursor()
query = f"SELECT COUNT(*) FROM {athena_database}.\"object_account\""
cursor.execute(query)

# Fetch and print the result
result = cursor.fetchone()
count = result[0]
print(f"Number of rows in object_account: {count}")

Last updated 6 months ago

Was this helpful?