Data Lake FAQ

Some use cases for Data Lake Parquet files involve consuming that Parquet with a third-party tool. Care should be taken to consume these files securely and without exposing the rest of the data in the bucket. The steps below outline how to do this within AWS's S3 and IAM services. Steps may vary for other cloud providers and services.

Cross Account Guide

This guide assumes that the consuming Principal is in a different AWS Account than the one that owns the S3 Bucket. If the consuming Principal is in the same AWS Account as the one that owns the S3 Bucket, then the steps below can be simplified.

Determine the consuming Principal

The first step is to determine the Principal that will be consuming the Parquet files. This is typically an IAM user, role, or account. For the purposes of this example, we will assume the Principal is anything owned by a specific AWS Account.

Set or Modify the S3 Bucket Policy

The next step is to set or modify the S3 Bucket Policy to allow the Principal to access the Parquet files. This can be done by adding a statement to the existing Bucket Policy or by creating a new Bucket Policy. If created anew, the Policy should look something like this with [MY_BUCKET_NAME] and [AWS_ACCOUNT_NUMBER] replaced with the appropriate values:

{
    "Version": "2012-10-17",
    "Id": "Policy1611277539797",
    "Statement": [
        {
            "Sid": "Parquet_Cross_Account_ListBucket",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::[AWS_ACCOUNT_NUMBER]:root"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::[MY_BUCKET_NAME]",
            "Condition": {
                "StringLike": {
                    "s3:prefix": "parquet/*"
                }
            }
        },
        {
            "Sid": "Parquet_Cross_Account_GetObject",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::[AWS_ACCOUNT_NUMBER]:root"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::[MY_BUCKET_NAME]/parquet/*"
        }
    ]
}

Create an IAM Policy

The next step is to create an IAM Policy that allows the Principal to assume the role that will be created in the next step. The Policy should look something like this with [MY_BUCKET_NAME] replaced with the appropriate value:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::[MY_BUCKET_NAME]",
            "Condition": {
                "StringLike": {
                    "s3:prefix": "parquet/*"
                }
            }
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::[MY_BUCKET_NAME]/parquet/*"
        }
    ]
}

Create an IAM Role

The next step is to create an IAM Role. The IAM Policy created above needs to be attached, and the Trust Policy needs to be set to allow the Principal to assume the role. The Trust Policy should look something like this with [AWS_ACCOUNT_NUMBER] replaced with the appropriate value:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Parquet_Cross_Account",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::[AWS_ACCOUNT_NUMBER]:root"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Assume the Role

At this time, anything matching the allowed Principal scope can assume the role. The role must be assumed to have access; resources from that account will not be able to directly interact with the Parquet files.

Last updated 3 months ago

Was this helpful?