LogoLogo
TrustAPI Docs
  • Application
  • Support
  • Platform
  • Infrastructure
  • Security
  • Notices
  • Overview
  • Protect Data
    • Auto Backup
      • Auto Backup API Usage
      • Supported Objects
      • Delete Tracking
      • Salesforce Metadata Backup
      • Missing Field Permissions
      • Viewing Records
      • Viewing Files
    • Archive
    • Restore
      • Restore Best Practices
    • Purge
  • Reuse Data
    • Data Replication
      • Data Replication API Usage
      • Supported Objects
      • Delete Tracking
      • Missing Field Permissions
      • Viewing Records
      • Viewing Files
    • Global Search
    • Data Lake (formerly History Stream)
      • AWS Data Lakehouse
      • DuckDB Data Lake
      • Heroku Data Lakehouse
      • Azure Data Lake
      • Data Lake FAQ
      • Data Lake v1 (formerly History Stream)
    • Salesforce Sandbox Seeding
      • Sandbox Seeding Walkthrough
    • Public API
    • Managed Package
      • Second Generation
        • Features
        • Install
        • Update
        • Uninstall
      • First Generation
        • Features
        • Configure
        • Uninstall
        • Migrate
      • Frequently Asked Questions
  • Other
    • Settings
      • Connecting Salesforce
      • Connecting Storage
      • Sandbox Refresh
    • Notifications
    • Permissions
      • Integration User
      • Integration User Scripts
    • Troubleshooting
      • Debugging Salesforce Triggers
    • Auto Updates

Copyright © 2025 GRAX, Inc.

On this page
  • Getting Started
  • Enable Objects
  • Excluding Fields
  • Object Status
  • Write Format
  • Paths
  • File Data
  • Important Notes
  • Frequently Asked Questions

Was this helpful?

Export as PDF
  1. Reuse Data

Data Lake (formerly History Stream)

Last updated 2 days ago

Was this helpful?

This documentation is for Data Lake v2. For v1 documentation, visit .

Data Lake writes data from GRAX backups to your storage bucket in format. You can use the written data to build data warehouses, merge data with other systems, and more.

Data Lake v2 is an iteration on our original Data Lake v1 offering. Compared to v1, Data Lake v2 offers much lower time to write after GRAX backs up data from Salesforce while increasing reliability and safety.

Getting Started

Enable Objects

To start, browse to Data Lake and enable your first objects. Most of your data lake data will reference standard Salesforce objects and fields, so we recommend enabling these first:

  • Account

  • AccountContactRelation

  • Case

  • Contact

  • Lead

  • Opportunity

  • OpportunityStage

  • User

Enable objects by clicking the "Add Objects" button in Data Lake and moving objects to the right-hand column to enable. To include all fields for an object, ensure the "Include all fields for new objects" checkbox is selected at the bottom of the Configure page.

Excluding Fields

To exclude fields from an object, deselect "Include all fields for new objects when adding it, as described above.

To exclude fields from an object already added to Data Lake:

  1. Pause writing for the object you wish to configure by clicking pause in the Actions column.

  2. Click the gear symbol in the Actions column to open object configuration.

  3. Select the fields to exclude from this object and click "Update.

Excluding fields from an Object does not remove data already written to Data Lake. Excluded fields are removed from future Data Lake writes. Remove and re-add the object, excluding fields, to rewrite the object history without the fields.

Object Status

After you enable an object it will take some time to populate the data lake. You will see the status go from "Backfilling" with a percentage to "Current" when it is complete. The initial backfill may take a while as it writes out data for every version of every object in your backup history.

Object Actions

You can pause and resume individual objects in the object actions column.

Once an object is paused, there are several actions available:

Write Format

Paths

When Data Lake v2 is enabled for an object, it begins writing all data in GRAX backups for that object. Data Lake v2 writes files to your storage bucket with paths that look like:

parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5e66e800.parquet
parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5f42a201.parquet
parquet/v2/org=X/object=Account/batch=05e0c89c0/data-16f61d5d004.parquet

The 16f5e66e800 portion of data-16f5e66e800.parquet is to ensure unique and increasing filenames. It will increase with each file. For example, data-44444444444.parquet will not be written after data-55555555555.parquet. Similar to the batch value, it is not related to the data contained in the file.

File Data

Each file contains a varying number of rows.

Rows are meant to reflect the full state of a record version (the combination of Id + source__modified) at the time of writing to Data Lake, not at the time of the relevant change. For example, record versions that are deleted before Data Lake v2 is enabled will have grax__deleted set.

File sizes vary with data but should generally not be above 100 MiB.

Within each data-16f5e66e800.parquet file, the data looks roughly like:

Field
Type
Meaning

Id

string

Salesforce Record ID.

source__modified

timestamp

The time the record version was modified in Salesforce (SystemModstamp).

grax__idseq

string

A per-Id value that increases with record changes.

grax__deleted

timestamp

If the record has been deleted or archived, the time the record was deleted or archived.

grax__deletesource

string

If the record has been deleted or archived, the source of the record delete. Will be grax for archives or source for Salesforce.

grax__purged

timestamp

If the record has been purged, the time the record was purged.

grax__restoredfrom

string

If the record was restored from another record, the ID of the record this record was restored from.

<record field name>, such as Name

string

Salesforce record field data, with the value from Salesforce (converted to a string).

<record field name>_<type>, such as IsDeleted_b

indicated by suffix

For non-string fields, one or more additional typed fields and values, _b for boolean, _i for integer, _f for float, _ts, for timestamp, _t for time, and _d for date.

Only Id, source__modified, and grax__idseq are guaranteed to be present.

A full example for an Account Parquet file might look like:

Id

source__modified

grax__idseq

grax__deleted

grax_deletesource

grax__purged

grax__restoredfrom

Name

Fax

1

t4

a

Alice

2

t5

q

t1

1

t5

b

Alicia

3

t1

x

t3

grax

Bob

800-555-1212

1

t6

c

Allison

4

t8

a

3

Bob

4

t9

g

3

Bill

888-111-5555

Important Notes

There will be duplicates. To achieve low write time and increase safety, v2 can produce duplicate records. Use the combination of Id and MAX(grax__idseq) to determine the latest write for the latest version of a record. Use the combination of Id, source__modified, and MAX(grax__idseq) to determine the latest write per version of a record. This was already the case with v1 but it was harder to de-duplicate without source__modified and grax__idseq.

Data can appear out of order. Similarly for performance and reliability reasons, v2 can write data that may logically be out of order. For example, data-X.parquet may be written first with Id=1, source__modified=t2 while data-Y.parquet may be written after with Id=1 and source__modified=t1. This can also happen within the same data-X.parquet file. Use the fact that grax__idseq will increase for each Id and source__modified combination to order data.

Each file can have a different schema. The Parquet schema for each file is based on the data in that file. Only fields with non-empty values are considered and written.

Files' schemas may include fields that contain no data. It's also possible that a file's Parquet schema will include a field that has no data within the file.

Typed fields may vary or clash over time. As schema and data change over time, a typed field such as Custom_Field__c_b may stop receiving data if the schema and data indicate it changes to a string. Or Custom_Field__c_i may stop receiving data in favor of Custom_Field__c_f if schema and data change to indicate it changes from an integer to a float.

Files written after a record has been purged will contain no field data for that record.

Frequently Asked Questions

What are the key differences between Data Lake v1 and v2?

The key differences between Data Lake v1 and v2 are:

  • Reduced delay from data being added in GRAX to writing to Data Lake

  • Increased throughput when backfilling newly enabled objects and handling large volumes of changes

  • Improved writing intelligence to reliably keep objects up-to-date

  • System improvements to remove possibilities of missed writes observed in v1

  • Different path structure (v2 prefix, day=YYYY-MM-DD/hr=HH to batch=444444444)

  • Increased max file size (10 MB for v1 to 100 MB for v2)

  • Addition of source__modified, grax__idseq, and grax__restoredfrom fields

  • Addition of typed fields for non-string values

  • Removal of grax__added field

How do I switch from Data Lake v1 to Data Lake v2?

You can turn on Data Lake v2 by navigating to the Data Lake section of your GRAX Application and clicking on Data Lake v2; you are then ready to start configuring objects for writing. Select the objects you would like to use for Data Lake v2, and once the initial backfilling is completed and you see a "Current" status, you can disable this object in Data Lake v1. Make sure to enable any processing rules and triggers for v2 data prior to disabling v1 to ensure there is no data loss.

What happens to my old parquet files if I previously used Data Lake v1?

Data previously written with Data Lake v1 is still useable and accessible during the switch to v2 and afterwards. Data Lake v2 will duplicate the data written with v1, and the v1 files are safe to delete once they are no longer being used and v2 has finished the initial backfilling process.

When deleting be sure to only delete files under parquet/org=X/... in your bucket. Do not delete files in other parts of the bucket.

Why isn't there a folder for each date in my storage bucket?

Parquet files are only written when Data Lake receives new or updated data. If Auto Backup is not running, or there is a connectivity issue with Salesforce (SFDC), no new files will be written for those dates. Once Auto Backup resumes, any missed data will be backed up and written to files on the date the backup process restarts.

If any objects were added without "Include all fields for new objects" selected, you will receive a configuration pop-up for each object, allowing you to fields from that object. Skipping the configuration pop-up for an object will add the object with all fields in a paused state, allowing you to revisit the configuration later.

Clicking will delete the object configuration. This does not remove any data already written to Data Lake. Parquet files already written can be manually removed if desired.

Clicking opens the object configuration screen, allowing you to from the object.

The batch=05e0c89c0 portion of the path is to group files into separate prefixes to . It will increase as more data is written. For example, batch=444444444 will not be used after batch=555555555. The batch value is not related to the data contained in the files, it is only for grouping files.

optimize performance in S3 and other object stores
exclude
Page cover image
the v1 documentation
Parquet
exclude fields