Page cover

Data Lake

Data Lake writes data from GRAX backups to your storage bucket in Parquet format. You can use the written data to build data warehouses, merge data with other systems, and more.

Getting Started

Enable Objects

To start, browse to Data Lake and enable your first objects. Most of your data lake data will reference standard Salesforce objects and fields, so we recommend enabling these first:

  • Account

  • AccountContactRelation

  • Case

  • Contact

  • Lead

  • Opportunity

  • OpportunityStage

  • User

Enable objects by clicking the Add Objects button in Data Lake and moving objects to the right-hand column to enable. To include all fields for an object, ensure the Include all fields for new objects checkbox is selected at the bottom of the Configure page.

If any objects were added without Include all fields for new objects selected, you will receive a configuration pop-up for each object, allowing you to exclude fields from that object. Skipping the configuration pop-up for an object will add the object with all fields in a paused state, allowing you to revisit the configuration later.

Excluding Fields

To exclude fields from an object, deselect Include all fields for new objects when adding it, as described above.

To exclude fields from an object already added to Data Lake:

  1. Pause writing for the object you wish to configure by clicking pause in the Actions column.

  2. Click the gear symbol in the Actions column to open object configuration.

  3. Select the fields to exclude from this object and click Update.

Object Status

After you enable an object it will take some time to populate the data lake. You will see the status go from Backfilling with a percentage to Current when it is complete. The initial backfill may take a while as it writes out data for every version of every object in your backup history.

Object Actions

You can pause and resume individual objects in the object actions column.

Once an object is paused, there are several actions available:

  • Clicking will delete the object configuration. This does not remove any data already written to Data Lake. Parquet files already written can be manually removed if desired.

  • Clicking opens the object configuration screen, allowing you to exclude fields from the object.

Write Format

Paths

When Data Lake is enabled for an object, it begins writing all data in GRAX backups for that object. Data Lake writes files to your storage bucket with paths that look like:

parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5e66e800.parquet
parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5f42a201.parquet
parquet/v2/org=X/object=Account/batch=05e0c89c0/data-16f61d5d004.parquet

The batch=05e0c89c0 portion of the path is to group files into separate prefixes to optimize performance in S3 and other object stores. It will increase as more data is written. For example, batch=444444444 will not be used after batch=555555555. The batch value is not related to the data contained in the files, it is only for grouping files.

The 16f5e66e800 portion of data-16f5e66e800.parquet is to ensure unique and increasing filenames. It will increase with each file. For example, data-44444444444.parquet will not be written after data-55555555555.parquet. Similar to the batch value, it is not related to the data contained in the file.

File Data

Each file contains a varying number of rows.

Rows are meant to reflect the full state of a record version (the combination of Id + source__modified) at the time of writing to Data Lake, not at the time of the relevant change. For example, record versions that are deleted before Data Lake is enabled will have grax__deleted set.

File sizes vary with data but should generally not be above 100 MiB.

Within each data-16f5e66e800.parquet file, the data looks roughly like:

Field
Type
Meaning

Id

string

Salesforce Record ID.

source__modified

timestamp

The time the record version was modified in Salesforce (SystemModstamp).

grax__idseq

string

A per-Id value that increases with record changes.

grax__deleted

timestamp

If the record has been deleted or archived, the time the record was deleted or archived.

grax__deletesource

string

If the record has been deleted or archived, the source of the record delete. Will be grax for archives or source for Salesforce.

grax__purged

timestamp

If the record has been purged, the time the record was purged.

grax__restoredfrom

string

If the record was restored from another record, the ID of the record this record was restored from.

<record field name>, such as Name

string

Salesforce record field data, with the value from Salesforce (converted to a string).

<record field name>_<type>, such as IsDeleted_b

indicated by suffix

For non-string fields, one or more additional typed fields and values, _b for boolean, _i for integer, _f for float, _ts, for timestamp, _t for time, and _d for date.

Only Id, source__modified, and grax__idseq are guaranteed to be present.

A full example for an Account Parquet file might look like:

Id

source__modified

grax__idseq

grax__deleted

grax_deletesource

grax__purged

grax__restoredfrom

Name

Fax

1

t4

a

Alice

2

t5

q

t1

1

t5

b

Alicia

3

t1

x

t3

grax

Bob

800-555-1212

1

t6

c

Allison

4

t8

a

3

Bob

4

t9

g

3

Bill

888-111-5555

Important Notes

There will be duplicates. To achieve low write time and increase safety, Data Lake can produce duplicate records.

  • Use the combination of Id and MAX(grax__idseq) to determine the latest write for the latest version of a record.

  • Use the combination of Id, source__modified, and MAX(grax__idseq) to determine the latest write per version of a record.

Data can appear out of order. Similarly for performance and reliability reasons, Data Lake can write data that may logically be out of order. For example, data-X.parquet may be written first with Id=1, source__modified=t2 while data-Y.parquet may be written after with Id=1 and source__modified=t1. This can also happen within the same data-X.parquet file.

  • Use the fact that grax__idseq will increase for each Id and source__modified combination to order data.

Each file can have a different schema. The Parquet schema for each file is based on the data in that file. Only fields with non-empty values are considered and written.

Files' schemas may include fields that contain no data. It's also possible that a file's Parquet schema will include a field that has no data within the file.

Typed fields may vary or clash over time. As schema and data change over time, a typed field such as Custom_Field__c_b may stop receiving data if the schema and data indicate it changes to a string. Or Custom_Field__c_i may stop receiving data in favor of Custom_Field__c_f if schema and data change to indicate it changes from an integer to a float.

Files written after a record has been purged will contain no field data for that record.

Frequently Asked Questions

How many total objects can be enabled for Data Lake?

You can enable up to 100 objects for a single org.

How many objects should be enabled at the same time?

You should add all required objects to the enabled objects list; this ensures objects are pushed to Data Lake as efficiently as possible. GRAX continuously writes files for each object as capacity becomes available.

How is the Data Lake data organized?

The format for the folders looks like this: parquet/org=1234/object=myObject/day=nn/hour=nn. New folders are be added in your storage bucket for each version/modification based on the date/time that the object/version was backed up, following the same format, always based on UTC time.

Why isn't there a folder for each date in my storage bucket?

Parquet files are only written when Data Lake receives new or updated data. If Backup is not running, or there is a connectivity issue with Salesforce (SFDC), no new files will be written for those dates. Once Backup resumes, any missed data will be backed up and written to files on the date the backup process restarts.

Why are some hourly folders missing?

Folders are only created for hours that contained data updates/changes; if there are no updates to an object in any given hour, no folders or files are created for that hour.

What is the estimated size of parquet files that are written?

The size varies based on the amount of and size of the data, but users should expect a range of 10MB-40MB per folder/file.

Why does the record count for an object in Data Lake not match the record count shown in Backup?

For backup, the numbers are split up into "records" and "versions." For Data Lake, the number of records and versions is combined in the “total records written” number. Please keep in mind that there might also be a small discrepancy between the totals if an object is still catching up in Data Lake. Additionally, if you had records backed up via Legacy Backup, those numbers are not included in the Backup totals but are shown in the Data Lake numbers.

How do I rewrite an object in Data Lake if new attributes have been added or field permissions have been updated?

You can reset the object for Backup within the “Tools and Diagnostics” page within the “Settings” tab. This re-runs the data for that object so updated versions are captured and subsequently added to your Data Lake files. This does not remove the previously written data from the parquet files, and duplicates any records that were already written and which are still present in the GRAX dataset.

Should I continue to run Backup and Archive jobs while objects are being enabled for Data Lake?

Yes, GRAX is designed to have all these tasks run concurrently without diminished performance.

What factors impact the speed of Data Lake processing?

There are many variables to the speed including the amount and size of data, available CPU, other app activity, etc. There is no set time frame for how long writing an object (or objects) takes as this is determined by specific factors within each org.

How can I get Data Lake objects written faster?

Data Lake speed is dependent on various factors such as the size and quantity of objects. There is no action that can be taken to increase the writing speed, but GRAX is continuously working to maximize efficiency of this feature.

Is there any effect on the speed/performance in other areas of the app when running Data Lake?

Functionality that needs to scan backup data (Global Search, archive jobs, etc.) might run a bit slower than usual, but there should not be any noticeable difference or related errors.

What are the necessary CPU/VM resources needed to run Data Lake effectively?

Customers must have at least the minimum requirements for GRAX as defined here. This allows Data Lake to run as intended, but having better CPU and RAM resources improves speed and capacity.

Can I turn Data Lake on and off if needed?

Yes, you can remove objects from the enabled list to pause Data Lake for a specific object. Once those objects are added back to the enabled list, Data Lake picks up where it left off. This might be useful if prioritizing certain objects or if internal flows need to be adjusted.

Last updated

Was this helpful?