# Data Lake

Data Lake writes data from GRAX backups to your storage bucket in [Parquet](https://parquet.apache.org/) format. You can use the written data to build data warehouses, merge data with other systems, and more.

## Getting Started

### Enable Objects

To start, browse to Data Lake and enable your first objects by clicking `Add Objects`.&#x20;

To help you get started, we’ve preselected the most common objects. You can customize these selections at any time, including removing any that were preselected:

* Account
* AccountContactRelation
* Campaign
* CampaignMember
* Case
* Contact
* Event
* Lead
* Opportunity
* OpportunityLineItem
* Task
* User

<figure><img src="https://4150568565-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwHKnqFEg4DROpG3KCq3D%2Fuploads%2FWqsdJXuuna9n9ITeLZc1%2FPreselected%20Data%20Lake%20objects.png?alt=media&#x26;token=74de640b-2b1a-4aa3-855c-98a859344edc" alt=""><figcaption></figcaption></figure>

By default, all fields are written. If you’d like to write only specific fields, simply uncheck `Include all fields for new objects`. at the bottom of the `Configure` page.

If any objects were added without `Include all fields for new objects` selected, you will receive a configuration pop-up for each object, allowing you to [exclude](#excluding-fields) fields from that object. Skipping the configuration pop-up for an object will add the object with all fields in a paused state, allowing you to revisit the configuration later.

### Excluding Fields

To exclude fields from an object, deselect `Include all fields for new objects` when adding it, as described above.&#x20;

To exclude fields from an object already added to Data Lake:&#x20;

1. Pause writing for the object you wish to configure by clicking pause in the `Actions` column.
2. Click the gear symbol in the `Actions` column to open object configuration.
3. Select the fields to exclude from this object and click `Update`.

<figure><img src="https://4150568565-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwHKnqFEg4DROpG3KCq3D%2Fuploads%2FICCsHuuR5o7lok0yGCT8%2FField%20exclusions.png?alt=media&#x26;token=fea8a432-f3c4-43f9-9b20-266204c565af" alt=""><figcaption></figcaption></figure>

{% hint style="warning" %}
Excluding fields from an object does not remove data that has already been written to the Data Lake. The excluded fields will only be omitted from future writes. To rewrite the object’s history without those fields, remove the object and then re-add it with the fields excluded.
{% endhint %}

### Object Status

After you enable an object it will take some time to populate the data lake. You will see the status go from `Backfilling` with a percentage to `Current` when it is complete. The initial backfill may take a while as it writes out data for every version of every object in your backup history.

#### Object Actions

You can pause and resume individual objects in the object actions column.

Once an object is paused, there are several actions available:

* Clicking <img src="https://4150568565-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwHKnqFEg4DROpG3KCq3D%2Fuploads%2F89TIaDBZsvvoNASBpPJe%2FScreenshot%202025-05-12%20at%208.41.53%E2%80%AFAM%20copy.png?alt=media&#x26;token=aec23521-ec3d-4537-aa30-ae2029012919" alt="" data-size="line">will delete the object configuration. This does **not** remove any data already written to Data Lake. Parquet files already written can be manually removed if desired.
* Clicking <img src="https://4150568565-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FwHKnqFEg4DROpG3KCq3D%2Fuploads%2FV87L1RLxuAQgG1eVSP0T%2FScreenshot%202025-05-12%20at%208.41.53%E2%80%AFAM.png?alt=media&#x26;token=4a1a487d-5907-46a7-a8e3-75a38fe5aba6" alt="" data-size="line">opens the object configuration screen, allowing you to [exclude fields](#excluding-fields) from the object.

## Write Format

### Paths

When Data Lake is enabled for an object, it begins writing all data in GRAX backups for that object. Data Lake writes files to your storage bucket with paths that look like:

```
parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5e66e800.parquet
parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5f42a201.parquet
parquet/v2/org=X/object=Account/batch=05e0c89c0/data-16f61d5d004.parquet
```

The `batch=05e0c89c0` portion of the path is to group files into separate prefixes to [optimize performance in S3 and other object stores](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html). It will increase as more data is written. For example, `batch=444444444` will not be used after `batch=555555555`. The `batch` value is not related to the data contained in the files, it is only for grouping files.

The `16f5e66e800` portion of `data-16f5e66e800.parquet` is to ensure unique and increasing filenames. It will increase with each file. For example, `data-44444444444.parquet` will not be written after `data-55555555555.parquet`. Similar to the `batch` value, it is not related to the data contained in the file.

### File Data

Each file contains a varying number of rows.

Rows are meant to reflect the full state of a record version (the combination of `Id` + `source__modified`) at the time of writing to Data Lake, not at the time of the relevant change. For example, record versions that are deleted before Data Lake is enabled will have `grax__deleted` set.

File sizes vary with data but should generally not be above 100 MiB.

Within each `data-16f5e66e800.parquet` file, the data looks roughly like:

| Field                                               | Type                | Meaning                                                                                                                                                                  |
| --------------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `Id`                                                | string              | Salesforce Record ID.                                                                                                                                                    |
| `source__modified`                                  | timestamp           | The time the record version was modified in Salesforce (SystemModstamp).                                                                                                 |
| `grax__idseq`                                       | string              | A per-`Id` value that increases with record changes.                                                                                                                     |
| `grax__deleted`                                     | timestamp           | If the record has been deleted or archived, the time the record was deleted or archived.                                                                                 |
| `grax__deletesource`                                | string              | If the record has been deleted or archived, the source of the record delete. Will be `grax` for archives or `source` for Salesforce.                                     |
| `grax__purged`                                      | timestamp           | If the record has been purged, the time the record was purged.                                                                                                           |
| `grax__restoredfrom`                                | string              | If the record was restored from another record, the ID of the record this record was restored from.                                                                      |
| `<record field name>`, such as `Name`               | string              | Salesforce record field data, with the value from Salesforce (converted to a string).                                                                                    |
| `<record field name>_<type>`, such as `IsDeleted_b` | indicated by suffix | For non-string fields, one or more additional typed fields and values, `_b` for boolean, `_f` for float/integer, `_ts`, for timestamp, `_t` for time, and `_d` for date. |

Only `Id`, `source__modified`, and `grax__idseq` are guaranteed to be present.

A full example for an Account Parquet file might look like:

| `Id` | `source__modified` | `grax__idseq` | `grax__deleted` | `grax_deletesource` | `grax__purged` | `grax__restoredfrom` | `Name`    | `Fax`          |
| ---- | ------------------ | ------------- | --------------- | ------------------- | -------------- | -------------------- | --------- | -------------- |
| `1`  | `t4`               | `a`           |                 |                     |                |                      | `Alice`   |                |
| `2`  | `t5`               | `q`           |                 |                     | `t1`           |                      |           |                |
| `1`  | `t5`               | `b`           |                 |                     |                |                      | `Alicia`  |                |
| `3`  | `t1`               | `x`           | `t3`            | `grax`              |                |                      | `Bob`     | `800-555-1212` |
| `1`  | `t6`               | `c`           |                 |                     |                |                      | `Allison` |                |
| `4`  | `t8`               | `a`           |                 |                     |                | `3`                  | `Bob`     |                |
| `4`  | `t9`               | `g`           |                 |                     |                | `3`                  | `Bill`    | `888-111-5555` |

## Important Notes

**There will be duplicates.** To achieve low write time and increase safety, Data Lake can produce duplicate records.&#x20;

* Use the combination of `Id` and `MAX(grax__idseq)` to determine the latest write for the latest version of a record.&#x20;
* Use the combination of `Id`, `source__modified`, and `MAX(grax__idseq)` to determine the latest write per version of a record.

**Data can appear out of order.** Similarly for performance and reliability reasons, Data Lake can write data that may logically be out of order. For example, `data-X.parquet` may be written first with `Id=1`, `source__modified=t2` while `data-Y.parquet` may be written after with `Id=1` and `source__modified=t1`. This can also happen within the same `data-X.parquet` file.&#x20;

* Use the fact that `grax__idseq` will increase for each `Id` and `source__modified` combination to order data.

**Each file can have a different schema.** The Parquet schema for each file is based on the data in that file. Only fields with non-empty values are considered and written.

**Files' schemas may include fields that contain no data.** It's also possible that a file's Parquet schema will include a field that has no data within the file.

**Typed fields may vary or clash over time.** As schema and data change over time, a typed field such as `Custom_Field__c_b` may stop receiving data if the schema and data indicate it changes to a string. Or `Custom_Field__c_d` may stop receiving data in favor of `Custom_Field__c_t` if schema and data change to indicate it changes from an date to a timestamp.

**Files written after a record has been purged will contain no field data for that record.**

## Frequently Asked Questions

#### How many objects should be enabled at the same time?

You should add all required objects to the `enabled objects` list; this ensures objects are pushed to Data Lake as efficiently as possible. GRAX continuously writes files for each object as capacity becomes available.

#### Why does the record count for an object in Data Lake not match the record count shown in Backup?

For backup, the numbers are split up into "records" and "versions." For Data Lake, the number of records and versions is combined in the “total records written” number. Please keep in mind that there might also be a small discrepancy between the totals if an object is still catching up in Data Lake. Additionally, if you had records backed up via Legacy Backup, those numbers are not included in the Backup totals but are shown in the Data Lake numbers.

#### How do I rewrite an object in Data Lake if new attributes have been added or field permissions have been updated?

You can remove an object from Data Lake configuration and then [add](#enable-objects) it back. This does not remove the previously written data from the parquet files, and duplicates any records that were already written and which are still present in the GRAX dataset.

{% hint style="info" %}
Before removing the object, review the object configuration to check if fields were [excluded](#excluding-fields), so you can exclude the same fields when adding the object back.
{% endhint %}

#### Should I continue to run Backup and Archive jobs while objects are being enabled for Data Lake?

Yes, GRAX is designed to have all these tasks run concurrently without diminished performance.

#### What factors impact the speed of Data Lake processing?

There are many variables to the speed including the amount and size of data, available CPU, other app activity, etc. There is no set time frame for how long writing an object (or objects) takes as this is determined by specific factors within each org.

#### How can I get Data Lake objects written faster?

Data Lake speed is dependent on various factors such as the size and quantity of objects. There is no action that can be taken to increase the writing speed, but GRAX is continuously working to maximize efficiency of this feature.

#### Is there any effect on the speed/performance in other areas of the app when running Data Lake?

Functionality that needs to scan backup data (Global Search, archive jobs, etc.) might run a bit slower than usual, but there should not be any noticeable difference or related errors.

#### What are the necessary CPU/VM resources needed to run Data Lake effectively?

Customers must have at least the minimum requirements for GRAX as defined [here](https://app.gitbook.com/s/d9R0vX0Xh14BwQLNh5fI/requirements/technical-requirements). This allows Data Lake to run as intended, but having better CPU and RAM resources improves speed and capacity.

#### Can I turn Data Lake on and off if needed?

Yes, you can remove objects from the enabled list to pause Data Lake for a specific object. Once those objects are added back to the enabled list, Data Lake picks up where it left off. This might be useful if prioritizing certain objects or if internal flows need to be adjusted.
