Data Lake (formerly History Stream)
Note
This documentation is for Data Lake v2. For v1 documentation, visit the v1 documentation
Data Lake writes data from GRAX backups to your storage bucket in Parquet format. You can use the written data to build data warehouses, merge data with other systems, and more.
Data Lake v2 is an iteration on our original Data Lake v1 offering. Compared to v1, Data Lake v2 offers much lower time to write after GRAX backs up data from Salesforce while increasing reliability and safety.
Write format
Paths
When Data Lake v2 is enabled for an object, it begins writing all data in GRAX backups for that object. Data Lake v2 writes files to your storage bucket with paths that look like:
parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5e66e800.parquet
parquet/v2/org=X/object=Account/batch=05e0be100/data-16f5f42a201.parquet
parquet/v2/org=X/object=Account/batch=05e0c89c0/data-16f61d5d004.parquet
The batch=05e0c89c0
portion of the path is to group files into separate prefixes to optimize performance in S3 and other object stores. It will increase as more data is written. For example, batch=444444444
will not be used after batch=555555555
. The batch
value is not related to the data contained in the files, it is only for grouping files.
The 16f5e66e800
portion of data-16f5e66e800.parquet
is to ensure unique and increasing filenames. It will increase with each file. For example, data-44444444444.parquet
will not be written after data-55555555555.parquet
. Similar to the batch
value, it is not related to the data contained in the file.
File data
Each file contains a varying number of rows.
Rows are meant to reflect the full state of a record version (the combination of Id
+ source__modified
) at the time of writing to Data Lake, not at the time of the relevant change. For example, record versions that are deleted before Data Lake v2 is enabled will have grax__deleted
set.
File sizes vary with data but should generally not be above 100 MiB.
Within each data-16f5e66e800.parquet
file, the data looks roughly like:
Field | Type | Meaning |
---|---|---|
Id | string | Salesforce Record ID. |
source__modified | timestamp | The time the record version was modified in Salesforce (SystemModstamp). |
grax__idseq | string | A per-Id value that increases with record changes. |
grax__deleted | timestamp | If the record has been deleted or archived, the time the record was deleted or archived. |
grax__deletesource | string | If the record has been deleted or archived, the source of the record delete. Will be grax for archives or source for Salesforce. |
grax__purged | timestamp | If the record has been purged, the time the record was purged. |
grax__restoredfrom | string | If the record was restored from another record, the ID of the record this record was restored from. |
<record field name> , such as Name | string | Salesforce record field data, with the value from Salesforce (converted to a string). |
<record field name>_<type> , such as IsDeleted_b | indicated by suffix | For non-string fields, one or more additional typed fields and values, _b for boolean, _i for integer, _f for float, _ts , for timestamp, _t for time, and _d for date. |
Only Id
, source__modified
, and grax__idseq
are guaranteed to be present.
A full example for an Account Parquet file might look like:
Id | source__modified | grax__idseq | grax__deleted | grax_deletesource | grax__purged | grax__restoredfrom | Name | Fax |
---|---|---|---|---|---|---|---|---|
1 | t4 | a | Alice | |||||
2 | t5 | q | t1 | |||||
1 | t5 | b | Alicia | |||||
3 | t1 | x | t3 | grax | Bob | 800-555-1212 | ||
1 | t6 | c | Allison | |||||
4 | t8 | a | 3 | Bob | ||||
4 | t9 | g | 3 | Bill | 888-111-5555 |
Important notes
There will be duplicates. To achieve low write time and increase safety, v2 can produce duplicate records. Use the combination of Id
and MAX(grax__idseq)
to determine the latest write for the latest version of a record. Use the combination of Id
, source__modified
, and MAX(grax__idseq)
to determine the latest write per version of a record. This was already the case with v1 but it was harder to de-duplicate without source__modified
and grax__idseq
.
Data can appear out of order. Similarly for performance and reliability reasons, v2 can write data that may logically be out of order. For example, data-X.parquet
may be written first with Id=1
, source__modified=t2
while data-Y.parquet
may be written after with Id=1
and source__modified=t1
. This can also happen within the same data-X.parquet
file. Use the fact that grax__idseq
will increase for each Id
and source__modified
combination to order data.
Each file can have a different schema. The Parquet schema for each file is based on the data in that file. Only fields with non-empty values are considered and written.
Files' schemas may include fields that contain no data. It's also possible that a file's Parquet schema will include a field that has no data within the file.
Typed fields may vary or clash over time. As schema and data change over time, a typed field such as Custom_Field__c_b
may stop receiving data if the schema and data indicate it changes to a string. Or Custom_Field__c_i
may stop receiving data in favor of Custom_Field__c_f
if schema and data change to indicate it changes from an integer to a float.
Files written after a record has been purged will contain no field data for that record.
Key differences from v1
The key differences between Data Lake v1 and v2 are:
- Reduced delay from data being added in GRAX to writing to Data Lake
- Increased throughput when backfilling newly enabled objects and handling large volumes of changes
- Improved writing intelligence to reliably keep objects up-to-date
- System improvements to remove possibilities of missed writes observed in v1
- Different path structure (
v2
prefix,day=YYYY-MM-DD/hr=HH
tobatch=444444444
) - Increased max file size (10 MB for v1 to 100 MB for v2)
- Addition of
source__modified
,grax__idseq
, andgrax__restoredfrom
fields - Addition of typed fields for non-string values
- Removal of
grax__added
field
Updated 13 days ago