Data Lake v1 FAQ
Note
This documentation is for Data Lake v1. For current documentation, visit the main Data Lake documentation
How many total objects can be enabled for Data Lake?
You can enable up to 100 objects for a single org.
How many objects should be enabled at the same time?
You should add all required objects to the enabled objects
list; this ensures objects are pushed to Data Lake as efficiently as possible. GRAX continuously writes files for each object as capacity becomes available.
How is the Data Lake data organized?
The format for the folders looks like this: parquet/org=1234/object=myObject/day=nn/hour=nn
. New folders are be added in your storage bucket for each version/modification based on the date/time that the object/version was backed up, following the same format, always based on UTC time.
Why are some hourly folders missing?
Folders are only created for hours that contained data updates/changes; if there are no updates to an object in any given hour, no folders or files are created for that hour.
What is the estimated size of parquet files that are written?
The size varies based on the amount of and size of the data, but users should expect a range of 10MB-40MB per folder/file.
Why does the record count for an object in Data Lake not match the record count shown in Auto Backup?
For backup, the numbers are split up into "records" and "versions." For Data Lake, the number of records and versions is combined in the “total records written” number. Please keep in mind that there might also be a small discrepancy between the totals if an object is still catching up in Data Lake. Additionally, if you had records backed up via Legacy Backup, those numbers are not included in the Auto Backup totals but are shown in the Data Lake numbers.
How do I rewrite an object in Data Lake if new attributes have been added or field permissions have been updated?
You can reset the object for Auto Backup within the “Tools and Diagnostics” page within the “Settings” tab. This re-runs the data for that object so updated versions are captured and subsequently added to your Data Lake files. This does not remove the previously written data from the parquet files, and duplicates any records that were already written and which are still present in the GRAX dataset.
Should I continue to run Auto Backup and archive jobs while objects are being enabled for Data Lake?
Yes, GRAX is designed to have all these tasks run concurrently without diminished performance.
What factors impact the speed of Data Lake processing?
There are many variables to the speed including the amount and size of data, available CPU, other app activity, etc. There is no set time frame for how long writing an object (or objects) takes as this is determined by specific factors within each org.
How can I get Data Lake objects written faster?
Data Lake speed is dependent on various factors such as the size and quantity of objects. There is no action that can be taken to increase the writing speed, but GRAX is continuously working to maximize efficiency of this feature.
Is there any effect on the speed/performance in other areas of the app when running Data Lake?
Functionality that needs to scan backup data (Global Search, archive jobs, ect…) might run a bit slower than usual, but there should not be any noticeable difference or related errors.
What are the necessary CPU/VM resources needed to run Data Lake effectively?
Customers must have at least the minimum requirements for GRAX as defined here. This allows Data Lake to run as intended, but having better CPU and RAM resources improves speed and capacity.
Can I turn Data Lake on and off if needed?
Yes, you can remove objects from the enabled list to pause Data Lake for a specific object. Once those objects are added back to the enabled list, Data Lake picks up where it left off. This might be useful if prioritizing certain objects or if internal flows need to be adjusted.
Updated 13 days ago