Monitoring
Whether you design and deploy the infrastructure for GRAX yourself or use GRAX-provided designs/templates, installing GRAX often means running cloud infrastructure within your own environment. This infrastructure isn't accessible or manageable by the GRAX team for the sake of safety, security, and compliance. Automated monitoring policies can help ensure that issues with this infrastructure are noticed quickly and downtime of your app remains low.
Application Support Policy
For more information about what's covered within the scope of GRAX support obligations, review our support documentation.
What can be monitored?
The exact observability/monitoring tools and configurations vary based on cloud provider or environment utilized for installing GRAX, but at a high-level we want to monitor major components:
- Load Balancer (if applicable)
- Instance Usage
- GRAX Service
- Postgres Database
- Overall Health and Replacement
Cost-based alerts for budgeting thresholds and forecasts can be configured separately from the GRAX infrastructure if required. See the documentation for your cloud provider of choice for more information. Automatic restriction of resources based on cost thresholds or budgets may cause interruption to your GRAX service.
Global services (like AWS' S3 or IAM) can be monitored at a per-service level but don't require further custom monitoring individual to your account.
Load Balancer
If your infrastructure deployment contains a load balancer for stable connectivity, it must be reachable on a given domain name with a valid certificate and have healthy targets behind it. Thus, monitoring criteria is:
- Application domain is registered
- Application domain is non-expired
- Domain certificate is non-expired
- Domain certificate is assigned to ALB
- Certificate is installed on instance (if applicable)
- Certificate installed on instance is non-expired (if applicable)
- Load Balancer is reachable from intended network segment
- Targets are healthy (see below for health checks)
Instance Usage
The GRAX Application workload can be varied and inconsistent based on Salesforce usage. As such, occasional heavy-load periods and periods with almost no usage are normal. We recommend the following monitoring criteria:
- CPU usage should remain below 80% on average (4-8hr roll up)
- RAM usage should remain below 80% on average (4-8hr roll up)
- Temp directory total size should be at least 500GB
- Temp directory free space should be at least 15% of total size
- Network usage should remain below 80% on average (4-8hr roll up)
For more information about the required specifications of GRAX hardware, please review the technical requirements document. If utilizing AWS, more documentation on instance and auto-scaling metrics is available here.
GRAX Service
Ensuring that the GRAX Application remains running on the instance is foundational to success. it's highly recommended to run GRAX as a service with an auto-restart configuration so that the app boots again in case of a fatal error.
External Health Endpoint
The GRAX service offers an endpoint for an external health check like those by AWS ALB Target Groups. Make a request like the following to check if the app is available:
Port: 8000 (default)
Path: /health
Method: GET
Protocol: HTTPS (HTTP1 Only)
If the GRAX services is running, the GET request above returns a status of 200
. This endpoint is designed for load balancer registration and de-registration, not for instance replacement.
Postgres Database
A valid connection to the app database is required for boot and operation of GRAX. Monitoring the GRAX database isn't unlike monitoring any other app database. Monitoring should cover the following:
- CPU usage should remain below 80% on average (4-8hr roll up)
- RAM usage should remain below 80% on average (4-8hr roll up)
- Total disk usage should remain below 80% (if applicable)
More options for monitoring Postgres are available based on platform/vendor including queue depth, IOPs statistics, and network throughput. For more information on how you can monitor these metrics on AWS's RDS, check out:
For similar information pertaining to Azure Postgres, check out:
Overall Health and Replacement
The monitoring of the major components above, in combination with other standard cloud provider or environment monitors may indicate a problem that requires action to recover from.
Traditional cloud operations best practices apply, and most problems require that an operator review metrics, logs and configuration to understand and resolve the issue.
Some perceived issues require no action but waiting. Examples of this include:
- The instance restarted and is performing automatic security updates and service configuration
- The GRAX service restarted and is performing an automatic database migration
- The Salesforce API is returning 500s indicating a service outage
Other issues require manual review and action. Examples of this include:
- The instance CPU, memory or network are at 100%+ utilization, which indicates it should be reconfigured with a larger instance
- The instance disk periodically fills up and GRAX crashes, which indicates it should be reconfigured with a disk location and size that meets minimum requirements
- GRAX is crashing connecting to the database, which indicates database configuration needs to be updated
Some issues can be fully automated. Examples of this include:
- An AWS instance status check indicates a hardware failure, which an AWS autoscaling group can periodically check and automatically replace.
In all cases GRAX is designed to be simple and resilient to problems. After any amount of downtime when it resumes normal operations it will pick up where it left off.
Updated about 21 hours ago