RightScale Grid is robust to transient failures in the AWS infrastructure but allows the user flexibility when designing a global error handling strategy.
Transient errors in AWS are common and normal. Amazon has long recommended that users receiving such transient errors use a careful strategy of retire and backoffs before declaring the operation as failed. RightScale Grid implements exponentially backed-off retries on all EC2, S3, and SQS operations.
Certain execution errors are classified as 'permanent errors'. If all retries fail on a AWS operation, the required input files for a work_unit are missing, or the application itself returns an error code, the work unit is tagged as a permanent error. Permanent errors interrupt work_unit processing and jump immediately to the error routing and reporting logic. Once the error (work unit) is successfully reported, the rightworker resumes execution with the next pending work_unit.
RightScale Grid has a highly configurable error handling system. Permanent errors can be reported to the result queue (alongside success results), to a special error queue, or to both. Additionally, RightScale Grid can report errors to a result server using HTTP POST requests. For information on how to configure error reporting, see the parameters result_queue, result_queue_ignore_errored, result_server, result_server_ignore_errored, and error_queue in RightScale Results Processing Logic.
© 2006-2014 RightScale, Inc. All rights reserved.
RightScale is a registered trademark of RightScale, Inc. All other products and services may be trademarks or servicemarks of their respective owners.