Note: Please go to docs.rightscale.com to access the current RightScale documentation set. Also, feel free to Chat with us!
Home > Tutorials > Amazon Web Services (AWS) > Grid Edition > RightGrid Basic Example

RightGrid Basic Example

Warning: This page contains outdated or otherwise non-applicable for a product that is no longer supported by RightScale.

Objective

The purpose of this tutorial is to introduce you to some of the key concepts about RightGrid and demonstrate how to set up a RightGrid application of your own. In this tutorial, you will build a "Hello World" RightGrid application that demonstrates the batch processing functionality of the RightGrid system.  

Note: RightScale now offers the RightGrid One-click Application tutorial, which uses a macro to create most of the key RightGrid components for you.

Prerequisites

This tutorial only applies to Grid or Premium accounts.  If you have a Developer account and would like to upgrade, please contact sales@rightscale.com.

Overview

RightGrid Architecture

Before you attempt to set up a RightGrid application, it's important to understand some of the basic concepts in order to better understand how all the different pieces fit together.  The diagram below highlights the main parts of a basic RightGrid application.

rightgrid_overview_diagram.gif

  • Work Unit - the meta-data that describes the task.  It contains the path to the input data and any other the relevant information.
  • Job Producer - submits work units to an SQS input queue and stores data associated with the work units in a bucket on S3.  The job producer can be a web server that sends long-running tasks, or it could be a back-end system that sends large numbers of tasks to be performed on a specific dataset.
  • Job Consumer - parses the result messages.  If necessary, it can also update a central database accordingly.
  • RightGrid Configuration File (rightworker.yml) - the configuration file that defines the key variables of RightGrid.
  • 'Worker' ServerTemplate - the ServerTemplate that is used to create each worker instance in the scalable server array.
  • Customer Application - defines how each work unit should be processed by a worker instance.

Now that you understand the core components of RightGrid, we will now show you how to set up a basic RightGrid application. 

Steps

When creating a RightGrid application or porting an existing application to RightGrid, most users perform the following tasks:

Step 1: Create an SQS Queues (Input and Output)
Step 2: Create an S3 bucket
Step 3: Create a Job Producer
Step 4: Create a Job Consumer
Step 5: Create a RightGrid Configuration File
Step 6: Create your Application/Kicker Class
Step 7: Configure a ServerTemplate for Worker Instances
Step 8: Create a Queue-based Server Array
Step 9: Launch a Worker Instance and Test the Results

 

 

Step 1 - Create SQS Queues (Input and Output)

The first step is to create your SQS Queues.  You'll need to create an SQS Input Queue to receive work unit messages from the job producer and an SQS Output Queue to receive the result messages after the work unit has been processed. Optionally, you can create an SQS Audit Queue where RightGrid will send its audit entires. 

rightgrid_diagram_queues.gif

 

Go to Clouds -> AWS Global -> Queues.  Click the New Queue button. 

screen-RightGridQueueNew.png

  • Queue Name - a nickname for the SQS queue.  NOTE:  You will need to create unique names.
  • API Generation - the generation of the SQS Queue API. Select 2. 
    NOTE:  All SQS Queues for RightGrid must use the second generation API.
  • Visibility Timeout - after a worker instance takes a work unit from the input queue, the input queue will "hide" the job in order to prevent multiple worker instances from taking the same job.   The visibility timeout specifies the time in seconds that a job will remain hidden before it becomes available to other worker instances.  If a worker instance successfully receives a work unit, it will send a message to delete the work unit from the input queue.  The visibility timeout ensures that if a worker is unsuccessful receiving a work unit, the work unit will become visible and available again to other worker instances.  (Default = 30)

Click the Create button.  A confirmation window will appear where you can add a message to the queue as a test.  

Go back to Clouds -> AWS Global -> Queues and repeat the process by creating an SQS Output Queue.

You should now have an input queue and output queue.

screen-RightGridQueueList.png
NOTE: Similar to S3 bucket names, SQS Queue names must be unique.

Step 2 - Create an S3 Bucket

You will need to create an S3 bucket to store work unit data.  You can use the same S3 bucket for storing input and output (result) data.  

       rightgrid_overview_s3.gif

Go to Clouds -> AWS Global -> S3 Browser and click the New Bucket button.

screen-RightGridS3New.png

  • Bucket Name - a nickname for your S3 bucket.  NOTE:  You will need to create a unique bucket name.
  • Location - the geographic location of your S3 bucket (US East is the default). 

Step 3 - Create a Job Producer

The job producer does not have to be running on EC2.  It can be located anywhere on the Internet.  The code for the job producer and job consumer can be written in any programming language provided that you can upload/download data to S3 and send/receive work units from SQS queues.

rightgrid_diagram_job_producer.gif


A job producer performs the following tasks:

  • It breaks down data-sets into individual work units.
  • It upload a work unit's matching input data files to a specified bucket on S3.
  • It sends a message about a work unit to the SQS Input Queue.  NOTE: Since the RightGrid daemon on a worker instance will automatically grab work units that become available in the input queue, it's important that the associated input data files for the work unit are already available on S3.


To create a job producer follow the steps below.  

  1. Install the RightScale AWS interface gem (right_aws).
  2. Add code to upload the input data files to the specified bucket on S3.  (See '# Get S3 and SQS handle')
  3. Add code to generate and encode a work unit.  (See '# Get S3 and SQS handle')
  4. Add code to send a work unit's message to the SQS input queue.

Similar to SQS messages, input queue messages are limited to 256KB. In most cases, the input queue messages contain only the work unit meta-data while the actual input data files for the worker application are uploaded to S3.

Sample Code

The sample code below is written in Ruby.  Use this code as a template for creating your own job producer.

 

jobproducer.rb
require 'yaml'
require 'rubygems'
require 'right_aws'

def upload_file(bucket, key, data)
  bucket.put(key, data)
end

def enqueue_work_unit(queue, work_unit)
  queue.send_message(work_unit)
end

# Load jobspec
jobspec = YAML::load_file("oneshotspec.yml")

# Get S3 and SQS handle
s3 = RightAws::S3.new(jobspec[:access_key_id], jobspec[:secret_access_key])
bucket = s3.bucket(jobspec[:bucket], false)
sqs = RightAws::SqsGen2.new(jobspec[:access_key_id], jobspec[:secret_access_key])
inqueue = sqs.queue(jobspec[:inputqueue], false)

# Generate work units
for id in 1...(jobspec[:number_of_units]+1)
  puts "Generating work unit #{id}"
  filename = "in/Log#{id}.log"
  text = "HelloWorld!"

  work_unit = {
    :created_at => Time.now.utc.strftime('%Y-%m-%d %H:%M:%S %Z'),
    :s3_download => [File.join(jobspec[:bucket], filename)],
    :worker_name => jobspec[:worker_name],
    :id => id,
  }

  wu_yaml = work_unit.to_yaml
  upload_file(bucket, filename, text)
  enqueue_work_unit(inqueue, wu_yaml)
  puts wu_yaml
end

 

oneshotspec.yml

---
:name: OneshotJob
:worker_name: RGHelloWorld
:number_of_units: 5000
:bucket: dw_rightgrid_demo
:inputqueue: RG-Input
:outputqueue: RG-Output
:access_key_id: <AWS_ACCESS_KEY>
:secret_access_key: <AWS_SECRET_ACCESS_KEY>

 
Notice that the 'work_unit' section is written in the YAML format.  If you use a different format, you will have to create an encoder for the message.  Therefore, we recommend using the YAML format. 

 

Step 4 - Create a Job Consumer

Similar to the job producer, the job consumer can be located anywhere on the Internet. The job consumer typically parses a work unit's result message and data files.  It can also update a central database, if necessary. 
 

rightgrid_diagram_job_consumer2.gif

 

To create a job consumer that's compatible with the RightGrid framework:

  1. Install the RightScale AWS interface gem (right_aws).
  2. Add code to receive messages from the SQS output queue and decode the message.
  3. Add code to download the result files from S3.
  4. (Optional) Update a central database with the result data files.

Sample CodeEdit section

The sample code below was written in Ruby.  Use this code as a template for creating your own job consumer.

 

jobconsumer.rb

require 'rubygems'
require 'yaml'
require 'right_aws'


def download_result(bucket, key)
    bucket.get(key)
end


def dequeue_entry(queue)
   queue.pop
end

# Load jobspec
jobspec = YAML::load_file("oneshotspec.yml")

# Get S3 and SQS handles
s3 = RightAws::S3.new(jobspec[:access_key_id], jobspec[:secret_access_key])
bucket = s3.bucket(jobspec[:bucket], false)
sqs = RightAws::SqsGen2.new(jobspec[:access_key_id], jobspec[:secret_access_key])
outputqueue = sqs.queue(jobspec[:outputqueue], false)

# Continually Pop messages off the result queue
while true do
  msg = dequeue_entry(outputqueue)

  #Here is where you would:
  #  1. Decode msg
  #  2. Download result files from s3
  #  3. Update a central database/update job statistics
end

 

 

Step 5 - Create a RightGrid configuration file (rightworker.yml)

The rightworker.yml configuration file is the heart of a RightGrid application. The config file sets variables needed by the RightGrid worker daemon in order to call the user's application with the correct parameters.  Be sure to place the rightworker.yml file inthe same directory as the app worker.

The rightworker.yml file contains the following parameters:

  • Defines the environment (ex: development, staging, production).
  • Defines how to upload the results back to S3.
  • Defines how to send result messages to the corresponding queues.
  • Includes your AWS access and secret access keys.

 The sample code below was written in Ruby.  Use this code as a template for creating your own job producer.

 

rightworker.yml

development:
    RightWorkersDaemon:
        aws_access_key: <AWS_ACCESS_KEY>
        aws_secret_access_key: <AWS_SECRET_ACCESS_KEY>
        log: RGHelloWorld.log
        email: yourName@yourSite.com
        halt_on_exit: true
        workers: 1
        user:
            custom_entry_a: user_entry_1
            custom_entry_b: user_entry_2
        queues:
            RG-Input:
                invocation_model: oneshot
                result_queue: RG-Output
                message_decoder: RightYamlDecoder
                s3_log: dw_rightgrid_demo/log/%{DATE}/%{MESSAGE_ID}
                s3_out: dw_rightgrid_demo/out/%{DATE}-%{TIME}-%{MESSAGE_ID}
                receive_message_timeout: 3600
                default_worker_name: RGHelloWorld
                life_time_after_fault: 7200
                s3_in: /tmp/s3_in
                s3_in_delete: false
                s3_in_overwrite: false
                s3_in_flat: true

In the sample code above, the parameters are defined in three sub-sections.

  • # Environment
    • # RightWorkers Daemon
      • # User
      • # Queue
'Environment' section:

The 'Environment' section is the highest-level section in the configuration file and is commonly used to create different configurations setups, such as for development, testing, and production.  You can have multiple environments and use a different RightGrid application for each environment.  Each environment section requires  a subsection called 'RightWorkersDaemon.'  

In this example, we are defining a 'development' environment.

'RightWorkers Daemon' section:

The 'RightWorkersDaemon' section holds all of the RightGrid-specific configuration information.  RightGrid will ignore any other subsections of the 'Environment' section. 

The 'RightWorkersDaemon' section includes the following variables: 

  • aws_access_key - your AWS access key ID.
  • aws_secret_access_key - your AWS secret key.
  • log - name of the RightGrid's log file.  This is not the application's log file.  NOTE: Stream names like STDOUT or STDERR are also allowed. 
  • email - (optional) an email address that will receive any error messages.
  • halt_on_exit - defines how an instance will be shutdown/terminated.
    • If set to true, the RightGrid daemon and the EC2 instance will both exit if no more jobs are in the input queue and it has been 55-59 minutes since the start of the last paying hour.  This process maximizes the usage time of all worker instances while minimizing overall usage costs since Amazon charges by whole hours.  (Useful for production environments.)
    • If set to false, the RightGrid daemon never exits and the instance is not terminated.  (Useful for debugging purposes.)  
  • workers - the number of RightGrid "worker" to be started on each worker instance.  Each "worker" can process one work unit.  For small jobs, you might want to increase the number of "workers" per instance in order to maximize an instance's CPU usage.  (Default = 1)
'User' section:

The 'User' section holds application-specific configuration information.  This is a useful way for passing common variables/information to all worker instances.  It can contain any number of key/value pairs.  RightGrid does not read this information, but rather passes it on to the do_work() method of the application class as part of the message_env hash.   In the sample code above, the 'custom_entry_a' and 'custom_entry_b' values will be passed to all worker instances.

'Queues' section:

The 'Queue' subsection which defines one or more input queues to monitor. If multiple queues are specified, RightGrid will monitor them in round-robin order. The title of each queue subsection must be the exact name of the input queue.

The following variables are common to all queues:

  • invocation_model - can take one of two values: 'oneshot' or 'persistent'  (Default = oneshot)
  • message_decoder - the name of the Ruby class to use as a message codec.  The class must implement the codec interface described in the code below, and the file containing the codec class definition must reside in the working directory of RightGrid. 
  • result_queue - the SQS queue where result messages will be sent.  If omitted, results will be sent to no queue. 
  • s3_in - specifies a location on the local filesystem under which all S3 input data will be placed.  By default, this input data is staged to an automatically generated location on the local filesystem. 
  • s3_in_overwrite - if true, files already present on the local filesystem will be re-downloaded from S3 and overwritten when each new workunit requires them. (Default = True)
  • s3_in_delete -  if true, it will remove downloaded files when the worker finishes processing the workunit.  NOTE: Only files are removed.  Directory structures are left intact.  (Default = True)
  • s3_in_flat - controls the collapse of file hierarchies on S3 into a flat file space on the local filesystem.  If the downloaded file is not specified as 'local_path_and_name' then it:
    • is set to 'message_env['s3_in']/bucket/key' if s3_in_flat==false;
    • is set to 'message_env['s3_in']/filename' where 'filename' is a key base name without any bucket if s3_in_flat == true. If the file has its own local name specified, 's3_in_flat' does not affect it.  (Default = False)
  • s3_out - specifies a bucket and key on S3 under which RightGrid will upload any output files generated by the application. If omitted, output will not be uploaded. 
  • s3_log - specifies a bucket and key on S3 under which RightGrid will upload any log files generated by the application.  If omitted, logs will not be uploaded. 
  • receive_message_timeout - SQS visibility timeout for messages retrieved from the input queue.  While a message is invisible, other RightGrid instances will not be able to dequeue it.  Note that this does not guarantee that a workunit won't be processed by multiple RightGrid instances.
  • life_time_after_fault - if errors occur while processing a work unit, RightGrid will process it again for a maximum of 'life_time_after_fault' seconds. If the work unit hasn’t been successfully processed in that time interval, it is deleted. This parameter is only used if the message has a 'created_at' timestamp in its body. Default value == 3600 seconds (1 hour).
Variables for "one-shot" queues: 
  • default_worker_name - the name of the worker class to invoke on work units; each work unit will be passed to the 'do_work()' method of this class.  The file containing the class definition should reside on the Ruby search path ($:).  In this example, we define 'RGHelloWorld' as the worker class in the sample code of the customer application below.
Variables for "persistent" queues: 
  • path_to_executable - the location of the application or kicker executable.

These are only a sample of the variables that can be defined in the rightworker.yml file.  For a complete list of all the variables, see the RightGrid User Guide.

Step 6 - Create your application

RightGrid supports two ways of invoking the user's application: one shot or persistent.  In this example, we are using the 'one shot' invocation model where a new process is created for each new work unit that is received by a worker.

 

RGHelloWorld.rb

class RGHelloWorld

  def do_work(message_env, message)
    starttime = Time.now

    for j in 0...10000 do
      2 + 2
    end

    finishtime = Time.now

    result = {
      :result => 0,
      :id => message_env[:id],
      :starttime => starttime,
      :finishtime => finishtime,
      :created_at => message_env[:created_at],
      :output => "Goodbye World!"
    }
  end

end
 

Where message_env is given by the following hash:

message_env = {
  'tmp_dir' => “Directory for temp files”,
  's3_in' => "Directory with the files downloaded from S3",
  'output_dir' => "Directory where the app or kicker should put output files to be
uploaded to S3",
  'log_dir' => "Directory where the app or kicker should put log files to be
uploaded to S3",
  'log_file' => "File for right_worker logs',
  'message_id' => "SQS message id",
  'controller' => "RightWorkersDaemon",
  'worker_name' => "name of user worker"
  'logger' => "handle to the RightGrid logger object",
  's3_downloaded_list' => {‘bucket/key1’=> ‘local_file1’, ..., ‘bucket/keyN’=>
‘local_fileN’}
}

 

 

Step 7 - Configure an EC2 'Worker' ServerTemplate

Each worker instance must be created with the same ServerTemplate.  RightScale provides the RightGrid Worker ServerTemplate, which is already configured to install RightGrid and automatically start the rightworker daemon.  Simply configure the ServerTemplate with the appropriate input parameters such as your SVN repository and add any custom RightScripts for your application and its dependencies.

Go to Design -> MultiCloud Marketplace -> ServerTemplates.  Find the latest version of the RSGrid Worker ServerTemplate.

Since you will probably customize the ServerTemplate, click the Clone button and rename the ServerTemplate "Worker."

Now go to the cloned ServerTemplate's Inputs tab and click the Edit link.

Each instance will need to be able to access your SVN repository in order to download your application code.

Define the the following input parameters. 

  • MON_PROCESS - set to Ignore (default).
  • OPT_GEMS_LIST - set to right_aws (default).
  • RAILS_ENV - development (default).
  • RSGRID_AUDIT_QUEUE - Specify the Amazon SQS queue to be used for audit entries from the worker code.
  • RSGRID_BUCKET - The Amazon S3 Bucket that is used to pass input files and results files between the job producer/Worker Code and the Worker Code/job consumer.
  • RSGRID_ERROR_QUEUE - Specify the Amazon SQS queue to be used for error messages. When the worker code processes input messages that results in an error, the results messages in placed in the error queue.
  • RSGRID_INPUT_QUEUE - Specify the Amazon SQS queue to be used for input messages. The Job Coordinator places new messages in the input queue and the worker code processes them.
  • RSGRID_OUTPUT_QUEUE - Specify the Amazon SQS queue to be used for output messages. The worker code processes input messages and places result messages in the output queue.
  • SYSLOG_SERVER - syslog.rightscale.com (default).

Step 8 - Create a Queue-based Server Array

The last step is to create a scalable server array for all worker instances.

You can create two types of server arrays, based on how you want the server array to resize.  You can either create a server array that will resize based on the number of jobs in the queue or based on the amount of time a job is in the queue.  For more information, see Server Arrays.

In this example we will create a server array based on the number of jobs in the queue. 

Go to Manage -> Arrays -> New.

screen-RightGridCreateArray.png

  • Nickname - name your server array.
  • Array type - Select: Queue-based
  • Deployment - each server array must be attached to a particular deployment.  A deployment can have multiple server arrays.   Select a deployment for the server array.  In this example, we've already created a "RightGrid Example" deployment. 
  • Active - the status of the server array. If active, the server array is enabled for scaling actions. For alert-based arrays, if you disable an array with running instances, the server array will no longer be able to scale up or down, but the instances will continue to run until they are manually terminated.
  • Cloud - the cloud infrastructure or EC2 region (US or EU) where the server array instances will be launched.
  • ServerTemplate - select the template that will be used to create the worker instances in the server array. Select: Worker from your Private ServerTemplates.
  • MultiCloud Image - select the MultiCloud Image to be used for your server array. Default is 'Inherit from ServerTemplate'.
  • Instance Type - select the Instance Type. Default is 'Inherit from ServerTemplate'.
  • SSH key - select the SSH key that will be used by all worker instances in the server array. Select your SSH key.
  • Security Group - select the security group that will be used by all worker instances in the server array. Select your Security Group.
  • Default min count - the minimum number of worker instances that must be operational in the server array.  You can set this value to zero to avoid idle server usage costs.  But, if you have time-sensitive tasks that need to be processed immediately, you might want to use a value of at least one worker instance.  Default = 1
  • Default max count - the maximum number of worker instances that can be operational at any given time in the server array. Default = 20
  • Availability Zone - specify which availability zone(s) new server instances should be launched into. Use the 'any' option to launch server instances randomly. Use the 'weighted' option to specify a desired ratio.
  • Elasticity function - defines how the server array should grow/shrink. Select sqs_queue_size
  • Elasticity params - defines the elasticity parameters that will be used to determine when a server array should be resized.
    • Items per instance - defines the ratio of worker instances per items in the queue.  Ex: If there are 50 items in the queue and "Items per instance" is set to 10, the server array will resize to 5 worker instances.  Enter: 20
  • Indicator - select the input SQS queue that contains all of the job tasks.  (ex: RG-Inputs)
  • Audit - (optional) select the audit SQS queue that will store audit entries.  Set this value to -none-.
  • Audit entry analysis - (optional) if checked, it will populate the Statistics tab for your array.

Click Save.

The next step is to activate your server array.  After saving your server array, find your server arry and click the enable link.

Step 9- Launch a Worker Instance and Test the Results

You're almost ready to launch your RightGrid application. 

The first step is to start the job producer to put work units into the SQS Input Queue.  Now go to your input queue to see the number of work units.  Go to Clouds -> AWS Global-> Queues.

Next, go to your RightGrid deployment (ex: RightGrid Example).  Notice that the "RG Worker Array" is active and running. 

Click the Launch button.  An input confirmation window will appear.  Confirm that you are using the correct launch inputs and click the Launch button.

Since we have 500 work units sitting in the input queue and we specified "20 items per instance" 25 instances would be launched, but since we defined a maximum of 20 instances in the server array, only 20 instances were launched.

Now check your SQS Queues.  You should now see the number of jobs in the output queue slowly increase as more jobs from the input queue are successfully processed.

NOTE:  The job producer and job consumer can run continuously.

Eventually all of the jobs will be processed and the server array will eventually  resize down to 1 worker instance. 

Congratulations!  You've just created a basic RightGrid application and have seen how powerful and easy it is to use cloud computing resources for more efficient batch processing tasks.

You must to post a comment.
Last modified
11:48, 26 Sep 2014

Tags

Classifications

This page has no classifications.

Announcements

None


© 2006-2014 RightScale, Inc. All rights reserved.
RightScale is a registered trademark of RightScale, Inc. All other products and services may be trademarks or servicemarks of their respective owners.