Created: August 25, 2008
Updated: January 22, 2010
This page explains how to create Condor worker nodes in Amazon's EC2. It assumes that you have an Amazon EC2 account and that you are somewhat familiar with EC2 usage. It also assumes that you have an existing Condor pool (central manager) running on a host outside the cloud, and that you are familiar with Condor usage.
This page will describe how to create a pool that looks like this:

You will need the EC2 command-line tools, which can be downloaded here. There's a good video tutorial to get you started here. I will assume that you have installed and properly configured the command-line tools on your computer.
First you need to create an EC2 image that contains Condor. Since I am assuming you know something about EC2 I will assume you know how to create an image, so I will just cover the details of how you need to configure Condor on the image.
EC2 nodes are behind a firewall and NAT. That means that the IP address used to access the nodes outside the cloud is different from the IP address used to access the nodes within the cloud. Furthermore, the nodes do not have any interfaces (e.g. eth0) that are configured with the public IP. This causes a problem with Condor because, by default, Condor gets the IP it advertises to the central manager from one of the worker node's interfaces. This, of course, causes a problem because all the IPs on the worker's interfaces are non-routable. In order to get around this problem we need to configure Condor to advertise the public IP of the worker using some EC2 metadata and an obscure Condor configuration parameter.
First, we need to get the public IP and host name of the worker. EC2 has a simple RESTful interface that can be called from the worker using curl or wget to retrieve these values. To get the IP run:
wget -q -O - http://instance-data.ec2.internal/latest/meta-data/public-ipv4
For the public host name run:
wget -q -O - http://instance-data.ec2.internal/latest/meta-data/public-hostname
There are other metadata values that be similarly retrieved. You can get a list by running:
wget -q -O - http://instance-data.ec2.internal/latest/meta-data
You can also get the user data (i.e. the central manager host specified with -d when running ec2-run-instances) by running:
wget -q -O - http://instance-data.ec2.internal/latest/user-data
Next, we need to configure Condor to advertise the public IP rather than the private IP. We can do that using the Condor configuration parameter TCP_FORWARDING_HOST. Set TCP_FORWARDING_HOST to the public IP of the worker node. You also need to set PRIVATE_NETWORK_NAME. The way these two parameters work is, the worker advertises both the public IP and the private IP of the node. When another condor daemon wants to contact the worker, if the daemon's PRIVATE_NETWORK_NAME matches the PRIVATE_NETWORK_NAME of the worker, then it uses the private IP, otherwise it uses the public IP. We set the PRIVATE_NETWORK_NAME for EC2 workers based on the EC2 availability zone. You can get the availability zone from http://instance-data.ec2.internal/latest/meta-data/placement/availability-zone. The configuration ends up like this:
PRIVATE_NETWORK_NAME = amazon-ec2-<amazon-availability-zone> TCP_FORWARDING_HOST = <public-ip-of-worker> PRIVATE_NETWORK_INTERFACE = <private-ip-of-worker>
For example:
PRIVATE_NETWORK_NAME = amazon-ec2-us-east-1c TCP_FORWARDING_HOST = 67.202.60.66 PRIVATE_NETWORK_INTERFACE = 10.253.191.243
With this configuration, any Condor daemons outside the cloud (i.e. the central manager daemons) will use the public IP to contact the worker, and any daemons within the cloud (i.e. condor_master) will use the private IP.
By default, EC2 workers are configured with a private hostname. When you execute the hostname command on a fresh EC2 instance you get something like "domU-12-31-38-01-B8-05.compute-1.internal." This is the value that Condor will try to report to the central manager. Because of the way we set TCP_FORWARDING_HOST it will work (setting HOSTALLOW_WRITE to *.compute-1.amazonaws.com even works because Condor does a reverse lookup on the IP), but it is rather confusing. Instead of using the private hostname, it is better to configure the node to use the public hostname by running:
hostname <public-hostname>
Of course, we can get the public hostname from http://instance-data.ec2.internal/latest/meta-data/public-hostname. It will be something like 'ec2-67-202-60-66.compute-1.amazonaws.com'.
The public hostname resolves to the private IP within the cloud, and the public IP outside the cloud, so changing the hostname shouldn't cause any problems for other services running on the worker.
EC2 is designed to start an operating system image on one or more virtual machines inside the cloud. It doesn't know anything about Condor or how to start and configure it. In order to get the workers to report to the central manager on startup we need a script that starts Condor when the OS boots, and shuts down condor when the OS shuts down.
In my image this script is called /etc/init.d/condor. It simply collects the meta-data from the RESTful interface, sets the worker's hostname, generates condor_config, and starts the Condor worker daemons (condor_master and condor_startd).
The complete worker configuration looks like this after the startup script generates it:
COLLECTOR_HOST = <your-central-manager> PRIVATE_NETWORK_NAME = <amazon-availability-zone> TCP_FORWARDING_HOST = <public-ip-of-worker> PRIVATE_NETWORK_INTERFACE = <private-ip-of-worker> ############################################################################### # Pool settings ############################################################################### # EC2 workers don't have shared filesystems or authentication UID_DOMAIN = $(FULL_HOSTNAME) FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) USE_NFS = False USE_AFS = False USE_CKPT_SERVER = False ############################################################################### # Local paths ############################################################################### RELEASE_DIR = /usr/local/condor/7.0.4 LOCAL_DIR = /var/condor # LOG and EXECUTE are set automatically by the startup script. They can't be # changed here. LOG = $(LOCAL_DIR)/log EXECUTE = $(LOCAL_DIR)/execute LOCK = $(LOG) ############################################################################### # Security settings ############################################################################### # Allow local host and the central manager to manage the node HOSTALLOW_ADMINISTRATOR = $(FULL_HOSTNAME), $(COLLECTOR_HOST) ############################################################################### # CPU usage settings ############################################################################### # Don't count a hyperthreaded CPU as multiple CPUs COUNT_HYPERTHREAD_CPUS = False # No need to be nice JOB_RENICE_INCREMENT = 0 # Leave this commented out. If your instance has more than one CPU (i.e. if # you use a large instance or something) then condor will advertise one # slot for each CPU. #NUM_CPUS = 1 ############################################################################### # Daemon settings ############################################################################### # Only master and startd, other daemons aren't needed on workers DAEMON_LIST = MASTER, STARTD SBIN = $(RELEASE_DIR) ALL_DEBUG = MASTER = $(SBIN)/condor_master MASTER_ADDRESS_FILE = $(LOG)/.master_address MASTER_LOG = $(LOG)/MasterLog MASTER_CHECK_NEW_EXEC_INTERVAL = 86400 STARTD = $(SBIN)/condor_startd STARTD_LOG = $(LOG)/StartdLog STARTER = $(SBIN)/condor_starter STARTER_STD = $(SBIN)/condor_starter.std STARTER_LOG = $(LOG)/StarterLog STARTER_LIST = STARTER, STARTER_STD PROCD = $(SBIN)/condor_procd PROCD_ADDRESS = $(LOG)/.procd_address PROCD_LOG = $(LOG)/ProcLog PROCD_MAX_SNAPSHOT_INTERVAL = 60 ############################################################################### # Classads ############################################################################### # Run everything, all the time START = True SUSPEND = False CONTINUE = True PREEMPT = False WANT_VACATE = False WANT_SUSPEND = True SUSPEND_VANILLA = False WANT_SUSPEND_VANILLA = True KILL = False STARTD_EXPRS = START ############################################################################### # Network settings ############################################################################### # TCP works better in the WAN. Note that you still need to open UDP ports on # the Amazon firewall or your workers will become claimed, run a couple jobs, # and never go back to Unclaimed. I'm not sure why this is the case. UPDATE_COLLECTOR_WITH_TCP = True # Use random numbers here so the workers don't all hit the collector at # the same time. If there are many workers the collector can get overwhelmed. UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370) MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370) # Port range for Amazon firewall LOWPORT=40000 HIGHPORT=40050
I assume your central manager is configured to use host-based authentication. If that is not the case, you are on your own. You will need to change the configuration file stored in the image to use your preferred authentication method. That's all I'll say about it right now.
Assuming that you ARE using host-based authentication, you need to configure the central manager to accept connections from the workers. All Amazon workers (as of this writing) are using public hostnames in the compute-1.amazonaws.com domain, so you will need to add '*.compute-1.amazonaws.com' to the HOSTALLOW_WRITE expression in condor_config. This leaves your system pretty exposed, so you might want to configure better security.
You will also need to allow the workers to connect via TCP. This makes the status updates a bit more reliable than UDP, which is important because the EC2 workers are far away across a WAN from your central manager. Add this to your central manager configuration:
HIGHPORT = 40050 LOWPORT = 40000 UPDATE_COLLECTOR_WITH_TCP=True COLLECTOR_SOCKET_CACHE_SIZE=1000
After making these changes make sure you restart Condor.
Finally, if your central manager host has a firewall you need to make sure it allows incoming connections from *.compute-1.amazonaws.com on ports 9618, and 40000-40050.
Before you can run Condor workers on EC2 you will need to create a new EC2 security group to run the workers in. This security group will be configured to enable your worker nodes to communicate freely with your central manager.
First, create a group with the ec2-add-group command. For this document the group will be called 'condor', but you can call it whatever you want, just make sure to replace the group name in all of the commands that follow.
$ ec2-add-group condor -d "Group for Condor workers"
The -d flag specifies the description of the new group. You can make sure the group was created by running ec2-describe-group.
Next, we need to allow the central manager to communicate with the worker nodes by opening up ports in the group's firewall. The worker nodes will be configured to use ports 40000-40050. We need to open both UDP and TCP for access from the central manager. We will allow these ports to be accessed by the central manager, which is specified using a CIDR address. In this case we will use the /32 CIDR suffix to specify that we only want to include one host instead of an entire subnet. We can configure this using the ec2-authorize command:
$ ec2-authorize condor -P tcp -p 40000-40050 -s <central-manager-ip>/32 $ ec2-authorize -P udp -p 40000-40050 -s <central-manager-ip>/32
The -P arguments specifies the protocol, the -p argument specifies the port range, and the -s argument specifies the source subnet (in this case, a single host). Note that the workers are configured with UPDATE_COLLECTOR_WITH_TCP = True, but UDP is still required. I don't know why. If you don't enable UDP, then the workers will become claimed, run the job, but never go back to unclaimed.
Now we are ready to launch workers. You can launch an instance of the image you created earlier with the ec2-run-instances command:
$ ec2-run-instances <image-id> -d <central-manager> -n <no-instances> -g condor
The -d flag is used to specify arbitrary user data. Your image should use the user data as the central manager host name (i.e. CONDOR_HOST in condor_config). The -n flag specifies the number of instances to launch. The -g flag specifies the security group we created earlier.
After a few minutes you should see the node(s) show up in condor_status:
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
ec2-67-202-60-66.c LINUX INTEL Unclaimed Idle 0.000 1706 0+00:01:10
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 1 0 0 1 0 0 0
Total 1 0 0 1 0 0 0
At this point you should be able to run jobs on the nodes.
The configuration described here is rather limited, allowing only the CONDOR_HOST parameter to be specified when starting a worker. This is the simplest solution that works, however, it is not very flexible. An enhancement to this would be to create an image that allows users to add or change arbitrary condor configuration parameters at start time. This could be accomplished by encoding the values in the user-data field, or by providing a user-data file (using the -f argument to ec2-run-instances).
Like I mentioned earlier, this solution relies on host-based authentication. This is not ideal because it exposes the user's central manager to all hosts in the cloud, not just the ones owned and managed by the user. A more secure method would use the keypairs that EC2 uses to authenticate SSH (the -k argument to ec2-run-instances, which can be accessed using the RESTful metadata interface), or to use something like GSI security.