Generic spot instance fleet manager for running long-running jobs on AWS EC2 spot instances.
- Multi-AZ Failover: Automatically tries different availability zones when spot capacity is unavailable
- Auto-Recovery: Monitors instances and automatically recovers failed jobs
- S3 Checkpointing: Download checkpoints from S3 when recovering jobs
- Email Alerts: Get notified when instances go offline or jobs complete
- Configurable Profiles: Define instance types and spot prices in JSON
# Copy example configs
cp fleet.env.example fleet.env
cp configs/instances.json.example configs/instances.json
cp configs/profiles.json.example configs/profiles.json
# Edit fleet.env with your AWS settings
vim fleet.envEdit fleet.env with your:
FLEET_KEY_NAME- SSH key pair nameFLEET_SECURITY_GROUP- Security group ID (must allow SSH)FLEET_AMI_ID- AMI to use
In fleet.env, configure:
JOB_NAME="my-job"
JOB_BUILD_CMD="make build"
JOB_START_CMD="./run.sh --start %START% --end %END% --checkpoint %CHECKPOINT%"
JOB_PROCESS_PATTERN="run.sh"Edit configs/instances.json:
{
"instances": [
{"num": 1, "ip": "", "start": 1000, "end": 2000, "desc": "Range1"},
{"num": 2, "ip": "", "start": 2001, "end": 3000, "desc": "Range2"}
]
}# Launch a single instance
./scripts/launch-instance.sh launch --profile gpu-t4
# Check status
./scripts/launch-instance.sh status
# SSH into instance
./scripts/launch-instance.sh ssh# Start job on instance 1
./scripts/recover-job.sh 1
# Start jobs on multiple instances
./scripts/recover-job.sh 1 2 3
# Start on all configured instances
./scripts/recover-job.sh all# One-time check
./scripts/monitor-fleet.sh
# Continuous monitoring
./scripts/monitor-fleet.sh --watch
# Monitor with auto-recovery
./scripts/monitor-fleet.sh --watch --auto-recoverec2-spot-fleet/
├── fleet.env # Main configuration (create from example)
├── fleet.env.example # Configuration template
├── scripts/
│ ├── launch-instance.sh # Launch spot instances
│ ├── monitor-fleet.sh # Monitor fleet, auto-recovery
│ ├── recover-job.sh # Restart jobs on instances
│ └── lib/
│ ├── common.sh # Shared functions
│ └── email.sh # Email alerting
├── configs/
│ ├── instances.json # Instance definitions (create from example)
│ ├── instances.json.example # Instance config template
│ ├── profiles.json # Instance profiles (create from example)
│ └── profiles.json.example # Profiles template
└── examples/
└── long-running-job/ # Example job configuration
| Variable | Description | Required |
|---|---|---|
FLEET_REGION |
AWS region | Yes |
FLEET_KEY_NAME |
SSH key pair name | Yes |
FLEET_SECURITY_GROUP |
Security group ID | Yes |
FLEET_AMI_ID |
AMI ID | Yes |
FLEET_PROJECT_TAG |
Project tag for resources | Yes |
FLEET_PROFILE |
Default instance profile | No |
FLEET_MAX_INSTANCES |
Safety limit | No |
FLEET_WORKSPACE |
Remote workspace path | No |
FLEET_SYNC_PATH |
Local path to sync | No |
JOB_START_CMD |
Command to start job | Yes (for jobs) |
JOB_PROCESS_PATTERN |
Pattern to detect running job | Yes (for jobs) |
See fleet.env.example for all options.
{
"instances": [
{
"num": 1, // Instance number (unique)
"ip": "1.2.3.4", // Current IP (empty if not launched)
"start": 1000, // Start value for job
"end": 2000, // End value for job
"desc": "Range1" // Description
}
]
}{
"profiles": {
"gpu-t4": {
"type": "g4dn.xlarge",
"spot_price": "0.30",
"description": "1x T4 GPU"
}
}
}In JOB_START_CMD, JOB_SETUP_CMD, and JOB_LOG_PATTERN, use these placeholders:
| Placeholder | Replaced With |
|---|---|
%NUM% |
Instance number |
%START% |
Start value from instances.json |
%END% |
End value from instances.json |
%CHECKPOINT% |
Checkpoint file path |
%LOG% |
Log file path |
%WORKSPACE% |
Workspace directory |
Configure SMTP settings in fleet.env:
FLEET_ALERT_EMAIL="alerts@example.com"
FLEET_SMTP_HOST="smtp.gmail.com"
FLEET_SMTP_PORT=587
FLEET_SMTP_USER="sender@gmail.com"
FLEET_SMTP_CREDENTIALS="/path/to/credentials.env"Create credentials file (keep out of version control):
# /path/to/credentials.env
SMTP_PASS="your-app-password"Configure S3 bucket for checkpoints:
JOB_S3_BUCKET="my-checkpoints"
JOB_CHECKPOINT_PREFIX="checkpoint_"Your job should:
- Read checkpoint from
%CHECKPOINT%file on startup - Periodically save progress to S3:
aws s3 cp checkpoint.txt s3://$JOB_S3_BUCKET/checkpoint_$NUM.txt
- Never commit
fleet.envwith real credentials - Add to
.gitignore:fleet.env .state/ configs/instances.json *-credentials.env - Use SMTP credential files instead of embedding passwords
- Review IAM permissions needed:
ec2:*for instance managements3:GetObject,s3:PutObjectfor checkpointing
All AZs are out of capacity for your instance type. Options:
- Wait and retry
- Try a different instance type/profile
- Use on-demand instances instead
- Check security group allows port 22
- Verify key pair name matches
- Instance may still be initializing (wait 1-2 minutes)
- Check
JOB_PROCESS_PATTERNmatches your process - Verify with:
ssh ubuntu@IP 'pgrep -f "pattern"'
MIT