SimpleFold For AWS HealthOmics

Includes Docker container and Terraform infrastructure for running SimpleFold protein structure prediction on AWS HealthOmics. This container packages SimpleFold with all required dependencies for use in Nextflow workflows.

>3J5R_1|Chains A, B, C, D|Transient receptor potential cation channel subfamily V member 1|Rattus norvegicus (10116)
LYDRRSIFDAVAQSNCQELESLLPFLQRSKKRLTDSEFKDPETGKTCLLKAMLNLHNGQNDTIALLLDVARKTDSLKQFVNASYTDSYYKGQTALHIAIERRNMTLVTLLVENGADVQAAANGDFFKKTKGRPGFYFGELPLSLAACTNQLAIVKFLLQNSWQPADISARDSVGNTVLHALVEVADNTVDNTKFVTSMYNEILILGAKLHPTLKLEEITNRKGLTPLALAASSGKIGVLAYILQREIHEPECRHLSRKFTEWAYGPVHSSLYDLSCIDTCEKNSVLEVIAYSSSETPNRHDMLLVEPLNRLLQDKWDRFVKRIFYFNFFVYCLYMIIFTAAAYYRPVEGLPPYKLKNTVGDYFRVTGEILSVSGGVYFFFRGIQYFLQRRPSLKSLFVDSYSEILFFVQSLFMLVSVVLYFSQRKEYVASMVFSLAMGWTNMLYYTRGFQQMGIYAVMIEKMILRDLCRFMFVYLVFLFGFSTAVVTLIEDGKYNSLYSTCLELFKFTIGMGDLEFTENYDFKAVFIILLLAYVILTYILLLNMLIALMGETVNKIAQESKNIWKLQRAITILDTEKSFLKCMRKAXXXXXXXXXXXX

↓↓↓

Features

SimpleFold ML Model: Apple's SimpleFold protein structure prediction model
PyTorch Backend: CUDA-optimized for GPU inference
Redis Integration: CCD database for mmCIF processing
Ready for HealthOmics: Pre-configured for AWS HealthOmics workflows
Automated Setup: One-click deployment with Terraform

Quick Start

Prerequisites

Docker
Terraform
AWS CLI
Nextflow (optional, for local testing)

Deployment with Terraform

Use Terraform to automatically create nessesarry AWS resources and deploy the HealthOmics workflow:

# Clone repository
git clone https://github.com/bioteam/aho-simplefold.git

# Navigate to terraform directory
cd aho-simplefold/terraform

# Copy the example variables file and customize
cp terraform.tfvars.example terraform.tfvars

# Edit terraform.tfvars with your specific values
vi terraform.tfvars

# Initialize Terraform
terraform init

# Plan the deployment
terraform plan

# Apply the configuration
terraform apply

The Terraform configuration will:

✅ Create S3 bucket for data storage
✅ Create ECR repository for container images
✅ Create IAM execution role with necessary permissions
✅ Set up necessary AWS resources for HealthOmics workflows
✅ Output configuration values for use in workflows

Build Container and Push to ECR

After Terraform creates the nessesarry AWS resources, build the Docker container and push it to the created ECR repository:

Using Make Commands (Recommended)

# Return to the root repository directory
cd ..

# Complete deployment: build, login, and push
make deploy-container

Granular Make Commands (Alternative)

# Build container and tag with ECR repository URL
make build

# Authenticate Docker to ECR
make login

# Upload container to ECR
make push

Raw Docker Commands (For Reference)

Click to expand manual docker commands

# Build container and tag with ECR repository URL
docker build --platform=linux/amd64 -t $(cd terraform && terraform output -raw ecr_repository_url):latest .

# Authenticate Docker to ECR
aws ecr get-login-password --region $(cd terraform && terraform output -raw aws_region) | docker login --username AWS --password-stdin $(cd terraform && terraform output -raw ecr_repository_url)

# Upload container to ECR
docker push $(cd terraform && terraform output -raw ecr_repository_url):latest

Make Targets Summary

Avaliable Makefile targets for common operations:

# View all available commands
make help

# Complete deployment pipeline
make deploy    # Build, login to ECR, and push container

# Individual steps
make build     # Build Docker container
make login     # Login to ECR
make push      # Push container to ECR
make run       # Start HealthOmics workflow run
make clean     # Remove local Docker images

AWS HealthOmics Usage

After running Terraform apply, you'll have everything configured for HealthOmics:

1. Upload Input Files

# Upload your FASTA files to the S3 input directory, assuming working directory  at root of repository
aws s3 cp your_protein.fasta s3://$(cd terraform && terraform output -raw s3_bucket_name)/input/

Note: This Nextflow workflow processes all FASTA files in the input directory. Multiple sequences can be included in a single FASTA file - the records will be split into separate files and processed automatically.

2. Run Workflow

Using Make Command (Recommended)

# Start a workflow run (uses values from Terraform outputs)
make run

Manual AWS CLI Command (Alternative)

Click to expand manual docker commands

# Start a workflow run (use values from Terraform outputs)
aws omics start-run \
    --role-arn "$(cd terraform && terraform output -raw execution_role_arn)" \
    --output-uri "s3://$(cd terraform && terraform output -raw s3_bucket_name)/out/" \
    --parameters "{\"input_dir\": \"s3://$(cd terraform && terraform output -raw s3_bucket_name)/input/\"}" \
    --workflow-id "$(cd terraform && terraform output -raw workflow_id)" \
    --name "simplefold-run-$(date +%Y%m%d-%H%M%S)"

Resource Configuration Reference

Default resource allocation per job:

CPU: 4 cores
Memory: 32 GB
GPU: 1x NVIDIA L40S (HealthOmics)
Storage: Auto-provisioned (HealthOmics)

See workflow/main.nf for task-specific resource settings.

Configuration

Nextflow Parameters

The workflow supports the following parameters (configured in workflow/nextflow.config.tpl):

params {
    // SimpleFold model settings
    simplefold_model = 'simplefold_100M'  // Model variant: simplefold_100M/360M/700M/1.1B/1.6B/3B
    num_steps = 500                       // Sampling steps
    tau = 0.01                            // Temperature parameter
    nsample_per_protein = 1               // Samples per protein
    enable_plddt = true                   // Enable pLDDT confidence scores
    
    // input path in S3 bucket
    input_dir = 's3://your-bucket/input/'
}

Terraform will automatically create a nextflow.config file from the template. If this configuration is modified, you must re-run terraform apply to update the workflow definition in S3.

Manually Create HealthOmics Workflow (Optional Reference)

You may also manually create the HealthOmics workflow definition via AWS Console or CLI, rather than using Terraform. To do so, first package and upload the workflow definition:

# Edit parameters as needed
cp workflow/nextflow.config.tpl workflow/nextflow.config
vi workflow/nextflow.config 

# Package and upload workflow definition
zip -r simplefold-workflow.zip workflow/
aws s3 cp simplefold-workflow.zip s3://$(terraform output -raw s3_bucket_name)/workflows/

# Create workflow definition in AWS Console or via CLI
WORKFLOW_ID=$(aws omics create-workflow \
    --name "SimpleFold-Workflow" \
    --engine "NEXTFLOW" \
    --no-cli-pager \
    --query 'id' --output text \
    --definition-uri s3://$(cd terraform && terraform output -raw s3_bucket_name)/workflows/simplefold-workflow.zip)

Local Testing With Docker

Run Local Protein Structure Prediction With Docker

docker run --gpus all --platform=linux/amd64 --rm \
  -v $(pwd)/input:/app/input \
  -v $(pwd)/output:/app/output \
  simplefold \
  simplefold --simplefold_model simplefold_100M \
    --num_steps 500 --tau 0.01 \
    --nsample_per_protein 1 --plddt \
    --fasta_path /app/input/3d06.fasta \
    --backend torch \
    --output_dir /app/output

Output Files

SimpleFold generates several output files:

.cif files: Protein structure in mmCIF format
.json files: Prediction metadata and confidence scores
.npz files: Raw embeddings and intermediate representations

Troubleshooting

Common Issues

Docker Build Fails
- Ensure Docker is running and you have sufficient disk space
- Check CUDA compatibility if using GPU features
AWS Permissions
- Verify AWS CLI is configured: aws sts get-caller-identity
- Ensure your user has permissions to create IAM roles, S3 buckets, and ECR repositories
Container Registry Issues
- Check ECR login: aws ecr get-login-password --region your-region
- Verify image was pushed: aws ecr list-images --repository-name simplefold
HealthOmics Workflow Fails
- Check CloudWatch logs for detailed error messages
- Verify S3 bucket permissions and file paths
- Ensure execution role has all necessary permissions

Development

Customization

Model Parameters: Edit workflow/nextflow.config.tpl
Resource Requirements: Modify process settings in config
Container Customization: Edit Dockerfile

Container Details

Base Image: nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04
Platform: linux/amd64 (required for CUDA compatibility)
SimpleFold Version: Commit 1aa097f45aad9b961e1acd91728504d9c035f4f5
Dependencies: ESM models, PyTorch with CUDA support
Redis: Integrated with CCD database for mmCIF processing

Volume Mounts

/app/input: Input FASTA files
/app/output: Structure prediction outputs
/app/data: Optional data directory for mmCIF processing

Health Check

The container includes a Redis health check to ensure proper startup:

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD redis-cli -p 7777 ping || exit 1

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
input		input
output		output
scripts		scripts
terraform		terraform
workflow		workflow
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
capsaicin.png		capsaicin.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleFold For AWS HealthOmics

Features

Quick Start

Prerequisites

Deployment with Terraform

Build Container and Push to ECR

Using Make Commands (Recommended)

Granular Make Commands (Alternative)

Raw Docker Commands (For Reference)

Make Targets Summary

AWS HealthOmics Usage

1. Upload Input Files

2. Run Workflow

Using Make Command (Recommended)

Manual AWS CLI Command (Alternative)

Resource Configuration Reference

Configuration

Nextflow Parameters

Manually Create HealthOmics Workflow (Optional Reference)

Local Testing With Docker

Run Local Protein Structure Prediction With Docker

Output Files

Troubleshooting

Common Issues

Development

Customization

Container Details

Volume Mounts

Health Check

Copyright

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimpleFold For AWS HealthOmics

Features

Quick Start

Prerequisites

Deployment with Terraform

Build Container and Push to ECR

Using Make Commands (Recommended)

Granular Make Commands (Alternative)

Raw Docker Commands (For Reference)

Make Targets Summary

AWS HealthOmics Usage

1. Upload Input Files

2. Run Workflow

Using Make Command (Recommended)

Manual AWS CLI Command (Alternative)

Resource Configuration Reference

Configuration

Nextflow Parameters

Manually Create HealthOmics Workflow (Optional Reference)

Local Testing With Docker

Run Local Protein Structure Prediction With Docker

Output Files

Troubleshooting

Common Issues

Development

Customization

Container Details

Volume Mounts

Health Check

Copyright

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages