Skip to content

Script for running, scheduling and restoring local clearml server instances #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

OldaKodym
Copy link

@OldaKodym OldaKodym commented Jun 17, 2025

A self-installing script with cmd interface for ClearML local server instance live backups.
It allows you to create and restore ClearML snapshots. It supports backing up Elasticsearch, MongoDB, Redis, and fileserver data without shutting down the server. It also allows scheduling backups using cron jobs.

Simply run it with --help and go from there.
image

I tested it on a dummy server instance as well as on multi-TB instance while tasks were running and successfully restored the snapshots in both instances. Still, there may be some loose ends or edge cases left to handle. Would be glad for any feedback.

@jkhenning
Copy link
Member

Hi @OldaKo,

This looks really cool! How did you test it?

@OldaKo
Copy link

OldaKo commented Jun 22, 2025

Hello @jkhenning

It was a pretty manual process where I created a dummy task with random scalar chart, debug sample image and a textfile artifact. I also kept an infinite task running just to check it doesn't break things. I let that run, run the snapshot creation, kill the server and delete everything in /opt/clearml. Then I reinitialized the server, copied the backed-up config and docker-compose.yaml, run the docker network of the new instance and run the snapshot restoration. Scalars, debug samples and artifacts all appeared with no apparent issues.

Something along these lines

import time
from clearml import Task
import numpy as np
from PIL import Image
import tempfile

def main():
    # Create a new ClearML task
    task = Task.init(project_name="testing_project", task_name="testing_task")

    # Log 100 random scalars
    for i in range(100):
        value = np.random.rand()
        task.get_logger().report_scalar(title="random_testing_scalar", series="test_series", value=value, iteration=i)

    # Create a random image and log it
    random_image = (np.random.rand(128, 128, 3) * 255).astype(np.uint8)
    img = Image.fromarray(random_image)
    task.get_logger().report_image(title="random_testing_image", series="sample", iteration=0, image=img)

    # Create a random artifact file and upload it

    with tempfile.NamedTemporaryFile(delete=False, suffix=".txt") as tmp_file:
        tmp_file.write(b"This is a random artifact file.\n")
        tmp_file_path = tmp_file.name

    task.upload_artifact(name="testing_artifact", artifact_object=tmp_file_path)
    
    # Create a long running task
    task = Task.init(project_name="testing_project", task_name="testing_task_long")

    # Log random scalars every 10 seconds
    i = 0
    while True:
        value = np.random.rand()
        task.get_logger().report_scalar(title="random_testing_scalar", series="test_series", value=value, iteration=i)
        i += 1
        time.sleep(10)

if __name__ == "__main__":
    main()

I repeated similar process for our live server, although I skipped the fileserver to save some time since that's just rsyncing files back and forth anyway. Our ES instance has ~80GB and I let a testing task run again to check that shards don't get corrupted by continuous logging while the snapshot is being created. Again, after deleting and restoration, everything seems to be in order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants