Script for running, scheduling and restoring local clearml server instances #288

OldaKodym · 2025-06-17T07:26:26Z

A self-installing script with cmd interface for ClearML local server instance live backups.
It allows you to create and restore ClearML snapshots. It supports backing up Elasticsearch, MongoDB, Redis, and fileserver data without shutting down the server. It also allows scheduling backups using cron jobs.

Simply run it with --help and go from there.

I tested it on a dummy server instance as well as on multi-TB instance while tasks were running and successfully restored the snapshots in both instances. Still, there may be some loose ends or edge cases left to handle. Would be glad for any feedback.

…earml server instances

jkhenning · 2025-06-22T08:56:00Z

Hi @OldaKo,

This looks really cool! How did you test it?

OldaKo · 2025-06-22T11:21:50Z

Hello @jkhenning

It was a pretty manual process where I created a dummy task with random scalar chart, debug sample image and a textfile artifact. I also kept an infinite task running just to check it doesn't break things. I let that run, run the snapshot creation, kill the server and delete everything in /opt/clearml. Then I reinitialized the server, copied the backed-up config and docker-compose.yaml, run the docker network of the new instance and run the snapshot restoration. Scalars, debug samples and artifacts all appeared with no apparent issues.

Something along these lines

import time
from clearml import Task
import numpy as np
from PIL import Image
import tempfile

def main():
    # Create a new ClearML task
    task = Task.init(project_name="testing_project", task_name="testing_task")

    # Log 100 random scalars
    for i in range(100):
        value = np.random.rand()
        task.get_logger().report_scalar(title="random_testing_scalar", series="test_series", value=value, iteration=i)

    # Create a random image and log it
    random_image = (np.random.rand(128, 128, 3) * 255).astype(np.uint8)
    img = Image.fromarray(random_image)
    task.get_logger().report_image(title="random_testing_image", series="sample", iteration=0, image=img)

    # Create a random artifact file and upload it

    with tempfile.NamedTemporaryFile(delete=False, suffix=".txt") as tmp_file:
        tmp_file.write(b"This is a random artifact file.\n")
        tmp_file_path = tmp_file.name

    task.upload_artifact(name="testing_artifact", artifact_object=tmp_file_path)
    
    # Create a long running task
    task = Task.init(project_name="testing_project", task_name="testing_task_long")

    # Log random scalars every 10 seconds
    i = 0
    while True:
        value = np.random.rand()
        task.get_logger().report_scalar(title="random_testing_scalar", series="test_series", value=value, iteration=i)
        i += 1
        time.sleep(10)

if __name__ == "__main__":
    main()

I repeated similar process for our live server, although I skipped the fileserver to save some time since that's just rsyncing files back and forth anyway. Our ES instance has ~80GB and I let a testing task run again to check that shards don't get corrupted by continuous logging while the snapshot is being created. Again, after deleting and restoration, everything seems to be in order.

self-installing script for running, scheduling and restorint local cl…

fa48caa

…earml server instances

OldaKo added 2 commits June 23, 2025 08:08

add ES path.repo to default docker-compose file to enable backups

79321be

fix backup tool cron scheduling syntax

e1a88b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Script for running, scheduling and restoring local clearml server instances #288

Script for running, scheduling and restoring local clearml server instances #288

OldaKodym commented Jun 17, 2025 •

edited

Loading

Uh oh!

jkhenning commented Jun 22, 2025

Uh oh!

OldaKo commented Jun 22, 2025

Uh oh!

Uh oh!

Script for running, scheduling and restoring local clearml server instances #288

Are you sure you want to change the base?

Script for running, scheduling and restoring local clearml server instances #288

Conversation

OldaKodym commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkhenning commented Jun 22, 2025

Uh oh!

OldaKo commented Jun 22, 2025

Uh oh!

Uh oh!

OldaKodym commented Jun 17, 2025 •

edited

Loading