[POC] Distributed scheduler #2002

DiegoTavares · 2025-09-27T01:19:05Z

A new scheduler meant to replace a portion of Cuebot's functionalities.

- Implement FrameRange and FrameSet structs to parse and represent complex frame range syntaxes including stepped, inverse stepped, negative steps, and interleaved ranges - Support chunking FrameSets into compact sub-ranges for dispatching - Integrate FrameSet chunking in RqdDispatcher for precise frame chunking - Improve dispatch error handling with distinct error types - Update host DAO and models to include allocation info for resource checks - Add .gitignore entry for /sandbox/kafka*

Signed-off-by: Diego Tavares <[email protected]>

The producer module produces events on kafka for each pending job. The consumer modules consume events and books jobs on host, still relying on the database.

This version still contains an issue when executing multiple tests at the same time, as tests are sharing a database instance an they rely on it existing to work.

Optimized async + pgpool interaction, but still far from perfect.

Last commit before giving up on dashmap

There is a protection against processing multiple bookings on a single host at the same time on HostDao that uses a database lock. This protection is intended for multiple instances of the scheduler running at the same time. However, this logic was also being triggered by a single instance, which indicated there was a race condition in place. The race condition happens because hosts can belong to multiple groups at the same time.

Metrics being tracked: From `entrypoint.rs`:** - `scheduler_jobs_queried_total` - Counter tracking total jobs queried from database - `scheduler_jobs_processed_total` - Counter tracking total jobs processed **From `matcher.rs`:** - `scheduler_no_candidate_iterations_total` - Counter for NoCandidateAvailable occurrences - `scheduler_candidates_per_layer` - Histogram tracking candidates needed to fully consume a layer (buckets: 1, 5, 10, 20, 50, 100) **From `dispatcher/actor.rs`:** - `scheduler_frames_dispatched_total` - Counter of successfully dispatched frames - `scheduler_time_to_book_seconds` - Histogram measuring time from frame.updated_at until dispatch (buckets: 0.1, 0.5, 1, 5, 10, 30, 60, 120, 300 seconds)

Adjust cache timeout values and enhance gRPC connection configuration with: - Reduced idle and live times for connection cache - Added connection, request, and keep-alive timeouts - Configured keep-alive ping settings - More robust endpoint creation with error handling

This feature is essential for migrating booking to the new external scheduler.

Add OS compatibility check to validate_match method to ensure matched hosts have the correct operating system requirement

…tomicity

Use a central host store to prevent a split brain condition when a host belongs to multiple clusters at the same time.

Besides that, use host_stats for up-to-date memory information when updating the host cache.

To simplify testing, these changes are being migrated to a new PR

Entries were migrated to a new PR isolating the feature they were related to

The new option is define as: ```yaml ```

DiegoTavares added 24 commits July 10, 2025 10:33

Setup new module for a job queue service

93c8317

Initial version of the distributed job-scheduler

425a393

[draft] dispatcher

1068970

Add job_resource cores limits to host_dao query

71aa58b

Merge branch 'master' into distributed_scheduler

378dde3

Signed-off-by: Diego Tavares <[email protected]>

Implement scheduler using kafka

c0a6193

The producer module produces events on kafka for each pending job. The consumer modules consume events and books jobs on host, still relying on the database.

Compiles

afb000c

Make all memory fields bytesize

75ac963

Fix database memory values from bytes to kb

1596d02

Implement cluster logic using facility+show+tag

dd3d41a

Fix layer host candidate loop

40d8a2b

Remove dead files and old TODOs

b147b7f

Add integration tests

94d7a6c

This version still contains an issue when executing multiple tests at the same time, as tests are sharing a database instance an they rely on it existing to work.

Rename and refactor integration_tests to smoke_tests

d04eaa9

WIP: Add scheduler stress tests

804f2ae

Minor fixes

78ecb0e

First working stress tests

623fcc3

Update job fetcher to use fetch_all and stream processing

fd6305e

Optimized async + pgpool interaction, but still far from perfect.

Refactor modules

d959d16

Batch layer and frame queries

1b7a9ad

Fixed several dashmap related deadlocks

2b97f12

Last commit before giving up on dashmap

Convert host_cache to scc

3216258

Wrap HostCache in an Actor System using actix

2e9f651

DiegoTavares mentioned this pull request Sep 27, 2025

[POC] Distributed scheduler #1809

Closed

DiegoTavares added 5 commits September 26, 2025 18:27

Remove unecessary debug statements

55f5e41

Migrate Dispatcher interface into an Actor

733f366

Migrate Dispatcher interface into an Actor

2dee076

Clean up warnings

2af6c54

DiegoTavares added 30 commits November 7, 2025 11:46

Disable stress tests on default testset

f6f2008

Merge branch 'master' into distributed_scheduler_2

bc962d8

Add optional host locking in dispatcher commands

aec329a

Refactor DatabaseConfig to use explicit connection params

401cf64

Update Scheduler Configuration and Config Model

550c7b8

Fix unit tests and warnings

886c194

Add URL encoding for database credentials

635b6ff

Add metrics for job query duration in scheduler

221e120

Add config option to turn off host booking

491feb3

This feature is essential for migrating booking to the new external scheduler.

Change log level from info to debug in matcher

3ac24dc

Add OS validation to host matching process

1816b49

Add OS compatibility check to validate_match method to ensure matched hosts have the correct operating system requirement

Ensure grpc connection cache is invalidated in any error condition

bb8a53c

Fix reference to facility ID in cluster feed loading

de1de29

[rebase] Update host cache and DAO to improve resource tracking and a…

f092f62

…tomicity

[rebase] Refactor host_cache

f75c8ba

Refactor host_cache

1ed312a

Use a central host store to prevent a split brain condition when a host belongs to multiple clusters at the same time.

Refactor HostStore with atomic operations and improved concurrency

125930f

Add debug logging for host cache and signal handling

f1c9016

Add debug log when host is considered stale in cache

ac35982

Change Id types to Uuid

178ba52

Replace host.ts_last_updated by host_stat.ts_ping

11aab71

Besides that, use host_stats for up-to-date memory information when updating the host cache.

Set PostgreSQL connection to UTC timezone

0e9bc63

Remove custom timestamp layers from tracing logs

e5d60bb

Fix case issue on facility pk on hos_dao

c3581e6

Revert cuebot changes

8d35a31

To simplify testing, these changes are being migrated to a new PR

Remove unused opencue.properties entries

fafef7e

Entries were migrated to a new PR isolating the feature they were related to

Add retry count tracking for dispatched frames

00c318c

Add option to ignore a list of tags

9ffcb65

The new option is define as: ```yaml ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC] Distributed scheduler #2002

[POC] Distributed scheduler #2002

DiegoTavares commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[POC] Distributed scheduler #2002

Are you sure you want to change the base?

[POC] Distributed scheduler #2002

Conversation

DiegoTavares commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants