Skip to content

Conversation

@DiegoTavares
Copy link
Collaborator

A new scheduler meant to replace a portion of Cuebot's functionalities.

- Implement FrameRange and FrameSet structs to parse and represent complex frame range syntaxes
including stepped, inverse stepped, negative steps, and interleaved ranges - Support chunking
FrameSets into compact sub-ranges for dispatching - Integrate FrameSet chunking in RqdDispatcher for
precise frame chunking - Improve dispatch error handling with distinct error types - Update host DAO
and models to include allocation info for resource checks - Add .gitignore entry for /sandbox/kafka*
The producer module produces events on kafka for each pending job. The consumer modules consume
events and books jobs on host, still relying on the database.
This version still contains an issue when executing multiple tests at the same time, as tests are
sharing a database instance an they rely on it existing to work.
Optimized async + pgpool interaction, but still far from perfect.
Last commit before giving up on dashmap
There is a protection against processing multiple bookings on a single host at the same time on
HostDao that uses a database lock. This protection is intended for multiple instances of the
scheduler running at the same time. However, this logic was also being triggered by a single
instance, which indicated there was a race condition in place.

The race condition happens because hosts can belong to multiple groups at the same time.
Metrics being tracked:

From `entrypoint.rs`:** - `scheduler_jobs_queried_total` - Counter tracking total jobs queried from
database - `scheduler_jobs_processed_total` - Counter tracking total jobs processed

**From `matcher.rs`:** - `scheduler_no_candidate_iterations_total` - Counter for
NoCandidateAvailable occurrences - `scheduler_candidates_per_layer` - Histogram tracking candidates
needed to fully consume a layer (buckets: 1, 5, 10, 20, 50, 100)

**From `dispatcher/actor.rs`:** - `scheduler_frames_dispatched_total` - Counter of successfully
dispatched frames - `scheduler_time_to_book_seconds` - Histogram measuring time from
frame.updated_at until dispatch (buckets: 0.1, 0.5, 1, 5, 10, 30, 60, 120, 300 seconds)
Adjust cache timeout values and enhance gRPC connection configuration with: - Reduced idle and live
times for connection cache - Added connection, request, and keep-alive timeouts - Configured
keep-alive ping settings - More robust endpoint creation with error handling
This feature is essential for migrating booking to the new external scheduler.
Add OS compatibility check to validate_match method to ensure matched hosts have the correct
operating system requirement
Use a central host store to prevent a split brain condition when a host belongs to multiple clusters
at the same time.
Besides that, use host_stats for up-to-date memory information when updating the host cache.
To simplify testing, these changes are being migrated to a new PR
Entries were migrated to a new PR isolating the feature they were related to
The new option is define as: ```yaml
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants