-
Notifications
You must be signed in to change notification settings - Fork 232
[POC] Distributed scheduler #2002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
DiegoTavares
wants to merge
91
commits into
AcademySoftwareFoundation:master
Choose a base branch
from
DiegoTavares:distributed_scheduler_2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[POC] Distributed scheduler #2002
DiegoTavares
wants to merge
91
commits into
AcademySoftwareFoundation:master
from
DiegoTavares:distributed_scheduler_2
+18,138
−3,859
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Implement FrameRange and FrameSet structs to parse and represent complex frame range syntaxes including stepped, inverse stepped, negative steps, and interleaved ranges - Support chunking FrameSets into compact sub-ranges for dispatching - Integrate FrameSet chunking in RqdDispatcher for precise frame chunking - Improve dispatch error handling with distinct error types - Update host DAO and models to include allocation info for resource checks - Add .gitignore entry for /sandbox/kafka*
Signed-off-by: Diego Tavares <[email protected]>
The producer module produces events on kafka for each pending job. The consumer modules consume events and books jobs on host, still relying on the database.
This version still contains an issue when executing multiple tests at the same time, as tests are sharing a database instance an they rely on it existing to work.
Optimized async + pgpool interaction, but still far from perfect.
Last commit before giving up on dashmap
There is a protection against processing multiple bookings on a single host at the same time on HostDao that uses a database lock. This protection is intended for multiple instances of the scheduler running at the same time. However, this logic was also being triggered by a single instance, which indicated there was a race condition in place. The race condition happens because hosts can belong to multiple groups at the same time.
Metrics being tracked: From `entrypoint.rs`:** - `scheduler_jobs_queried_total` - Counter tracking total jobs queried from database - `scheduler_jobs_processed_total` - Counter tracking total jobs processed **From `matcher.rs`:** - `scheduler_no_candidate_iterations_total` - Counter for NoCandidateAvailable occurrences - `scheduler_candidates_per_layer` - Histogram tracking candidates needed to fully consume a layer (buckets: 1, 5, 10, 20, 50, 100) **From `dispatcher/actor.rs`:** - `scheduler_frames_dispatched_total` - Counter of successfully dispatched frames - `scheduler_time_to_book_seconds` - Histogram measuring time from frame.updated_at until dispatch (buckets: 0.1, 0.5, 1, 5, 10, 30, 60, 120, 300 seconds)
Adjust cache timeout values and enhance gRPC connection configuration with: - Reduced idle and live times for connection cache - Added connection, request, and keep-alive timeouts - Configured keep-alive ping settings - More robust endpoint creation with error handling
This feature is essential for migrating booking to the new external scheduler.
Add OS compatibility check to validate_match method to ensure matched hosts have the correct operating system requirement
Use a central host store to prevent a split brain condition when a host belongs to multiple clusters at the same time.
Besides that, use host_stats for up-to-date memory information when updating the host cache.
To simplify testing, these changes are being migrated to a new PR
Entries were migrated to a new PR isolating the feature they were related to
The new option is define as: ```yaml ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A new scheduler meant to replace a portion of Cuebot's functionalities.