-
Notifications
You must be signed in to change notification settings - Fork 232
[cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation #2086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
c6ea5f7 to
3f0791f
Compare
Implement event-driven monitoring infrastructure for collecting and storing render farm statistics with historical data access capabilities. Key components: - Define monitoring.proto with job/layer/frame/host lifecycle events - Add KafkaEventPublisher for async event publishing to Kafka topics - Create Elasticsearch client and consumer for historical data storage - Hook event publishing into FrameCompleteHandler and HostReportHandler - Extend PrometheusMetricsCollector with frame/job completion metrics - Add MonitoringInterface gRPC service for historical data queries - Create pycue monitoring wrapper with historical data API methods - Add applicationContext-monitoring.xml Spring configuration for monitoring beans - Update TestAppConfig to include monitoring context for tests Configuration: - All features disabled by default (opt-in via properties) - Kafka: monitoring.kafka.enabled, monitoring.kafka.bootstrap.servers - Elasticsearch: monitoring.elasticsearch.enabled, monitoring.elasticsearch.host This enables: - Extended memory prediction beyond the 3-day pycue API limit - Real-time farm monitoring via Prometheus/Grafana dashboards - Historical job/frame/layer analytics via Elasticsearch
3f0791f to
cb8a9a2
Compare
…earch client - Fix Elasticsearch client library conflict by adding explicit elasticsearch-rest-client 8.8.0 dependency to resolve version mismatch with Spring Boot managed 6.8.4 - Add docker-compose.monitoring-full.yml with complete monitoring stack: Zookeeper, Kafka, Elasticsearch, Kibana, Prometheus, Grafana - Add Prometheus configuration for scraping cuebot metrics endpoint - Add Grafana dashboard for OpenCue monitoring with panels for: frame completion rates, job completion by show, frame runtime distribution, memory usage, event queue metrics, and host reports
- concepts/render-farm-monitoring.md: Architecture overview, event types, Kafka topics, Elasticsearch storage, and Prometheus metrics concepts - quick-starts/quick-start-monitoring.md: Step-by-step guide to deploy the monitoring stack with Docker Compose - getting-started/deploying-monitoring.md: Production deployment guide for Kafka, Elasticsearch, Prometheus, and Grafana - user-guides/render-farm-monitoring-guide.md: Configure Grafana dashboards, alerts, Kafka consumers, and Elasticsearch queries - developer-guide/monitoring-development.md: Extend the monitoring system with custom events, metrics, and Elasticsearch indexing - reference/monitoring-reference.md: Complete API reference for Kafka topics, event schemas, Prometheus metrics, and configuration options - tutorials/monitoring-tutorial.md: Hands-on tutorial for building custom dashboards and processing monitoring events - sandbox/monitor_events.py: Example Kafka consumer for monitoring events
Update nav_order across all documentation files for consistent navigation: 1) Run script docs/extract_nav_orders.py 2) Manual fix of nav_order_index.txt 3) run script docs/update_nav_order.py
OpenCue monitoring stackDocumentation
|
New script to load test jobs using PyOutlineLoad test script to submit jobs to OpenCue for monitoring testing. Usage:
python load_test_jobs.py # Uses defaults: 1000 jobs, batch size 50
python load_test_jobs.py -n 100 # Submit 100 jobs
python load_test_jobs.py -n 500 -b 25 # Submit 500 jobs in batches of 25
python load_test_jobs.py --num-jobs 100 --batch-size 10
python sandbox/load_test_jobs.py Submitting 1000 jobs to OpenCue...
------------------------------------------------------------
Submitted 10/1000 jobs (10 successful, 0 failed)
Submitted 20/1000 jobs (20 successful, 0 failed)
Submitted 30/1000 jobs (30 successful, 0 failed)
Submitted 40/1000 jobs (40 successful, 0 failed)
Submitted 50/1000 jobs (50 successful, 0 failed)
Batch complete, pausing briefly...
Submitted 60/1000 jobs (60 successful, 0 failed)
Submitted 70/1000 jobs (70 successful, 0 failed)
Submitted 80/1000 jobs (80 successful, 0 failed)
Submitted 90/1000 jobs (90 successful, 0 failed)
Submitted 100/1000 jobs (100 successful, 0 failed)
Batch complete, pausing briefly...
...
Submitted 910/1000 jobs (910 successful, 0 failed)
Submitted 920/1000 jobs (920 successful, 0 failed)
Submitted 930/1000 jobs (930 successful, 0 failed)
Submitted 940/1000 jobs (940 successful, 0 failed)
Submitted 950/1000 jobs (950 successful, 0 failed)
Batch complete, pausing briefly...
Submitted 960/1000 jobs (960 successful, 0 failed)
Submitted 970/1000 jobs (970 successful, 0 failed)
Submitted 980/1000 jobs (980 successful, 0 failed)
Submitted 990/1000 jobs (990 successful, 0 failed)
Submitted 1000/1000 jobs (1000 successful, 0 failed)
Batch complete, pausing briefly...
------------------------------------------------------------
Load test complete!
Submitted: 1000
Failed: 0
Total frames: ~2000
|
…rd, and improve docs - Implement cue_elasticsearch_index_queue_size Prometheus metric using @Autowired for ElasticsearchClient in PrometheusMetricsCollector - Update Grafana dashboard panel colors and labels: - Frames Completed: DEAD (red), SUCCEEDED (green), WAITING (yellow) - Events Published: human-readable labels with consistent colors - Add monitoring documentation screenshots for all components: - Grafana, Prometheus, Kafka UI, Elasticsearch, Kibana - Update all monitoring docs (Quick Start, Concepts, User Guides, Reference, Tutorials, Developer Guide) with visual references - Add load_test_jobs.py script for generating test monitoring data - Update monitor_events.py consumer script
345ef66 to
3285e41
Compare
|
@DiegoTavares / @lithorus |
|
I don't think this diagram makes too much sense. Monitoring Manager is both the consumer and producer? What is it consuming? Is the ESClient reading from elasticsearch? if so, who is writing? I thought the writer would be consuming from the kafka queue and writing to elastic |
…nitoringManager
- Correct data flow in architecture diagram to show:
- Service Layer -> KafkaEventPublisher -> Kafka
- Kafka -> KafkaEventConsumer -> ElasticsearchClient -> Elasticsearch
- Remove MonitoringManager from Key classes table (use correct names)
- Fix PrometheusMetrics -> PrometheusMetricsCollector (correct class name)
- Update code example to match actual implementation pattern with
kafkaEventPublisher.publishJobEvent() and proper error handling
- Add explicit data flow explanation for clarity
Thanks for catching this, Diego! You're right - the diagram was confusing and incorrect. Fixed. See updated documentation: docs/_docs/developer-guide/monitoring-development.md Data flow:
|
cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/monitoring/KafkaEventPublisher.java
Show resolved
Hide resolved
cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java
Outdated
Show resolved
Hide resolved
…theus Per review feedback, remove Kafka-related metrics that add storage overhead without providing essential value: - cue_monitoring_events_published_total - cue_monitoring_events_dropped_total - cue_monitoring_event_queue_size Keep only cue_elasticsearch_index_queue_size as the single metric for monitoring the monitoring system. Changes: - Remove metric definitions and methods from PrometheusMetricsCollector - Remove prometheusMetrics field and setter from KafkaEventPublisher - Update applicationContext-monitoring.xml to remove property injection - Update Grafana dashboard: replace 3 Kafka metric panels with single Elasticsearch Index Queue Size panel
Per review feedback, remove metric that monitors the monitoring system: - cue_elasticsearch_index_queue_size Elasticsearch health can be checked directly via Kibana or ES APIs. Changes: - PrometheusMetricsCollector: remove elasticsearchIndexQueueSize metric, elasticsearchClient field, and related setter methods - Grafana dashboard: remove "Elasticsearch Index Queue Size" panel, adjust remaining panel positions
Add argparse support for configurable job submission: - -n, --num-jobs: Number of jobs to submit (default: 1000) - -b, --batch-size: Batch size for submission pauses (default: 50) Allows flexible load testing without modifying the script.
- Add 'shot' label to cue_frames_completed_total counter - Add 'shot' label to cue_jobs_completed_total counter - Update recordFrameCompleted() to accept shot parameter - Update recordJobCompleted() to accept shot parameter - Update FrameCompleteHandler to pass frame.shot to metrics - Update JobManagerSupport to fetch JobDetail and pass shot to metrics
- Add cue_job_core_seconds histogram to track total core seconds per job - Record job core seconds on job completion using ExecutionSummary - Include show and shot labels for filtering - Add "Job Core Seconds Distribution" panel to Grafana dashboard - Use buckets: 3600, 36000, 360000, 3600000, 36000000 (1h to 10000h)
Replace per-frame histogram metrics with layer-level aggregations to reduce metric cardinality and cost: - Rename cue_frame_runtime_seconds to cue_layer_max_runtime_seconds - Rename cue_frame_memory_bytes to cue_layer_max_memory_bytes - Add shot label to both layer histograms - Record metrics when layer completes instead of per-frame - Add highFrameSec field to ExecutionSummary for max frame runtime - Update LayerDaoJdbc to fetch int_clock_time_high - Update Grafana dashboard with new metric names and panel titles This reduces metric volume since frames within a layer have similar runtime and memory characteristics.
…all types Implement Elasticsearch search methods and wire up event publishing for job, layer, host, and proc events: - ElasticsearchClient: Add search methods for historical job, frame, layer, and layer memory queries with filtering and pagination - HistoricalDaoJdbc: Integrate with ElasticsearchClient to return actual query results instead of empty lists - JobManagerSupport: Publish job events (JOB_FINISHED, JOB_KILLED) when jobs complete - FrameCompleteHandler: Publish layer events (LAYER_COMPLETED) when layers finish - HostReportHandler: Publish host events (HOST_STATE_CHANGED) when hardware state changes - DispatchSupportService: Publish proc events (PROC_BOOKED, PROC_UNBOOKED) when procs are created/deleted - applicationContext-service.xml: Wire kafkaEventPublisher to beans All six Kafka event types are now indexed to Elasticsearch: - opencue.frame.events (FRAME_COMPLETED, FRAME_FAILED, etc.) - opencue.job.events (JOB_FINISHED, JOB_KILLED) - opencue.layer.events (LAYER_COMPLETED) - opencue.host.events (HOST_STATE_CHANGED) - opencue.host.reports (HOST_REPORT, HOST_BOOT) - opencue.proc.events (PROC_BOOKED, PROC_UNBOOKED)
Move kafkaEventPublisher assignment inside null check to avoid unnecessary null assignments and keep related logic grouped together. Changes: - FrameCompleteHandler: Guard assignment with null check - HostReportHandler: Guard assignment with null check - DispatchSupportService: Guard assignment with null check - JobManagerSupport: Guard assignment with null check, remove unused monitoringEventBuilder getter/setter - applicationContext-service.xml: Remove monitoringEventBuilder property from jobManagerSupport bean (now auto-created in setter)
…eHandler Per review feedback, remove exception handling that silently swallows errors in Prometheus metrics recording and Kafka event publishing. Exceptions should propagate to allow proper error visibility. Changes: - Remove try-catch around prometheusMetrics.recordLayerMaxRuntime/Memory - Remove try-catch around kafkaEventPublisher.publishLayerEvent - Remove try-catch around prometheusMetrics.recordFrameCompleted - Remove try-catch around kafkaEventPublisher.publishFrameEvent - Fix spotless formatting for LayerEvent builder call
Move MonitoringEventBuilder instantiation from inline setter creation to proper Spring dependency injection via applicationContext configuration. Changes: - Wire monitoringEventBuilder bean to dispatchSupport, jobManagerSupport, frameCompleteHandler, and hostReportHandler in applicationContext-service.xml - Simplify setKafkaEventPublisher() setters to just assign the field - Add setMonitoringEventBuilder() setters for Spring injection - Remove try-catch blocks that silently swallowed exceptions in Prometheus metrics and Kafka event publishing (per review feedback) This follows Spring DI best practices where MonitoringEventBuilder is a shared singleton managed by the container rather than being manually instantiated in each class.
Implement FRAME_STARTED and FRAME_DISPATCHED event publishing to track how long frames wait in the queue before being dispatched to hosts. Pickup time = FRAME_STARTED.timestamp - FRAME_DISPATCHED.timestamp Changes: - Add isFrameDispatchable() to DependDao to check if frame has no pending deps - Publish FRAME_STARTED events in DispatchSupportService on WAITING -> RUNNING - Publish FRAME_DISPATCHED events in DependManagerService on DEPEND -> WAITING - Add buildFrameStartedEvent/buildFrameDispatchableEvent to MonitoringEventBuilder - Wire kafkaEventPublisher and monitoringEventBuilder to dependManager bean Testing: - Add MonitoringEventBuilderTests for event building validation - Add PickupTimeTrackingTests for dependency satisfaction flow Dashboard: - Add Elasticsearch datasource with header.timestamp as time field - Add Pickup Time Metrics row with 6 new panels: - Frames Started/Dispatchable stat panels - Pickup Time Events Over Time chart - Recent FRAME_STARTED/FRAME_DISPATCHED tables
…asticsearch: - Index overview and document count queries - Pickup time tracking (FRAME_STARTED/FRAME_DISPATCHED events) - Frame, job, layer, proc, and host event queries - Time-based analytics and aggregations - Correlation queries for tracing job/frame lifecycles
- Remove try-catch from Prometheus metrics recording to allow programming errors (wrong labels) to fail loudly - Keep Kafka event publishing exception handling but properly log with stack trace for debugging transient failures
…osition - Add createIndexTemplates() to ElasticsearchClient to ensure header.timestamp is mapped as date type with epoch_millis format (fixes Grafana "No data" issue) - Refactor monitoring.proto to use composition pattern - embed Job, Layer, Frame, Host messages instead of duplicating fields - Update MonitoringEventBuilder to work with embedded proto messages - Exclude -serial compiler warning in build.gradle for protobuf-generated code - Add unit tests for FRAME_STARTED and FRAME_DISPATCHED event building The timestamp mapping fix resolves time-based filtering in Grafana dashboards for Pickup Time Metrics (FRAME_STARTED/FRAME_DISPATCHED events).
Remove conditional try/except around monitoring proto imports and MONITORING_AVAILABLE flag. Monitoring functions should always be available in pycue - if monitoring is disabled at cuebot level, it will return grpc.Status=UNIMPLEMENTED rather than failing at import. This allows toggling monitoring at the Cuebot end without requiring a new version of pycue.
Host reports are too large to store in Kafka/Elasticsearch due to their high frequency (~60s intervals) and data volume. Host metrics should use Prometheus instead. Changes: - Remove HostReportEvent and RunningFrameSummary from monitoring.proto - Remove publishHostReportEvent from HostReportHandler, KafkaEventPublisher - Remove host report indexing from ElasticsearchClient, KafkaEventConsumer - Remove buildHostReportEvent from MonitoringEventBuilder - Update documentation to note host metrics use Prometheus - Keep HostEvent for state change audit trail (up/down/locked)
Producer now acts as topic admin, creating topics with explicit configuration rather than relying on auto-creation with defaults. Configurable topic settings: - monitoring.kafka.topic.partitions (default: 3) - monitoring.kafka.topic.replication.factor (default: 1) - monitoring.kafka.topic.retention.ms (default: 7 days) - monitoring.kafka.topic.cleanup.policy (default: delete) - monitoring.kafka.topic.segment.ms (default: 1 day) - monitoring.kafka.topic.segment.bytes (default: 1GB) Topics are created on initialization before producer starts. TopicExistsException is handled gracefully for idempotent startup.
…dexer Move Kafka-to-Elasticsearch event indexing from Cuebot to a standalone Rust service, addressing code review feedback to decouple the consumer from the Java codebase. Rust kafka-es-indexer: - Add rust/crates/kafka-es-indexer: standalone Kafka consumer that indexes OpenCue events (job, layer, frame, host, proc) to Elasticsearch - Async Kafka consumer with configurable batch processing - Elasticsearch bulk indexing with date-based indices - Index templates with proper field mappings for all event types - CLI with environment variable configuration Cuebot cleanup: - Remove Java KafkaEventConsumer and ElasticsearchClient classes - Remove getJobHistory, getFrameHistory, getLayerHistory, getLayerMemoryHistory from HistoricalDao and HistoricalManager - Update ManageMonitoring gRPC servant to return UNIMPLEMENTED with message directing users to query Elasticsearch directly - Keep KafkaEventPublisher for publishing events from Cuebot to Kafka - Keep core job archival methods (getFinishedJobs, transferJob) intact Infrastructure: - Update docker-compose.monitoring-full.yml to include kafka-es-indexer
Update all monitoring documentation to reflect the decoupled architecture where Elasticsearch indexing is handled by the standalone Rust kafka-es-indexer service instead of Cuebot. Documentation changes: - Add kafka-es-indexer to component tables and architecture diagrams - Update configuration examples with correct CLI args and env vars - Remove stale Prometheus metrics (cue_monitoring_events_*, cue_elasticsearch_*) - Remove opencue.host.reports topic (removed from pipeline) - Replace Cuebot Elasticsearch config with kafka-es-indexer config - Update alert examples to use existing metrics Files updated: - docs/_docs/concepts/render-farm-monitoring.md - docs/_docs/developer-guide/monitoring-development.md - docs/_docs/getting-started/deploying-monitoring.md - docs/_docs/quick-starts/quick-start-monitoring.md - docs/_docs/reference/monitoring-reference.md - docs/_docs/tutorials/monitoring-tutorial.md - docs/_docs/user-guides/render-farm-monitoring-guide.md - rust/README.md - Add kafka-es-indexer to crates list - sandbox/README.md - Add event streaming monitoring stack section - opencue_monitoring images: opencue_monitoring_elasticsearch_kibana_dev_tools.png, opencue_monitoring_grafana_chart.png, opencue_monitoring_prometheus.png
6e6ea36 to
56e7416
Compare
|
@DiegoTavares / @lithorus |
Link the Issue(s) this Pull Request is related to.
Summarize your change.
[cuebot/pycue/proto/sandbox/docs] Introduce full event-driven monitoring stack, enhance metrics, dashboards, and documentation
Introduce an event-driven monitoring infrastructure for OpenCue, enabling real-time and historical analysis of render farm activity with a fully integrated monitoring stack and comprehensive documentation.
This change adds a Kafka + Elasticsearch pipeline for collecting and storing job, layer, frame, and host lifecycle events, while integrating Prometheus and Grafana for live dashboards and operational visibility.
Core features:
monitoring.protowith job / layer / frame / host lifecycle eventsKafkaEventPublisherfor asynchronous event publishingFrameCompleteHandlerandHostReportHandlerPrometheusMetricsCollectorwith:cue_elasticsearch_index_queue_sizemetric (ElasticsearchClient)MonitoringInterfacegRPC service for historical query accessapplicationContext-monitoring.xmlTestAppConfigto include monitoring contextMonitoring stack infrastructure:
docker-compose.monitoring-full.ymlincluding:Documentation and examples:
New & updated sandbox utilities:
sandbox/monitor_events.py: Example Kafka consumer (enhanced)sandbox/load_test_jobs.py: Test data generator for monitoring validationConfiguration (opt-in):
Kafka:
monitoring.kafka.enabledmonitoring.kafka.bootstrap.serversElasticsearch:
monitoring.elasticsearch.enabledmonitoring.elasticsearch.hostEnables: