[cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation #2086

ramonfigueiredo · 2025-11-25T00:29:01Z

Link the Issue(s) this Pull Request is related to.

[cuebot/pycue/proto] Add render farm monitoring system with Kafka, Elasticsearch, and enhanced Prometheus metrics #2085

Summarize your change.

[cuebot/pycue/proto/sandbox/docs] Introduce full event-driven monitoring stack, enhance metrics, dashboards, and documentation

Introduce an event-driven monitoring infrastructure for OpenCue, enabling real-time and historical analysis of render farm activity with a fully integrated monitoring stack and comprehensive documentation.

This change adds a Kafka + Elasticsearch pipeline for collecting and storing job, layer, frame, and host lifecycle events, while integrating Prometheus and Grafana for live dashboards and operational visibility.

Core features:

Define monitoring.proto with job / layer / frame / host lifecycle events
Add KafkaEventPublisher for asynchronous event publishing
Create Elasticsearch consumer and client for long-term historical storage
Hook event publishing into FrameCompleteHandler and HostReportHandler
Extend PrometheusMetricsCollector with:
- Job & frame completion metrics
- New cue_elasticsearch_index_queue_size metric (ElasticsearchClient)
Add MonitoringInterface gRPC service for historical query access
Implement PyCue monitoring wrapper with historical data APIs
Add Spring configuration via applicationContext-monitoring.xml
Update TestAppConfig to include monitoring context

Monitoring stack infrastructure:

Add docker-compose.monitoring-full.yml including:
- Zookeeper
- Kafka
- Elasticsearch
- Kibana
- Prometheus
- Grafana
Add Prometheus configuration for Cuebot metrics scraping
Enhance Grafana dashboard with:
- Frame completion rates
- Job completion by show
- Frame runtime distribution
- Memory usage trends
- Event queue metrics
- Host reports
Improve panel clarity with consistent labels and color coding:
- Frames Completed: DEAD (red), SUCCEEDED (green), WAITING (yellow)
- Events Published

Documentation and examples:

Architecture, concepts, and pipeline explanation
Deployment and Quick Start guides
User and Developer guides
API Reference and tutorials
Screenshots for:
- Grafana
- Prometheus
- Kafka UI
- Elasticsearch
- Kibana

New & updated sandbox utilities:

sandbox/monitor_events.py: Example Kafka consumer (enhanced)
sandbox/load_test_jobs.py: Test data generator for monitoring validation

Configuration (opt-in):

All monitoring features remain disabled by default and must be explicitly enabled:

Kafka:

monitoring.kafka.enabled
monitoring.kafka.bootstrap.servers

Elasticsearch:

monitoring.elasticsearch.enabled
monitoring.elasticsearch.host

Enables:

Historical render farm analytics beyond pycue 3-day API limits
Real-time farm visibility via Prometheus & Grafana
Long-term job, frame, and host analysis via Elasticsearch
Enhanced operational insights and memory prediction capabilities

Implement event-driven monitoring infrastructure for collecting and storing render farm statistics with historical data access capabilities. Key components: - Define monitoring.proto with job/layer/frame/host lifecycle events - Add KafkaEventPublisher for async event publishing to Kafka topics - Create Elasticsearch client and consumer for historical data storage - Hook event publishing into FrameCompleteHandler and HostReportHandler - Extend PrometheusMetricsCollector with frame/job completion metrics - Add MonitoringInterface gRPC service for historical data queries - Create pycue monitoring wrapper with historical data API methods - Add applicationContext-monitoring.xml Spring configuration for monitoring beans - Update TestAppConfig to include monitoring context for tests Configuration: - All features disabled by default (opt-in via properties) - Kafka: monitoring.kafka.enabled, monitoring.kafka.bootstrap.servers - Elasticsearch: monitoring.elasticsearch.enabled, monitoring.elasticsearch.host This enables: - Extended memory prediction beyond the 3-day pycue API limit - Real-time farm monitoring via Prometheus/Grafana dashboards - Historical job/frame/layer analytics via Elasticsearch

…earch client - Fix Elasticsearch client library conflict by adding explicit elasticsearch-rest-client 8.8.0 dependency to resolve version mismatch with Spring Boot managed 6.8.4 - Add docker-compose.monitoring-full.yml with complete monitoring stack: Zookeeper, Kafka, Elasticsearch, Kibana, Prometheus, Grafana - Add Prometheus configuration for scraping cuebot metrics endpoint - Add Grafana dashboard for OpenCue monitoring with panels for: frame completion rates, job completion by show, frame runtime distribution, memory usage, event queue metrics, and host reports

- concepts/render-farm-monitoring.md: Architecture overview, event types, Kafka topics, Elasticsearch storage, and Prometheus metrics concepts - quick-starts/quick-start-monitoring.md: Step-by-step guide to deploy the monitoring stack with Docker Compose - getting-started/deploying-monitoring.md: Production deployment guide for Kafka, Elasticsearch, Prometheus, and Grafana - user-guides/render-farm-monitoring-guide.md: Configure Grafana dashboards, alerts, Kafka consumers, and Elasticsearch queries - developer-guide/monitoring-development.md: Extend the monitoring system with custom events, metrics, and Elasticsearch indexing - reference/monitoring-reference.md: Complete API reference for Kafka topics, event schemas, Prometheus metrics, and configuration options - tutorials/monitoring-tutorial.md: Hands-on tutorial for building custom dashboards and processing monitoring events - sandbox/monitor_events.py: Example Kafka consumer for monitoring events

Update nav_order across all documentation files for consistent navigation: 1) Run script docs/extract_nav_orders.py 2) Manual fix of nav_order_index.txt 3) run script docs/update_nav_order.py

ramonfigueiredo · 2025-11-25T23:39:01Z

OpenCue monitoring stack

Documentation

Quick Starts > Quick start for OpenCue monitoring stack: https://docs.opencue.io/docs/quick-starts/quick-start-monitoring/
- docs/_docs/quick-starts/quick-start-monitoring.md
Concepts > Render farm monitoring: https://docs.opencue.io/docs/concepts/render-farm-monitoring/
- docs/_docs/concepts/render-farm-monitoring.md
Getting Started > Deploying the monitoring stack: https://docs.opencue.io/docs/getting-started/deploying-monitoring/
- docs/_docs/getting-started/deploying-monitoring.md
User Guides > Render farm monitoring guide: https://docs.opencue.io/docs/user-guides/render-farm-monitoring-guide/
- docs/_docs/user-guides/render-farm-monitoring-guide.md
Reference > Monitoring system reference: https://docs.opencue.io/docs/reference/monitoring-reference/
- docs/_docs/reference/monitoring-reference.md
Tutorials > Monitoring tutorial: https://docs.opencue.io/docs/tutorials/monitoring-tutorial/
- docs/_docs/tutorials/monitoring-tutorial.md
Developer Guide > Monitoring system development: https://docs.opencue.io/docs/developer-guide/monitoring-development/
- docs/_docs/developer-guide/monitoring-development.md

ramonfigueiredo · 2025-11-25T23:59:19Z

New script to load test jobs using PyOutline

Load test script to submit jobs to OpenCue for monitoring testing.

Usage:
    python load_test_jobs.py                    # Uses defaults: 1000 jobs, batch size 50
    python load_test_jobs.py -n 100             # Submit 100 jobs
    python load_test_jobs.py -n 500 -b 25       # Submit 500 jobs in batches of 25
    python load_test_jobs.py --num-jobs 100 --batch-size 10

python sandbox/load_test_jobs.py

Submitting 1000 jobs to OpenCue...
------------------------------------------------------------
Submitted 10/1000 jobs (10 successful, 0 failed)
Submitted 20/1000 jobs (20 successful, 0 failed)
Submitted 30/1000 jobs (30 successful, 0 failed)
Submitted 40/1000 jobs (40 successful, 0 failed)
Submitted 50/1000 jobs (50 successful, 0 failed)
  Batch complete, pausing briefly...
Submitted 60/1000 jobs (60 successful, 0 failed)
Submitted 70/1000 jobs (70 successful, 0 failed)
Submitted 80/1000 jobs (80 successful, 0 failed)
Submitted 90/1000 jobs (90 successful, 0 failed)
Submitted 100/1000 jobs (100 successful, 0 failed)
  Batch complete, pausing briefly...
...
Submitted 910/1000 jobs (910 successful, 0 failed)
Submitted 920/1000 jobs (920 successful, 0 failed)
Submitted 930/1000 jobs (930 successful, 0 failed)
Submitted 940/1000 jobs (940 successful, 0 failed)
Submitted 950/1000 jobs (950 successful, 0 failed)
  Batch complete, pausing briefly...
Submitted 960/1000 jobs (960 successful, 0 failed)
Submitted 970/1000 jobs (970 successful, 0 failed)
Submitted 980/1000 jobs (980 successful, 0 failed)
Submitted 990/1000 jobs (990 successful, 0 failed)
Submitted 1000/1000 jobs (1000 successful, 0 failed)
  Batch complete, pausing briefly...
------------------------------------------------------------
Load test complete!
  Submitted: 1000
  Failed: 0
  Total frames: ~2000

@Autowired

…rd, and improve docs - Implement cue_elasticsearch_index_queue_size Prometheus metric using @Autowired for ElasticsearchClient in PrometheusMetricsCollector - Update Grafana dashboard panel colors and labels: - Frames Completed: DEAD (red), SUCCEEDED (green), WAITING (yellow) - Events Published: human-readable labels with consistent colors - Add monitoring documentation screenshots for all components: - Grafana, Prometheus, Kafka UI, Elasticsearch, Kibana - Update all monitoring docs (Quick Start, Concepts, User Guides, Reference, Tutorials, Developer Guide) with visual references - Add load_test_jobs.py script for generating test monitoring data - Update monitor_events.py consumer script

ramonfigueiredo · 2025-11-26T02:12:58Z

@DiegoTavares / @lithorus
Ready for review!

DiegoTavares · 2025-11-26T17:05:22Z

I don't think this diagram makes too much sense. Monitoring Manager is both the consumer and producer? What is it consuming? Is the ESClient reading from elasticsearch? if so, who is writing? I thought the writer would be consuming from the kafka queue and writing to elastic

…nitoringManager - Correct data flow in architecture diagram to show: - Service Layer -> KafkaEventPublisher -> Kafka - Kafka -> KafkaEventConsumer -> ElasticsearchClient -> Elasticsearch - Remove MonitoringManager from Key classes table (use correct names) - Fix PrometheusMetrics -> PrometheusMetricsCollector (correct class name) - Update code example to match actual implementation pattern with kafkaEventPublisher.publishJobEvent() and proper error handling - Add explicit data flow explanation for clarity

ramonfigueiredo · 2025-11-26T19:22:33Z

I don't think this diagram makes too much sense. Monitoring Manager is both the consumer and producer? What is it consuming? Is the ESClient reading from elasticsearch? if so, who is writing? I thought the writer would be consuming from the kafka queue and writing to elastic

Thanks for catching this, Diego! You're right - the diagram was confusing and incorrect.

Fixed. See updated documentation: docs/_docs/developer-guide/monitoring-development.md

┌────────────────────────────────────────────────────────────────────────────┐
│                                Cuebot                                      │
│                                                                            │
│  ┌─────────────┐     ┌─────────────────────┐                               │
│  │   Service   │────>│ KafkaEventPublisher │──────────> Kafka              │
│  │   Layer     │     └─────────────────────┘               │               │
│  └─────────────┘              │                            │               │
│        │                      v                            │               │
│        └─────────────>┌──────────────┐                     │               │
│                       │  Prometheus  │                     │               │
│                       │   Metrics    │                     │               │
│                       └──────────────┘                     │               │
└────────────────────────────────────────────────────────────│───────────────┘
                                                             │
                                                             v
┌────────────────────────────────────────────────────────────────────────────┐
│                      kafka-es-indexer (Rust)                               │
│                                                                            │
│  ┌───────────────────┐         ┌─────────────────────────┐                 │
│  │   Kafka Consumer  │────────>│   Elasticsearch Client  │                 │
│  │     (rdkafka)     │         │     (bulk indexing)     │                 │
│  └───────────────────┘         └─────────────────────────┘                 │
│                                            │                               │
└────────────────────────────────────────────│───────────────────────────────┘
                                             v
                                       Elasticsearch

Data flow:

Service Layer (e.g., FrameCompleteHandler, HostReportHandler) generates events and calls KafkaEventPublisher
KafkaEventPublisher serializes events as JSON and publishes them to Kafka topics
kafka-es-indexer (standalone Rust service) consumes events from Kafka topics
kafka-es-indexer bulk indexes events into Elasticsearch for historical storage
Prometheus Metrics are updated directly by the Service Layer and KafkaEventPublisher (for queue metrics)

cuebot/src/main/java/com/imageworks/spcue/PrometheusMetricsCollector.java

cuebot/src/main/java/com/imageworks/spcue/monitoring/KafkaEventPublisher.java

proto/src/monitoring.proto

cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java

proto/src/monitoring.proto

pycue/opencue/cuebot.py

…theus Per review feedback, remove Kafka-related metrics that add storage overhead without providing essential value: - cue_monitoring_events_published_total - cue_monitoring_events_dropped_total - cue_monitoring_event_queue_size Keep only cue_elasticsearch_index_queue_size as the single metric for monitoring the monitoring system. Changes: - Remove metric definitions and methods from PrometheusMetricsCollector - Remove prometheusMetrics field and setter from KafkaEventPublisher - Update applicationContext-monitoring.xml to remove property injection - Update Grafana dashboard: replace 3 Kafka metric panels with single Elasticsearch Index Queue Size panel

Per review feedback, remove metric that monitors the monitoring system: - cue_elasticsearch_index_queue_size Elasticsearch health can be checked directly via Kibana or ES APIs. Changes: - PrometheusMetricsCollector: remove elasticsearchIndexQueueSize metric, elasticsearchClient field, and related setter methods - Grafana dashboard: remove "Elasticsearch Index Queue Size" panel, adjust remaining panel positions

Add argparse support for configurable job submission: - -n, --num-jobs: Number of jobs to submit (default: 1000) - -b, --batch-size: Batch size for submission pauses (default: 50) Allows flexible load testing without modifying the script.

- Add 'shot' label to cue_frames_completed_total counter - Add 'shot' label to cue_jobs_completed_total counter - Update recordFrameCompleted() to accept shot parameter - Update recordJobCompleted() to accept shot parameter - Update FrameCompleteHandler to pass frame.shot to metrics - Update JobManagerSupport to fetch JobDetail and pass shot to metrics

- Add cue_job_core_seconds histogram to track total core seconds per job - Record job core seconds on job completion using ExecutionSummary - Include show and shot labels for filtering - Add "Job Core Seconds Distribution" panel to Grafana dashboard - Use buckets: 3600, 36000, 360000, 3600000, 36000000 (1h to 10000h)

Replace per-frame histogram metrics with layer-level aggregations to reduce metric cardinality and cost: - Rename cue_frame_runtime_seconds to cue_layer_max_runtime_seconds - Rename cue_frame_memory_bytes to cue_layer_max_memory_bytes - Add shot label to both layer histograms - Record metrics when layer completes instead of per-frame - Add highFrameSec field to ExecutionSummary for max frame runtime - Update LayerDaoJdbc to fetch int_clock_time_high - Update Grafana dashboard with new metric names and panel titles This reduces metric volume since frames within a layer have similar runtime and memory characteristics.

…all types Implement Elasticsearch search methods and wire up event publishing for job, layer, host, and proc events: - ElasticsearchClient: Add search methods for historical job, frame, layer, and layer memory queries with filtering and pagination - HistoricalDaoJdbc: Integrate with ElasticsearchClient to return actual query results instead of empty lists - JobManagerSupport: Publish job events (JOB_FINISHED, JOB_KILLED) when jobs complete - FrameCompleteHandler: Publish layer events (LAYER_COMPLETED) when layers finish - HostReportHandler: Publish host events (HOST_STATE_CHANGED) when hardware state changes - DispatchSupportService: Publish proc events (PROC_BOOKED, PROC_UNBOOKED) when procs are created/deleted - applicationContext-service.xml: Wire kafkaEventPublisher to beans All six Kafka event types are now indexed to Elasticsearch: - opencue.frame.events (FRAME_COMPLETED, FRAME_FAILED, etc.) - opencue.job.events (JOB_FINISHED, JOB_KILLED) - opencue.layer.events (LAYER_COMPLETED) - opencue.host.events (HOST_STATE_CHANGED) - opencue.host.reports (HOST_REPORT, HOST_BOOT) - opencue.proc.events (PROC_BOOKED, PROC_UNBOOKED)

Move kafkaEventPublisher assignment inside null check to avoid unnecessary null assignments and keep related logic grouped together. Changes: - FrameCompleteHandler: Guard assignment with null check - HostReportHandler: Guard assignment with null check - DispatchSupportService: Guard assignment with null check - JobManagerSupport: Guard assignment with null check, remove unused monitoringEventBuilder getter/setter - applicationContext-service.xml: Remove monitoringEventBuilder property from jobManagerSupport bean (now auto-created in setter)

…eHandler Per review feedback, remove exception handling that silently swallows errors in Prometheus metrics recording and Kafka event publishing. Exceptions should propagate to allow proper error visibility. Changes: - Remove try-catch around prometheusMetrics.recordLayerMaxRuntime/Memory - Remove try-catch around kafkaEventPublisher.publishLayerEvent - Remove try-catch around prometheusMetrics.recordFrameCompleted - Remove try-catch around kafkaEventPublisher.publishFrameEvent - Fix spotless formatting for LayerEvent builder call

Move MonitoringEventBuilder instantiation from inline setter creation to proper Spring dependency injection via applicationContext configuration. Changes: - Wire monitoringEventBuilder bean to dispatchSupport, jobManagerSupport, frameCompleteHandler, and hostReportHandler in applicationContext-service.xml - Simplify setKafkaEventPublisher() setters to just assign the field - Add setMonitoringEventBuilder() setters for Spring injection - Remove try-catch blocks that silently swallowed exceptions in Prometheus metrics and Kafka event publishing (per review feedback) This follows Spring DI best practices where MonitoringEventBuilder is a shared singleton managed by the container rather than being manually instantiated in each class.

Implement FRAME_STARTED and FRAME_DISPATCHED event publishing to track how long frames wait in the queue before being dispatched to hosts. Pickup time = FRAME_STARTED.timestamp - FRAME_DISPATCHED.timestamp Changes: - Add isFrameDispatchable() to DependDao to check if frame has no pending deps - Publish FRAME_STARTED events in DispatchSupportService on WAITING -> RUNNING - Publish FRAME_DISPATCHED events in DependManagerService on DEPEND -> WAITING - Add buildFrameStartedEvent/buildFrameDispatchableEvent to MonitoringEventBuilder - Wire kafkaEventPublisher and monitoringEventBuilder to dependManager bean Testing: - Add MonitoringEventBuilderTests for event building validation - Add PickupTimeTrackingTests for dependency satisfaction flow Dashboard: - Add Elasticsearch datasource with header.timestamp as time field - Add Pickup Time Metrics row with 6 new panels: - Frames Started/Dispatchable stat panels - Pickup Time Events Over Time chart - Recent FRAME_STARTED/FRAME_DISPATCHED tables

…asticsearch: - Index overview and document count queries - Pickup time tracking (FRAME_STARTED/FRAME_DISPATCHED events) - Frame, job, layer, proc, and host event queries - Time-based analytics and aggregations - Correlation queries for tracing job/frame lifecycles

- Remove try-catch from Prometheus metrics recording to allow programming errors (wrong labels) to fail loudly - Keep Kafka event publishing exception handling but properly log with stack trace for debugging transient failures

…osition - Add createIndexTemplates() to ElasticsearchClient to ensure header.timestamp is mapped as date type with epoch_millis format (fixes Grafana "No data" issue) - Refactor monitoring.proto to use composition pattern - embed Job, Layer, Frame, Host messages instead of duplicating fields - Update MonitoringEventBuilder to work with embedded proto messages - Exclude -serial compiler warning in build.gradle for protobuf-generated code - Add unit tests for FRAME_STARTED and FRAME_DISPATCHED event building The timestamp mapping fix resolves time-based filtering in Grafana dashboards for Pickup Time Metrics (FRAME_STARTED/FRAME_DISPATCHED events).

Remove conditional try/except around monitoring proto imports and MONITORING_AVAILABLE flag. Monitoring functions should always be available in pycue - if monitoring is disabled at cuebot level, it will return grpc.Status=UNIMPLEMENTED rather than failing at import. This allows toggling monitoring at the Cuebot end without requiring a new version of pycue.

Host reports are too large to store in Kafka/Elasticsearch due to their high frequency (~60s intervals) and data volume. Host metrics should use Prometheus instead. Changes: - Remove HostReportEvent and RunningFrameSummary from monitoring.proto - Remove publishHostReportEvent from HostReportHandler, KafkaEventPublisher - Remove host report indexing from ElasticsearchClient, KafkaEventConsumer - Remove buildHostReportEvent from MonitoringEventBuilder - Update documentation to note host metrics use Prometheus - Keep HostEvent for state change audit trail (up/down/locked)

Producer now acts as topic admin, creating topics with explicit configuration rather than relying on auto-creation with defaults. Configurable topic settings: - monitoring.kafka.topic.partitions (default: 3) - monitoring.kafka.topic.replication.factor (default: 1) - monitoring.kafka.topic.retention.ms (default: 7 days) - monitoring.kafka.topic.cleanup.policy (default: delete) - monitoring.kafka.topic.segment.ms (default: 1 day) - monitoring.kafka.topic.segment.bytes (default: 1GB) Topics are created on initialization before producer starts. TopicExistsException is handled gracefully for idempotent startup.

…dexer Move Kafka-to-Elasticsearch event indexing from Cuebot to a standalone Rust service, addressing code review feedback to decouple the consumer from the Java codebase. Rust kafka-es-indexer: - Add rust/crates/kafka-es-indexer: standalone Kafka consumer that indexes OpenCue events (job, layer, frame, host, proc) to Elasticsearch - Async Kafka consumer with configurable batch processing - Elasticsearch bulk indexing with date-based indices - Index templates with proper field mappings for all event types - CLI with environment variable configuration Cuebot cleanup: - Remove Java KafkaEventConsumer and ElasticsearchClient classes - Remove getJobHistory, getFrameHistory, getLayerHistory, getLayerMemoryHistory from HistoricalDao and HistoricalManager - Update ManageMonitoring gRPC servant to return UNIMPLEMENTED with message directing users to query Elasticsearch directly - Keep KafkaEventPublisher for publishing events from Cuebot to Kafka - Keep core job archival methods (getFinishedJobs, transferJob) intact Infrastructure: - Update docker-compose.monitoring-full.yml to include kafka-es-indexer

Update all monitoring documentation to reflect the decoupled architecture where Elasticsearch indexing is handled by the standalone Rust kafka-es-indexer service instead of Cuebot. Documentation changes: - Add kafka-es-indexer to component tables and architecture diagrams - Update configuration examples with correct CLI args and env vars - Remove stale Prometheus metrics (cue_monitoring_events_*, cue_elasticsearch_*) - Remove opencue.host.reports topic (removed from pipeline) - Replace Cuebot Elasticsearch config with kafka-es-indexer config - Update alert examples to use existing metrics Files updated: - docs/_docs/concepts/render-farm-monitoring.md - docs/_docs/developer-guide/monitoring-development.md - docs/_docs/getting-started/deploying-monitoring.md - docs/_docs/quick-starts/quick-start-monitoring.md - docs/_docs/reference/monitoring-reference.md - docs/_docs/tutorials/monitoring-tutorial.md - docs/_docs/user-guides/render-farm-monitoring-guide.md - rust/README.md - Add kafka-es-indexer to crates list - sandbox/README.md - Add event streaming monitoring stack section - opencue_monitoring images: opencue_monitoring_elasticsearch_kibana_dev_tools.png, opencue_monitoring_grafana_chart.png, opencue_monitoring_prometheus.png

ramonfigueiredo · 2025-11-30T08:49:11Z

@DiegoTavares / @lithorus
Ready for review!

ramonfigueiredo self-assigned this Nov 25, 2025

ramonfigueiredo force-pushed the feature/render-monitoring-system branch 4 times, most recently from c6ea5f7 to 3f0791f Compare November 25, 2025 02:37

ramonfigueiredo force-pushed the feature/render-monitoring-system branch from 3f0791f to cb8a9a2 Compare November 25, 2025 02:46

ramonfigueiredo changed the title ~~[cuebot/pycue/proto] Add OpenCue render farm monitoring system~~ [cuebot/pycue/proto/sandbox] Add event-driven monitoring system and full monitoring stack for OpenCue Nov 25, 2025

ramonfigueiredo changed the title ~~[cuebot/pycue/proto/sandbox] Add event-driven monitoring system and full monitoring stack for OpenCue~~ [cuebot/pycue/proto/sandbox/docs] Add event-driven monitoring system, full monitoring stack, and documentation for OpenCue Nov 25, 2025

ramonfigueiredo changed the title ~~[cuebot/pycue/proto/sandbox/docs] Add event-driven monitoring system, full monitoring stack, and documentation for OpenCue~~ [cuebot/pycue/proto/sandbox/docs] Add event-driven monitoring system, full monitoring stack, and documentation Nov 25, 2025

ramonfigueiredo changed the title ~~[cuebot/pycue/proto/sandbox/docs] Add event-driven monitoring system, full monitoring stack, and documentation~~ [cuebot/pycue/proto/sandbox/docs] Add event-driven monitoring system, full monitoring stack, and update documentation Nov 25, 2025

[docs] Add render farm monitoring documentation and update nav ordering

b40538f

Update nav_order across all documentation files for consistent navigation: 1) Run script docs/extract_nav_orders.py 2) Manual fix of nav_order_index.txt 3) run script docs/update_nav_order.py

ramonfigueiredo changed the title ~~[cuebot/pycue/proto/sandbox/docs] Add event-driven monitoring system, full monitoring stack, and update documentation~~ [cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation Nov 26, 2025

ramonfigueiredo force-pushed the feature/render-monitoring-system branch from 345ef66 to 3285e41 Compare November 26, 2025 01:48

ramonfigueiredo marked this pull request as ready for review November 26, 2025 02:08

ramonfigueiredo requested review from DiegoTavares and lithorus as code owners November 26, 2025 02:08

DiegoTavares requested changes Nov 26, 2025

View reviewed changes

DiegoTavares reviewed Nov 27, 2025

View reviewed changes

pycue/opencue/cuebot.py Outdated Show resolved Hide resolved

ramonfigueiredo added 2 commits November 27, 2025 14:33

ramonfigueiredo added 7 commits November 27, 2025 15:11

ramonfigueiredo marked this pull request as draft November 28, 2025 22:38

ramonfigueiredo added 10 commits November 28, 2025 16:13

[cuebot] Fix exception handling in job completion metrics

3c0686e

- Remove try-catch from Prometheus metrics recording to allow programming errors (wrong labels) to fail loudly - Keep Kafka event publishing exception handling but properly log with stack trace for debugging transient failures

ramonfigueiredo force-pushed the feature/render-monitoring-system branch from 6e6ea36 to 56e7416 Compare November 30, 2025 08:39

ramonfigueiredo marked this pull request as ready for review November 30, 2025 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation #2086

[cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation #2086

Uh oh!

ramonfigueiredo commented Nov 25, 2025 •

edited

Loading

Uh oh!

ramonfigueiredo commented Nov 25, 2025 •

edited

Loading

Uh oh!

ramonfigueiredo commented Nov 25, 2025 •

edited

Loading

Uh oh!

ramonfigueiredo commented Nov 26, 2025

Uh oh!

DiegoTavares commented Nov 26, 2025 •

edited by ramonfigueiredo

Loading

Uh oh!

ramonfigueiredo commented Nov 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ramonfigueiredo commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation #2086

Are you sure you want to change the base?

[cuebot/pycue/proto/sandbox/docs] Add full event-driven monitoring stack, enhance metrics, dashboards, and documentation #2086

Uh oh!

Conversation

ramonfigueiredo commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramonfigueiredo commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenCue monitoring stack

Documentation

Uh oh!

ramonfigueiredo commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New script to load test jobs using PyOutline

Uh oh!

ramonfigueiredo commented Nov 26, 2025

Uh oh!

DiegoTavares commented Nov 26, 2025 • edited by ramonfigueiredo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramonfigueiredo commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ramonfigueiredo commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ramonfigueiredo commented Nov 25, 2025 •

edited

Loading

ramonfigueiredo commented Nov 25, 2025 •

edited

Loading

ramonfigueiredo commented Nov 25, 2025 •

edited

Loading

DiegoTavares commented Nov 26, 2025 •

edited by ramonfigueiredo

Loading

ramonfigueiredo commented Nov 26, 2025 •

edited

Loading