Skip to content

Commit 56e7416

Browse files
[docs] Update monitoring documentation
Update all monitoring documentation to reflect the decoupled architecture where Elasticsearch indexing is handled by the standalone Rust kafka-es-indexer service instead of Cuebot. Documentation changes: - Add kafka-es-indexer to component tables and architecture diagrams - Update configuration examples with correct CLI args and env vars - Remove stale Prometheus metrics (cue_monitoring_events_*, cue_elasticsearch_*) - Remove opencue.host.reports topic (removed from pipeline) - Replace Cuebot Elasticsearch config with kafka-es-indexer config - Update alert examples to use existing metrics Files updated: - docs/_docs/concepts/render-farm-monitoring.md - docs/_docs/developer-guide/monitoring-development.md - docs/_docs/getting-started/deploying-monitoring.md - docs/_docs/quick-starts/quick-start-monitoring.md - docs/_docs/reference/monitoring-reference.md - docs/_docs/tutorials/monitoring-tutorial.md - docs/_docs/user-guides/render-farm-monitoring-guide.md - rust/README.md - Add kafka-es-indexer to crates list - sandbox/README.md - Add event streaming monitoring stack section - opencue_monitoring images: opencue_monitoring_elasticsearch_kibana_dev_tools.png, opencue_monitoring_grafana_chart.png, opencue_monitoring_prometheus.png
1 parent 8f39cc3 commit 56e7416

File tree

12 files changed

+291
-145
lines changed

12 files changed

+291
-145
lines changed

docs/_docs/concepts/render-farm-monitoring.md

Lines changed: 51 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,36 @@ The monitoring system is built on an event-driven architecture that captures lif
2929

3030
## Architecture
3131

32-
The monitoring system consists of three main components:
32+
The monitoring system uses a decoupled architecture:
33+
34+
```
35+
┌────────────────────────────────────────────────────────────────────────────┐
36+
│ Cuebot │
37+
│ │
38+
│ ┌─────────────┐ ┌─────────────────────┐ │
39+
│ │ Service │────>│ KafkaEventPublisher │──────────> Kafka │
40+
│ │ Layer │ └─────────────────────┘ │ │
41+
│ └─────────────┘ │ │ │
42+
│ │ v │ │
43+
│ └─────────────>┌──────────────┐ │ │
44+
│ │ Prometheus │ │ │
45+
│ │ Metrics │ │ │
46+
│ └──────────────┘ │ │
47+
└────────────────────────────────────────────────────────────│───────────────┘
48+
49+
v
50+
┌────────────────────────────────────────────────────────────────────────────┐
51+
│ kafka-es-indexer (Rust) │
52+
│ │
53+
│ ┌───────────────────┐ ┌─────────────────────────┐ │
54+
│ │ Kafka Consumer │────────>│ Elasticsearch Client │ │
55+
│ │ (rdkafka) │ │ (bulk indexing) │ │
56+
│ └───────────────────┘ └─────────────────────────┘ │
57+
│ │ │
58+
└────────────────────────────────────────────│───────────────────────────────┘
59+
v
60+
Elasticsearch
61+
```
3362

3463
### Event publishing (Kafka)
3564

@@ -49,7 +78,7 @@ Events are published asynchronously to avoid impacting render farm performance.
4978

5079
### Historical storage (Elasticsearch)
5180

52-
The Kafka event consumer indexes events into Elasticsearch for long-term storage and analysis. This enables:
81+
A standalone Rust-based service (`kafka-es-indexer`) consumes events from Kafka and indexes them into Elasticsearch for long-term storage and analysis. This decoupled architecture enables:
5382

5483
- **Historical queries**: Search for jobs, frames, or hosts by any attribute
5584
- **Trend analysis**: Track metrics over time (job completion rates, failure patterns)
@@ -77,11 +106,6 @@ Cuebot exposes a `/metrics` endpoint compatible with Prometheus. Key metrics inc
77106
- `cue_booking_waiting_total` - Tasks waiting in booking queue
78107
- `cue_report_executed_total` - Host reports processed
79108

80-
**Monitoring system metrics:**
81-
- `cue_monitoring_events_published_total` - Events published to Kafka
82-
- `cue_monitoring_events_dropped_total` - Events dropped due to queue overflow
83-
- `cue_monitoring_event_queue_size` - Current event queue size
84-
85109
## Event types
86110

87111
### Job events
@@ -118,22 +142,36 @@ Host events monitor render node status:
118142

119143
## Configuration
120144

121-
The monitoring system is configured through Cuebot properties:
145+
### Cuebot configuration
146+
147+
Enable Kafka and Prometheus through Cuebot properties:
122148

123149
```properties
124150
# Kafka event publishing
125151
monitoring.kafka.enabled=true
126152
monitoring.kafka.bootstrap.servers=kafka:9092
127153

128-
# Elasticsearch storage
129-
monitoring.elasticsearch.enabled=true
130-
monitoring.elasticsearch.host=elasticsearch
131-
monitoring.elasticsearch.port=9200
132-
133154
# Prometheus metrics
134155
metrics.prometheus.collector=true
135156
```
136157

158+
### kafka-es-indexer configuration
159+
160+
The standalone Rust indexer (`rust/crates/kafka-es-indexer/`) is configured via environment variables or CLI arguments:
161+
162+
```bash
163+
# Using environment variables
164+
export KAFKA_BOOTSTRAP_SERVERS=kafka:9092
165+
export ELASTICSEARCH_URL=http://elasticsearch:9200
166+
kafka-es-indexer
167+
168+
# Or using CLI arguments
169+
kafka-es-indexer \
170+
--kafka-servers kafka:9092 \
171+
--elasticsearch-url http://elasticsearch:9200 \
172+
--index-prefix opencue
173+
```
174+
137175
Each component can be enabled or disabled independently based on your infrastructure needs.
138176

139177
## What's next?

docs/_docs/developer-guide/monitoring-development.md

Lines changed: 106 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -19,47 +19,62 @@ This guide explains how to extend, customize, and develop against the OpenCue mo
1919

2020
## Architecture overview
2121

22-
The monitoring system is implemented in Cuebot and consists of:
22+
The monitoring system uses a decoupled architecture with Cuebot publishing events to Kafka and a standalone Rust-based indexer consuming events for Elasticsearch storage:
2323

2424
```
2525
┌────────────────────────────────────────────────────────────────────────────┐
2626
│ Cuebot │
2727
│ │
2828
│ ┌─────────────┐ ┌─────────────────────┐ │
29-
│ │ Service │────>│ KafkaEventPublisher │───────> Kafka │
30-
│ │ Layer │ └─────────────────────┘ │ │
31-
│ └─────────────┘ │ │ │
32-
│ │ │ v │
33-
│ │ v ┌───────────────────┐ │
34-
│ │ ┌──────────────┐ │ KafkaEventConsumer│ │
35-
│ └─────────────>│ Prometheus │ └───────────────────┘ │
36-
│ │ Metrics │ │ │
37-
│ └──────────────┘ v │
38-
│ ┌───────────────────┐ │
39-
│ │ ElasticsearchClient│ │
40-
│ └───────────────────┘ │
41-
│ │ │
42-
└─────────────────────────────────────────────────────────│──────────────────┘
43-
v
44-
Elasticsearch
29+
│ │ Service │────>│ KafkaEventPublisher │──────────> Kafka │
30+
│ │ Layer │ └─────────────────────┘ │ │
31+
│ └─────────────┘ │ │ │
32+
│ │ v │ │
33+
│ └─────────────>┌──────────────┐ │ │
34+
│ │ Prometheus │ │ │
35+
│ │ Metrics │ │ │
36+
│ └──────────────┘ │ │
37+
└────────────────────────────────────────────────────────────│───────────────┘
38+
39+
v
40+
┌────────────────────────────────────────────────────────────────────────────┐
41+
│ kafka-es-indexer (Rust) │
42+
│ │
43+
│ ┌───────────────────┐ ┌─────────────────────────┐ │
44+
│ │ Kafka Consumer │────────>│ Elasticsearch Client │ │
45+
│ │ (rdkafka) │ │ (bulk indexing) │ │
46+
│ └───────────────────┘ └─────────────────────────┘ │
47+
│ │ │
48+
└────────────────────────────────────────────│───────────────────────────────┘
49+
v
50+
Elasticsearch
4551
```
4652

4753
**Data flow:**
4854
1. **Service Layer** (e.g., FrameCompleteHandler, HostReportHandler) generates events and calls KafkaEventPublisher
4955
2. **KafkaEventPublisher** serializes events as JSON and publishes them to Kafka topics
50-
3. **KafkaEventConsumer** subscribes to Kafka topics and receives published events
51-
4. **KafkaEventConsumer** uses **ElasticsearchClient** to index events into Elasticsearch for historical storage
56+
3. **kafka-es-indexer** (standalone Rust service) consumes events from Kafka topics
57+
4. **kafka-es-indexer** bulk indexes events into Elasticsearch for historical storage
5258
5. **Prometheus Metrics** are updated directly by the Service Layer and KafkaEventPublisher (for queue metrics)
5359

54-
### Key classes
60+
### Key components
5561

56-
| Class | Location | Purpose |
57-
|-------|----------|---------|
62+
| Component | Location | Purpose |
63+
|-----------|----------|---------|
5864
| `KafkaEventPublisher` | `com.imageworks.spcue.monitoring` | Publishes events to Kafka |
59-
| `KafkaEventConsumer` | `com.imageworks.spcue.monitoring` | Consumes events from Kafka for ES indexing |
60-
| `ElasticsearchClient` | `com.imageworks.spcue.monitoring` | Writes events to Elasticsearch |
6165
| `MonitoringEventBuilder` | `com.imageworks.spcue.monitoring` | Builds event payloads |
6266
| `PrometheusMetricsCollector` | `com.imageworks.spcue` | Exposes Prometheus metrics |
67+
| `kafka-es-indexer` | `rust/crates/kafka-es-indexer/` | Consumes Kafka, indexes to Elasticsearch |
68+
69+
### Why a separate indexer?
70+
71+
The Kafka-to-Elasticsearch indexer is implemented as a standalone Rust service rather than within Cuebot for several reasons:
72+
73+
- **Decoupling**: Cuebot focuses on core scheduling; indexing is a separate concern
74+
- **Scalability**: The indexer can be scaled independently from Cuebot
75+
- **Reliability**: Kafka buffering ensures events are not lost if Elasticsearch is temporarily unavailable
76+
- **Performance**: Rust provides efficient resource usage for high-throughput event processing
77+
- **Operational flexibility**: The indexer can be updated, restarted, or replayed without affecting Cuebot
6378

6479
## Adding new event types
6580

@@ -201,6 +216,8 @@ public static void setActiveJobs(String show, String state, int count) {
201216

202217
## Customizing Elasticsearch indexing
203218

219+
The `kafka-es-indexer` service handles all Elasticsearch indexing. It automatically routes events to indices based on the Kafka topic name.
220+
204221
### Index templates
205222

206223
Create custom index templates for new event types. Note that events use snake_case field names and include a `header` object:
@@ -234,27 +251,18 @@ Create custom index templates for new event types. Note that events use snake_ca
234251
}
235252
```
236253

237-
### Custom indexing logic
254+
### Index naming convention
238255

239-
Extend `ElasticsearchClient` to add custom indexing:
256+
The kafka-es-indexer creates daily indices using the pattern:
240257

241-
```java
242-
// ElasticsearchClient.java
243-
public void indexJobAdminEvent(MonitoringEvent event) {
244-
String indexName = "opencue-job-admin-" +
245-
LocalDate.now().format(DateTimeFormatter.ISO_DATE);
246-
247-
Map<String, Object> document = new HashMap<>();
248-
document.put("eventType", event.getEventType().name());
249-
document.put("timestamp", event.getTimestamp());
250-
document.put("jobId", event.getJobId());
251-
document.put("jobName", event.getJobName());
252-
document.putAll(event.getMetadataMap());
253-
254-
indexDocument(indexName, document);
255-
}
258+
```
259+
{topic-name-converted}-YYYY-MM-DD
256260
```
257261

262+
For example:
263+
- `opencue.job.events``opencue-job-events-2024-11-29`
264+
- `opencue.frame.events``opencue-frame-events-2024-11-29`
265+
258266
## Testing
259267

260268
### Unit testing event builders
@@ -312,15 +320,46 @@ public class KafkaEventPublisherIntegrationTest {
312320
| `monitoring.kafka.linger.ms` | `100` | Time to wait before sending batch |
313321
| `monitoring.kafka.acks` | `1` | Required acknowledgments |
314322

315-
### Elasticsearch configuration
323+
### kafka-es-indexer configuration
316324

317-
| Property | Default | Description |
318-
|----------|---------|-------------|
319-
| `monitoring.elasticsearch.enabled` | `false` | Enable ES storage |
320-
| `monitoring.elasticsearch.host` | `localhost` | ES host |
321-
| `monitoring.elasticsearch.port` | `9200` | ES port |
322-
| `monitoring.elasticsearch.scheme` | `http` | Connection scheme |
323-
| `monitoring.elasticsearch.index.prefix` | `opencue` | Index name prefix |
325+
The kafka-es-indexer is configured via command-line arguments, environment variables, or a YAML config file:
326+
327+
| CLI Argument | Env Variable | Default | Description |
328+
|--------------|--------------|---------|-------------|
329+
| `--kafka-servers` | `KAFKA_BOOTSTRAP_SERVERS` | `localhost:9092` | Kafka broker addresses |
330+
| `--kafka-group-id` | `KAFKA_GROUP_ID` | `opencue-elasticsearch-indexer` | Consumer group ID |
331+
| `--elasticsearch-url` | `ELASTICSEARCH_URL` | `http://localhost:9200` | Elasticsearch URL |
332+
| `--index-prefix` | `ELASTICSEARCH_INDEX_PREFIX` | `opencue` | Elasticsearch index prefix |
333+
| `--log-level` | `LOG_LEVEL` | `info` | Log level (debug, info, warn, error) |
334+
| `--config` | - | - | Path to YAML config file |
335+
336+
The indexer automatically subscribes to all OpenCue Kafka topics:
337+
- `opencue.job.events`
338+
- `opencue.layer.events`
339+
- `opencue.frame.events`
340+
- `opencue.host.events`
341+
- `opencue.proc.events`
342+
343+
Example with CLI arguments:
344+
345+
```bash
346+
kafka-es-indexer \
347+
--kafka-servers kafka:9092 \
348+
--kafka-group-id opencue-elasticsearch-indexer \
349+
--elasticsearch-url http://elasticsearch:9200 \
350+
--index-prefix opencue \
351+
--log-level info
352+
```
353+
354+
Example with environment variables:
355+
356+
```bash
357+
export KAFKA_BOOTSTRAP_SERVERS=kafka:9092
358+
export KAFKA_GROUP_ID=opencue-elasticsearch-indexer
359+
export ELASTICSEARCH_URL=http://elasticsearch:9200
360+
export ELASTICSEARCH_INDEX_PREFIX=opencue
361+
kafka-es-indexer
362+
```
324363

325364
### Prometheus configuration
326365

@@ -331,23 +370,14 @@ public class KafkaEventPublisherIntegrationTest {
331370

332371
## Debugging
333372

334-
### Enable debug logging
373+
### Enable debug logging in Cuebot
335374

336375
Add to `log4j2.xml`:
337376

338377
```xml
339378
<Logger name="com.imageworks.spcue.monitoring" level="DEBUG"/>
340379
```
341380

342-
### Check event queue status
343-
344-
Monitor the event queue via metrics:
345-
346-
```promql
347-
cue_monitoring_event_queue_size
348-
cue_monitoring_events_dropped_total
349-
```
350-
351381
### Verify Kafka connectivity
352382

353383
```bash
@@ -360,6 +390,23 @@ kafka-consumer-groups --bootstrap-server kafka:9092 \
360390
--group opencue-elasticsearch-indexer --describe
361391
```
362392

393+
### Debugging kafka-es-indexer
394+
395+
```bash
396+
# View indexer logs
397+
docker logs opencue-kafka-es-indexer
398+
399+
# Check indexer help
400+
docker exec opencue-kafka-es-indexer kafka-es-indexer --help
401+
402+
# Verify Elasticsearch indices are being created
403+
curl -s "http://localhost:9200/_cat/indices/opencue-*?v"
404+
405+
# Check event counts in Elasticsearch
406+
curl -s "http://localhost:9200/opencue-job-events-*/_count"
407+
curl -s "http://localhost:9200/opencue-frame-events-*/_count"
408+
```
409+
363410
## Best practices
364411

365412
### Event design

0 commit comments

Comments
 (0)