Skip to content

docs: add prometheus + grafana deployment guide #1019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions config/observability/prometheus/rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: inference-gateway-metrics-reader
rules:
- nonResourceURLs:
- /metrics
- /debug/pprof/*
verbs:
- get
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: inference-gateway-sa-metrics-reader
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: inference-gateway-sa-metrics-reader-role-binding
namespace: monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterRoleBinding is a cluster-scoped resource, so remove the namespace field (to avoid confusion).

subjects:
- kind: ServiceAccount
name: inference-gateway-sa-metrics-reader
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these resources are already covered in https://gateway-api-inference-extension.sigs.k8s.io/guides/metrics-and-observability/#scrape-metrics-pprof-profiles. To avoid confusion and misconfiguration, this file should only contain ServiceAccount and an updated ClusterRoleBinding that includes an additional subjects entry for the new ServiceAccount:

- kind: ServiceAccount
  name: inference-gateway-sa-metrics-reader
  namespace: monitoring

Maybe use kubectl to patch the ClusterRoleBinding with this entry, so this file only contains the ServiceAccount.

name: inference-gateway-metrics-reader
---
27 changes: 27 additions & 0 deletions config/observability/prometheus/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
serviceAccounts:
server:
create: false
name: inference-gateway-sa-metrics-reader

extraScrapeConfigs: |
- job_name: 'inference-extension-epp'
authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scrape_interval: 5s
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: .*-epp$
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "9090"
- job_name: vllm
scrape_interval: 5s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: vllm-llama3-8b-instruct
60 changes: 60 additions & 0 deletions site-src/guides/metrics-and-observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,66 @@ PROFILE_NAME=heap
curl -H "Authorization: Bearer $TOKEN" localhost:9090/debug/pprof/$PROFILE_NAME -o profile.out
go tool pprof -png profile.out
```
## Setting Up Grafana + Prometheus

### Grafana

A simple grafana deployment can be done with the following commands:

```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana --namespace monitoring --create-namespace
```

Get the Grafana URL to visit by running these commands in the same shell:

```bash
export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=grafana" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace monitoring port-forward $POD_NAME 3000
```
Comment on lines +142 to +145
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify with a single command that port-forwards the grafana deployment:

kubectl -n monitoring port-forward deploy/grafana 3000

Since Grafana is not configured to use the default admin/admin login info, add a step to get the password from the secret. For example:

kubectl -n monitoring get secret grafana \
  -o go-template='{{ index .data "admin-password" | base64decode }}'

Add a note such as "You can now access the Grafana UI from http://127.0.0.1"


### Prometheus

We currently have 2 types of prometheus deployments documented:

1. Self Hosted using the prometheus helm chart
2. Using Google Managed Prometheus

=== "Self-Hosted"

Create Necessary ServiceAccount and RBAC Resources:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Necessary/the necessary/ and s/Resources/resources/

or if you update rbac.yaml to only include the SA:

s/Create Necessary ServiceAccount and RBAC Resources:/Create the necessary ServiceAccount resource:/

and then include the kubectl patch command to patch the ClusterRoleBinding with a subject for the monitoring ServiceAccount.


```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/observability/prometheus/rbac.yaml
```

Add the prometheus-community helm repository:

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
```

Deploy the prometheus helm chart using this command:
```bash
helm install prometheus prometheus-community/prometheus \
--namespace monitoring \
-f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/observability/prometheus/values.yaml
```

You can add the prometheus data source to grafana following [This Guide](https://grafana.com/docs/grafana/latest/administration/data-source-management/).
The prometheus server host is by default `http://prometheus-server`

Notice that the given values file is very simple and will work directly after following the [Getting Started Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/), you might need to modify it

=== "Google Managed"

If you run the inference gateway with [Google Managed Prometheus](https://cloud.google.com/stackdriver/docs/managed-prometheus), please follow the [instructions](https://cloud.google.com/stackdriver/docs/managed-prometheus/query)
to configure Google Managed Prometheus as data source for the grafana dashboard.

## Load Inference Extension dashboard into Grafana

Please follow [grafana instructions](https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/import-dashboards/) to load the dashboard json.
The dashboard can be found here [Grafana Dashboard](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/tools/dashboards/inference_gateway.json)

## Prometheus Alerts

Expand Down