-
Notifications
You must be signed in to change notification settings - Fork 336
Description
background
We are heavy users of log alerts, and internally we have spent a lot of time and effort on log alerts.
We referenced Grafana's code and restructured Datadog's functionality to completely overhaul the alerting process.
enhancement
async execute alert
Currently, if any log alert takes too long to run, it will block the execution of other alerts.
const now = new Date();
const alerts = await getAlerts();
logger.info(`Going to process ${alerts.length} alerts`);
await Promise.all(alerts.map(alert => processAlert(now, alert)));
};
Modified to:
-
Ensure that only one alertID is executed at a time via runningAlerts.
-
If an alert with the same ID has not finished executing, skip it.
Refer to Grafana: https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/schedule.go#L420
const runningAlerts = new Map<string, Date>();
export default async () => {
const now = new Date();
const alerts = await getAlerts();
logger.info(`Going to process ${alerts.length} alerts`);
alerts.forEach(alert => {
const alertId = alert.id;
if (runningAlerts.has(alertId)) {
logger.error({
message: 'Tick dropped because alert rule evaluation is too slow',
alert_id: alertId,
alertname: alert.name,
time: now,
lastEvaluation: runningAlerts.get(alertId),
});
scheduleRuleEvaluationsMissedTotal.inc({
alertname: alert.name ?? 'unknown',
});
return;
}
runningAlerts.set(alertId, now);
void processAlert(now, alert).finally(() => runningAlerts.delete(alertId));
});
};
add retry
During alert execution, whether a ClickHouse query fails or a webhook request fails, an exception should be thrown upwards.
Retry should then be initiated at the top level.
See https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/alert_rule.go#L285
const attempt = async (retries: number): Promise<void> => {
try {
await processAlert(now, alert);
} catch (err) {
logger.error({
message: '"Failed to evaluate rule',
retries,
error: serializeError(err),
alertname: alert.name,
alertId,
});
if (retries < MAX_ATTEMPTS) {
await new Promise<void>(resolve =>
setTimeout(resolve, RETRY_DELAY_MS),
);
return attempt(retries + 1);
}
// no more retries
return;
}
};
void attempt(1).finally(() => {
EvalDuration.observe(
(new Date().getTime() - evalStart.getTime()) / 1000,
);
runningAlerts.delete(alertId);
});
Alarms are executed evenly.
Execute evenly over 1 minute to reduce database load.
Refer to https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/schedule.go#L439
const alertId = alert.id;
const delayMs = (fnv.hash(alertId) % 60) * 1000;
setTimeout((alert) => {
processAlert(alert)
}, delayMs))
// 00:01:30 s 或 00:01:59 去执行告警,告警都查询的是 [00:00:00, 00:01:00) 的日志。
do not query history alert
Without querying historical data: In the old implementation, alert would query data from the last execution to the current time.
For example, if a machine has been down for a month, querying the logs for the most recent month would trigger many meaningless alerts.
OK -> alert -> recovery -> alert.
Referring to the Grafana implementation, this should be modified to only check the current time (1 minute past).
Observability.
Implemented according to https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/metrics/scheduler.go.
const EvaluationMissed = new Prometheus.Counter({
name: 'schedule_rule_evaluations_missed_total',
help: 'The total number of rule evaluations missed due to a slow rule evaluation or schedule problem.',
labelNames: ['alertname'],
});
const EvalDuration = new Prometheus.Summary({
name: 'rule_evaluation_duration_seconds',
help: 'The time to evaluate a rule.',
labelNames: ['type'],
percentiles: [0.01, 0.05, 0.5, 0.9, 0.99],
});
const EvalRetry = new Prometheus.Counter({
name: 'rule_evaluation_retry_total',
help: 'The total number of rule retry.',
labelNames: ['alertname'],
});
const EvalFailures = new Prometheus.Counter({
name: 'rule_evaluation_failures_total',
help: 'The total number of rule evaluation failures.',
labelNames: ['alertname'],
});
run alert as service
The alerting service runs independently, with multiple pods electing a leader via MongoDB.
Currently, the API, UI, and AlertTask tasks all run in a single deployment.
The AlertTask task will be split into a separate deployment to reduce interference.
The alerting service runs in multiple pods, with a leader elected via MongoDB to ensure that only one pod is actually querying alerts.
MongoDB Tables:
lockValue: string; // Unique identifier of the lock holder (pod name)
acquiredAt: Date; // Lock acquisition time
expiresAt: Date; // Lock expiration time, automatically cleared with TTL
Acquiring a Lock:
-
Query MongoDB to see if anyone holds the lock.
-
If the current service holds the lock, return true and extend the lock's expiration time.
-
If another service holds the lock, return false.
-
If no service holds the lock, create a record and set the expiration time to 2 minutes.
-
If 11000 is returned, it indicates that a concurrent service successfully acquired the lock; return false.
-
An error was returned.
-
-