Notify VM Creation Failure by GunaKKIBM · Pull Request #75 · IBM/power-access-cloud

GunaKKIBM · 2026-04-16T03:35:07Z

This PR notifies the user and admin via email, if VM creation fails/VM is in error state.

GunaKKIBM · 2026-04-16T03:36:52Z

 	}
+
+	setupLog.Info("Attempting to connect to MongoDB...")
+	// TODO: Should we really fail, if connection to mongoDB fails?


Need suggestions here
Should we exit, if mongoDB connection fails? Given the main functionality here is to provision/delete the VM, we can just log, if mongoDB connection fails. And notifications won't be sent.

mayuka-c · 2026-04-16T05:05:27Z

 		os.Exit(1)
 	}
+
+	setupLog.Info("Attempting to connect to MongoDB...")


Any reason for connecting to mongoDB at controller start up?
I feel we can do the db connection during reconcile if we want to store some events on demand

mayuka-c · 2026-04-16T05:08:16Z

+		if err := scope.NotifyServiceCreationFailure(errorMsg); err != nil {
+			scope.Logger.Error(err, "failed to create failure notification event")
+		}


This might produce events during every reconcile right if VM is in error state?

You might want to make it idempotent, if the event for given service id and name already exists, then do not need to update the event in DB

I feel we can have in-memory cache for now (maybe later external-cache like redis), store the events there and then peridiocally maybe every 10-15 mins dump in DB. Without this you might end up making lot of DB calls I believe?

This is something, that is supposed to be notified immediately, given the mail is sent based on events from DB, caching here wouldn't be right is what I think.

May be, it could me more like a rate limiter, for a given VM instance, if the failure is seen more than x times in n seconds, we can send the event

Makes sense

There was one more which @anshuman-agarwala said today that we can get the older VM state and current VM state, if older was error and the newer is also error then do not send. If its transitioned from anything else to error (meaning first time error), then notify.

Pls do check with him regarding this.

@anshuman-agarwala , I checked the PVM_instance_spec. I checked the cloud docs as well. No spec clearly talks about previous health state. Only one field that could closely relate here is PVMInstanceHealth
I checked the cloud docs as well. It doesn't clearly specify anything about VM's older state.

@mayuka-c . I have added a rate limiter. As you are already aware, it doesn't persist across restarts. But, for now, this should look ok.

It's a controller-runtime concept, you can take a look at adding an admission webhook or a shared informer to the controller for this. We can discuss further offline.

Signed-off-by: Guna K Kambalimath <Guna.Kambalimath@ibm.com>

anshuman-agarwala · 2026-04-23T06:18:00Z

+var (
+	notificationCache  = make(map[string]time.Time)
+	cacheMutex         sync.RWMutex
+	minIntervalMinutes = 30


Since the cache is in-memory this can cause multiple emails in a short time if the controller goes into a restart loop, right?

Yes, this can happen. In the long run, we should persist this using DB

anshuman-agarwala · 2026-04-23T06:18:29Z

 	// collection exists
 	return true, nil
-}
+}


Was this a mistake?

.I ran gofmt, on a few files, and this got added, will remove it.

anshuman-agarwala · 2026-04-23T06:21:08Z

+	logMessage := fmt.Sprintf("Service '%s' creation failed. Reason: %s", s.Service.Name, errorMessage)
+	event.SetLog(models.EventLogLevelERROR, logMessage)
+
+	dbCon, disconnect, err := connectDB(s.Logger)


It will be better if we can maintain a single long living connection instead of recreating the connection every time this method gets called. Hypothetically if a bunch of VMs fail at the same time then this will create a bunch of connections to the DB potentially causing slowdown/crash on the DB as well.

The original changes was actually like that,
#75 (comment)

Mayuka added this comment that contradicts with yours. I think, maintaining a connection would be a better idea

GunaKKIBM · 2026-04-23T01:15:03Z

+		if err := scope.NotifyServiceCreationFailure(errorMsg); err != nil {
+			scope.Logger.Error(err, "failed to create failure notification event")
+		}


@mayuka-c . I have added a rate limiter. As you are already aware, it doesn't persist across restarts. But, for now, this should look ok.

GunaKKIBM · 2026-04-23T01:58:57Z


 	// collection exists
 	return true, nil
-}


looks like, this got updated because of gofmt

GunaKKIBM · 2026-04-23T06:25:49Z

 	// collection exists
 	return true, nil
-}
+}


.I ran gofmt, on a few files, and this got added, will remove it.

GunaKKIBM · 2026-04-23T06:26:56Z

+	logMessage := fmt.Sprintf("Service '%s' creation failed. Reason: %s", s.Service.Name, errorMessage)
+	event.SetLog(models.EventLogLevelERROR, logMessage)
+
+	dbCon, disconnect, err := connectDB(s.Logger)


The original changes was actually like that,
#75 (comment)

Mayuka added this comment that contradicts with yours. I think, maintaining a connection would be a better idea

GunaKKIBM · 2026-04-23T06:42:42Z

+var (
+	notificationCache  = make(map[string]time.Time)
+	cacheMutex         sync.RWMutex
+	minIntervalMinutes = 30


Yes, this can happen. In the long run, we should persist this using DB

GunaKKIBM commented Apr 16, 2026

View reviewed changes

GunaKKIBM force-pushed the Notify-VM-Creation-Failure branch from e880482 to a140814 Compare April 16, 2026 03:48

mayuka-c reviewed Apr 16, 2026

View reviewed changes

GunaKKIBM force-pushed the Notify-VM-Creation-Failure branch from a140814 to 0bc0116 Compare April 23, 2026 01:12

Notify VM Creation Failure

9dddd04

Signed-off-by: Guna K Kambalimath <Guna.Kambalimath@ibm.com>

GunaKKIBM force-pushed the Notify-VM-Creation-Failure branch from 0bc0116 to 9dddd04 Compare April 23, 2026 01:39

anshuman-agarwala suggested changes Apr 23, 2026

View reviewed changes

GunaKKIBM commented Apr 23, 2026

View reviewed changes

Conversation

GunaKKIBM commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayuka-c Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayuka-c Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayuka-c Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GunaKKIBM commented Apr 16, 2026 •

edited

Loading

mayuka-c Apr 16, 2026 •

edited

Loading

mayuka-c Apr 16, 2026 •

edited

Loading

mayuka-c Apr 16, 2026 •

edited

Loading