Skip to content

Notify VM Creation Failure#75

Open
GunaKKIBM wants to merge 1 commit intoIBM:mainfrom
GunaKKIBM:Notify-VM-Creation-Failure
Open

Notify VM Creation Failure#75
GunaKKIBM wants to merge 1 commit intoIBM:mainfrom
GunaKKIBM:Notify-VM-Creation-Failure

Conversation

@GunaKKIBM
Copy link
Copy Markdown
Member

@GunaKKIBM GunaKKIBM commented Apr 16, 2026

This PR notifies the user and admin via email, if VM creation fails/VM is in error state.

Comment thread api/main.go Outdated
}

setupLog.Info("Attempting to connect to MongoDB...")
// TODO: Should we really fail, if connection to mongoDB fails?
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need suggestions here
Should we exit, if mongoDB connection fails? Given the main functionality here is to provision/delete the VM, we can just log, if mongoDB connection fails. And notifications won't be sent.

@GunaKKIBM GunaKKIBM force-pushed the Notify-VM-Creation-Failure branch from e880482 to a140814 Compare April 16, 2026 03:48
Comment thread api/main.go Outdated
os.Exit(1)
}

setupLog.Info("Attempting to connect to MongoDB...")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for connecting to mongoDB at controller start up?
I feel we can do the db connection during reconcile if we want to store some events on demand

Comment on lines +146 to +148
if err := scope.NotifyServiceCreationFailure(errorMsg); err != nil {
scope.Logger.Error(err, "failed to create failure notification event")
}
Copy link
Copy Markdown
Member

@mayuka-c mayuka-c Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might produce events during every reconcile right if VM is in error state?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to make it idempotent, if the event for given service id and name already exists, then do not need to update the event in DB

Copy link
Copy Markdown
Member

@mayuka-c mayuka-c Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can have in-memory cache for now (maybe later external-cache like redis), store the events there and then peridiocally maybe every 10-15 mins dump in DB. Without this you might end up making lot of DB calls I believe?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something, that is supposed to be notified immediately, given the mail is sent based on events from DB, caching here wouldn't be right is what I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be, it could me more like a rate limiter, for a given VM instance, if the failure is seen more than x times in n seconds, we can send the event

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Copy link
Copy Markdown
Member

@mayuka-c mayuka-c Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one more which @anshuman-agarwala said today that we can get the older VM state and current VM state, if older was error and the newer is also error then do not send. If its transitioned from anything else to error (meaning first time error), then notify.

Pls do check with him regarding this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshuman-agarwala , I checked the PVM_instance_spec. I checked the cloud docs as well. No spec clearly talks about previous health state. Only one field that could closely relate here is PVMInstanceHealth
I checked the cloud docs as well. It doesn't clearly specify anything about VM's older state.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayuka-c . I have added a rate limiter. As you are already aware, it doesn't persist across restarts. But, for now, this should look ok.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a controller-runtime concept, you can take a look at adding an admission webhook or a shared informer to the controller for this. We can discuss further offline.

@GunaKKIBM GunaKKIBM force-pushed the Notify-VM-Creation-Failure branch from a140814 to 0bc0116 Compare April 23, 2026 01:12
Signed-off-by: Guna K Kambalimath <Guna.Kambalimath@ibm.com>
@GunaKKIBM GunaKKIBM force-pushed the Notify-VM-Creation-Failure branch from 0bc0116 to 9dddd04 Compare April 23, 2026 01:39
var (
notificationCache = make(map[string]time.Time)
cacheMutex sync.RWMutex
minIntervalMinutes = 30
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cache is in-memory this can cause multiple emails in a short time if the controller goes into a restart loop, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can happen. In the long run, we should persist this using DB

// collection exists
return true, nil
}
} No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a mistake?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.I ran gofmt, on a few files, and this got added, will remove it.

logMessage := fmt.Sprintf("Service '%s' creation failed. Reason: %s", s.Service.Name, errorMessage)
event.SetLog(models.EventLogLevelERROR, logMessage)

dbCon, disconnect, err := connectDB(s.Logger)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better if we can maintain a single long living connection instead of recreating the connection every time this method gets called. Hypothetically if a bunch of VMs fail at the same time then this will create a bunch of connections to the DB potentially causing slowdown/crash on the DB as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original changes was actually like that,
#75 (comment)

Mayuka added this comment that contradicts with yours. I think, maintaining a connection would be a better idea

Comment on lines +146 to +148
if err := scope.NotifyServiceCreationFailure(errorMsg); err != nil {
scope.Logger.Error(err, "failed to create failure notification event")
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayuka-c . I have added a rate limiter. As you are already aware, it doesn't persist across restarts. But, for now, this should look ok.


// collection exists
return true, nil
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like, this got updated because of gofmt

// collection exists
return true, nil
}
} No newline at end of file
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.I ran gofmt, on a few files, and this got added, will remove it.

logMessage := fmt.Sprintf("Service '%s' creation failed. Reason: %s", s.Service.Name, errorMessage)
event.SetLog(models.EventLogLevelERROR, logMessage)

dbCon, disconnect, err := connectDB(s.Logger)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original changes was actually like that,
#75 (comment)

Mayuka added this comment that contradicts with yours. I think, maintaining a connection would be a better idea

var (
notificationCache = make(map[string]time.Time)
cacheMutex sync.RWMutex
minIntervalMinutes = 30
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can happen. In the long run, we should persist this using DB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants