[AI] Implement RealtimeInputConfig and Manual Activity Signals for Live API#8080
[AI] Implement RealtimeInputConfig and Manual Activity Signals for Live API#8080
Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
📝 PRs merging into main branchOur main branch should always be in a releasable state. If you are working on a larger change, or if you don't want this change to see the light of the day just yet, consider using a feature branch first, and only merge into the main branch when the code complete and ready to be released. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for realtime input configuration in the Live API, specifically adding LiveRealtimeInputConfig and LiveActivityDetection. These additions allow for fine-grained control over how the model detects user activity, handles interruptions, and defines turn coverage. The PR includes new public methods sendStartActivityRealtime and sendStopActivityRealtime for manual activity signaling, along with corresponding updates to the Java-friendly LiveSessionFutures and internal serialization logic. Comprehensive unit and instrumentation tests have been added to verify the new configuration options. Review feedback focuses on minor documentation refinements to ensure consistent KDoc linking syntax and improved readability in the LiveSession class.
| * | ||
| * Only required when automatic activity detection is disabled via [LiveRealtimeInputConfig]. | ||
| */ | ||
| public suspend fun sendStartActivityRealtime() { |
There was a problem hiding this comment.
func name is not decided yet. same for sendStopActivityRealtime.
| * starting and ending of the activity. | ||
| */ | ||
| @PublicPreviewAPI | ||
| public class LiveRealtimeInputConfig |
There was a problem hiding this comment.
name is not decided yet. could be changed to RealtimeInputConfig.
| gotTurnComplete = true | ||
| } | ||
| // Stop collecting when there's a new handle AND turnComplete is true | ||
| !(gotTurnComplete && lastResumptionUpdate?.newHandle != null) |
There was a problem hiding this comment.
fixes takeWhile early return error when it received lastResumptionUpdate with null newHandle
There was a problem hiding this comment.
If the takeWhile bug starts breaking CI we can spin a different PR just with this fix.
There was a problem hiding this comment.
have removed the test fix from this PR.
| ) | ||
|
|
||
| @Serializable | ||
| internal data class Internal( |
There was a problem hiding this comment.
toInternal() and internal are a bit long. Open for suggesstion.
There was a problem hiding this comment.
You can make Sensitivity have it's own Internal class and dedup here.
| * Only required when automatic activity detection is disabled via [LiveRealtimeInputConfig]. | ||
| */ | ||
| public suspend fun sendStartActivityRealtime() { | ||
| sendFrame(BidiGenerateContentRealtimeInputSetup(activityStart = true).toInternal()) |
There was a problem hiding this comment.
activity_start and activity_end are expected to be an object with no field: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#activitystart
these two fields set to booleans in BidiGenerateContentRealtimeInputSetup to be serialized to {} later.
open for code optimization here.
I do have considered using a sealed interface instead ,like this:
internal sealed interface BidiRealtimeInput {
data class Text(val text: String) : BidiRealtimeInput
data class Audio(val data: InlineData) : BidiRealtimeInput
data class Video(val data: InlineData) : BidiRealtimeInput
data class MediaStream(val chunks: List<InlineData>) : BidiRealtimeInput
object ActivityStart : BidiRealtimeInput
object ActivityEnd : BidiRealtimeInput
}
and seriaization works like this
internal class BidiGenerateContentRealtimeInputSetup(val input: BidiRealtimeInput) {
@Serializable internal class ActivityStart
@Serializable internal class ActivityEnd
@Serializable
internal class Internal(val realtimeInput: BidiGenerateContentRealtimeInput) {
@Serializable
internal data class BidiGenerateContentRealtimeInput(
val mediaChunks: List<InlineData.Internal>? = null, ...
)
}
fun toInternal() = Internal(
when (input) {
is BidiRealtimeInput.Text -> Internal.BidiGenerateContentRealtimeInput(text = input.text)
...
is BidiRealtimeInput.ActivityStart -> Internal.BidiGenerateContentRealtimeInput(activityStart = ActivityStart())
...
}
)
}
By doing so, we are no longer using boolean for activityStart, activityEnd. but we are also fixed to mutually exclusive input action (Only one input action,Text OR Audio OR Start signal can be sent once).
There was a problem hiding this comment.
If encoding works, and given that this is internal, feel free to go that way!
| ) { | ||
|
|
||
| /** How sensitive the model interprets speech activity. */ | ||
| public enum class Sensitivity { |
There was a problem hiding this comment.
public API shouldn't be using enums
| ) | ||
|
|
||
| @Serializable | ||
| internal data class Internal( |
There was a problem hiding this comment.
You can make Sensitivity have it's own Internal class and dedup here.
| ) { | ||
| @Serializable | ||
| internal enum class StartSensitivity { | ||
| @SerialName("START_SENSITIVITY_UNSPECIFIED") UNSPECIFIED, |
There was a problem hiding this comment.
Not 100% necessary, but if the LiveActivityDetection class is only ever created by devs, there's no need for an "unspecified" case.
| ) { | ||
|
|
||
| /** How a model handles user input activity. */ | ||
| public enum class ActivityHandling { |
There was a problem hiding this comment.
public API should have no enums
| } | ||
|
|
||
| /** How the model considers which input is included in the user's turn. */ | ||
| public enum class TurnCoverage { |
| * Only required when automatic activity detection is disabled via [LiveRealtimeInputConfig]. | ||
| */ | ||
| public suspend fun sendStartActivityRealtime() { | ||
| sendFrame(BidiGenerateContentRealtimeInputSetup(activityStart = true).toInternal()) |
There was a problem hiding this comment.
If encoding works, and given that this is internal, feel free to go that way!
Add support for
RealtimeInputConfigto configure voice activity detection and manual activity signals in the Live API.