Multimodal Software Development #1145
AdamSobieski
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
|
I think combining multimodal capabilities with WebRTC/ORTC in an agent framework opens up interesting possibilities for real-time communication features. The challenge would be managing the complexity of media streams alongside agent orchestration, but it could enable some powerful use cases like collaborative coding with voice/video or AI assistants that can handle multiple input types simultaneously. Would be curious to see how you're planning to structure the agent interactions with the WebRTC layer. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
Hello. I am excited to share some ideas, for discussion, towards simplifying the development of multimodal software, e.g., dialogue systems, using the new Agent Framework and WebRTC and/or ORTC.
For Python development, there is the aiortc library (documentation) which is based on JavaScript APIs while closely following Python's standard asynchronous I/O framework.
For UWP, iOS, and Android, there is the ORTC Lib. Its supported languages include C++, Swift/Objective-C, C#, and Java.
How could the development of multimodal artificial-intelligence systems (atop WebRTC and ORTC) be simplified?
Discussion
A Static Class
A new static class resembling
System.Consolecould allow software developers to more easily make use of real-time communication capabilities. This new static class might be named something resembling:System.RtcorSystem.Media.Services
A new service could provide software developers with these capabilities. This might resemble:
agent.GetService<IRtcService>()oragent.GetService<IMediaService>().Workflow-based Approaches
A new workflow-based approach, e.g., using Agent Framework, could provide software developers with media-processing capabilities.
A code sketch of some multimedia-related workflow executors.
Visual Programming
Developers could visually connect multimedia-processing components in dataflow diagrams (these, perhaps, representing services), e.g., a component for transcribing videos' audio tracks' speech into text.
Tool Use and Source-code Generation
Artificial-intelligence systems could generate source code or make use of tools, e.g., to place executors into runtime-dynamic workflows representing pending video content to be streamed over ORTC or WebRTC.
Multimodal Dialogue Systems
Multimodal dialogue systems can be much more than synthetic talking heads or digital avatars with viseme-synchronized text-to-speech capabilities. They can be off-screen voice-over narrators, engaging end-users in interactive real-time conversations while other content is retrieved or generated, scheduled, streamed, and displayed on-screen by artificial-intelligence systems. Artificial-intelligence systems could produce the video content to be streamed for display to end-users concurrent to coherent artificial-intelligence-generated voice-over narration.
Multimodal content could, for examples, be engaging educational documentaries or seamless sequences of video clips showing specific products in video-enhanced conversational storefronts.
Conclusion
What do you think about multimodal software development using WebRTC and/or ORTC with the Agent Framework? Thank you.
Beta Was this translation helpful? Give feedback.
All reactions