Multimodal Software Development #1145

AdamSobieski · 2025-10-03T03:40:52Z

AdamSobieski
Oct 3, 2025

Introduction

Hello. I am excited to share some ideas, for discussion, towards simplifying the development of multimodal software, e.g., dialogue systems, using the new Agent Framework and WebRTC and/or ORTC.

For Python development, there is the aiortc library (documentation) which is based on JavaScript APIs while closely following Python's standard asynchronous I/O framework.

For UWP, iOS, and Android, there is the ORTC Lib. Its supported languages include C++, Swift/Objective-C, C#, and Java.

How could the development of multimodal artificial-intelligence systems (atop WebRTC and ORTC) be simplified?

Discussion

A Static Class

A new static class resembling System.Console could allow software developers to more easily make use of real-time communication capabilities. This new static class might be named something resembling: System.Rtc or System.Media.

Services

A new service could provide software developers with these capabilities. This might resemble: agent.GetService<IRtcService>() or agent.GetService<IMediaService>().

Workflow-based Approaches

A new workflow-based approach, e.g., using Agent Framework, could provide software developers with media-processing capabilities.

A code sketch of some multimedia-related workflow executors.

public interface IMediaWorkflowContext : IWorkflowContext { }

public abstract class MediaExecutor(string id, ExecutorOptions? options = null) : Executor(id, options) { }

// displays the result of text-to-image into a video stream
public class TextToImageMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays the result of text-to-video into a video stream
public class TextToVideoMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// renders text-to-speech into an audio track
public class TextToSpeechMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a synchronized digital avatar / talking head into the video track
public class TalkingHeadMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays an image into a video stream while gently zooming in or out
public class ImageZoomMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// searches for a video resource
public class VideoSearchMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a video resource into a media stream, e.g., video clip of a product to a customer
public class VideoResourceMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// perform one media executor with an alpha-channel video atop another media executor
public class VideoOverlayMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// renders a text console, using a monospaced font, into a video stream
public class TextConsoleMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// renders a text string into a video stream
public class TextMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// a timeline-based component for compositing and synchronizing child media executors and multimedia content
public class TimelineComponentMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// renders a Unity scene to a video stream
public class UnityMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a slideshow into a video stream
public class SlideshowPresentationMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a slideshow slide into a video stream
public class SlideshowSlideMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a paged document into a video stream
public class DocumentMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a paged document's page into a video stream
public class DocumentPageMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a paged document's page with content selection into a video stream
public class DocumentSelectionMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a spreadsheet into a video stream
public class SpreadsheetMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

// displays a screencast into a video stream
public class ScreencastMediaExecutor(string id, ExecutorOptions? options = null) : MediaExecutor(id, options) { }

Visual Programming

Developers could visually connect multimedia-processing components in dataflow diagrams (these, perhaps, representing services), e.g., a component for transcribing videos' audio tracks' speech into text.

Tool Use and Source-code Generation

Artificial-intelligence systems could generate source code or make use of tools, e.g., to place executors into runtime-dynamic workflows representing pending video content to be streamed over ORTC or WebRTC.

Multimodal Dialogue Systems

Multimodal dialogue systems can be much more than synthetic talking heads or digital avatars with viseme-synchronized text-to-speech capabilities. They can be off-screen voice-over narrators, engaging end-users in interactive real-time conversations while other content is retrieved or generated, scheduled, streamed, and displayed on-screen by artificial-intelligence systems. Artificial-intelligence systems could produce the video content to be streamed for display to end-users concurrent to coherent artificial-intelligence-generated voice-over narration.

Multimodal content could, for examples, be engaging educational documentaries or seamless sequences of video clips showing specific products in video-enhanced conversational storefronts.

Conclusion

What do you think about multimodal software development using WebRTC and/or ORTC with the Agent Framework? Thank you.

stackglow · 2025-11-20T13:24:23Z

stackglow
Nov 20, 2025

I think combining multimodal capabilities with WebRTC/ORTC in an agent framework opens up interesting possibilities for real-time communication features. The challenge would be managing the complexity of media streams alongside agent orchestration, but it could enable some powerful use cases like collaborative coding with voice/video or AI assistants that can handle multiple input types simultaneously. Would be curious to see how you're planning to structure the agent interactions with the WebRTC layer.

1 reply

AdamSobieski Nov 20, 2025
Author

Hello. I am interested in AI systems generating educational, documentary-style film content such that students could ask questions of and engage in conversations with voice-over narrators to receive seamless multimodal answers to their questions.

With respect to implementational possibilities, I am brainstorming about mapping video editing and production with structured document, diagram, or source-code creation and revision so that, if agentic systems can create joint documents, diagrams, or source code, then they could create real-time streaming video content.

As envisioned, structured documents (these perhaps inspired by screenplays or storyboards), diagrams (these perhaps inspired by workflow), or source code would be automatically processed into streaming video content as virtual play-heads traversed them.

I'm picturing agentic systems adding to joint structured documents, diagrams, or source code, at runtime, while these would be processed by virtual play-heads transforming them into video content streamed to and presented to end-users or audiences. AI systems could, then, be up ahead, enqueueing "recipe instructions" for these virtual play-heads to later dequeue and process, these systems capable of being interrupted by end-users which would result in them revising their recipes on-the-fly.

A simple, concrete recipe might involve playing a number of short video clips, in a sequence, while synchronizing some text-to-speech content over them to provide voice-over narration. Such a recipe could be represented using structured documents, diagrams, or source code.

As considered, there would be a bit more to it, i.e., context engineering for AI systems to appear to know what the contents of their created and presented videos were and to also remember their conversational interactions with end-users.

There are many other interesting use cases and theoretical implementational possibilities to consider. I would enjoy discussing any of these ideas, above, or any other ideas to enable or simplify the development of multimodal agentic systems.

Some thoughts on domain-specific languages for synchronized multimedia...

Expanding on the source-code-generation possibilities indicated, above, one could invent techniques to utilize existing programming languages, e.g., Python, or could design a new domain-specific language (DSL).

In the following preliminary sketch of a DSL borrowing concepts from SMIL and SSML, elements resembling labels are utilized to express synchronization points between sequences of instructions to be processed in parallel.

parallel
{
  sequence
  {
    mark1:
      say("George Washington was the first president of the United States.");
    mark2:
      say("One of his horses during the American Revolutionary War was named <mark name='mark3'> Blueskin.");
    mark4:
      say("This horse is often portrayed in artwork.");
  }

  sequence
  {
    mark1:
      show('washington-1.mp4');
    mark2:
      show('washington-2.mp4');
    mark3:
      show('blueskin.mp4');
    mark4:
      show('artwork.mp4');
  }
}

In the following preliminary sketch, a kind of metadata attribute element is, instead, utilized.

parallel
{
  [narration]
  sequence
  {
    [start(1)]
    say("George Washington was the first president of the United States.");
      
    [start(2)]
    say("One of his horses during the American Revolutionary War was named [start(3)] Blueskin.");
 
    pause();

    [start(4)]
    say("This horse is often portrayed in artwork.");
  }

  [video]
  sequence
  {
    [start(1)]
    show('washington-1.mp4');

    [start(2)]
    show('washington-2.mp4');

    [start(3)]
    show('blueskin.mp4');

    [start(4)]
    [during(5, 00:10 - 00:20)]
    [during(6, 00:20 - 00:30)]
    show('artwork.mp4');
  }

  [music]
  sequence
  {
    [start(5)]
    play('classical-music-1.mp3');

    [start(6)]
    play('classical-music-2.mp3');
  }
}

While there would be much more work needed to design any new DSL for use by agentic AI systems with respect to multimedia synchronization, I hope that these ideas and preliminary sketches were of some interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodal Software Development #1145

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Multimodal Software Development #1145

Uh oh!

Uh oh!

AdamSobieski Oct 3, 2025

Introduction

Discussion

A Static Class

Services

Workflow-based Approaches

Visual Programming

Tool Use and Source-code Generation

Multimodal Dialogue Systems

Conclusion

Replies: 1 comment · 1 reply

Uh oh!

stackglow Nov 20, 2025

Uh oh!

Uh oh!

AdamSobieski Nov 20, 2025 Author

AdamSobieski
Oct 3, 2025

Replies: 1 comment 1 reply

stackglow
Nov 20, 2025

AdamSobieski Nov 20, 2025
Author