Add "For Developers" guides - Debugging, Encryption and SDP

mickel8 · mickel8 · commit 8b26ed42e72d · 2025-07-09T17:24:23.000+02:00
diff --git a/guides/for_developers/fd_debugging.md b/guides/for_developers/fd_debugging.md
@@ -0,0 +1,132 @@
+# WebRTC debugging
+
+**It is also worth taking a look at [debugging](../advanced/debugging.md)**
+
+
+In most cases, when **something** does not work, we try to find the problem according to the following workflow:
+1. Check whether session has been negotiated successfully.
+1. Check whether connection (ICE and DTLS) has been established.
+1. Check whether RTP packets are demuxed, frames assembled and decoded.
+1. Check QoE statistics - freezes, jitter, packet loss, bitrate, fps.
+ 
+
+```mermaid
+flowchart TD
+    S["Types of problems in WebRTC"] --> Session["Session negotiation (number of tracks, codecs)"]
+    S --> Connection["Connection establishment (ICE and DTLS)"]
+    S --> Playout["Playout (demuxing, packetization, decoding)"]
+    S --> QoE["QoE (freezes, low quality, low fps)"]
+```
+
+## Session Negotiation
+
+Here, we just validate that SDP offer/answer looks the way it should.
+In particular:
+1. Check number of audio and video mlines.
+1. Check if any mlines are rejected (either by presence of port 0 in the m="" or a=inactive).
+In most cases port is set to 9 (which means automatic negotiation by ICE) or if ICE is already in progress or this is subsequent negotiation, it might be set to a port currently used by the ICE agent. Port 0 appears when someone stops transceiver via [`stop()`](https://developer.mozilla.org/en-US/docs/Web/API/RTCRtpTransceiver/stop).
+1. Check  mlines directions (a=sendrecv/sendonly/recvonly/inactive)
+1. Check codecs, their profiles and payload types.
+1. Number of mlines between offer and answer cannot change. 
+This means that if one side offer that it is willing to only receive a single audio track,
+everything the other side can do is either confirm it will be sending or decline and say it won't be sending. 
+If the other side also wants to send, additional negotiation has to be performed **in this case**.
+
+SDP offer/answer can be easily checked in chrome://webrtc-internals (in chromium based browsers) or (about:webrtc in FF).  
+
+## Connection establishment
+
+WebRTC connection state (PeerConnection or PC state) is a sum of ICE connection state and DTLS state.
+In particular, PC is in state connected when both ICE and DTLS is in state connected.
+
+The whole flow looks like this:
+1. ICE searches for a pair of local and remote address that can be used to send and receive data.
+1. Once a valid pair is found, ICE changes its state to connected and DTLS handshake is started.
+1. Once DTLS handshake finishes, DTLS changes its state to connected and so the whole PC.
+1. In the meantime, ICE continues checking other pairs of local and remote address in case there is better path. 
+If there is, ICE seamlessly switches to it - the transmission is not stopped or interrupted.
+
+
+More on ICE, its state changes, failures and restarts in section devoted to ICE.
+
+In most cases, DTLS handshake works correctly. Most problems are related to ICE as it's pretty complex protocol.
+
+Debugging ICE:
+
+1. Check ICE candidates grid in chrome://webrtc-internals or about:webrtc
+1. Turn on debug logs in ex_ice or chromium (via command line argument). FF exposes all ICE logs in about:webrtc->Connection Log. 
+Every implementation (ex_ice, chromium, ff) is very verbose.
+You can compare what's happening on both sides.
+1. Try to filter out some of the local network interfaces and remove STUN/TURN servers to reduce complexity of ICE candidate grid, amount of logs and number of connectivity checks.
+In ex_webrtc, this is possible via [configuration options](https://hexdocs.pm/ex_webrtc/0.14.0/ExWebRTC.PeerConnection.Configuration.html#t:options/0).
+1. Use Wireshark. 
+Use filters on src/dst ip/port, udp and stun protocols.
+This way you can analyze whole STUN/ICE/TURN traffic between a single local and remote address.
+
+Debugging DTLS:
+
+This is really rare.
+We used Wireshark or turned on [debug logs in ex_dtls](https://hexdocs.pm/ex_dtls/0.17.0/readme.html#debugging). 
+
+## Playout
+
+If both session negotiation and connection establishment went well, you can observe packets are flowing but nothing is visible in the web browser, the problem might be in RTP packets demuxing, frames assembly or frames decoding on the client side.
+
+1. We heavily rely on chrome://webrtc-internals here. 
+1. Check counters: packetsReceived, framesReceived, framesDecoded, framesDropped.
+1. E.g. if packetsReceived increases but framesReceived does not, it means that there is a problem in assembling video frames from RTP packets. This can happen when:
+    1. web browser is not able to correctly demux incomming RTP streams possibly because sender uses incorrect payload type in RTP packets (different than the one announced in SDP) or does not include MID in RTP headers. 
+    Keep in mind that MID MAY be sent only at the beginning of the transmission to save bandwidth.
+    This is enough to create a mapping between SSRC and MID on the receiver side.
+    1. marker bit in RTP header is incorrectly set by the sender (although dependent on the codec, in case of video, marker bit is typically set when an RTP packet contains the end of a video frame)
+    1. media is incorrectly packed into RTP packet payload because of bugs in RTP payloader
+1. E.g. if packetsReceived increases, framesReceived increases but framesDecoded does not, it probably means errors in decoding process. 
+In this case, framesDropped will probably also increase.
+1. framesDropped may also increase when frames are assembled too late i.e. their playout time has passed.
+1. Check borwser logs. 
+Some of the errors (e.g. decoder errors) might be logged.
+
+## QoE
+
+The hardest thing to debug.
+Mostly because it very often depends on a lot of factors (network condition, hardware, sender capabilities, mobile devices).
+Problems with QoE are hard to reproduce, very often don't occur in local/office environment.
+
+1. We heavily rely on chrome://webrtc-internals here.
+1. Check coutners: nackCount, retransmittedPacketsSent, packetsLost. 
+Retransmissions (RTX) are must have. 
+Without RTX, even 1% of packet loss will have very big impact on QoE.
+1. Check incoming/outgoing bitrate and its stability.
+1. Check jitterBufferDelay/jitterBufferEmittedCount_in_ms - this is avg time each video frame spends in jitter buffer before being emitted for plaout.
+1. JitterBuffer is adjusted dynamically. 
+
+## Debugging in production
+
+1. Dump WebRTC stats via getStats() into db for later analysis.
+1. getStats() can still be called after PC has failed or has been closed.
+1. Continous storage WebRTC stats as time series might be challenging.
+We don't have a lot of experience doing it.
+1. Come up with custom metrics that will allow you to observe the scale of a given problem or monitor how something changes in time.
+1. E.g. if you feel like you very often encounter ICE failures, count them and compare to successful workflows or to the number of complete and successful SDP offer/answer exchanges.
+This way you will see the scale of the problem and you can observer how it changes in time,  after introducing fixes or new features.
+1. It's important to look at numbers instead of specific cases as there will always be someone who needs to refresh the page, restart the connection etc.
+What matters is the ratio of such problems and how it changes in time.
+1. E.g. this is a quote from Sean DuBois working on WebRTC in OpenAI:
+    > We have metrics of how many people post an offer compared to how many people get to connected [state]. It’s never alarmed on a lot of users.
+    
+    Watch the full interview [here](https://www.youtube.com/watch?v=HVsvNGV_gg8) and read the blog [here](https://webrtchacks.com/openai-webrtc-qa-with-sean-dubois/#h).
+1. Collect user feedback (on a scale 1-3/1-5, via emoji) and observe how it changes.
+
+## MOS
+
+Initially, MOS was simply asking people about their feedback on a scale from 1 to 5 and then computing avg.
+Right now, we have algorithms that aim to calculate audio/video quality on the same scale but using WebRTC stats: jitter, bitrate, packet loss, resolution, codecs, freezes, etc.
+An example can be found here: https://github.com/livekit/rtcscore-go
+
+## chrome://webrtc-internals
+
+1. Based on [getStats()](https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection/getStats) API
+1. getStats() does not return derivatives. 
+They depend on the frequency of calls to getStats() and have to be calcualted by a user.
+1. chrome://webrtc-internals can be dumped and then analyzed using: https://fippo.github.io/webrtc-dump-importer/
+
diff --git a/guides/for_developers/fd_encryption.md b/guides/for_developers/fd_encryption.md
@@ -0,0 +1,141 @@
+# WebRTC encryption
+
+In WebRTC, there are two types of data:
+* Media
+* Arbitrary Data
+
+## Media
+
+Media is audio or video. 
+It's sent using RTP - a protocol that adds timestamps, sequence numbers and other information
+that UDP lacks but is needed for retransmissions (RTX), correct audio/video demultiplexing, sync and playout, and so on.
+
+## Arbitrary data
+
+By arbitrary data we mean anything that is not audio or video.
+This can be chat messages, signalling in game dev, files, etc.
+Arbitrary data is sent using SCTP - a transport protocol like UDP/TCP but with a lot of custom features.
+In the context of WebRTC, two of them are the most important - reliability and transmission order.
+They are configurable and depending on the use-case, we can send data reliably/unreliably and in-order/unordered.
+SCTP has not been successfully implemented in the industry.
+A lot of network devices don't support SCTP datagrams and are optimized for TCP traffic.
+Hence, in WebRTC, SCTP is encapsulated into DTLS and then into UDP.
+Users do not interact with SCTP directly, instead they use abstraction layer built on top of it called Data Channels.
+Data Channels do not add additional fields/header to the SCTP payload.
+
+## Encryption
+
+```mermaid
+flowchart TD
+    subgraph Media - optional
+        M(["Media"]) --> R["RTP/RTCP"]
+    end
+    subgraph ArbitraryData - optional
+        A["Arbitrary Data"] --> SCTP["SCTP"]
+    end
+    R --> S["SRTP/SRTCP"]
+    D["DTLS"] -- keying material --> S 
+    I["ICE"] --> U["UDP"]
+    SCTP --> D
+    S --> I
+    D --> I
+```
+
+1. Media is encapsulated into RTP packets but not into DTLS datagrams.
+1. In the context of media, DTLS is only used to obtain keying material that is used to create SRTP/SRTCP context.
+1. RTP packet **payloads** are encrypted using SRTP.
+1. RTP headers are not encrypted - we can see and analyze them in Wireshark without configuring encryption keys.
+1. DTLS datagrams, among other fields, contain 16-bit sequence number in their headers.
+
+
+## E2E Flow
+
+1. Establish ICE connection
+2. Perform DTLS handshake
+3. Create SRTP/SRTCP context using keying material obtained from DTLS context
+4. Encapsulate media into RTP, encrypt using SRTP, and send using ICE(UDP)
+5. Encapsulate arbitrary data into SCTP, encrypt using DTLS and send using ICE(UDP)
+
+Points 1 and 2 are mandatory, no matter we send media, arbitrary data or both.
+WebRTC communication is **ALWAYS** encrypted.
+
+## TLS/DTLS handshake
+
+See:
+* https://tls12.xargs.org/
+* https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange#
+* https://webrtcforthecurious.com/docs/04-securing/
+* https://www.ibm.com/docs/en/cloud-paks/z-modernization-stack/2023.4?topic=handshake-tls-12-protocol
+
+
+1. TLS uses asymmetric cryptography but depending on TLS version and cipher suites, it is used for different purposes.
+1. In TLS-RSA, we use server's public key from server's cert to encrypt pre-master secret and send it from a client to the server.
+Then, both sides use client random, server random, and pre-master secret to create master secret. 
+1. In DH-TLS, server's public key from server's cert is not used to encrypt anything.
+Instead, both sides generate priv/pub key pairs and exchange pub keys between each other. 
+Pub key is based on a priv key and both of them are generated per connection.
+They are not related to e.g. server's pub key that's included in server's cert.
+All params are sent unecrypted.
+1. Regardless of the TLS version, server's cert is used to ensure server's identity.
+This cert is signed by Certificate Authority (CA).
+CA computes hash of the certificate and encrypts it using CA's private key.
+The result is known as digest and is included in server's cert.
+Client takes cert digest and verifies it using CA public key.
+1. In standard TLS handshake, server MUST send its certificate to a client but
+client only sends its certificate when explicitly requests by the server.
+1. In DTLS-SRTP in WebRTC, both sides MUST send their certificates.
+1. In DTLS-SRTP in WebRTC, both sides generate self-signed certificates.
+1. Alternatively, certs can be configured when creating a peer connection: https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection/RTCPeerConnection#certificates
+1. Fingerprints of these certs are included in SDP offer/answer and are checked once DTLS-SRTP handshake is completed i.e.
+we take fingerprint from the SDP (which is assumed to be received via secured channel) and check it against fingerprint
+of the cert received during DTLS-SRTP handshake.
+1. The result of DTLS-SRTP handshake is master secret, which is then used to create so called keying material.
+
+## Keying material
+
+See:
+* https://datatracker.ietf.org/doc/html/rfc5705#section-4
+* https://datatracker.ietf.org/doc/html/rfc5764#section-4.2 
+
+Keying material is used to create SRTP encryption keys and is derived from a master secret established during DTLS-SRTP handshake.
+
+```
+keying_material = PRF(master_secret, client_random + server_random + context_value_length + context_value, label)
+```
+
+
+* PRF is defined by TLS/DTLS
+* context_value and context_value_length are optional and are not used in WebRTC
+* label is used to allow for a single master secret to be used for many different purposes. 
+This is because PRF gives the same output for the same input.
+Using exactly the same keying material in different contexts would be insecure.
+In WebRTC this is a string "EXTRACTOR-dtls_srtp"
+* length of keying material is configurable and depends on SRTP profile
+* keying material is divided into four parts as shown below:
+
+    ```mermaid
+    flowchart TD
+        K["KeyingMaterial"] --> CM["ClientMasterKey"]
+        K --> SM["ServertMasterKey"]
+        K --> CS["ClientMasterSalt"]
+        K --> SS["ServerMasterSalt"]
+    ```
+    
+    They are then fed into SRTP KDF (key derivation function), which is another PRF (dependent on SRTP protection profile), which produces actual encryption keys.
+    Client uses ClientMasterKey and ClientMasterSalt while server uses ServerMasterKey and ServerMasterSalt.
+    By client and server we mean DTLS roles i.e. client is the side that inits DTLS handshake. 
+
+### Protection profiles
+
+Some of the protection profiles:
+* AES128_CM_SHA1_80
+* AES128_CM_SHA1_32
+* AEAD_AES_128_GCM
+* AEAD_AES_256_GCM
+
+Meaning:
+* AES128_CM - encryption algorithm (AES in counter mode) with 128-bit long key
+* SHA1_80 - auth function for creating 80-bit long message authentication code (MAC)
+* AEAD_AES_128_GCM - modified AES, both encrypts and authenticates
+
+Most of the SRTP protection profiles use AES_CM as KDF.
diff --git a/guides/for_developers/fd_sdp.md b/guides/for_developers/fd_sdp.md
@@ -0,0 +1,63 @@
+# WebRTC SDP
+
+WebRTC uses SDP offer/answer to negotiate session parameters (numer of audio/video tracks, their directions, codecs, etc.).
+The way they are exchanged between both sides is not standardized. 
+Very often it is a websocket.
+WebRTC was standardized by (among others) Google, Cisco and Mozilla.
+Cisco and Mozilla insisted on compatibility with SIP and telephone industry, hence a lot of strange things in WebRTC are present to allow for WebRTC <-> SIP interoperability (e.g. SDP, DTMF).
+
+## General information
+
+* an mline starts with `m=` and continues until the next mline or the end of the SDP
+* an mline represents a transceiver or data channel
+* audio/video mlines have direction - sendrecv, recvonly, sendonly, inactive
+* an mline can be rejected - in such case, its direction is set to inactive
+* when transceiver is stopped, port number in mline is set to 0
+* port number in mline set to 9 means that connection address will be set dynamically via ICE
+* SDP can include ICE candidates but it doesn't have to.
+In particular, when you create the first offer it won't have any ICE candidates, but if you wait a couple of seconds and read peerconnection.localDescription it will contain ICE candidates that were gatherd throughout this time.
+* offerer can offer to both send and receive
+* mline includes a list of supported codecs.
+They are sorted in preference order
+* sender can switch between negotiated codecs without informing receiver about this fact.
+Receiver has to be prepared for receiving any payload type accepted in SDP answer.
+This is e.g. used to switch between audio from microphone and DTMF.
+* each codec has its payload type - a number that identifies it and is included in RTP packet header
+* fmtp stands for format parameters and denotes additional codec parameters e.g. profile level, minimal packetization length, etc.
+* a lot of identifiers are obsolete (ssrc, cname, trackid in msid) but some implementations still rely on them (e.g. pion requries SSRC to be present in SDP to correctly demux incoming RTP streams). See RFC 8843, 9.2 for correct RTP demuxer algorithm. 
+* rtcp-fb is RTCP feedback supported by offerer/answerer. 
+Example feedbacks are used to request keyframes, retransmissions or to allow for congestion control implementation.
+
+## Rules
+
+1. Number of mlines in SDP answer MUST be the same as in the offer.
+1. Number of mlines MUST NOT decrease between subsequent offer/answers.
+1. SDP answer can exclude codecs, rtp header extensions, and rtcp feedbacks that were offered but are not supported by the answerer
+
+
+## Dictionary
+
+* SDP munging - manual SDP string modification to enable/disable some of the WebRTC features. 
+It happens inbetween createOffer/createAnswer and setLocalDescription. 
+E.g. when experimental support for a new codec was introduced, it could be enabled via SDP munging.
+
+
+## Negotiating bidirectional P2P connection
+
+See also [Mastering Transceivers](../advanced/mastering_transceivers.md) guide.
+
+When the other side is a casual peer, in most cases we want to both send and receive a single audio and video.
+This is the most common case.
+Hence, when we add audio or video track via addTrack, and create offer via createOffer,
+this offer will have mlines with directions set to sendrecv to allow for immediate, bidirectional session establishment.
+
+When the other side is an SFU, we have at least 3 options:
+* server sends, via signaling channel, information to the client on how many audio and video tracks there already are in the room.
+Client sends SDP offer including both its own tracks and server's tracks.
+This requires a single negotiation.
+* client sends SDP offer only including its own tracks. 
+After this is negotiated succsessfully, server sends its SDP offer.
+This requries two negotiations.
+* we use two separate peer connections, one for sending and one for receiving.
+This way client and server can send their offers in parallel.
+This was used e.g. by LiveKit.
diff --git a/mix.exs b/mix.exs