What is WebRTC ? (Part 1 ~ Intro)

Published in

Huddle 01

10 min readJan 6, 2021

This is transcription of the talk I presented at Google Developer Group, Jaipur.

Have you ever wondered how Huddle 01 works? Well this blog explores how exactly it works down to the core. The magic happens with the help of tech called WebRTC.

WebRTC brief

As the name suggest WebRTC is backbone of Real Time Communication that happens over internet. At present majority of internet application are based on top of server client (request, response) mechanisms. For real time audio, video and data processing via browser to browser we require efficient protocols for Audio/Video processing and networking.

WebRTC is currently being used for variety of use-cases like Conferencing, Live streaming, Streaming, CDN ,Cloud Gaming etc.

WebRTC in Depth

WebRTC in itself is collection of bunch of protocols and standardisation which work hand in hand to establish peer to peer exchange of audio, video and data between browser independent of any 3rd party plugins. WebRTC abstracts this as real time communication feature and any Web app can use it with simple Javascript APIs, i.e.

MediaStream:- acquisition of audio / video streams
RTCPeerConnection: communication of audio / video data
RTCDataChannel: communication of data

With these APIs any web app can boast a peer to peer communication. These API abstract aways the technical details of signalling , NAT traversal, connection establishment , security and communication.

The abstracted protocols also determines the performance of p2p communication like latency , messages sizes, jitter, round trip time. WebRTC primarily uses UDP as opposed to TCP to deliver adaptive stream. UDP provides a faster transmission as compared to TCP but:

Is not reliable. i.e. packets and its synchronisation might be missing.
UDP is too thin for transfer adaptive stream and require additional logic on top to achieve it.

WebRTC can also be integrated with other existing communication systems like VoIP, SIP , PSTN etc .

WebRTC Audio / Video Engines

For a basic media conferencing web app we are required to:

Acquire users Audio / Video stream of users for communication that is handled by simple API (Mediastream) instead of relying on any other third party sources.

Process user media streams for maintaining quality , synchronisation between audio ,video packets and adjust bitrate dynamically to adapt to unpredictable bandwidth between users.

Decode when a peer receive streams it must decode and process for network jitter and packets loss.

WebRTC does all of this with its Audio / Video engines under its API which manages the full lifecycle of establishing and maintaining a p2p connection

Audio (Voice) Engine

Audio engine process to capture raw audio data directly from sound card till upto transporting it over network . Audio engine engages in different tasks during the processes that can be categorised into

Audio Codecs There are audio codecs used in webRTC ie iSAC / iLBC /G 711/G722 /Opus codecs . Opus codec is majorly used because of its performance over fluctuating network . In VoIP application iSAC and iLBC are preferred .

The internet Speech Audio Codec (iSAC) is a standard developed by Global IP Solutions. iSAC is already used by VoIP to provide an audio communications that adjust according to the bandwidth available (also called bandwidth-adaptive). iSAC is intended for wideband network conditions where bitrates may be low and packet loss, delay, or jitter as are common (these conditions are usually true over wide area networks).
The Internet low bit rate codec (iLBC) is a narrowband speech codec for VoIP and streaming audio defined in RFCs 3951 and 3952. iLBC is well suited to poor network conditions with limited bitrates and is more rubust to lost speech frames — audio quality gracefully degrades as network conditions detiorate.
Opus is a lossy audio coding format standardized in RFC 6716, and incorporates technology already deployed by Skype. Opus is a flexible codec that can handle a wide range of audio applications, including Voice over IP, videoconferencing, in-game chat, and even remote live music performances. It can scale from low bitrate narrowband speech to very high quality stereo music.

Acoustic Echo Canceler (AEC) The Echo Canceler is a signal processor that removes any acoustic echo from the voice. This is a needed component because WebRTC was developed with end-user browser-based communication in mind. This means that most WebRTC users will have an integrated camera, speaker and microphone where the output from the speaker will be picked up by the active microphone. Without echo canceling, you would end up with feedback and an unusable audio stream.

Noise Reduction (NR) Noise Reduction is another signal processing component developed to deal with the common conditions of WebRTC and VoIP deployments. Specifically, computers emit a lot of background noise like the spinning of a fan, or the hum of an electrical wire. Noise Reduction reduces these noises to enhance the quality of the audio stream.

Video Engine

VideoEngine, as the name would suggest, provides much the same functionality as AudioEngine, only tailored to streaming video. VideoEngine also is similarly intended to take a raw video capture from the device and prepare it for transport over the web.

Codecs VP8 is the most preferred codecs used in webRTC application as it performs well in sub-latency and has friendly usage license. The other codecs option include Vp9, H264, AV1

Raw video is bandwidth intensive, and the key functionality that VP8 provides is video compression.

Video jitter buffer The jitter buffer helps conceal packet loss. The jitter buffer works by collecting and storing incoming media packets in a buffer, and decides when to pass them along to the decoder and playback engine. It makes that decision based on the packets it has collected, the packets it is still waiting for and the timing required to playback the media.

Image enhancements Any image enhancements are dependent on the WebRTC implementation. Typically, the VideoEngine will remove any video noise from the image captured by the web cam.

All of the processing is done directly by the browser, and even more importantly, the browser dynamically adjusts its processing pipeline to account for the continuously changing parameters of the audio and video streams and networking conditions. Once all of this work is done, the web application receives the optimised media stream, which it can then output to the local screen and speakers, forward to its peers, or post-process using one of the HTML5 media APIs!

Acquiring Audio and Video with getUserMedia

The Media Capture and Streams W3C specification defines a set of new JavaScript APIs that enable the application to request audio and video streams from the platform, as well as a set of APIs to manipulate and process the acquired media streams. The MediaStream object is the primary interface that enables all of this functionality.

*MediaStream carries one or more synchronized tracks*

The MediaStream object consists of one or more individual tracks (MediaStreamTrack).
Tracks within a MediaStream object are synchronized with one another.
The input source can be a physical device, such as a microphone, webcam or a local or remote file from the user’s hard drive or a remote network peer.
The output of a MediaStream can be sent to one or more destinations: a local video or audio element, JavaScript code for post-processing, or a remote peer.

A MediaStream object represents a real-time media stream and allows the application code to acquire data, manipulate individual tracks, and specify outputs. All the audio and video processing, such as noise cancellation, equalisation, image enhancement, and more are automatically handled by the audio and video engines.

However, the features of the acquired media stream are constrained by the capabilities of the input source: a microphone can emit only an audio stream, and some webcams can produce higher-resolution video streams than others. As a result, when requesting media streams in the browser, the getUserMedia() API allows us to specify a list of mandatory and optional constraints to match the needs of the application

The getUserMedia() API is responsible for requesting access to the microphone and camera from the user, and acquiring the streams that match the specified constraints—that’s the whirlwind tour.

The provided APIs also enable the application to manipulate individual tracks, clone them, modify constraints, and more. Further, once the stream is acquired, we can feed it into a variety of other browser APIs:

Web Audio API enables processing of audio in the browser.
Canvas API enables capture and post-processing of individual video frames.
CSS3 and WebGL APIs can apply a variety of 2D/3D effects on the output stream.

To make a long story short, getUserMedia() is a simple API to acquire audio and video streams from the underlying platform. The media is automatically optimized, encoded, and decoded by the WebRTC audio and video engines and is then routed to one or more outputs.

Real-Time Network Transports. (route data to peer)

Real-time communication is time-sensitive. As a result, audio and video streaming applications are designed to tolerate intermittent packet loss: the audio and video codecs can fill in small data gaps, often with minimal impact on the output quality. Similarly, applications must implement their own logic to recover from lost or delayed packets carrying other types of application data. Timeliness and low latency can be more important than reliability.

Audio and video streaming in particular have to adapt to the unique properties of our brains. Turns out we are very good at filling in the gaps but highly sensitive to latency delays. Add some variable delays into an audio stream, and “it just won’t feel right,” but drop a few samples in between, and most of us won’t even notice!

The requirement for timeliness over reliability is the primary reason why the UDP protocol is a preferred transport for delivery of real-time data. TCP delivers a reliable, ordered stream of data: if an intermediate packet is lost, then TCP buffers all the packets after it, waits for a retransmission, and then delivers the stream in order to the application.

By comparison, UDP offers the following “non-services”:

No guarantee of message delivery
No acknowledgments, retransmissions, or timeouts.
No guarantee of order of delivery
No packet sequence numbers, no reordering, no head-of-line blocking.
No connection state tracking
No connection establishment or teardown state machines.
No congestion control
No built-in client or network feedback mechanisms.

WebRTC uses UDP at the transport layer: latency and timeliness are critical. We also need mechanisms to traverse the many layers of NATs and firewalls, negotiate the parameters for each stream, provide encryption of user data, implement congestion and flow control, and more!

UDP is the foundation for real-time communication in the browser, but to meet all the requirements of WebRTC, the browser also needs a large supporting cast of protocols and services above it.

ICE: Interactive Connectivity Establishment (RFC 5245)
STUN: Session Traversal Utilities for NAT (RFC 5389)
TURN: Traversal Using Relays around NAT (RFC 5766)
SDP: Session Description Protocol (RFC 4566)
DTLS: Datagram Transport Layer Security (RFC 6347)
SCTP: Stream Control Transport Protocol (RFC 4960)
SRTP: Secure Real-Time Transport Protocol (RFC 3711)

ICE, STUN, and TURN are necessary to establish and maintain a peer-to-peer connection over UDP. DTLS is used to secure all data transfers between peers; encryption is a mandatory feature of WebRTC. Finally, SCTP and SRTP are the application protocols used to multiplex the different streams, provide congestion and flow control, and provide partially reliable delivery and other additional services on top of UDP.

RTCPeerConnection API

Despite the many protocols involved in setting up and maintaining a peer-to-peer connection, the application API exposed by the browser is relatively simple. The RTCPeerConnection interface is responsible for managing the full life cycle of each peer-to-peer connection.

RTCPeerConnection manages the full ICE workflow for NAT traversal.
RTCPeerConnection sends automatic (STUN) keepalives between peers.
RTCPeerConnection keeps track of local streams.
RTCPeerConnection keeps track of remote streams.
RTCPeerConnection triggers automatic stream renegotiation as required.
RTCPeerConnection provides necessary APIs to generate the connection offer, accept the answer, allows us to query the connection for its current state, and more.

In short, RTCPeerConnection encapsulates all the connection setup, management, and state within a single interface.

There are four basic steps that need to happen to establish a WebRTC connection. Each of these steps should complete before the next steps takes over.

1 . Signalling (upcoming in part 2)

2 . Connecting …. (upcoming in part 3)

3 . Securing (upcoming in part 4 )

4 . Communicating (upcoming in part 5)

5. Usecases and Application (upcoming in part 6)

6. Debugging (upcoming in part 7)

Inspiration for content and diagrams were obtained from references mentioned below.