How to Analyze and Interpret Real-Time Video Network Data

Photo of author
Written By Aditya Sharma

Lorem ipsum dolor sit amet consectetur pulvinar ligula augue . 

When dealing with real-time communications, latency can be a problem. CDNs are able to address last-mile quality issues. They are not options for real-time streaming because they introduce network latency which can kill communication. This article is the first in a series of two. Jerod Venema (CEO and co-founder of LiveSwitch) discusses how to analyze the problem.

My product owner recently created an acronym that we use internally at LiveSwitch. It is called CNCS. After discussing simulcast and SVC, RED packets, and other mechanisms to mitigate poor quality networks, and debating which options were most effective under certain conditions, this acronym was born. Crappy Network Conditions Support is a crucial element of LiveSwitch's operation.

Latency is one of the biggest challenges in dealing with real-time communication (200ms). There are many techniques that traditional CDNs can use to address last-mile quality issues. There are many options that can be used to overcome a user's poor wifi. None of these options work for real-time streaming. They all introduce network latency which can kill communication. As I type this, I receive a regular phone call from a good friend; there was a lag of about 1.5 seconds between us and we had to redial as it was too much.

What can we do? What are our options if we can't buffer or fix the end user’s network, keep the latency below 200ms, and still maintain high-quality communication?

Identify Potential Sources of Lag Within The System

When it comes to real-time communication, there are many components. Let's take a look at what happens when you send data from your camera directly to the server.

[access the camera]->[receive an image]->[scale the image]->[convert the image to the right format]->[encode the frame]->[packetize the encoded data]->[wrap the packet]->[encrypt packet]->[place packet on the network card]->[network card transmits to wifi access point]->[wifi access point transmits to router]->[router transmits server over public internet]->[TURN(S) server]->[YOU HAVE ARRIVED]

The flowchart shows that every arrow can be used as a handoff point, where there may be a queue. Every item in [square parents] could also be used as a point of latency. The receiver must then reverse the process. Once we have a better understanding of the flow, we can identify areas where problems could occur and get feedback from our system to help us adjust accordingly.

For the purposes of this discussion, I want to point out that audio behaves in the same way as video and that you can have other sources of video than a camera. For our purposes, however, we will assume that a camera's video feed is used. There are differences. Audio cannot be NACKed. Screenshare would be degraded differently from a camera because of the use case. We will address that later.

It is important to mention that we will be looking at a lag in the entire system, despite the title. This is because you can't adjust for the network only and still be successful. It is important to consider the whole system.

Remember that everything you see in this list is happening very fast. Each frame is 30fps, which means it takes 33.3 milliseconds to complete.

Step 1: Camera Access

The actual camera access is a surprising source of latency. It takes only a few seconds to start the camera in hardware, so we have a brief pause. It is a one-time startup expense. It is worth noting, however, that modern cameras often (but not always!) provide timestamps related to the images they capture. As do audio devices, modern cameras provide timestamps for the images they capture.

It is possible to obtain data from a camera that has a timestamp different from the one from your microphone. This could lead to lip sync issues. There are not many ways to detect this. We have found that pre-recorded video and audio with known data embedded in them are the best methods to detect this. This can cause false positives, as it leaves out the physical device.

If you're getting timestamps from the underlying hardware, the solution lies in making sure that users use both a microphone AND a camera from the same device. They will be able to receive their timestamps together.

Another solution is to use timestamps from OS. This is possible if you replace audio and video timestamps with a single source and ensure that timestamps can be applied early in both the capture elements. Although this is not a perfect solution, it can be used in the worst-case scenario.

Step 2: Reception of The Image

This seems like an absurd point to include. What is the point of caring about how long it takes to receive an image locally in code? Scale is the problem. A 320x240 image would be no problem on any modern PC. It takes some time to load a buffer with a 4k image or a 1080p one. That's (24/8 * 3840 * 2160 bytes, or approximately 24.9MB per frame for 4k. (Don't forget that we move a frame every 33 milliseconds).

When dealing with images of high resolution, it is better to use pointers instead of using the GPU. If the pipeline is not able to keep up with the increase in frame rate and quality, frames may drop on the sender side. This is a sender-side dropped frame count and indicates one:

  • The GPU or CPU is not able to process the amount of data we are sending through.
  • The server informs us that there are network losses and we need to send fewer frames.

Leave a Comment