what is acceptible jitter for voip and videoconferencing?
Dear nanog-ers: I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago. The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms), https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/ but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it. -- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
On Tue, Sep 19, 2023 at 5:11 PM Dave Taht <dave.taht@gmail.com> wrote:
The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),
Hi Dave, I don't know your use case but bear in mind that jitter impacts gaming as well, and not necessarily in the same way it impacts voip and video conferencing. Voip can have the luxury of dynamically growing the jitter buffer. Gaming... often does not. Just mentioning it so you don't get blind-sided. Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Wed, 20 Sept 2023 at 03:15, Dave Taht <dave.taht@gmail.com> wrote:
I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago.
I don't believe LLQ has utility in hardware based routers, packets stay inside hardware based routers single digit microseconds with nanoseconds of jitter. For software based devices, I'm sure the situation is different. Practical example, tier1 network running 3 vendors, with no LLQ can go across the globe with lower jitter (microseconds) than I can ping my M1 laptop 127.0.0.1, because I have to do context switches, the network does not. This is in the BE queue measured in real operation under long periods, without any engineering effort to try to achieve low jitter.
The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),
I know there are academic papers as well as vendor graphs showing the impact of jitter on quality. Here is one: https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043&context=cs_theses - this appears to roughly say '20ms' G711 is fine. But I'm sure this is actually very complex to answer well, and I'm sure choice of codec greatly impacts the answer, like whatsapp uses Opus, skype uses Silk (maybe teams too?). And there are many more rare/exotic codecs optimised for very specific scenarios, like massive packet loss. -- ++ytti
On Sep 20, 2023, at 2:46 AM, Saku Ytti <saku@ytti.fi> wrote:
skype uses Silk (maybe teams too?).
We run Teams Telephony in $DAYJOB, and it does use SILK. https://learn.microsoft.com/en-us/microsoftteams/platform/bots/calls-and-mee...
On Wed, 20 Sept 2023 at 19:06, Chris Boyd <cboyd@gizmopartners.com> wrote:
We run Teams Telephony in $DAYJOB, and it does use SILK.
https://learn.microsoft.com/en-us/microsoftteams/platform/bots/calls-and-mee...
Looks like codecs still are rapidly evolving in walled gardens. I just learned about 'Satin'. https://en.wikipedia.org/wiki/Satin_(codec) https://ibb.co/jfrD6yk - notice 'payload description' from Teams admin portal. So at least in some cases Teams switches from Silk to Satin, wiki suggests 1on1 only, but I can't confirm or deny this. -- ++ytti
Looks like codecs still are rapidly evolving in walled gardens. I just learned about 'Satin'.
Yeah There are also some opensourced like lyra from google with v2 released last year. https://opensource.googleblog.com/2022/09/lyra-v2-a-better-faster-and-more-v... otoh the voip specifications here in Italy mandates the use of only g711 and g729 for calls between landline providers. Transcode or use only 729/711 making Killing them softly with quality issues sound like a good title for a song. Brian
I think it all goes back to the earliest MOS tests ("Hold up the number of fingers for how good the sound is") and every once in a while somebody actually does some testing to look for correlations. Thought it's 15 years old, I like this thesis for the writer's reporting: https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043&context=cs_theses In particular, this table shows the correlation, and is consistent with what I would expect. [cid:image001.png@01D9EBA9.A25944E0] Lee From: NANOG <nanog-bounces+leehoward=hilcostreambank.com@nanog.org> On Behalf Of Dave Taht Sent: Tuesday, September 19, 2023 8:12 PM To: NANOG <nanog@nanog.org> Subject: what is acceptible jitter for voip and videoconferencing? This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with links and attachments. Dear nanog-ers: I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago. The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms), https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/ but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it. -- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio. On Tue, Sep 19, 2023 at 8:13 PM Dave Taht <dave.taht@gmail.com> wrote:
Dear nanog-ers:
I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago.
The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),
https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/
but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it.
-- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
Artifacts in audio are a product of packet loss or jitter resulting in codec issues issues leading to human subject perceptible audio anomalies, not so much latency by itself. Two way voice is remarkably NOT terrible on a 495ms RTT satellite based two-way geostationary connection as long as there is little or no packet loss. On Thu, Sep 21, 2023 at 12:47 PM Tom Beecher <beecher@beecher.cc> wrote:
My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio.
On Tue, Sep 19, 2023 at 8:13 PM Dave Taht <dave.taht@gmail.com> wrote:
Dear nanog-ers:
I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago.
The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),
https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/
but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it.
-- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
Thank you all for your answers here, on the poll itself, and for papers like this one. The consensus seems to be settling around 30ms for VOIP with a few interesting outliers and viewpoints. https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043&context=cs_theses Something that came up in reading that... that I half remember from my early days of working with VOIP (on asterisk) was that silence suppression (and comfort noise) that did not send any packets was in general worse than sending silence (or comfort noise) - for two reasons - one was nat closures, but the other was that steady stream also helped control congestion and had less jitter swings. So in the deployments I was doing then, I universally disabled this feature on the phones I was using then. In my mind (particularly in a network that is packet (not byte) buffer limited), this showed that point, (to an extreme!) https://www.duo.uio.no/bitstream/handle/10852/45274/1/thesis.pdf But my question is now, are we doing silence suppression (not sending packets) on voip nowadays? On Thu, Sep 21, 2023 at 2:55 PM Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
Artifacts in audio are a product of packet loss or jitter resulting in codec issues issues leading to human subject perceptible audio anomalies, not so much latency by itself. Two way voice is remarkably NOT terrible on a 495ms RTT satellite based two-way geostationary connection as long as there is little or no packet loss.
On Thu, Sep 21, 2023 at 12:47 PM Tom Beecher <beecher@beecher.cc> wrote:
My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio.
On Tue, Sep 19, 2023 at 8:13 PM Dave Taht <dave.taht@gmail.com> wrote:
Dear nanog-ers:
I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago.
The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),
https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/
but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it.
-- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
-- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
On Thu, Sep 21, 2023 at 6:28 AM Tom Beecher <beecher@beecher.cc> wrote:
My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio.
Hi Tom, Jitter doesn't necessarily cause artifacts in the audio. Modern applications implement what's called a "jitter buffer." As the name implies, the buffer collects and delays audio for a brief time before playing it for the user. This allows time for the packets which have been delayed a little longer (jitter) to catch up with the earlier ones before they have to be played for the user. Smart implementations can adjust the size of the jitter buffer to match the observed variation in delay so that sound quality remains the same regardless of jitter. Indeed, on Zoom I barely noticed audio artifacts for a friend who was experiencing 800ms jitter. Yes, really, 800ms. We had to quit our gaming session because it caused his character actions to be utterly spastic, but his audio came through okay. The problem, of course, is that instead of the audio delay being the average packet delay, it becomes the maximum packet delay. You start to have problems with people talking over each other because when they start they can't yet hear the other person talking. "Sorry, go ahead. No, you go ahead." Regards, Bill Herrin -- William Herrin bill@herrin.us https://bill.herrin.us/
On Thu, Sep 21, 2023 at 3:34 PM William Herrin <bill@herrin.us> wrote:
My understanding has always been that 30ms was set based on human
On Thu, Sep 21, 2023 at 6:28 AM Tom Beecher <beecher@beecher.cc> wrote: perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio.
Hi Tom,
Jitter doesn't necessarily cause artifacts in the audio. Modern applications implement what's called a "jitter buffer." As the name implies, the buffer collects and delays audio for a brief time before playing it for the user. This allows time for the packets which have been delayed a little longer (jitter) to catch up with the earlier ones before they have to be played for the user. Smart implementations can adjust the size of the jitter buffer to match the observed variation in delay so that sound quality remains the same regardless of jitter.
Indeed, on Zoom I barely noticed audio artifacts for a friend who was experiencing 800ms jitter. Yes, really, 800ms. We had to quit our gaming session because it caused his character actions to be utterly spastic, but his audio came through okay.
The problem, of course, is that instead of the audio delay being the average packet delay, it becomes the maximum packet delay.
Yes. I talked to this point in my apnic session here: https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over... I called it "riding the TCP sawtooth"- the compensating voip delay becomes equal to the maximum size of the buffer, and thus controls the jitter that way. Sometimes, to unreasonable extents, like 800ms in your example.
You start to have problems with people talking over each other because when they start they can't yet hear the other person talking. "Sorry, go ahead. No, you go ahead."
Regards, Bill Herrin
-- William Herrin bill@herrin.us https://bill.herrin.us/
-- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
On 9/21/23 3:31 PM, William Herrin wrote:
On Thu, Sep 21, 2023 at 6:28 AM Tom Beecher <beecher@beecher.cc> wrote:
My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio. Hi Tom,
Jitter doesn't necessarily cause artifacts in the audio. Modern applications implement what's called a "jitter buffer." As the name implies, the buffer collects and delays audio for a brief time before playing it for the user. This allows time for the packets which have been delayed a little longer (jitter) to catch up with the earlier ones before they have to be played for the user. Smart implementations can adjust the size of the jitter buffer to match the observed variation in delay so that sound quality remains the same regardless of jitter.
Indeed, on Zoom I barely noticed audio artifacts for a friend who was experiencing 800ms jitter. Yes, really, 800ms. We had to quit our gaming session because it caused his character actions to be utterly spastic, but his audio came through okay.
When I wrote my first implementation of telnet ages ago, i was both amused and annoyed about the go-ahead option. Obviously patterned after audio meat-space protocols, but I was never convinced it wasn't a solution in search of a problem. I wonder if CDMA was really an outgrowth of those protocols? But it's my impression that gaming is by far more affected by latency and thus jitter buffers for voice. Don't some ISP's even cater to gamers about latency? Mike
On 9/21/23 17:04, Michael Thomas wrote:
When I wrote my first implementation of telnet ages ago, i was both amused and annoyed about the go-ahead option. Obviously patterned after audio meat-space protocols, but I was never convinced it wasn't a solution in search of a problem. I wonder if CDMA was really an outgrowth of those protocols?
Typically seen with half-duplex implementations, like "Over" in two-way radio. Still used in TTY/TDD as "GA".
But it's my impression that gaming is by far more affected by latency and thus jitter buffers for voice. Don't some ISP's even cater to gamers about latency?
Yep. Dilithium crystal futures are up due to gaming industry demand. ;-) -- Jay Hennigan - jay@west.net Network Engineering - CCIE #7880 503 897-8550 - WB6RDV
On 9/22/23 9:42 AM, Jay Hennigan wrote:
On 9/21/23 17:04, Michael Thomas wrote:
When I wrote my first implementation of telnet ages ago, i was both amused and annoyed about the go-ahead option. Obviously patterned after audio meat-space protocols, but I was never convinced it wasn't a solution in search of a problem. I wonder if CDMA was really an outgrowth of those protocols?
Typically seen with half-duplex implementations, like "Over" in two-way radio. Still used in TTY/TDD as "GA".
DId that ever actually occur over the internet such that telnet would need it? Half duplex seems to be pretty clearly an L1/L2 problem. IIRC, it was something of a pain to implement. Mike
Telnet sessions where often initiated from half duplex terminals. Pushing that flow control across the network helped those users. -- Mark Andrews
On 23 Sep 2023, at 06:25, Michael Thomas <mike@mtcc.com> wrote:
On 9/22/23 9:42 AM, Jay Hennigan wrote:
On 9/21/23 17:04, Michael Thomas wrote:
When I wrote my first implementation of telnet ages ago, i was both amused and annoyed about the go-ahead option. Obviously patterned after audio meat-space protocols, but I was never convinced it wasn't a solution in search of a problem. I wonder if CDMA was really an outgrowth of those protocols?
Typically seen with half-duplex implementations, like "Over" in two-way radio. Still used in TTY/TDD as "GA".
DId that ever actually occur over the internet such that telnet would need it? Half duplex seems to be pretty clearly an L1/L2 problem. IIRC, it was something of a pain to implement.
Mike
On 9/22/23 1:54 PM, Mark Andrews wrote:
Telnet sessions where often initiated from half duplex terminals. Pushing that flow control across the network helped those users.
I'm still confused. Did it require the telnet users to actually take action? Like they'd manually need to enter the GA option? It's very possible that I didn't fully implement it if so since I didn't realize that. Mike
The implication would look at the terminal characteristics and enable as required. -- Mark Andrews
On 23 Sep 2023, at 08:33, Michael Thomas <mike@mtcc.com> wrote:
On 9/22/23 1:54 PM, Mark Andrews wrote: Telnet sessions where often initiated from half duplex terminals. Pushing that flow control across the network helped those users.
I'm still confused. Did it require the telnet users to actually take action? Like they'd manually need to enter the GA option? It's very possible that I didn't fully implement it if so since I didn't realize that.
Mike
Hi Dave, You did not tell: is it interactive? Because we could use big buffers and convert jitter to latency (some STBs have sub-second buffers). Then jitter would effectively become Zero (more precise: not a problem), and we deal only with latency consequences. Hence, your question is not about jitter, it is about latency. By all 5 (or 6?) senses, the Human is a 25ms resolution machine (limited by the animal part of our brain: limbic system). Anything faster is “real-time”. Even echo cancellation is not needed – we hear echo but cannot split signals. Dog has 2x better resolution, cat is 3x better. They probably hate cheap monitor pictures (PAL/SECAM had 50Hz, NTSC had 60Hz). 25ms is for everything round trip. 8ms is wasted just for visualization on the best screen (120Hz). The typical budget left for the networking part (speed of light in the fiber) is about 5ms one way (1000km or do you prefer miles?). Maybe worse, depends on the rendering in GPU (3ms?), processing in the app (3ms?), sensor of the initial signal (1ms?), and so on. The worst problem is that the jitter buffer would be substructed from the same 25ms budget☹ Hence, it is easy to consume (by the jitter buffer) the 10ms that we typically have for networking and come to the situation when we left just with 1ms that pushe us to install MEC (distributed servers to every municipality). Accounting for jitter buffer, it is pretty hard to be “real-time” for humans. Hint: “Pacing” is the solution. The application should send packets with equal intervals. It is very much adopted by all OTTs. By the way, “pacing” has many other positive effects on networking. The next level is about our reaction (possibility to click). That is 150ms for some people, 250ms on average. Hence, gaming is pretty affected by 50ms one-way latency because 2*50ms is becoming comparable to 150ms – it affects the gaming experience. In addition to seeing the dealy, we lose the time – the enemy would shoot us first. The next level (for non-interactive applications) is limited only by the memory that you could devote to the jitter buffer. The cinema would be fine even with a 5s jitter buffer. Except for zipping time, but it is a different story. Eduard From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei.com@nanog.org] On Behalf Of Dave Taht Sent: Wednesday, September 20, 2023 3:12 AM To: NANOG <nanog@nanog.org> Subject: what is acceptible jitter for voip and videoconferencing? Dear nanog-ers: I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago. The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms), https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/ but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it. -- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos
participants (12)
-
Brian Turnbow
-
Chris Boyd
-
Dave Taht
-
Eric Kuhnke
-
Howard, Lee
-
Jay Hennigan
-
Mark Andrews
-
Michael Thomas
-
Saku Ytti
-
Tom Beecher
-
Vasilenko Eduard
-
William Herrin