Video Encoding Quality: A lot of useless data

shanix · January 6, 2022, 2:21am

Yeah I’m gonna pretend this is a research paper. It’s not, but it helps me with formatting and collecting my thoughts.

Abstract

Video compression is a science of art. It’s math that’s viewed subjectively, ephemerally, and smeared 20 to 60 times per second. So it’s no wonder that we argue all the time about settings without being able to quantify the way video makes us feel. I’m not going to present anything to change your mind.

Introduction

So I got bored one day and wanted to know, “how much does transcoding a file in Plex hurt the quality?” Pretty simple question, right? How bad can it possibly be? So I grabbed a video in my library, encoded it, and watched it again. Didn’t look too bad. But then I realized it was already compressed from a higher quality source, so maybe it was so low quality that I didn’t notice how bad it was? So I encoded it again, same settings. And it still looked file.

That’s when I remembered, if my server transcodes it uses an Nvidia 1060 to encode. Maybe the GPU makes it look worse? I watched a few minutes of it, making sure the GPU was transcoding, and again, didn’t notice a problem. So I did what any sane person would do - I grabbed a bunch of different files, set up a bunch of machines in my homelab, and started encoding like my life depended on it.

Thanks to some previous research, I know that there’s some math out there to actually quantify the difference in quality between reference and compressed video. Peak Signal-to-Noise Ratio is the classic, and Structural Similarity Index Measure was made for exactly this. And on top of that, noted Internet Content Delivery Company Netflix developed VMAF for their entire library of content. So I used those three metrics to compare the 450 final encodes I created.

Methods

Encoder Settings

You can find my encoding/calculation scripts, encoder presets, and ramblings at this github repo. In short, I selected 9 videos to serve as “sources” for comparison:

A WEB-DL of a Hit TV Show that I own on Bluray. This was my initial test, because I wanted to see how bad the quality could get if it came initially compressed.
A chapter of a digitally produced movie ripped from one of my blurays, to represent “best possible quality” for source (that we as consumers can acquire. I know that movies are mastered in the Gbps range or higher, and I think there’s one available, but the chance that someone has an original master copy to compress is so slim I didn’t bother. I also don’t have one, netflix pls gib). I specify digitally produced because I wanted to avoid film grain as an issue.
A chapter of an animated film, to see how well animation compresses (hint: VERY well)
The same chapter of an animated film, to see how well the Animation Tune works.
A high-action scene from a movie released on bluray to see quality loss for a hard-to-encode video.
The same movie above, in 4k, because I also own it on 4k. God, that takes a while to encode.
An older film re-released on bluray with heavy grain. To ignore the end point of source 1, I wanted to see how bad heavy grain makes an encode.
An older film re-released on bluray with heavy grain, with a denoise filter applied (note that denoising is CPU bound, and is also not available on Plex for transcoding, so this is mostly because I wanted to see it).
A chapter of a movie, heavily compressed (CRF 20 with slow preset), then compared against the original bluray source. This is probably the closest to realistic we’ll see.

All (except 3 and 7) were encoded with the following settings:

Video Encoder: x264/x264 NVENC/x264 QSV/x265 (10-bit)/x265 NVENC (8-bit)¹ /x265 QSV (10-bit)
- These are effectively what are being tested
Framerate: Same as source
Encoder Preset: Slow (or Quality for QSV)
Encoder Tune: None
Encoder Profile & Level: Auto
Fast Decode: Disabled
Constant Quality, 22 RF
No filters
Audio passed through, no subtitle burn-in (or subtitles at all)

Encodes for Source 3 had the Tune set to Animation to evaluate its usage, but otherwise remained the same (and thus, produced fewer encodes because some would be equivalent to encodes from Source 2).

Encodes for Source 7 had the Denoise filter set to NLMeans with the Ultralight Preset and Tune set to None. This is what I use for encoding grainy material and wanted to evaluate encode speed/quality.

All settings can be loaded into HandBrake v1.4.2 from the linked github repo for verification/repetition.

Encodes Produced

All source files (save 3 and 7) were encoded once with all encoders (h264, h265, h264 w/ nvenc, h265 w/ nvenc, h264 with QSV, h265 with QSV), then each output was encoded again with the same encoders. This produced 42 files per source: six first level encodes (e.g. with h265 or h264 w/ nvenc), six second level encodes per first level encode (e.g. with h264 w/ qsv, then that used as an input for encoding with h265). Six first level encodes and 36 second level encodes.

As mentioned earlier, sources 3 and 7 had a different number of encodes produced.

Source 3, the animated film with the animation tune, produced 36 encodes total. Since the Tune is only available for software encoding, not hardware accelerated, I effectively added two more encoding settings rather than an additional six. I also didn’t create the same encodes that Source 2 created except first level, since they logically would be the same. Thus, there were three encodes for the original six encoders (one first level, two second level that had Tune set to Animation), and nine encodes for the two Tune encoders (one first level and eight second level).

Source 7, the grainy film, had a similar setup to Source 3. However, since Denoising is a filter it’s CPU bound, not GPU bound. I was able to see the GPU doing some work but not to the scale as other sources. And since this setting was available for all encoders, we doubled to 12 possible encoders (all as stated, with and without denoising). As with Source 3, encodes that were produced by Source 6 (grainy film without denoising) are not produced by Source 7 unless needed for second level encoding. This resulted in 120 encodes for Source 7: 12 first level encodes, six second level encodes for the first six encoders, and 12 second level encodes for the six new encoders.

Hardware

All CPU encodes were encoded on a 3900X. All NVENC encodes were encoded on a system with a 3900X and a 1080ti running drivers v497.29. All QuickSync encodes were encoded on an E-2146G. I would have tested on a 3770 I have in my homelab but the encoder kept crashing no matter what settings I used, so I decided to not bother. Disappointing, but I can build another system in the future to compare. I also considered purchasing an HP 290 as recommended by the fine folks over at Serverbuilds.net, but considering those are listed as having the same generation iGPU, I decided it wasn’t worth it. I also had a P400 I could have tested with, but since it’s Pascal like the 1080ti, it wasn’t worth the setup time.

Gods, I wish someone would have given me free hardware for this. There’s still time folks, I bet the first Nvidia or AMD would love to show off how good their new hardware is! And Intel, hey, I hear Alder Lake and Xe want to compete too!

Results

Spreadsheet of results can be found here. For those opposed to Google, CSV files of the output are available in the Github repo (though you’ll miss out on my high quality highlighting, such a loss).

Each sheet (or CSV file) represents the summarized output of a source’s encodes. Column explanation:

Index, Source, and Output are just file information, used for tracking. Output is probably the only one to really care about, since it’s the description of the encoder(s) for each encode.
Encode FPS is the encoder’s rate of work done averaged across the entire encode duration. Higher is generally better.
Bitrate and Bitrate (kbps) are the bit rate of the video stream, in bytes per second and kilobytes per second. Generally, lower is better.
Under the VMAF Header:
- Mean is the average VMAF score of each frame. Higher is better. 6 points is generally considered to be the Just Noticeable Difference ².
- 1% Low and 0.1% Low are the averages of the lowest 1% and 0.1% scores. If a source had 1000 frames, then the 1% low is the average of the worst 30 frames, and 0.1% is the average of the worst 3 frames.
- Min is the minimum VMAF score.
- Harmonic Mean is the… Harmonic Mean… of the scores. It’s effectively the reciprocal of the sum of the reciprocals. Usually these values are very close (as you can see in the findings, it’s 0% and 0.2% different than the Mean). This is very useful because it reduces the impact of large values. So if the median is 80 but the Harmonic Mean is 20, well, there’s a LOT of bad frames with a few good ones.
- Mean Diff is the percent difference between Mean and Harmonic Mean. I added it to the table as a way of quickly checking if the means were out of touch. And generally, they aren’t, which means there aren’t a lot of low quality frames in most encodes.
- Bitrate/Quality and Bitrate/Quality (H) are, and I cannot stress this enough, COMPLETE BULLSHIT. VMAF is a measure of relative quality (i.e. how good the encode looks compared to the original), not of absolute quality, and these metrics only really work with absolute measures. I used this as a rough measure of how many kilobytes it takes to “gain” one VMAF point. This is best scene comparing GPU to CPU encodes. Quite often, the GPU encodes are higher quality, but with massive filesizes, so their B/Q values are massive as well. The difference is that (H) indicates dividing Bitrate by the Harmonic Mean, and the lack indicates division by the Mean.
Under PSNR and SSIM:
- Median corresponds to the median value of the scores. It’s not the mean, and I’m realizing this while typing this up, and I don’t feel like going back and calculating. Whatever.
- 1% Low and 0.1% Low are as before, the average value of the lowest 1% and 0.1% scores for all frames.
- Min is as before, the lowest score.

PSNR and SSIM scores have been coded based off widely accepted values.

PSNR is flagged as Yellow above 45db, Red below 35db, and green in between. It’s commonly accepted that PSNR over 45 indicates data that users will not notice (i.e. you’ve wasted data by sending them quality they can’t perceive) and below 35 will be noticeably not good (i.e. you shouldn’t’ve encoded this segment so hard, they’ll notice artifacting) ².
SSIM is flagged as Green above .99, Yellow between .88 and .99, and Red below .88. Researchers have mapped subjective values to SSIM scores ³, and the rough metric is >= .99 is “Imperceptible”, .99 > SSIM >= .95 is “Perceptible but not annoying”, .95 > SSIM >= .88 is “Slightly annoying”, and below .88 is annoying or worse.

Discussion

With all that information thrown at you, here’s my conclusions:

GPUs add a fuckton of data for minimal quality add. For example, WEB-DL 00100 vs 00300, h264 vs. h264 with NVENC. If you looked at quality or FPS, you’d say it’s the best. 277FPS (which would be 11.5 concurrent transcodes!) and a VMAF score of 96.75, it blows h264 out of the water. Except, if you look at bitrate, it’s nearly three times as much data! No wonder it’s so high quality, it’s barely compressing the file at all! In fact, this data isn’t on the spreadsheet, but 00300 is only 7.5% smaller than the reference file (compared to 00100 being 63% smaller than reference). This is repeated in every case, every encoder.
- GPUs are if you have a client that can’t direct play/stream one of your videos and your CPU can’t keep up, but should be avoided otherwise. If you use a GPU to pre-encode video, just stop now. If you keep doing it, you’re an idiot. It’ll take longer but you’ll end up with less storage used (and if you use Quicksync, probably higher quality) per video.
QuickSync, even on Coffee Lake, is less than ideal. I can’t say the HP290 (or whatever the contemporary version is) QSV box ain’t a good/value option for a Plex server, but I would not use that iGPU for encoding video ahead of time unless I absolutely had to (and neither I nor you have to).
There’s… not actually much of a quality loss from twice encoded video. Shocked, honestly. Even for the WEB-DL, which is effectively thrice encoded, there wasn’t a massive loss of quality in cases like 00101 or even 00602 (though I don’t have the original original to compare against). Looking at Source 2 and Source 3, it’s clear that if you have more data to work with you’ll have better quality encodes (Surprising, I know). But even encoding an h264 WEB-DL to h265 would be barely noticeable for up to 80% space savings. I’m not gonna start re-encoding these videos, but it’s made me less apprehensive about it.
- If any of your clients have to transcode, you might be able to rest easy knowing the quality loss ain’t that bad, actually. Maybe.
- I do want to note that twice encoding generally doesn’t do more than shave a few percent off the total file size. Generally encoding from h265 to h264 results in a higher file size, but only if you’ve encoded the h265 yourself. If you’re ripping a 4k bluray (which are almost always h265), then h264 will still be smaller.
The Animation Tune is totally worth it, for 2D animated content. Animation compresses really well, that’s been known for a while, but it’s great to see it proved again. I want to point out that 30900, just a straight h265 with tune, is 3 percent of the reference file size with a VMAF score of just under 94. What the fuck.
Compressing already compressed media is probably the dumbest thing I’ve done, and I’ve willing done all of this work to already, it doesn’t get much dumber. But it’s good to prove that, yes, at a certain point you are ending up wasting CPU/GPU cycles. If at all possible, always encode from the highest quality source you can find, just the encoder has as much data it can throw away.
Denoising grainy content is worth it, if you can stomach the encode times. The average bitrate of all denoised encodes is about 2Mbps lower than the average of all grainy encodes, for a less than a point lost in the VMAF score, a half a decibel in PSNR, and 0.03 points from SSIM. From a user perspective, it’s a big savings on data for barely any quality loss.
- Scene rules recommend encoding grainy content with average bitrate, not CRF, which I’ll probably investigate eventually. Scene Rules are accepted for a reason.
There is a LOT of data that can be compressed in 4k releases. Compress away.

All jokes aside, please, if you take anything from all this, let it be this one thing: STOP USING YOU FUCKING GPU TO PRE-ENCODE VIDEO. IT AIN’T SAVING YOU THAT MUCH SPACE AND IT AIN’T THAT MUCH BETTER QUALITY.

Flaws

Lack of trans-generational hardware for hardware comparison (e.g. no 3080ti vs 1080ti, no v2 QSV vs. v6 QSV), would’ve been nice to see how things have/n’t improved over the years. If I ever get a 30 series card I’ll probably update the spreadsheet if I notice a big difference.
Lack of AMD Hardware. Would have liked to see how they compare too, even if few people use their hardware encoder.
Use of HandBrake rather than ffmpeg. I’d happily use ffmpeg if I didn’t have a day job that I put my mental energy into. HandBrake has a GUI, saves presets as JSON, and can run those presets from the command line. Any performance or quality loss is worth it.
- Ah fuck, catch me learning ffmpeg within a year to update this.
I really should have used average bitrate and with presets that Plex uses, this that was the original reason for all of this. It’s still useful to know that encoding from one codec to another isn’t a major loss in quality, whether you use a GPU or not, so long as your source has enough data that it can still discard things. It might even make it faster, like 10100 vs. 10101, or 10200 vs. 10202 (which makes sense, less data means less work for the encoder, for better and worse).
- Sadly, I don’t know exactly what Plex is doing, beyond resolution and possibly average bitrate (average bitrate is the only thing that makes sense considering options are “Resolution Bit Rate”). Maybe one of their engineers will tell me, and I can benchmark for them, lol.
Not testing other RF values. I think it’d be useful to have a bit more of a spread so people can start figuring out where they want to encode media. But, in my very honest and gatekeeping opinion, that’s a journey everyone has to undertake alone.
- I did the math while I was waiting for the 4k content to be VMAF’d/PSNR’d/SSIM’d and if I ended up testing denoising (both algorithms and all strengths), all the stated encoders (with GPUs enabled and disabled), with and without the Animation Tune, and every RF in increments of 2 from 18 to 30 (inclusive), I’d end up with like 69,000 encodes per source. Pretty nice, but also, I want to use my computers at some point this decade. And I categorically refuse to do 69,000 encodes of 4k, when it takes on average about 6.5 minutes per encode (so about literally 317 days STRAIGHT of just encoding, not even computing scores). I’d definitely buy a lot more hardware to parallelize things.
Not encoding to 720p or 480p and comparing with VMAF. It can do the comparison, as long as ffmpeg is scaling it back up to source size during so. Since Plex defaults to 720p 2Mbps, that’s an obvious target to check next time I’m inspired for this kind of hell.
Not sleeping enough. That has nothing to do with encoding but I should be sleeping more either way.

Footnotes/References

HandBrake v1.4.2 does not support 10 bit for NVENC encoding. This issue seems to say it does and would be deployed in v1.4.0, yet, it ain’t for me. Perhaps it’s a hardware limitation.
Finding the Just Noticeable Difference with Netflix VMAF
Mapping SSIM and VMAF scores to subjective ratings

Thanks

I have to express my heartfelt thanks to (in no particular order):

jlesage and The maintainers of ffmpeg-quality-metrics, for saving me so much goddamn headache through all of this.
Jan Ozer, whose book Video Encoding by the Numbers inspired this. Fantastic read, should be required for anyone hosting their own Plex (or similar) server. So much information made avaiable and easy to follow.
DenverCoder9, for their immense help getting this off the ground.

TL;DR

STOP USING YOU FUCKING GPU TO ENCODE VIDEO THAT YOU’LL TRANSCODE LATER. If I catch any of y’all using Tdarr to pre-encode your media with your Nvidia or Intel GPUs I’ll rip your head off and shit in your shoulders. When setting up your encoding profiles, you don’t have to worry much if transcoding is hurting your experience, unless you’re a sicko who transcodes 1080p content to 720p or lower.

monotux · January 8, 2022, 7:05pm

This is amazing, and reminded me of a prior job[^1] with the dryness and the level of details. Well done!

Do you think the GPU bitrate-increases is due to the hardware generation, and that it might improve with a newer card?

And how to do you handle these data, do you just avoid transcoding?

[1]: Working with speech quality assessments at a telecom company, odds are that all of you have made a phone call through the systems I tested. That company had something like 80% of the world market at that point!

shanix · January 8, 2022, 8:06pm

This is amazing, and reminded me of a prior job[^1] with the dryness and the level of details. Well done!

Thank you, that means a lot!

Do you think the GPU bitrate-increases is due to the hardware generation, and that it might improve with a newer card?

Honestly, I have no idea exactly why the bitrate is so much larger (everything I’ve read boils down to “that’s just how GPU acceleration is”) but I do intend on testing with newer hardware when I get my hands on it. I actually have a system with a 3770k I was hoping to use for testing QSV v2 vs. v6 but it kept erroring out when trying to encode. So I’m stuck waiting for a 3000 series card to show up at a reasonable price.

And how to do you handle these data, do you just avoid transcoding?

I actually started this project to see how bad transcoding could be, and if my data has any correlation to the real world I’ll still keep transcoding media if I need to. But I also pre-encode all the media I own on bluray before it goes on my Plex server, so, it’s not been a big issue.

easyrhino · January 11, 2022, 8:25pm

This info doesn’t help me directly but I love that you compiled it. Thanks!

shanix · January 11, 2022, 8:44pm

I would be surprised if it helped anyone, and thank you nonetheless!

JDM_WAAAT · January 11, 2022, 9:28pm

You’ll fit in here nicely.

shanix · January 12, 2022, 12:42am

Spreadsheet updated with data from 4k content, grainy content (both compressed normally and with a denoise filter), and content whose reference was heavily compressed before being twice encoded (then compared against the original bluray reference).

Dissertation and scripts coming soon.

shanix · January 12, 2022, 4:02am

Main post has been updated. I have no idea why I formatted it like a research paper, but enjoy what is arguably the best and most coherent paper I’ve ever written. Including my senior thesis.

owenthewizard · June 8, 2022, 8:24pm

Wow, this is fantastic!