It's a big omission to claim "minimalist" but then have no information about code size. Nonetheless, as someone who has written an H.261 through H.263 decoder as a learning exercise, it's good to see more people writing video codecs. Getting high performance may not be straightforward, but the algorithms themselves are well-defined by the standard.
Access to left/top macroblock values is done with direct offsets in memory instead of copying their values to a buffer beforehand.
I made use of this technique too, so I think it's not particularly novel nor non-obvious. The performance-sensitivity of video decoding necessarily means avoiding any extraneous data movement whenever possible.
On linux it uses IFUNC resolved at load/dynamic relocation time, so at runtime it's the same cost as any other (relocatable) function call. But they're "static" in that it's not a calculated address so pretty easy for a superscaler CPU to follow.
So it does have some limitations like not being inlined, same as any other external function.
Since TEXTREL is basically gone these days (for good reasons!), IFUNC is the same as any other call that is relocatable to a target not in the same DSO. Which is either a GOT or PLT, either of which ends up being an indirect call (or branch if the compiler feels like it and the PLT isn’t involved). Which is what the person you’re replying to said :)
A relocatable call within the same DSO can be a PC-relative relocation, which is not a relocation at all when you load the DSO and ends up as a plain PC-relative branch or call.
What about duplicating the entire executable essentially a few times, and jumping to the right version at the very beginning of execution?
You have bigger binaries, but the logistics are simplified compared to shipping multiple binaries and you should get the same speed as multiple binaries with fully inlined code.
Since they don't seem to be doing that, my question is: what's the caveat I'm missing? (Or are the bigger binaries enough of a caveat by themselves?)
Ideally you only need to duplicate until you hit the first not-inlined function call; at that point there’s nothing gained and it’s just a waste of binary size.
Yes. If there's a hole in macOS's VideoToolbox support it's the middling quality of their hardware-accelerated encoder, so people who want high quality encodes will generally use x264/x265 for that.
But as a software decoder which is specifically made to not use hardware APIs for decoding, I am not sure why they skipped ARM64 on non-linux platforms.
But why go through the trouble of building and shipping a software decoder for a platform you know has no devices which need such a thing? On the other hand it's not too hard to find ARM64 Linux devices which need an efficient software decoder (either because there isn't a hardware one at all, there one that is there is limited in feature support, or the one that is there is hybrid but written so poorly a good software decoder is more efficient).
What if there were some intelligence to test-for and auto-switch to support extensions when available? If you specify it manually it already supports x64-specific instructions via the ${VARIANTS} env var.
In the case of Apple AVD, it's a multi-stage system with a bunch of special primitives, orchestrated by a Cortex-M3 with firmware. Codec-specific frontends emit IR which a less specialized backend can execute.
This really heavily depends on the device, though. There are all sorts of "hardware" video decoders ranging from fairly generic vector coprocessors running firmware to "pure" HDL/VLSI level implementations. Usually on more modern or advanced hardware you'll see more and more become more general purpose, since a lot of the later stages can be shared across codecs, saving area vs. a pure hardware implementation.
I don't see why this would support Linux arm64 but not macOS.
Anyway, you can just use libavcodec, which is faster (because of frame based multithreading) and doesn't operate on the mistaken belief that it's a good idea to use SIMD intrinsics.
It's a big omission to claim "minimalist" but then have no information about code size. Nonetheless, as someone who has written an H.261 through H.263 decoder as a learning exercise, it's good to see more people writing video codecs. Getting high performance may not be straightforward, but the algorithms themselves are well-defined by the standard.
Access to left/top macroblock values is done with direct offsets in memory instead of copying their values to a buffer beforehand.
I made use of this technique too, so I think it's not particularly novel nor non-obvious. The performance-sensitivity of video decoding necessarily means avoiding any extraneous data movement whenever possible.
Also worth noting: H.264 patents have already expired in most of the world: https://meta.wikimedia.org/wiki/Have_the_patents_for_H.264_M...
Talk by one of the authors - https://archive.fosdem.org/2025/schedule/event/fosdem-2025-5...
I like the `VARIANTS` env. var [0] to take advantage of x86_64 newer extensions if one’s processor has them.
CachyOS is a whole distro compiled with these flags, if possible, which is appealing.
[0] https://github.com/tvlabs/edge264#compiling-and-testing
I wonder why they use multiple executables instead of something like function multiversioning [0]
[0] https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning....
Function multiversioning would require indirect jumps/indirect calls, wouldn't it? Separate executables can do static jumps/calls.
On linux it uses IFUNC resolved at load/dynamic relocation time, so at runtime it's the same cost as any other (relocatable) function call. But they're "static" in that it's not a calculated address so pretty easy for a superscaler CPU to follow.
So it does have some limitations like not being inlined, same as any other external function.
Since TEXTREL is basically gone these days (for good reasons!), IFUNC is the same as any other call that is relocatable to a target not in the same DSO. Which is either a GOT or PLT, either of which ends up being an indirect call (or branch if the compiler feels like it and the PLT isn’t involved). Which is what the person you’re replying to said :)
A relocatable call within the same DSO can be a PC-relative relocation, which is not a relocation at all when you load the DSO and ends up as a plain PC-relative branch or call.
What about duplicating the entire executable essentially a few times, and jumping to the right version at the very beginning of execution?
You have bigger binaries, but the logistics are simplified compared to shipping multiple binaries and you should get the same speed as multiple binaries with fully inlined code.
Since they don't seem to be doing that, my question is: what's the caveat I'm missing? (Or are the bigger binaries enough of a caveat by themselves?)
There's no need to do any of that, a table of function pointers to DSP functions works fine.
It can be useful to duplicate the entire code for 8-bit vs 10-bit pixels because that does affect nearly everything.
Ideally you only need to duplicate until you hit the first not-inlined function call; at that point there’s nothing gained and it’s just a waste of binary size.
This may eventually be better for people working in the cloud. Shame there's no apple silicon support.
(See also Cisco's openh264, which supports decoding)
Don't all Apple Silicon devices have extremely good (in both speed and feature coverage) H.264 hardware decoders already?
Yes. If there's a hole in macOS's VideoToolbox support it's the middling quality of their hardware-accelerated encoder, so people who want high quality encodes will generally use x264/x265 for that.
Yes, H.264 is in hardware on Apple Silicon.
But as a software decoder which is specifically made to not use hardware APIs for decoding, I am not sure why they skipped ARM64 on non-linux platforms.
But why go through the trouble of building and shipping a software decoder for a platform you know has no devices which need such a thing? On the other hand it's not too hard to find ARM64 Linux devices which need an efficient software decoder (either because there isn't a hardware one at all, there one that is there is limited in feature support, or the one that is there is hybrid but written so poorly a good software decoder is more efficient).
What if there were some intelligence to test-for and auto-switch to support extensions when available? If you specify it manually it already supports x64-specific instructions via the ${VARIANTS} env var.
https://github.com/tvlabs/edge264/blob/5a3c19fc0ccacb03f9841...
Out of curiosity, what does “in hardware” actually mean in this context? Is it pure vhdl? Microcode that leverages special primitives? Something else?
In the case of Apple AVD, it's a multi-stage system with a bunch of special primitives, orchestrated by a Cortex-M3 with firmware. Codec-specific frontends emit IR which a less specialized backend can execute.
https://github.com/eiln/avd
This really heavily depends on the device, though. There are all sorts of "hardware" video decoders ranging from fairly generic vector coprocessors running firmware to "pure" HDL/VLSI level implementations. Usually on more modern or advanced hardware you'll see more and more become more general purpose, since a lot of the later stages can be shared across codecs, saving area vs. a pure hardware implementation.
I don't see why this would support Linux arm64 but not macOS.
Anyway, you can just use libavcodec, which is faster (because of frame based multithreading) and doesn't operate on the mistaken belief that it's a good idea to use SIMD intrinsics.
I had no issues getting this to build, pass tests, and render a video on ARM64 Mac OS X.