I designed my own fast game streaming video codec – PyroWave

themaister.net

・

430 points

・

Bogdanp

・

4 days ago

138 comments

Almondsetat ・ 4 days ago

VC-2 is an intra-only wavelet-based ultra low latency codec developed by the BBC years ago for exactly this purpose. It is royalty free and currently the only implementations are in ffmpeg and in the official BBC repository, and are CPU based. I am planning to make a CUDA accelerated version for my master thesis, since the Vulkan implementations made at GSoC last year still suck quite a bit. I would suggest people to look into this codec

_kb ・ 3 days ago

Definitely a neat codec! You can get COTS hardware en/decoders that use it via https://atlona.com/omnistream-av-over-ip/.
averne_ ・ 3 days ago

Do you mind going in some detail as to why they suck? Not a dig, just genuinely curious.
- Almondsetat ・ 3 days ago
  
  95% GPU usage but only x2 faster than the reference SIMD encoder/decoder
- actionfromafar ・ 3 days ago
  
  What I wonder is, how do you get the video frames to be compressed from the video card into the encoder?
  The only frame capture APIs I know, take the image from the GPU, to CPU RAM, then you can put it back into the GPU for encoding.
  Are there APIs which can sidestep the "load to CPU RAM" part?
  Or is it implied, that a game streaming codec has to be implemented with custom GPU drivers?
  
  Almondsetat ・ 3 days ago
  
  Some capture cards (Blackmagic comes to mind) have worked together with NVIDIA to expose DMA access. This way video frames are automatically transferred from the card to the GPU memory bypassing the RAM and CPU. I think all GPU manufacturers expose APIs to do this, but it's not that common in consumer products.
  
  Const-me ・ 3 days ago
  ・ 2 more
  
  > Are there APIs which can sidestep the "load to CPU RAM" part?
  On windows that API is Desktop Duplication. The API delivers D3D11 textures, usually in BGRA8_UNORM format. When HDR is enabled you would need slightly different API method which can deliver HDR frames in RGBA16_FLOAT pixel format.
  
  mmozeiko ・ 3 days ago
  
  There's also Windows.Graphics.Capture. It allows to get texture not only for whole desktop, but just individual windows.
  
  LtdJorge ・ 3 days ago
  
  On Linux you should look into GStreamer and dmabuf.
oplav ・ 3 days ago

In your experience, how does VC-2 compare to JPEG XS from a quality perspective? The JPEG XS resources I’ve seen say JPEG XS has higher visual quality, but curious what it’s like in practice.
- Almondsetat ・ 3 days ago
  
  JPEG-XS is an almost direct successor to VC-2. They use the same techniques and if you read JPEG-XS's whitepaper they explicitly cite VC-2 as an inspiration and a target to surpass. JPEG-XS is an improvement, there is not doubt about that, but unfortunately they decided to patent it for all uses. In both cases, the publicly available software implementations are very few, CPU-based, and the ones that aren't are implemented in hardware inside business AV solutions.

_kb ・ 4 days ago

This is a really nice walkthrough of matching trade offs to acceptable distortions for a known signal type. Even if you’re selecting rather than designing a codec, it’s a great process to follow.

For those interesting in the ultra low latency space (where you’re willing to trade a bit of bandwidth to gain quality and minimise latency), VSF have a pretty good wrap up of other common options and what they each optimise for: https://static.vsf.tv/download/technical_recommendations/VSF...

sippeangelo ・ 4 days ago

I know next to nothing about video encoding, but I feel like there should be so much low hanging fruit when it comes to videogame streaming if the encoder just cooperated with the game engine even slightly. Things like motion prediction would be free since most rendering engines already have a dedicated buffer just for that for its own rendering, for example. But there's probably some nasty patent hampering innovation there, so might as well forget it!

torginus ・ 4 days ago

'Motion vectors' in H.264 are a weird bit twiddling/image compression hack and have nothing to do with actual motion vectors.
- In a 3d game, a motion vector is the difference between the position of an object in 3d space from the previous to the current frame
- In H.264, the 'motion vector' is basically saying - copy this rectangular chunk of pixels from some point from some arbitrary previous frame and then encode the difference between the reference pixels and the copy with JPEG-like techniques (DCT et al)
This block copying is why H.264 video devolves into a mess of squares once the bandwidth craps out.
- pornel ・ 4 days ago
  
  Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.
  In typical video encoding motion compensation of course isn't derived from real 3D motion vectors, it's merely a heuristic based on optical flow and a bag of tricks, but in principle the actual game's motion vectors could be used to guide video's motion compensation. This is especially true when we're talking about a custom codec, and not reusing the H.264 bitstream format.
  Referencing previous frames doesn't add latency, and limiting motion to just displacement of the previous frame would be computationally relatively simple. You'd need some keyframes or gradual refresh to avoid "datamoshing" look persisting on packet loss.
  However, the challenge is in encoding the motion precisely enough to make it useful. If it's not aligned with sub-pixel precision it may make textures blurrier and make movement look wobbly almost like PS1 games. It's hard to fix that by encoding the diff, because the diff ends up having high frequencies that don't survive compression. Motion compensation also should be encoded with sharp boundaries between objects, as otherwise it causes shimmering around edges.
  
  CyberDildonics ・ 4 days ago
  ・ 8 more
  
  Motion vectors in video codecs are an equivalent of a 2D projection of 3D motion vectors.
  3D motion vectors always get projected to 2D anyway. They also aren't used for moving blocks of pixels around, they are floating point values that get used along with a depth map to re-rasterize an image with motion blur.
  
  pornel ・ 4 days ago
  ・ 7 more
  
  They are used for moving pixels around when used in Frame Generation. P-frames in video codecs aim to do exactly the same thing.
  Implementation details are quite different, but for reasons unrelated to motion vectors — the video codecs that are established now were designed decades ago, when use of neural networks was in infancy, and the hardware acceleration for NNs was way outside of the budget of HW video decoders.
  
  CyberDildonics ・ 3 days ago
  ・ 6 more
  
  There is a lot to unpack here.
  First, neural networks don't have anything to do with this.
  Second, generating a new frame would be optical flow and it always is 2D, there is no 3D involved because it's from a 2D image not a 3D scene.
  https://en.wikipedia.org/wiki/Optical_flow https://docs.opencv.org/3.4/d4/dee/tutorial_optical_flow.htm...
  Third, optical flow isn't moving blocks of pixels around by an offset then encoding the difference, it is creating a floating point vector for every pixel then re-rasterizing the image into a new one.
  
  pornel ・ 3 days ago
  ・ 5 more
  
  You've previously emphasised use of blocks in video codecs, as if it was some special distinguishing characteristic, but I wanted to explain that's an implementation detail, and novel video codecs could have different approaches to encoding P-frames. They don't have to code a literal 2D vector per macroblock that "moves pixels around". There are already more sophisticated implementations than that. It's an open problem of reusing previous frames' data to predict the next frame (as a base to minimize the residual), and it could be approached in very different ways, including use of neural networks that predict the motion. I mention NNs to emphasise how different motion compensation can be than just copying pixels on a 2D canvas.
  Motion vectors are still motion vectors regardless of how many dimensions they have. You can have per-pixel 3D floating-point motion vectors in a game engine, or you can have 2D-flattened motion vectors in a video codec. They're still vectors, and they still represent motion (or its approximation).
  Optical flow is just one possible technique of getting the motion vectors for coding P-frames. Usually video codecs are fed only pixels, so they have no choice but to deduce the motion from the pixels. However, motion estimated via optical flow can be ambiguous (flat surfaces) or incorrect (repeating patterns), or non-physical (e.g. fade-out of a gradient). Poorly estimated motion can cause visible distortions when the residual isn't transmitted with high-enough quality to cover it up.
  3D motion vectors from a game engine can be projected into 2D to get the exact motion information that can be used for motion compensation/P-frames in video encoding. Games already use it for TAA, so this is going to be pretty accurate and authoritative motion information, and it completely replaces the need to estimate the motion from the 2D pixels. Dense optical flow is a hard problem, and game engines can give the flow field basically for free.
  You've misread what I've said about optical flow earlier. You don't need to give me Wikipedia links, I implement codecs for a living.
  
  CyberDildonics ・ 3 days ago
  ・ 4 more
  
  The big difference is that if you are recreating an entire image and there isn't going to be any difference information against a reference image you can't move pixels around, you have to get fractional values out of optical flow and move pixels fractional amounts that potentially overlap in some areas and leave gaps in others.
  This means rasterization and making a weighted average of moved pixels as points with a kernel with width and height.
  Optical flow isn't one technique, it's just a name for getting motion vectors in the first place.
  Here is a lecture to help clear it up.
  https://www.cs.princeton.edu/courses/archive/fall19/cos429/s...
  
  pornel ・ 3 days ago
  ・ 3 more
  
  I've started this thread by explaining this very problem, so I don't get why you're trying to lecture me on subpel motion and disocclusion.
  What's your point? Your replies seem to be just broadly contrarian and patronizing.
  I've continued this discussion assuming that maybe we talk past each other by using the term "motion vectors" in narrower and broader meanings, or maybe you did not believe that the motion vectors that game engines have can be incredibly useful for video encoding.
  However, you haven't really communicated your point across. I only see that whenever I describe something in a simplified way, you jump to correct me, while failing to realize that I'm intentionally simplifying for brevity and to avoid unnecessary jargon.
  
  undefined ・ 2 days ago
  
  [deleted]
  
  CyberDildonics ・ 3 days ago
  
  You said they were the same and then talked about motion vectors from 3D objects and neural networks for an unknown reason.
  I'm saying that moving pixels and taking differences to a reference image is different from re-rasterizing an image with distortion and no correction.
- robterrell ・ 4 days ago
  
  Isn't the use of the H.264 motion vector to preserve bit when there is a camera pan? A pan is a case where every pixel in the frame will change, but maybe doesn't have to.
  
  superjan ・ 4 days ago
  
  Yes, or when a character moves across the screen. They are quite fine grained. However, when the decoder reads the motion vectors from the bitstream, it is typically not supposed to attach meaning to them: they could point to a patch that is not the same patch in the previous scene, but looks similar enough to serve as a starting point.
ChadNauseam ・ 4 days ago

I think you're right. Suppose the connection to the game streaming service adds two frames of latency, and the player is playing an FPS. One thing game engines could do is provide the game UI and the "3D world view" as separate framebuffers. Then, when moving the mouse on the client, the software could translate the 3D world view instantly for the next two frames that came from the server but are from before the user having moved their mouse.
VR games already do something like this, so that when a game runs at below the maximum FPS of the VR headset, it can still respond to your head movements. It's not perfect because there's no parallax and it can't show anything for the region that was previously outside of your field of view, but it still makes a huge difference. (Of course, it's more important for VR because without doing this, any lag spike in a game would instantly induce motion sickness in the player. And if they wanted to, parallax could be faked using a depth map)
- rowanG077 ・ 4 days ago
  
  You can do parallax if you use the depth buffer.
WantonQuantum ・ 4 days ago

A simple thing to start with would be akin to Sensor Assisted Video Encoding where phone accelerometers and digital compasses are used to give hints to video encoding: https://ieeexplore.ieee.org/document/5711656
Also, for 2d games a simple sideways scrolling game could give very accurate motion vectors for the background and large foreground linearly moving objects.
I'm surprised at the number of people disagreeing with your idea here. I think HN has a lot of "if I can't see how it can be done then it can't be done" people.
Edit: Also any 2d graphical overlays like HUDs, maps, scores, subtitles, menus, etc could be sent as 2d compressed data, which could enable better compression for that data - for example much sharper pixel perfect encoding for simple shapes.
- derf_ ・ 3 days ago
  
  > I think HN has a lot of "if I can't see how it can be done then it can't be done" people.
  No, HN has, "This has been thought of a thousand times before and it's not actually that good of an idea," people.
  The motion search in a video encoder is highly optimized. Take your side-scroller as an example. If several of your neighboring blocks have the same MV, that is the first candidate your search is going to check, and if the match is good, you will not check any others. The check itself has specialized CPU instructions to accelerate it. If the bulk of the screen really has the same motion, the entire search will take a tiny fraction of the encoding time, even in a low-latency, real-time scenario. Even if you reduce that to zero, you will barely notice.
  On the other end of the spectrum, consider a modern 3D engine. There will be many things not describable by block-based motion of the underlying geometry: shadows, occlusions, reflections, clouds or transparency, shader effects, water or atmospheric effects, etc. Even if you could track the "real" motion through all of that, the best MV to use for compression does not need to match the real motion (which might be very expensive to code, while something "close enough" could be much cheaper, as just one possible reason), it might come from any number of frames (not necessarily the most recent), etc., so you still need to do a search, and it's not obvious the real motion is much better as a starting point than the heuristics an encoder already uses, where they even differ.
  All of that said, some encoder APIs do allow providing motion hints [0], you will find research papers and theses on the topic, and of course, patents. That the technique is not more widespread is not because no one ever tried to make it work.
  [0] https://docs.nvidia.com/video-technologies/video-codec-sdk/1... as the first random result of a simple search.
  
  WantonQuantum ・ 3 days ago
  
  > If several of your neighboring blocks have the same MV
  I think we’re mostly agreeing here. Finding the MVs in any block takes time. Time that can be saved by hints about the direction of motion. Sure, once some motion vectors are found then other blocks benefit by initially assuming similar vectors. To speed things up why not give the hints right away if they’re known a priori?
mikepurvis ・ 4 days ago

I’ve wondered about this as well, like most clients should be capable of still doing a bit of compositing. Like if you sent billboard renders of background objects at lower fidelity/frequency than foreground characters, updated hud objects with priority and using codecs that prioritize clarity, etc.
It was always shocking to me that Stadia was literally making their own games in house and somehow the end result was still just a streamed video and the latency gains were supposed to come from edge deployed gpus and a wifi-connected controller.
Then again, maybe they tried some of this stuff and the gains weren't worth it relative to battle-tested video codecs.
toast0 ・ 4 days ago

For 2d sprite games, OMG yes, you could provide some very accurate motion vectors to the encoder. For 3d rendered games, I'm not so sure. The rendering engine has (or could have) motion vectors for the 3d objects, but you'd have to translate them to the 2d world the encoder works in; I don't know if it's reasonable to do that ... or if it would help the encoder enough to justify.
- sudosysgen ・ 4 days ago
  
  Schemes like DLSS already do provide 2D motion vectors, it's not necessarily a crazy ask.
markisus ・ 4 days ago

The ultimate compression is to send just the user inputs and reconstitute the game state on the other end.
- w-ll ・ 4 days ago
  
  The issue is the "reconstitute the game state on the other end" when it comes to at least how I travel.
  I haven't in a while but I used to use https://parsec.app/ on a cheap intel Air to do my STO dailies on vacation. It sends inputs, but gets a compressed stream. Im curious of any OS of something similar.
- Zardoz84 ・ 3 days ago
  
  Good old DooM save demos are essentially this.
cma ・ 4 days ago

> Things like motion prediction would be free since most rendering engines already have a dedicated buffer just for that for its own rendering, for example.
Doesn't work for translucency and shader animation. The latter can be made to work if the shader can also calculate motion vectors.
WithinReason ・ 4 days ago

Instead of motion vectors you probably want to send RGBD (+depth) so the client can compute its own motion vectors based on input, depth, and camera parameters. You get instant response to user input this way, but you need to in-paint disocclusions somehow.
dmos62 ・ 4 days ago

Could you say more? My first thought is that CPUs and GPUs have much higher bandwidths and lower latencies than ethernet, so just piping some of that workload to a streaming client wouldn't be feasible. Am I wrong?
undefined ・ 4 days ago

[deleted]
IshKebab ・ 4 days ago

I don't think games do normally have a motion vector buffer. I guess they could render one relatively easily, but that's a bit of a chicken and egg problem.
- garaetjjte ・ 4 days ago
  
  They do, one reason is postprocessing effects like motion blur, another is antialiasing like TAA or DLSS upscaling.
  
  shmerl ・ 4 days ago
  ・ 8 more
  
  Many games have it, but I always turn it off. I guess some like its cinematic effect, but I prefer less motion blur, not more.
  
  theshackleford ・ 4 days ago
  ・ 7 more
  
  Modern monitor technology has more than enough technology that adding more is most certainly not my cup of tea. Made worse ironically by modern rendering techniques...
  Though my understanding is that it helps hide shakier framerates in console land. Which sounds like it could be a thing...
  
  shmerl ・ 4 days ago
  ・ 6 more
  
  If anything, high refresh rate displays are trying to reduce motion blur. Artificially adding it back sounds weird and counter intuitive.
  
  tjoff ・ 4 days ago
  ・ 5 more
  
  It adds realism.
  Your vision have motion blur. Staring at your screen at fixed distance and no movement is highly unrealistic and allows you to see crisp 4k images no matter the content. This results in a cartoonish experience because it mimics nothing in real life.
  Now you do have the normal problem that the designers of the game/movie can't know for sure what part of the image you are focusing on (my pet peeve with 3D movies) since that affects where and how you would perceive the blur.
  Also have the problem of overuse or using it to mask other issues, or just as an artistic choice.
  But it makes total sense to invest in a high refresh display with quick pixel transitions to reduce blur, and then selectively add motion blur back artificially.
  Turning it off is akin to cranking up the brightness to 400% because otherwise you can't make out details in the dark parts off the game ... thats the point.
  But if you prefer it off then go ahead, games are meant to be enjoyed!
  
  oasisaimlessly ・ 3 days ago
  ・ 4 more
  
  Your eyes do not have built-in motion blur. If they are accurately tracking a moving object, it will not be seen as blurry. Artifically adding motion blur breaks this.
  
  tjoff ・ 3 days ago
  ・ 3 more
  
  Sure they do, the moving object in focus will not have motion blur but the surroundings will. Motion blur is not indiscriminately adding blur everywhere.
  
  theshackleford ・ 2 days ago
  ・ 2 more
  
  > Motion blur is not indiscriminately adding blur everywhere.
  Motion blur in games is inaccurate and exaggerated and isn’t close to presenting any kind of “realism.”
  My surroundings might have blur, but I don’t move my vision in the same way a 3d camera is controlled in game, so in the “same” circumstances I do not see the blur you do when moving a camera in 3d space in a game. My eyes jump from point to point, meaning the image I see is clear and blur free. When I’m tracking a single point, that point remains perfectly clear whilst sure, outside of that the surroundings blur.
  However motion blur in games does can literally not replicate either of these realities, it just adds a smear on top of a smear on top of a smear.
  So given both are unrealistic, I’d appreciate the one that’s far closer to how I actually see which is the one without yet another layer of blur. Modern displays add blur, modern rendering techniques add more, I don't need EVEN more added on top with in-game blur on top of that.
  
  tjoff ・ a day ago
  
  Yes, and that was exactly my point in my original post...
  With or without, neither is going to be perfect. At least when not even attempting eye-tracking. But there are still many reasons to do it.
  
  IshKebab ・ 4 days ago
  ・ 4 more
  
  Yeah I did almost mention motion blur but do many games use that? I don't play many AAA games TBF so maybe I'm just out of date...
  Take something like Rocket League for example. Definitely doesn't have velocity buffers.
  
  raincole ・ 4 days ago
  ・ 2 more
  
  > Take something like Rocket League for example. Definitely doesn't have velocity buffers.
  How did you reach this conclusion? Rocket League looks like a game that definitely have velocity buffers to me. (Many fast-moving scenarios + motion blur)
  
  IshKebab ・ 4 days ago
  
  It doesn't have motion blur. At least I've never seen any.
  Actually I just checked and it does have a motion blur setting... maybe I just turned it off years ago and forgot or something.
  
  izacus ・ 4 days ago
  
  Yes, most games these days have motion blur and motion vector buffers.
  Yes even Rocket League has it
- ACCount36 ・ 4 days ago
  
  Exposing motion vectors is a prerequisite for a lot of AI framegen tech. If you could tap that?
tomaskafka ・ 3 days ago

Also all major GPUs now have machine learning based next frame prediction, it’s hard to imagine this wouldn’t be useful.
- keyringlight ・ 3 days ago
  
  Plus whether there's further benefits available for the FSR/DLSS/XeSS type upscalers in knowing more about the scene. I'm reminded a bit of variable rate shading where if renderer analyses the scene for where detail levels will reward spending performance, could assign blocks (eg, 1x2, 4x2 pixels etc) to be shaded once instead of per-pixel to concentrate there. It's not exactly the same thing as the upscalers, but it seems a better foundation for a better output image compared to a blunt dropping the whole rendered resolution by a percentage. However, that's assuming traditional rendering before any ML gets involved which I think has proven its case in the past 7 years.
  I think the other side to this is the difference between further integration of the engine and scaler/frame generation which would seem to involve a lot of low level tuning (probably per-title), and having a generic solution that uplifts as many titles as possible even if there's "perfect is the enemy of good" left on the table.
d--b ・ 4 days ago

The point of streaming games though is to offload the hard computation to the server.
I mean you could also ship the textures ahead of time so that the compressor could look up if something looks like a distorted texture. You could send the geometry of what's being rendered, that would give a lot of info to the decompressor. You could send the HUD separately. And so on.
But here you want something that's high level and works with any game engine, any hardware. The main issue being latency rather than bandwidth, you really don't want to add calculation cycles.

keketi ・ 4 days ago

Have an LLM transcribe what is happening in the game into a few sentences per frame, transfer the text over network and have another LLM reconstruct the frame from the text. It won't be fast, it's going to be lossy, but compression ratio is insane and it's got all the right buzzwords.

jameshart ・ 4 days ago

Frame 1:
You are standing in an open field west of a white house, with a boarded front door. There is a small mailbox here.
- nusl ・ 4 days ago
  
  They did this :P
  https://www.youtube.com/watch?v=ZpCrBBj6AWE
- Eduard ・ 4 days ago
  
  (user input: mouse delta: (-20, -8))
  Frame 2:
  A few blades of grass sway gently in the breeze. The camera begins to drift slightly, as if under player control — a faint ambient sound begins: wind and birds.
- taneq ・ 4 days ago
  
  Ah, this explains why there are clowns under the bed and creepy children staring at me from the forest.
- Y_Y ・ 4 days ago
  
  kill jester
cyclotron3k ・ 4 days ago

Send the descriptions via the blockchain so there's an immutable record
poglet ・ 4 days ago

Maybe even one day we reach point where the game can run locally on the end users' machine.
foota ・ 4 days ago

You've got my attention

raphman ・ 4 days ago

Very cool - That's nearly exactly what I need for a research project.

FWIW, there's also the non-free JPEG-XS standard [1] which also claims very low latency [2] and might be a safer choice for commercial projects, given that there is a patent pool around it.

[1] https://www.jpegxs.com/

[2] https://ds.jpeg.org/whitepapers/jpeg-xs-whitepaper.pdf

jamesfmilne ・ 4 days ago

JPEG-XS is great for low latency, but it uses more bandwidth. We're using it for low-latency image streaming for film/TV post production:
https://www.filmlight.ltd.uk/store/press_releases/filmlight-...
We currently use the IntoPIX CUDA encoder/decoder implementation, and SRT for the low-level transport.
You can definitely achieve end-to-end latencies <16ms over decent networks.
We have customers deploying their machines in data centres and using them in their post-production facilities in the centre of town, usually over a 10GbE link. But I've had others using 1GbE links between countries, running at higher compression ratios.
indolering ・ 4 days ago

A patent pool doesn't make you safer: it's just a patent troll charging you to cross the bridge. They are not offering insurance against more patent trolls blackmailing you after you cross the bridge.
- raphman ・ 4 days ago
  
  While I am personally opposed to software patents, I'd argue that the JPEG XS patent holders [1] are not 'patent trolls' in any meaningful sense of the word.
  While I have no personal experience on that topic, I'd assume that a codec with a patent pool is a safer bet for a commercial project. Key aspects being protected by patents makes it less likely that some random patent troll or competitor extorts you with some nonsense patent. Also, using e.g., JPEG XS instead of e.g., pyrowave also ensures that you won't be extorted by the JPEG XS patent holders.
  One may call this a protection racket - but under the current system, it may make economical sense to pay for a license instead of risking expensive law suits.
  [1] https://www.jpegxspool.com/
  
  rcxdude ・ 4 days ago
  
  >Key aspects being protected by patents makes it less likely that some random patent troll or competitor extorts you with some nonsense patent
  Does it? how? Patents can overlap, for example. Unless there's some indemnity or insurance for fighting patent lawsuits as part of the pool, it's a protection only against those patent holders, not other trolls.

Thaxll ・ 4 days ago

There is the creator of VLC that is working on something similar, very cutting edge.

https://streaminglearningcenter.com/codecs/an-interview-with...

Ultra low latency for streaming.

https://www.youtube.com/watch?v=0RvosCplkCc

torginus ・ 4 days ago

Having worked in the space, I'd have to say hardware encoders and H.264 is pretty dang good - NVENC works with very little latency (if you tell it to, and disable the features that increase it, such as multiple frame prediction, B-frames).
The two things that increase latency are more advanced processing algorithms, giving the encoder more stuff to do, and schemes that require waiting multiple frames. If you go disable those, the encoder can pretty much start working on your frame the nanosecond the GPU stops rendering to it, and have it encoded in <10ms.
- Wowfunhappy ・ 4 days ago
  
  > have it encoded in <10ms.
  For context, OP achieved 0.13 ms with his codec.
  
  pjc50 ・ 4 days ago
  ・ 2 more
  
  "0.13 ms on a RX 9070 XT on RADV."
  "interesting data point is that transferring a 4K RGBA8 image over the PCI-e bus is far slower than compressing it on the GPU like this, followed by copying over the compressed payload."
  "200mbit/s at 60 fps"
  It's certainly a very different set of tradeoffs, using a lot more bandwidth.
  
  theshackleford ・ 4 days ago
  
  > It's certainly a very different set of tradeoffs, using a lot more bandwidth.
  Wasnt that the point?
  > These use cases demand very, very low latency. Every millisecond counts here
  > When game streaming, the expectation is that we have a lot of bandwidth available. Streaming locally on a LAN in particular, bandwidth is basically free. Gigabit ethernet is ancient technology and hundreds of megabits over WiFi is no problem either. This shifts priorities a little bit for me at least.
  
  torginus ・ 4 days ago
  
  I don't have the timings right now but you can go significantly below 10ms.
  There's a tradeoff between quality and encoding time - for example, if you want your motion vector reference to go back 4 frames, instead of 2, then the encoder will take longer to run, and you get better quality at no extra bitrate, but more runtime.
  If your key to-screen latency has an irreducible 50-60ms part of rendering, processing, data transfer, decoding and display, then the extra 10ms is just 15% more latency, but you have to find the correct tradeoff for yourself.
  
  your_challenger ・ 4 days ago
  
  But isn't the OP talking about local network while Jean-Baptiste Kempf is talking about the internet?
- dishsoap ・ 4 days ago
  
  10ms is quite long in this context.
- RobRivera ・ 4 days ago
  
  >10 ms
  Do not shame this dojo.
latchkey ・ 4 days ago

Sadly appears to be unavailable.

Cadwhisker ・ 4 days ago

This CODEC uses the same base algorithm as HTJ2K (High-Throughput JPEG 2000).

If the author is reading this, it would be very interesting to read about the differences between this method and HTJ2K.

Fidelix ・ 4 days ago

Unbelievable... Good job mate.

Can't wait until one day this gets into Moonlight or something like it.

cpeth ・ 4 days ago

Exactly what I was thinking. Wish I had the time and expertise to give adding support for this codec myself a go. Streaming Clair Obscure over my LAN via Sunshine / Moonlight is exactly my use-case and the latency could definitely be better.

freshtake ・ 4 days ago

If you're focused solely on local network streaming, you can throw most of the features of modern codecs out the window. The trade-off is bandwidth, but if the network can support 100 Mbps, you can get remarkably low latency with relatively little processing.

For example, Microsoft's DXT codec lacks most modern features (no entropy coding, motion comp, deblocking, etc.), but delivers roughly 4x to 8x compression and is hardware decodable (saving on decoding and presentation latency).

Of course, once you've tuned the end to end capture-encode-transmit-decode-display loop to sub 10 ms, you then have to contend with the 30-100 ms of video processing latency introduced by the display :-)

CharlesW ・ 4 days ago

> Given how niche and esoteric this codec is, it’s hard to find any actual competing codecs to compare against.

It'd be interesting to see benchmarks against H.264/AVC (see example "zero‑latency" ffmpeg settings below) and JPEG XS.

  -c:v libx264 -preset ultrafast -tune zerolatency \
  -x264-params "keyint=1:min-keyint=1:scenecut=0:rc-lookahead=0" \
  -bf 0 -b:v 8M -maxrate 8M -bufsize 1M

kookamamie ・ 4 days ago

Not bad. The closest competition would be NDI from NewTek, now Vizrt. It targets similar bitrate and latency ranges.

monster_truck ・ 4 days ago

Looks like NDI without any of the conveniences.

You're doing something wrong if nvenc is any slower, the llhp preset should be all you need.

kevingadd ・ 4 days ago

The sample screenshot of Expedition 33 is really impressive quality considering it appears to be encoding at around 1 bit per pixel and (according to the post) it took a fraction of a millisecond to encode it. This is an order of magnitude faster than typical hardware encoders, AFAIK.

Very cool work explained well.

superjan ・ 4 days ago

I love this. The widely used standards for video compression are focused on compression efficiency, which is important if you’re netflix or youtube, but sometimes latency and low complexity is more important. Even if only to play around and learn how a video codec actually works.

CharlesW ・ 4 days ago

> The widely used standards for video compression are focused on compression efficiency, which is important if you’re netflix or youtube, but sometimes latency and low complexity is more important.
That's a misconception. All modern video codecs (i.e. H.264/AVC, H.265/HEVC, AV1) have explicit, first-class tools, profiles, and reference modes aimed at both low- and high-resolution low‑latency and/or low‑complexity use.
AV1: Improving RTC Video Quality at Scale: https://atscaleconference.com/av1-improving-rtc-video-qualit...
- westurner ・ 4 days ago
  
  There are hardware AV1 encoders and decoders.
  Objective metrics and tools for video encoding and source signal quality: netflix/vmaf, easyVmaf, psy-ex/metrics, ffmpeg-quality-metrics,
  Ffmpeg settings for low-latency encoding:
  # h264, h265 -preset ultrafast -tune zerolatency # AV1 -c:v libsvtav1 -preset 8 -svtav1-params tune=0:latency-mode=1 -g 60
  It's possible to follow along with ffmpeg encoding for visual inspection without waiting for the whole job to complete with the tee muxer and ffplay.
  GPU Screen Recorder and Sunlight server expose some encoder options in GUI forms, but parameter optimization is still manual; nothing does easyVmaf with thumbnails of each rendering parameter set with IDK auto-identification of encoding artifacts.
  Ardour has a "Loudness Analyzer & Normalizer" with profiles for specific streaming services.
  What are good target bitrates for low-latency livestreaming 4k with h264, h265 (HDR), and AV1?
  
  westurner ・ 2 days ago
  
  FFmpeg Explorer is made with ffmpeg.wasm: https://github.com/antiboredom/ffmpeg-explorer .. web: https://ffmpeg.lav.io/

nairoz ・ 4 days ago

It's really cool. I have always wondered if it would be possible to have video encoders designed for some specific games with prior knowledge about important regions to encode with more details. Example would be the center of the screen for the main character.

DecentShoes ・ 4 days ago

Hell, you could do eye tracking and full on foveated rendering.

mleonhard ・ 4 days ago

https://github.com/Themaister/pyrowave

sitkack ・ 4 days ago

I'd want to know how it compares to https://github.com/phoboslab/qoi

One thing to note when designing a new video codec is to carpet bomb around the idea with research projects to stake claim to any possible feature enhancements.

Anything can have an improvement patent filed against, no matter the license.

atiedebee ・ a day ago

QOI is an image codex, lossless, and not parallelizable. I don't think they are comparable
- sitkack ・ 5 hours ago
  
  Both of you are falling into the trap of comparing at the lowest possible level.
  Here is an example of QOI being used in a video codec, https://wide-video.github.io/qov/static/demo.html https://github.com/wide-video/qov
Almondsetat ・ 4 days ago

QOI is lossless

hashtekar ・ 4 days ago

What a great read and such a throwback for me. I worked on video compression techniques using wavelets 30yrs ago. Computing power and networking speeds were not what they are now and I had difficulty getting the backing to carry it forward. I’m so happy that this still has such active development and boundaries are still being pushed. Bravo.

kreco ・ 4 days ago

I have nothing interesting to say, but thanks for author and for the one who shared the article. It was a good read.

crazygringo ・ 4 days ago

Fascinating! But...

> The go-to solution here is GPU accelerated video compression

Isn't the solution usually hardware encoding?

> I think this is an order of magnitude faster than even dedicated hardware codecs on GPUs.

Is there an actual benchmark though?

I would have assumed that built-in hardware encoding would always be faster. Plus, I'd assume your game is already saturating your GPU, so the last thing you want to do is use it for simultaneous video encoding. But I'm not an expert in either of these, so curious to know if/how I'm wrong here? Like if hardware encoders are designed to be real-time, but intentionally trade off latency for higher compression? And is the proposed video encoding really is so lightweight it can easily share the GPU without affecting game performance?

averne_ ・ 4 days ago

Hardware GPU encoders refer to dedicated ASIC engines, separate from the main shader cores. So they run in parallel and there is no performance penalty for using both simultaneously, besides increased power consumption.
Generally, you're right that these hardware blocks favor latency. One example of this is motion estimation (one of the most expensive operations during encoding). The NVENC engine on NVidia GPUs will only use fairly basic detection loops, but can optionally be fed motion hints from an external source. I know that NVidia has a CUDA-based motion estimator (called CEA) for this purpose. On recent GPUs there is also the optical flow engine (another separate block) which might be able to do higher quality detection.
- miladyincontrol ・ 4 days ago
  
  Im pretty sure they arent dedicated ASIC engines anymore. Thats why hacks like nvidia-patch are a thing where you can scale up NVENC usage up to the full GPU's compute rather than the arbitrary limitation nvidia adds. The penalty for using them within those limitations tends to be negligible however.
  And on a similar note, NvFBC helps a ton with latency but its disabled on a driver level for consumer cards.
  
  theshackleford ・ 4 days ago
  
  > Im pretty sure they arent dedicated ASIC engines anymore.
  They are. That patch doesnt do what you think it does.

ChadNauseam ・ 4 days ago

Great article. One thing I've always noticed is that when you get good enough at coding it just turns into math. I hope I can reach that level some day.

maxglute ・ 3 days ago

Infra really caught up to point where remote streaming my desktop to various devices mobile devices have been viable. I kind of wish windows phone OS didn't fail so hard, it would be nice to toggle mobile mode in a remote session and have years of apps built to support it.

richardw ・ 4 days ago

If the system knew both sides were the same vendor or used the same algorithm, would it be better to stream the scene/instructions rather than the video?

I suppose the issue would be media. Faster to load locally than push it out. Could be semi solved with typical web caching approaches.

Firehawke ・ 4 days ago

I've been mulling over the concept of a codec designed for streaming NES video. Stream the VRAM tiles and the RAM data needed to reconstruct the output locally, which could be done in a shader along with CRT simulation if desired.
Very much a one-trick pony, but probably considerably less bandwidth-intensive than even the original resolution (320x224) under nearly any acceptable bitrate.
Karliss ・ 4 days ago

Are you suggesting to do the 3d rendering on client side, which would require a beefy GPU for the client? The whole point of game streaming is that the game can run on the big noisy, power hungry computer located somewhere else, but the device receiving the stream only needs minimal compute power to decode video.

10000truths ・ 4 days ago

Are there any solutions to game streaming that build an RPC on top of the DirectX/Vulkan API and data structures? I feel like streaming a serialized form of the command queue over the network would be more efficient than streaming video frames.

babypuncher ・ 4 days ago

What's the point in streaming a video game from one computer to another if the client machine still needs the expensive and power hungry dedicated graphics hardware to display it?
- mschuster91 ・ 4 days ago
  
  You could use that to have a chungus dGPU with a massive amount of VRAM keep all the assets, position data and god knows what else to do the heavy lifting - determine what needs to be drawn and where, deal with physics, 3D audio simulation etc. - and then offload only a (comparatively) small-ish amount of work to the client GPU.
  
  debugnik ・ 4 days ago
  
  But those are the cheap parts compared to the rendering itself. A small-ish amount of work is what we're already sending, the final image, because anything else takes much, much more GPU work.
  
  babypuncher ・ 3 days ago
  
  The client GPU still needs all those assets in its own VRAM in order to render any scene that uses them. You would need to stream all of those assets in real time to the client, and last I checked consumer network interfaces are checks notes slower than 16 PCI Express lanes.
  I'm still unsure what computational resources are being saved for the client here, the actual rasterization is where the bulk of the work is done.
wmf ・ 4 days ago

I don't think this is true when you count texture uploads. Loading 8 GB of textures over the network would take a while.
- 10000truths ・ 4 days ago
  
  Only once, then subsequent references to the texture(s) would be done via descriptor. Most game engines will preload large assets like textures before rendering a scene.
  
  duskwuff ・ 4 days ago
  ・ 2 more
  
  That really depends on the game. Loading every texture in advance isn't practical for all games - many newer "open world" games will stream textures to the GPU as needed based on the player's location and what they're doing.
  Also, modern game textures are a lot of data.
  
  10000truths ・ 4 days ago
  
  True, on-demand loading/unloading of large textures still needs to be handled. Video streaming handles congestion by sacrificing fidelity to reduce bitrate. A similar approach could be taken with textures by downsampling them (or, better yet, streaming them with a compression codec that supports progressive decoding).
sebmaynard ・ 4 days ago

That's exactly what https://polystream.com/ do
dec05eba ・ 3 days ago

That's kinda like how indirect glx works with x11 (remote opengl)

Fokamul ・ 4 days ago

By any chance, does anyone know what streaming services use? Like Boosteroid or GeforceNow.

Because GeforceNow has best streaming quality in the business and Boosteroid have various problems like stuttering etc.

theLiminator ・ 4 days ago

Pretty cool, though I think in practice network latency dominates so much that this kind of optimization is fairly low impact.

I think the main advantage is perhaps the robustness against packet drops is better.

babypuncher ・ 4 days ago

It could make in-home streaming actually usable for me. I've never been happy with the lag in Steam's streaming or moonlight, even when both the server and client are on the same switch. That's not a network latency problem, that's an everything else problem.
theshackleford ・ 4 days ago

> though I think in practice network latency dominates so much that this kind of optimization is fairly low impact.
In practice, no.
Network latency is the least problematic part of the stack. I consistently get <3ms. It's largely encode/decode time which in my setup sits at around 20ms meaning any work in this area would actually have a a HUGE impact.
_kb ・ 4 days ago

Depends on the network. Where this style of codec is common you’re not traversing internet so transport latency, including switch forwarding, is normally in the microseconds. The killer is the display device that ends up rendering this. If you’re not careful that can add 10-100ms to glass to glass times.

tomaskafka ・ 3 days ago

I love how the final step is “Dopamine is released in target brain”

We can minimize the latency and save on rendering by doing only this step :).

pimlottc ・ 4 days ago

What are existing streaming game services using? Surely there is some good previous work in this area?

toast0 ・ 4 days ago

I doubt any of those are willing to drop 100mbps+ per client. If the clients can even manage it.
- theshackleford ・ 4 days ago
  
  > I doubt any of those are willing to drop 100mbps+ per client.
  There is zero reason to not be willing to do so on a local network.
  > If the clients can even manage it.
  They can. I use significantly more than that already.
  
  detaro ・ 4 days ago
  ・ 2 more
  
  "Streaming game service" clearly does not refer to something in your local network
  
  theshackleford ・ 3 days ago
  
  Plenty of people I know refer to local game streaming setups as services, especially when they’re run as system services or hosted servers (like I do).
  In my world, anything that runs persistently and offers functionality, whether it’s local or the cloud, is a service. I run a local game streaming setup that fits that definition exactly and so yeah, when I hear game streaming service, that’s what I think of.
  That said, I get that others could associate the term more with commercial cloud solutions with some reflection. Just not where my brain goes by default, since I don’t use them and nobody I know does or would as we can host our own “services” for this purpose.
  My bad.
wmf ・ 4 days ago

Probably GPU hardware encoding using low-latency mode. Somebody reported 1 ms latency.

undefined ・ 3 days ago

[deleted]

undefined ・ 4 days ago

[deleted]

tombert ・ 4 days ago

One of my bucket list things is to some day build a video codec from scratch. I have no delusions of competing with h264 or anything like that, just something that does some basic compression and can play videos in the process.

Maybe I should try for that next weekend.