I'd like to use io_uring, but as long as it bypasses seccomp it should be disabled whenever seccomp is in use. As such, I use epoll, and find it annoying when kernel APIs like ublk require io_uring. The places I'd want to use ublk are inside sandboxes using seccomp. Given that container runtimes, hardened kernels, chromeos, etc., disable io_uring, using it means needing an epoll fallback anyways, so might as well just use epoll and not maintain two async backends for your application.
ublk, specifically, is something I'd expect to be primarily used in privileged contexts anyway, because the primary use of the resulting block device is to mount it, which requires privileges for most interesting filesystems. If you want an unprivileged mechanism, you may be interested in the upcoming uring-accelerated FUSE support.
For other uses, uring has a "restriction" mechanism that does part of what you want. See REGISTER_RESTRICTIONS in the documentation. Any process that's setting up its own seccomp restrictions can also set up a uring with restrictions, limiting the opcodes it can use.
That said, that mechanism would benefit from a way to apply such restrictions to a process that isn't doing the setup itself, such as when setting up seccomp restrictions on a container or daemon. For instance, a way to set restrictions on all rings created by child processes, or a way for seccomp to enforce that any uring created has restrictions applied to it.
The main problem I have with fuse is inotify not working. If inotify just worked for fuse, I'd just use it. Ideally I could just run the software in a mount namespace with a fuse fs, but I need inotify.
I mainly was trying to use ublk to implement a sort of fuse like thing with the kernel handling the fs and thus having inotify support.
Interesting, I didn't realize inotify didn't work with FUSE. Is this a flaw in the FUSE interface, or is it just a deficiency in certain FUSE filesystems?
I think the key problem is that mapping from FUSE requests to inotify events requires information that only the FUSE daemon has. For example, lets say you open a file with O_CREAT. Whether this should trigger IN_CREATE depends on whether the file already exists. The kernel doesn't know this, and so couldn't be responsible for generating the IN_CREATE event.
Now, the FUSE daemon could generate the event, but correctly generating events (especially handling edge cases) is difficult.
I was thinking about cases where a filesystem change event doesn't stem from a system call at all, for example, because some other machine wrote to a remote fileserver the daemon provides access to. Is that a problem?
> you may be interested in the upcoming uring-accelerated FUSE support.
Do you have a reference for this? What is the anticipated timeframe?
https://lore.kernel.org/io-uring/20241209-fuse-uring-for-6-1...
I don't know when it'll be merged, but it seems like it's getting close to ready.
> For instance, a way to set restrictions on all rings created by child processes, or a way for seccomp to enforce that any uring created has restrictions applied to it.
SELinux or your favorite MAC is there to solve this exact problem.
Does this mean you shouldn't use it in containers?
edit: it does seem it is disabled there now: https://github.com/containerd/containerd/pull/9320 (thanks to sibling comment for an adjancent link)
Yeah I had code at one point in my hobby project that used io_uring and it stopped working in docker without overriding security restrictions.
Unfortunately decided it's not worth it.
> find it annoying when kernel APIs like ublk require io_uring
Good. That's a forcing function for making io_uring work in your environment.
> bypasses seccomp
Seccomp sucks.
We shouldn't be enforcing security by filtering system calls, the set of which will grow forever, but instead by describing access control rules on objects, e.g. with SELinux. If your security policy is that your sandbox should be able to read from some file but not write to it, you should do that with real MAC, which applies to all operations , il_uring included. You shouldn't just filter read(2) and write(2) in particular.
We shouldn't hold back evolution in systems interfaces because some people are stuck on bad ways of doing things and won't move.
SELinux is a dx/ux hostile nightmare that we definitely shouldn't be springing on everybody.
Since when can you use a MAC as an unprivileged user on an arbitrary distro?
Parent is referring to https://en.m.wikipedia.org/wiki/Mandatory_access_control As opposed to https://en.m.wikipedia.org/wiki/Medium_access_control
- [deleted]
Not necessarily, how do I use SELinux as a unprivileged user on eg Debian?
seccomp is a mitigation. Once you have already been exploited, if further escalation is prevented by seccomp, or ASLR, or NX stack, or ....... then you got lucky.
Is there a specific io_uring opcode you would like disabled in your sandboxes? It's not like io_uring is a complete seccomp bypass, just another syscall that provides an alternative way to do many things. I doubt you block "read" or "accept" in docker, for example. You can't execute a sysctl or mount a filesystem using io_uring, which are things that are actually blocked in Docker by default.
edit: on the other hand, a good reason to disable uring in containers is that it's infested with vulnerabilities. It's new, complex, and does a whole lot of things - all of which make serious security bugs there quite common right now.
> infested with vulnerabilities
Current io_uring is not particularly prone to vulnerabilities. The original version of it had a design that often led to them (a kernel thread doing operations on behalf of the process and not always remembering to set the appropriate privileges), but it no longer uses that design, and the current design is much more resilient. Unfortunately, the original design led to a reputation that it's still trying to shake.
> Current io_uring is not particularly prone to vulnerabilities
The tech industry: launch early! Develop in public! Many eyes make all bugs shallow!
Also the tech industry: we will never forgive you for that one segfault you had ten years ago.
Excuse me? Io_uring is by far the most often exploited syscall on modern day Linux. Most often exploited subsystem even. https://www.phoronix.com/news/Google-Restricting-IO_uring
That's a lot like saying "the syscall interface is the most exploited interface to the kernel". io_uring is an entire syscall interface itself; the right point of comparison would be "every other syscall".
How do the exploits for io_uring compare to the exploits for the rest of the kernel?
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=%22linux%20...
Remember that 10yo crash? Well, I'm going to use a 12yo kernel and complain about it.
- [deleted]
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
20 CVEs in 2024. Yes, some of them are not (exploitable) vulnerabilities, probably, because Linux CNA is being difficult. But many of them are, just ctrl+f privilege.
It's not only potentially infested with vulnerabilities. It's also not possible to filter io_uring using seccomp at all. So if you allow io_uring, you allow all that is possible with it.
Out of current ones, at a quick glance: connect, openat, openat2, renameat, mkdirat, and bind. More importantly, I'd like to block any opcode I haven't whitelisted, even when my software runs on future kernels with more opcodes available.
Now that I think about it, how does io_uring interact with landlock?