| CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
| In the Linux kernel, the following vulnerability has been resolved:
netfilter: bpf: defer hook memory release until rcu readers are done
Yiming Qian reports UaF when concurrent process is dumping hooks via
nfnetlink_hooks:
BUG: KASAN: slab-use-after-free in nfnl_hook_dump_one.isra.0+0xe71/0x10f0
Read of size 8 at addr ffff888003edbf88 by task poc/79
Call Trace:
<TASK>
nfnl_hook_dump_one.isra.0+0xe71/0x10f0
netlink_dump+0x554/0x12b0
nfnl_hook_get+0x176/0x230
[..]
Defer release until after concurrent readers have completed. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix race between freeing data and fs accessing it
AppArmor was putting the reference to i_private data on its end after
removing the original entry from the file system. However the inode
can aand does live beyond that point and it is possible that some of
the fs call back functions will be invoked after the reference has
been put, which results in a race between freeing the data and
accessing it through the fs.
While the rawdata/loaddata is the most likely candidate to fail the
race, as it has the fewest references. If properly crafted it might be
possible to trigger a race for the other types stored in i_private.
Fix this by moving the put of i_private referenced data to the correct
place which is during inode eviction. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix race on rawdata dereference
There is a race condition that leads to a use-after-free situation:
because the rawdata inodes are not refcounted, an attacker can start
open()ing one of the rawdata files, and at the same time remove the
last reference to this rawdata (by removing the corresponding profile,
for example), which frees its struct aa_loaddata; as a result, when
seq_rawdata_open() is reached, i_private is a dangling pointer and
freed memory is accessed.
The rawdata inodes weren't refcounted to avoid a circular refcount and
were supposed to be held by the profile rawdata reference. However
during profile removal there is a window where the vfs and profile
destruction race, resulting in the use after free.
Fix this by moving to a double refcount scheme. Where the profile
refcount on rawdata is used to break the circular dependency. Allowing
for freeing of the rawdata once all inode references to the rawdata
are put. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix differential encoding verification
Differential encoding allows loops to be created if it is abused. To
prevent this the unpack should verify that a diff-encode chain
terminates.
Unfortunately the differential encode verification had two bugs.
1. it conflated states that had gone through check and already been
marked, with states that were currently being checked and marked.
This means that loops in the current chain being verified are treated
as a chain that has already been verified.
2. the order bailout on already checked states compared current chain
check iterators j,k instead of using the outer loop iterator i.
Meaning a step backwards in states in the current chain verification
was being mistaken for moving to an already verified state.
Move to a double mark scheme where already verified states get a
different mark, than the current chain being kept. This enables us
to also drop the backwards verification check that was the cause of
the second error as any already verified state is already marked. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: Fix double free of ns_name in aa_replace_profiles()
if ns_name is NULL after
1071 error = aa_unpack(udata, &lh, &ns_name);
and if ent->ns_name contains an ns_name in
1089 } else if (ent->ns_name) {
then ns_name is assigned the ent->ns_name
1095 ns_name = ent->ns_name;
however ent->ns_name is freed at
1262 aa_load_ent_free(ent);
and then again when freeing ns_name at
1270 kfree(ns_name);
Fix this by NULLing out ent->ns_name after it is transferred to ns_name
") |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix missing bounds check on DEFAULT table in verify_dfa()
The verify_dfa() function only checks DEFAULT_TABLE bounds when the state
is not differentially encoded.
When the verification loop traverses the differential encoding chain,
it reads k = DEFAULT_TABLE[j] and uses k as an array index without
validation. A malformed DFA with DEFAULT_TABLE[j] >= state_count,
therefore, causes both out-of-bounds reads and writes.
[ 57.179855] ==================================================================
[ 57.180549] BUG: KASAN: slab-out-of-bounds in verify_dfa+0x59a/0x660
[ 57.180904] Read of size 4 at addr ffff888100eadec4 by task su/993
[ 57.181554] CPU: 1 UID: 0 PID: 993 Comm: su Not tainted 6.19.0-rc7-next-20260127 #1 PREEMPT(lazy)
[ 57.181558] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 57.181563] Call Trace:
[ 57.181572] <TASK>
[ 57.181577] dump_stack_lvl+0x5e/0x80
[ 57.181596] print_report+0xc8/0x270
[ 57.181605] ? verify_dfa+0x59a/0x660
[ 57.181608] kasan_report+0x118/0x150
[ 57.181620] ? verify_dfa+0x59a/0x660
[ 57.181623] verify_dfa+0x59a/0x660
[ 57.181627] aa_dfa_unpack+0x1610/0x1740
[ 57.181629] ? __kmalloc_cache_noprof+0x1d0/0x470
[ 57.181640] unpack_pdb+0x86d/0x46b0
[ 57.181647] ? srso_alias_return_thunk+0x5/0xfbef5
[ 57.181653] ? srso_alias_return_thunk+0x5/0xfbef5
[ 57.181656] ? aa_unpack_nameX+0x1a8/0x300
[ 57.181659] aa_unpack+0x20b0/0x4c30
[ 57.181662] ? srso_alias_return_thunk+0x5/0xfbef5
[ 57.181664] ? stack_depot_save_flags+0x33/0x700
[ 57.181681] ? kasan_save_track+0x4f/0x80
[ 57.181683] ? kasan_save_track+0x3e/0x80
[ 57.181686] ? __kasan_kmalloc+0x93/0xb0
[ 57.181688] ? __kvmalloc_node_noprof+0x44a/0x780
[ 57.181693] ? aa_simple_write_to_buffer+0x54/0x130
[ 57.181697] ? policy_update+0x154/0x330
[ 57.181704] aa_replace_profiles+0x15a/0x1dd0
[ 57.181707] ? srso_alias_return_thunk+0x5/0xfbef5
[ 57.181710] ? __kvmalloc_node_noprof+0x44a/0x780
[ 57.181712] ? aa_loaddata_alloc+0x77/0x140
[ 57.181715] ? srso_alias_return_thunk+0x5/0xfbef5
[ 57.181717] ? _copy_from_user+0x2a/0x70
[ 57.181730] policy_update+0x17a/0x330
[ 57.181733] profile_replace+0x153/0x1a0
[ 57.181735] ? rw_verify_area+0x93/0x2d0
[ 57.181740] vfs_write+0x235/0xab0
[ 57.181745] ksys_write+0xb0/0x170
[ 57.181748] do_syscall_64+0x8e/0x660
[ 57.181762] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 57.181765] RIP: 0033:0x7f6192792eb2
Remove the MATCH_FLAG_DIFF_ENCODE condition to validate all DEFAULT_TABLE
entries unconditionally. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix side-effect bug in match_char() macro usage
The match_char() macro evaluates its character parameter multiple
times when traversing differential encoding chains. When invoked
with *str++, the string pointer advances on each iteration of the
inner do-while loop, causing the DFA to check different characters
at each iteration and therefore skip input characters.
This results in out-of-bounds reads when the pointer advances past
the input buffer boundary.
[ 94.984676] ==================================================================
[ 94.985301] BUG: KASAN: slab-out-of-bounds in aa_dfa_match+0x5ae/0x760
[ 94.985655] Read of size 1 at addr ffff888100342000 by task file/976
[ 94.986319] CPU: 7 UID: 1000 PID: 976 Comm: file Not tainted 6.19.0-rc7-next-20260127 #1 PREEMPT(lazy)
[ 94.986322] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 94.986329] Call Trace:
[ 94.986341] <TASK>
[ 94.986347] dump_stack_lvl+0x5e/0x80
[ 94.986374] print_report+0xc8/0x270
[ 94.986384] ? aa_dfa_match+0x5ae/0x760
[ 94.986388] kasan_report+0x118/0x150
[ 94.986401] ? aa_dfa_match+0x5ae/0x760
[ 94.986405] aa_dfa_match+0x5ae/0x760
[ 94.986408] __aa_path_perm+0x131/0x400
[ 94.986418] aa_path_perm+0x219/0x2f0
[ 94.986424] apparmor_file_open+0x345/0x570
[ 94.986431] security_file_open+0x5c/0x140
[ 94.986442] do_dentry_open+0x2f6/0x1120
[ 94.986450] vfs_open+0x38/0x2b0
[ 94.986453] ? may_open+0x1e2/0x2b0
[ 94.986466] path_openat+0x231b/0x2b30
[ 94.986469] ? __x64_sys_openat+0xf8/0x130
[ 94.986477] do_file_open+0x19d/0x360
[ 94.986487] do_sys_openat2+0x98/0x100
[ 94.986491] __x64_sys_openat+0xf8/0x130
[ 94.986499] do_syscall_64+0x8e/0x660
[ 94.986515] ? count_memcg_events+0x15f/0x3c0
[ 94.986526] ? srso_alias_return_thunk+0x5/0xfbef5
[ 94.986540] ? handle_mm_fault+0x1639/0x1ef0
[ 94.986551] ? vma_start_read+0xf0/0x320
[ 94.986558] ? srso_alias_return_thunk+0x5/0xfbef5
[ 94.986561] ? srso_alias_return_thunk+0x5/0xfbef5
[ 94.986563] ? fpregs_assert_state_consistent+0x50/0xe0
[ 94.986572] ? srso_alias_return_thunk+0x5/0xfbef5
[ 94.986574] ? arch_exit_to_user_mode_prepare+0x9/0xb0
[ 94.986587] ? srso_alias_return_thunk+0x5/0xfbef5
[ 94.986588] ? irqentry_exit+0x3c/0x590
[ 94.986595] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 94.986597] RIP: 0033:0x7fda4a79c3ea
Fix by extracting the character value before invoking match_char,
ensuring single evaluation per outer loop. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix: limit the number of levels of policy namespaces
Currently the number of policy namespaces is not bounded relying on
the user namespace limit. However policy namespaces aren't strictly
tied to user namespaces and it is possible to create them and nest
them arbitrarily deep which can be used to exhaust system resource.
Hard cap policy namespaces to the same depth as user namespaces. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: replace recursive profile removal with iterative approach
The profile removal code uses recursion when removing nested profiles,
which can lead to kernel stack exhaustion and system crashes.
Reproducer:
$ pf='a'; for ((i=0; i<1024; i++)); do
echo -e "profile $pf { \n }" | apparmor_parser -K -a;
pf="$pf//x";
done
$ echo -n a > /sys/kernel/security/apparmor/.remove
Replace the recursive __aa_profile_list_release() approach with an
iterative approach in __remove_profile(). The function repeatedly
finds and removes leaf profiles until the entire subtree is removed,
maintaining the same removal semantic without recursion. |
| In the Linux kernel, the following vulnerability has been resolved:
apparmor: fix memory leak in verify_header
The function sets `*ns = NULL` on every call, leaking the namespace
string allocated in previous iterations when multiple profiles are
unpacked. This also breaks namespace consistency checking since *ns
is always NULL when the comparison is made.
Remove the incorrect assignment.
The caller (aa_unpack) initializes *ns to NULL once before the loop,
which is sufficient. |
| In the Linux kernel, the following vulnerability has been resolved:
KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE
Adjust KVM's sanity check against overwriting a shadow-present SPTE with a
another SPTE with a different target PFN to only apply to direct MMUs,
i.e. only to MMUs without shadowed gPTEs. While it's impossible for KVM
to overwrite a shadow-present SPTE in response to a guest write, writes
from outside the scope of KVM, e.g. from host userspace, aren't detected
by KVM's write tracking and so can break KVM's shadow paging rules.
------------[ cut here ]------------
pfn != spte_to_pfn(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:3069 at mmu_set_spte+0x1e4/0x440 [kvm], CPU#0: vmx_ept_stale_r/872
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 872 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mmu_set_spte+0x1e4/0x440 [kvm]
Call Trace:
<TASK>
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]--- |
| In the Linux kernel, the following vulnerability has been resolved:
KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE
When installing an emulated MMIO SPTE, do so *after* dropping/zapping the
existing SPTE (if it's shadow-present). While commit a54aa15c6bda3 was
right about it being impossible to convert a shadow-present SPTE to an
MMIO SPTE due to a _guest_ write, it failed to account for writes to guest
memory that are outside the scope of KVM.
E.g. if host userspace modifies a shadowed gPTE to switch from a memslot
to emulted MMIO and then the guest hits a relevant page fault, KVM will
install the MMIO SPTE without first zapping the shadow-present SPTE.
------------[ cut here ]------------
is_shadow_present_pte(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:484 at mark_mmio_spte+0xb2/0xc0 [kvm], CPU#0: vmx_ept_stale_r/4292
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 4292 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mark_mmio_spte+0xb2/0xc0 [kvm]
Call Trace:
<TASK>
mmu_set_spte+0x237/0x440 [kvm]
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x47fa3f
</TASK>
---[ end trace 0000000000000000 ]--- |
| In the Linux kernel, the following vulnerability has been resolved:
rust_binder: call set_notification_done() without proc lock
Consider the following sequence of events on a death listener:
1. The remote process dies and sends a BR_DEAD_BINDER message.
2. The local process invokes the BC_CLEAR_DEATH_NOTIFICATION command.
3. The local process then invokes the BC_DEAD_BINDER_DONE.
Then, the kernel will reply to the BC_DEAD_BINDER_DONE command with a
BR_CLEAR_DEATH_NOTIFICATION_DONE reply using push_work_if_looper().
However, this can result in a deadlock if the current thread is not a
looper. This is because dead_binder_done() still holds the proc lock
during set_notification_done(), which called push_work_if_looper().
Normally, push_work_if_looper() takes the thread lock, which is fine to
take under the proc lock. But if the current thread is not a looper,
then it falls back to delivering the reply to the process work queue,
which involves taking the proc lock. Since the proc lock is already
held, this is a deadlock.
Fix this by releasing the proc lock during set_notification_done(). It
was not intentional that it was held during that function to begin with.
I don't think this ever happens in Android because BC_DEAD_BINDER_DONE
is only invoked in response to BR_DEAD_BINDER messages, and the kernel
always delivers BR_DEAD_BINDER to a looper. So there's no scenario where
Android userspace will call BC_DEAD_BINDER_DONE on a non-looper thread. |
| In the Linux kernel, the following vulnerability has been resolved:
nf_tables: nft_dynset: fix possible stateful expression memleak in error path
If cloning the second stateful expression in the element via GFP_ATOMIC
fails, then the first stateful expression remains in place without being
released.
unreferenced object (percpu) 0x607b97e9cab8 (size 16):
comm "softirq", pid 0, jiffies 4294931867
hex dump (first 16 bytes on cpu 3):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
backtrace (crc 0):
pcpu_alloc_noprof+0x453/0xd80
nft_counter_clone+0x9c/0x190 [nf_tables]
nft_expr_clone+0x8f/0x1b0 [nf_tables]
nft_dynset_new+0x2cb/0x5f0 [nf_tables]
nft_rhash_update+0x236/0x11c0 [nf_tables]
nft_dynset_eval+0x11f/0x670 [nf_tables]
nft_do_chain+0x253/0x1700 [nf_tables]
nft_do_chain_ipv4+0x18d/0x270 [nf_tables]
nf_hook_slow+0xaa/0x1e0
ip_local_deliver+0x209/0x330 |
| In the Linux kernel, the following vulnerability has been resolved:
icmp: fix NULL pointer dereference in icmp_tag_validation()
icmp_tag_validation() unconditionally dereferences the result of
rcu_dereference(inet_protos[proto]) without checking for NULL.
The inet_protos[] array is sparse -- only about 15 of 256 protocol
numbers have registered handlers. When ip_no_pmtu_disc is set to 3
(hardened PMTU mode) and the kernel receives an ICMP Fragmentation
Needed error with a quoted inner IP header containing an unregistered
protocol number, the NULL dereference causes a kernel panic in
softirq context.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:icmp_unreach (net/ipv4/icmp.c:1085 net/ipv4/icmp.c:1143)
Call Trace:
<IRQ>
icmp_rcv (net/ipv4/icmp.c:1527)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:207)
ip_local_deliver_finish (net/ipv4/ip_input.c:242)
ip_local_deliver (net/ipv4/ip_input.c:262)
ip_rcv (net/ipv4/ip_input.c:573)
__netif_receive_skb_one_core (net/core/dev.c:6164)
process_backlog (net/core/dev.c:6628)
handle_softirqs (kernel/softirq.c:561)
</IRQ>
Add a NULL check before accessing icmp_strict_tag_validation. If the
protocol has no registered handler, return false since it cannot
perform strict tag validation. |
| In the Linux kernel, the following vulnerability has been resolved:
nfnetlink_osf: validate individual option lengths in fingerprints
nfnl_osf_add_callback() validates opt_num bounds and string
NUL-termination but does not check individual option length fields.
A zero-length option causes nf_osf_match_one() to enter the option
matching loop even when foptsize sums to zero, which matches packets
with no TCP options where ctx->optp is NULL:
Oops: general protection fault
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
Call Trace:
nf_osf_match (net/netfilter/nfnetlink_osf.c:227)
xt_osf_match_packet (net/netfilter/xt_osf.c:32)
ipt_do_table (net/ipv4/netfilter/ip_tables.c:293)
nf_hook_slow (net/netfilter/core.c:623)
ip_local_deliver (net/ipv4/ip_input.c:262)
ip_rcv (net/ipv4/ip_input.c:573)
Additionally, an MSS option (kind=2) with length < 4 causes
out-of-bounds reads when nf_osf_match_one() unconditionally accesses
optp[2] and optp[3] for MSS value extraction. While RFC 9293
section 3.2 specifies that the MSS option is always exactly 4
bytes (Kind=2, Length=4), the check uses "< 4" rather than
"!= 4" because lengths greater than 4 do not cause memory
safety issues -- the buffer is guaranteed to be at least
foptsize bytes by the ctx->optsize == foptsize check.
Reject fingerprints where any option has zero length, or where an MSS
option has length less than 4, at add time rather than trusting these
values in the packet matching hot path. |
| In the Linux kernel, the following vulnerability has been resolved:
wifi: mac80211: fix NULL deref in mesh_matches_local()
mesh_matches_local() unconditionally dereferences ie->mesh_config to
compare mesh configuration parameters. When called from
mesh_rx_csa_frame(), the parsed action-frame elements may not contain a
Mesh Configuration IE, leaving ie->mesh_config NULL and triggering a
kernel NULL pointer dereference.
The other two callers are already safe:
- ieee80211_mesh_rx_bcn_presp() checks !elems->mesh_config before
calling mesh_matches_local()
- mesh_plink_get_event() is only reached through
mesh_process_plink_frame(), which checks !elems->mesh_config, too
mesh_rx_csa_frame() is the only caller that passes raw parsed elements
to mesh_matches_local() without guarding mesh_config. An adjacent
attacker can exploit this by sending a crafted CSA action frame that
includes a valid Mesh ID IE but omits the Mesh Configuration IE,
crashing the kernel.
The captured crash log:
Oops: general protection fault, probably for non-canonical address ...
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
Workqueue: events_unbound cfg80211_wiphy_work
[...]
Call Trace:
<TASK>
? __pfx_mesh_matches_local (net/mac80211/mesh.c:65)
ieee80211_mesh_rx_queued_mgmt (net/mac80211/mesh.c:1686)
[...]
ieee80211_iface_work (net/mac80211/iface.c:1754 net/mac80211/iface.c:1802)
[...]
cfg80211_wiphy_work (net/wireless/core.c:426)
process_one_work (net/kernel/workqueue.c:3280)
? assign_work (net/kernel/workqueue.c:1219)
worker_thread (net/kernel/workqueue.c:3352)
? __pfx_worker_thread (net/kernel/workqueue.c:3385)
kthread (net/kernel/kthread.c:436)
[...]
ret_from_fork_asm (net/arch/x86/entry/entry_64.S:255)
</TASK>
This patch adds a NULL check for ie->mesh_config at the top of
mesh_matches_local() to return false early when the Mesh Configuration
IE is absent. |
| In the Linux kernel, the following vulnerability has been resolved:
Bluetooth: L2CAP: Fix accepting multiple L2CAP_ECRED_CONN_REQ
Currently the code attempts to accept requests regardless of the
command identifier which may cause multiple requests to be marked
as pending (FLAG_DEFER_SETUP) which can cause more than
L2CAP_ECRED_MAX_CID(5) to be allocated in l2cap_ecred_rsp_defer
causing an overflow.
The spec is quite clear that the same identifier shall not be used on
subsequent requests:
'Within each signaling channel a different Identifier shall be used
for each successive request or indication.'
https://www.bluetooth.com/wp-content/uploads/Files/Specification/HTML/Core-62/out/en/host/logical-link-control-and-adaptation-protocol-specification.html#UUID-32a25a06-4aa4-c6c7-77c5-dcfe3682355d
So this attempts to check if there are any channels pending with the
same identifier and rejects if any are found. |
| In the Linux kernel, the following vulnerability has been resolved:
af_unix: Give up GC if MSG_PEEK intervened.
Igor Ushakov reported that GC purged the receive queue of
an alive socket due to a race with MSG_PEEK with a nice repro.
This is the exact same issue previously fixed by commit
cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
After GC was replaced with the current algorithm, the cited
commit removed the locking dance in unix_peek_fds() and
reintroduced the same issue.
The problem is that MSG_PEEK bumps a file refcount without
interacting with GC.
Consider an SCC containing sk-A and sk-B, where sk-A is
close()d but can be recv()ed via sk-B.
The bad thing happens if sk-A is recv()ed with MSG_PEEK from
sk-B and sk-B is close()d while GC is checking unix_vertex_dead()
for sk-A and sk-B.
GC thread User thread
--------- -----------
unix_vertex_dead(sk-A)
-> true <------.
\
`------ recv(sk-B, MSG_PEEK)
invalidate !! -> sk-A's file refcount : 1 -> 2
close(sk-B)
-> sk-B's file refcount : 2 -> 1
unix_vertex_dead(sk-B)
-> true
Initially, sk-A's file refcount is 1 by the inflight fd in sk-B
recvq. GC thinks sk-A is dead because the file refcount is the
same as the number of its inflight fds.
However, sk-A's file refcount is bumped silently by MSG_PEEK,
which invalidates the previous evaluation.
At this moment, sk-B's file refcount is 2; one by the open fd,
and one by the inflight fd in sk-A. The subsequent close()
releases one refcount by the former.
Finally, GC incorrectly concludes that both sk-A and sk-B are dead.
One option is to restore the locking dance in unix_peek_fds(),
but we can resolve this more elegantly thanks to the new algorithm.
The point is that the issue does not occur without the subsequent
close() and we actually do not need to synchronise MSG_PEEK with
the dead SCC detection.
When the issue occurs, close() and GC touch the same file refcount.
If GC sees the refcount being decremented by close(), it can just
give up garbage-collecting the SCC.
Therefore, we only need to signal the race during MSG_PEEK with
a proper memory barrier to make it visible to the GC.
Let's use seqcount_t to notify GC when MSG_PEEK occurs and let
it defer the SCC to the next run.
This way no locking is needed on the MSG_PEEK side, and we can
avoid imposing a penalty on every MSG_PEEK unnecessarily.
Note that we can retry within unix_scc_dead() if MSG_PEEK is
detected, but we do not do so to avoid hung task splat from
abusive MSG_PEEK calls. |
| In the Linux kernel, the following vulnerability has been resolved:
bridge: cfm: Fix race condition in peer_mep deletion
When a peer MEP is being deleted, cancel_delayed_work_sync() is called
on ccm_rx_dwork before freeing. However, br_cfm_frame_rx() runs in
softirq context under rcu_read_lock (without RTNL) and can re-schedule
ccm_rx_dwork via ccm_rx_timer_start() between cancel_delayed_work_sync()
returning and kfree_rcu() being called.
The following is a simple race scenario:
cpu0 cpu1
mep_delete_implementation()
cancel_delayed_work_sync(ccm_rx_dwork);
br_cfm_frame_rx()
// peer_mep still in hlist
if (peer_mep->ccm_defect)
ccm_rx_timer_start()
queue_delayed_work(ccm_rx_dwork)
hlist_del_rcu(&peer_mep->head);
kfree_rcu(peer_mep, rcu);
ccm_rx_work_expired()
// on freed peer_mep
To prevent this, cancel_delayed_work_sync() is replaced with
disable_delayed_work_sync() in both peer MEP deletion paths, so
that subsequent queue_delayed_work() calls from br_cfm_frame_rx()
are silently rejected.
The cc_peer_disable() helper retains cancel_delayed_work_sync()
because it is also used for the CC enable/disable toggle path where
the work must remain re-schedulable. |