Commit Graph

300534 Commits

Author SHA1 Message Date
Kyle Evans ee9895e10d kern: send parent a SIGCHLD when the debugger has detached
The practical scenario that leads to this is porch(1) spawning some
utility and sending it a SIGSTOP as a debugging aide.  The user then
attaches a debugger and walks through how some specific input is
processed, then detaches to allow the script to continue.  When ptrace
is detached, the process resumes execution but the parent is never
notified and may be stuck in wait(2) for it to continue or terminate.

Other platforms seem to re-suspend the process after the debugger is
detached, but neither behavior seems unreasonable.  Just notifying the
parent that the child has resumed is a relatively low-risk departure
from our current behavior and had apparently been considered in the
past, based on pre-existing comments.

Move p_flag and p_xsig handling into childproc_continued(), as just
sending the SIGCHLD here isn't really useful without P_CONTINUED set
and the other caller already sets these up as well.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D50917
2025-06-19 10:32:04 -05:00
Ed Maste 5110a74afe sys: Correct osreldate descriptions
The kern.osreldate sysctl reports the kernel version, not a release
date.  Also correct a comment about /usr/include/osreldate.h.

Reviewed by:	kp, olce
Event:		Kitchener-Waterloo Hackathon 202506
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50938
2025-06-19 10:52:52 -04:00
Maxim Konovalov b78b7fa01f nuageinit.7: language and grammar improvements
Reviewed by:	bapt
2025-06-19 13:14:33 +00:00
Kevin Lo 19d0dd8718 mtw: fix display of the MAC revision
Reviewed by:	adrian
Differential Revision:	https://reviews.freebsd.org/D50542
2025-06-19 13:42:39 +08:00
Bjoern A. Zeeb f51c794cbc net80211: in ieee80211_sta_join() only do_ht if HT is avail
In ieee80211_sta_join() there are currently two ways to set
"do_ht": (1) after checking HT IEs are avail, and (2) after
checking VHT IEs are avail and we are not on 2GHz.

In the latter case no one checks that HT IEs are available and
when we hit ieee80211_ht_updateparams_final() htinfo may be NULL
and we panic.

Avoid this by only checking for VHT if do_ht was set.
No VHT without HT IEs.

While here switch do_ht to be a bool.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
PR:		287625
Fixes:		51172f62a7
Reviewed by:	adrian
Differential Revision: https://reviews.freebsd.org/D50923
2025-06-19 01:23:12 +00:00
Mark Johnston 4c6c1dd8f7 vm_page: Fix nofree page accounting
In commit ae10431c98 ("vm_page: Allow PG_NOFREE pages to be freed"), I
changed the v_nofree_count counter to instead count the size of the
nofree queue, on the basis that with the ability to free nofree pages,
the size of the queue is unbounded.

The use of a counter(9) for this purpose is not really correct, as early
initialization of per-CPU counters interferes with precise accounting
that we want here.  Instead, add a global tracker for this purpose,
expose it elsewhere in the sysctl tree, and restore v_free_nofree's
original use as a counter of allocated nofree pages.

Reviewed by:	bnovkov, alc, kib
Reported by:	alc
Fixes:		ae10431c98 ("vm_page: Allow PG_NOFREE pages to be freed")
Differential Revision:	https://reviews.freebsd.org/D50877
2025-06-18 23:48:07 +00:00
Bjoern A. Zeeb f1f71cc717 fwget: pci_intel_video: do no log on no match
We should never "log" a statement on no match for a given device we
do not know about.  We do not control the PCI ID assignments and thus
cannot predict if we would even support such a device.

This also triggers an invalid output in the installer.

Leave it as log_verbose for now.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
PR:		287639
Reviewed by:	manu, emaste
Differential Revision: https://reviews.freebsd.org/D50916
2025-06-18 23:31:13 +00:00
Alan Cox deddede58e arm64 pmap: use the counter(9) KPI for L2 superpages
Use the counter(9) KPI instead of atomics to maintain the L2 superpage
mapping counts.  (A similar change was made to the amd64 pmap in 2021.)
While here, update the SYSCTL descriptions to reflect the possibility
that the base page size is 16KB.
2025-06-18 18:05:58 -05:00
Mateusz Piotrowski c29459f901 tracing.7: Add a single reference point for tracing facilities in FreeBSD
FreeBSD has a fair number of tracing facilities. The new tracing(7)
manual page aims to provide a starting point for users to learn about
what is available.

Reviewed by:	christos, bnovkov, markj, ziaee
Approved by:	christos (mentor), bnovkov (mentor), markj (mentor)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D50854
2025-06-19 00:15:26 +02:00
Sergey A. Osokin 22c7815118 exec(3): add missing execvpe(3) to MLINKS
Reviewed by:	glebius
2025-06-18 17:40:22 -04:00
Warner Losh c329931c02 pass: Make the name of the driver a #define
"pass" is in several places, but should be a #define. Make it one. This
also lets folks with particular needs that copy this driver to reduce
diffs.

Sponsored by:		Netflix
2025-06-18 14:30:34 -06:00
Kyle Evans eca5637760 stand: userboot: allow building on !x86
We can still get plenty of use out of a userboot that doesn't know
anything about how to load or boot a kernel; notably, the test harness
in tools/boot can still be used to test lua changes.

Hack out the necessary bits to simply build on other platforms, and add
a small warning with ample time to view the warning on other platforms.
We still won't build userboot by default on these platforms, since the
build product isn't useful for most people.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D41529
2025-06-18 13:42:29 -05:00
Konstantin Belousov 0452f5f7b3 audit: move the wait from the queue length from the commit to alloc
AUDIT_SYSCALL_EXIT() and indirectly audit_commit() is intended to be
called from arbitrary top-level context.  This means that any sleepable
locks can be owned by the caller, and which makes the sleeping in
audit_commit() forbidden.

Since we need to sleep for the record in audit_alloc() anyway, move the
sleep for the queue limit there.  At worst, if the audit is suspended is
disabled when we actually reach the commit location, this means that we
lost time uselessly.

PR:	287566
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D50879
2025-06-18 20:57:49 +03:00
Mateusz Piotrowski fa9ac741d0 truss.1: Reference sysdecode(3)
MFC after:	1 week
2025-06-18 19:40:27 +02:00
Olivier Certner 013c58ced6 sched_ule: 32-bit platforms: Fix runq_print() after runq changes
The compiler would report a mismatch between the format and the actual
type of the runqueue status word because the latter is now
unconditionally defined as an 'unsigned long' (which has the "natural"
platform size) and the format expects a 'size_t', which expands to an
'unsigned int' on 32-bit platforms (although they are both of the same
actual size).

This worked before as the C type used depended on the architecture and
was set to 'uint32_t' aka 'unsigned int' on these 32-bit platforms.

Just fix the format (use 'l').  While here, remove outputting '0x' by
hand, instead relying on '#' (only difference is for 0, and is fine).

runq_print() should be moved out of 'sched_ule.c' in a subsequent
commit.

Reported by:    Jenkins
Fixes:          79d8a99ee583 ("runq: Deduce most parameters, remove machine headers")
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
2025-06-18 12:00:13 -04:00
Olivier Certner 63c9b01806 arm64: lib32: Don't try to install removed <machine/runq.h>
Reported by:    Herbert J. Skuhra (herbert gojira.at)
Fixes:          79d8a99ee583 ("runq: Deduce most parameters, remove machine headers")
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
2025-06-18 12:00:08 -04:00
Kyle Evans abdbd85d1b lualoader: adapt builtin brand/logo definitions as well
While these should be moved to the new format, it wasn't my intention
to force them over immediately.  Downstreams may embed their own brands
in drawer.lua, and we shouldn't break them for something like this.

Move adapt_fb_shim() up and use it for preloaded definitions to avoid
forcing the matter for now.  Perhaps in the future we'll start writing
out warnings for those that do need adapted.

Reported by:	0x1eef on IRC
2025-06-18 10:21:37 -05:00
Mark Johnston 9d0d55e398 ufshci: Remove an unneeded variable definition
Reported by:	gcc
Fixes:		1349a733cf ("ufshci: Introduce the ufshci(4) driver")
2025-06-18 13:13:08 +00:00
Randall Stewart 359f590b29 Fix a warning in the rack stack.
There is an initialization warning where error may not be set when logging
extended BBlogs. Lets fix this so error is init'd to zero so we won't have
a warning.
2025-06-18 08:14:51 -04:00
Robert Wing 690f642fab growfs(8): use gpart(8) instead of bsdlabel(8) in test
bsdlabel(8) is deprecated

Reviewed by:	emaste
Differential Revision:	https://reviews.freebsd.org/D50865
2025-06-17 23:21:20 -08:00
Gleb Smirnoff 46023d54c7 tcp: fixup wording in comment
Submitted by:	Steffen Nurpmeso <steffen sdaoden.eu>
Fixes:		b59753f1d5
2025-06-17 20:47:31 -07:00
Olivier Certner 1d8f8f3e36 ps(1), top(1): Priority: Let 0 be the first timesharing level
Change the origin from PZERO to PUSER.

Doing so allows users to immediately detect if some thread is running
under a high priority (kernel or realtime) or under a low one
(timesharing or idle).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
2025-06-17 22:09:39 -04:00
Olivier Certner eebc148f25 sched_4bsd: ESTCPULIM(): Allow any value in the timeshare range
The current formula wastes queues and degrades usage estimation
precision, since any increase of ticks that goes over 40 priorities (so,
8 * 40) is clamped to the last of these 40 levels (the nice value is
subsequently added to that number to get the final priority level).

Allow 'ts_estcpu' to grow up to a value corresponding to the greatest
(i.e., lowest) priority of the timeshare range.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45392
2025-06-17 22:09:39 -04:00
Olivier Certner 51a4ae05ab sched_4bsd: Remove RQ_PPQ from ESTCPULIM()'s formula
Substracting RQ_PPQ to the maximum number of allowed priority values
(the factor to INVERSE_ESTCPU_WEIGHT) has the effect of pessimizing the
number of processes assigned to the last priority bucket.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45392
2025-06-17 22:09:38 -04:00
Olivier Certner a454ff6b04 sched_4bsd: Move ESTCPULIM() after its macro dependencies
No functional change (intended).

Also makes the comment about INVERSE_ESTCPU_WEIGHT() adjacent to its
definition.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45392
2025-06-17 22:09:38 -04:00
Olivier Certner a33225efb4 sched_ule: Sanitize CPU's use and priority computations, and ticks storage
Computation of %CPU in sched_pctcpu() was overly complicated, wrong in
the case of a non-maximal window (10 seconds span; this is always the
case in practice as the window would oscillate between 10 and 11 seconds
for continuously running processes) and performed unshifted for the
first part, essentially losing precision (up to 9% for SCHED_TICK_SECS
being 10), and with some uneffective shift for the second part.
Conserve maximum precision by only shifting by the require amount to
attain FSHIFT before dividing.  Apply classical rounding to nearest
instead of rounding down.

To generally avoid wraparound problems with tick fields in 'struct
td_sched' (as already happened once in sched_pctcpu_update()), make then
all unsigned, and ensure 'ticks' is always converted to some 'u_int'.
While here, fix SCHED_AFFINITY().

Rewrite sched_pctcpu_update() while keeping the existing formulas:
- Fix the hole in the cliff case that in theory 'ts_ticks' can become
  greater than the window size if a running thread has not been
  accounted for too long (today cannot happen because of sched_clock()).
- Make the decay ratio explicit and configurable (SCHED_CPU_DECAY_NUMER,
  SCHED_CPU_DECAY_DENOM).  Set it to the current value (10/11),
  currently producing a 95% attenuation after about ~32s.  This eases
  experimenting with changing it.  Apply the ratio on shifted ticks for
  better precision, independently of the chosen value for
  SCHED_TICK_MAX/SCHED_TICK_SECS.
- Remove redundant SCHED_TICK_TARG.  Compute SCHED_TICK_MAX from
  SCHED_TICK_SECS, the latter now really specifying the maximum size of
  the %CPU estimation window.
- Ensure it is immune to varying 'hz' (which today can't happen), so
  that after computation SCHED_TICK_RUN(ts) is mathematically guaranteed
  lower than SCHED_TICK_LENGTH(ts).
- Thoroughly explain the current formula, and mention its main drawback
  (it is completely dependent on the frequency of calls to
  sched_pctcpu_update(), which currently manifests itself for sleeping
  threads).

Rework sched_priority():
- Ensure 'p_nice' is read only once, to be immune to a concurrent
  change.
- Clearly show that the computed priority is the sum of 3 components.
  Make them all positive by shifting the starting priority and shifting
  the nice value in SCHED_PRI_NICE().
- Compute the priority offset deriving from the %CPU with rounding to
  nearest.
- Much more informative KASSERT() output with details regarding the
  priority computation.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D46567
2025-06-17 22:09:38 -04:00
Olivier Certner 6792f3411f sched_ule: Recover previous nice and anti-starvation behaviors
Justification for this change is to avoid disturbing ULE's behavior too
much at this time.  We however acknowledge that the effect of "nice"
values is extremely weak and will most probably change it going forward.

Tuning allows to mostly recover ULE's behavior prior to the switch to
a single 256-queue runqueue and the increase of the timesharing priority
levels' range.

After this change, in a series of test involving two long-running
processes with varying nice values competing for the same CPU, we
observe that used CPU time ratios of the highest priority process to
change by at most 1.15% and on average by 0.46% (absolute differences).
In relative differences, they change by at most 2% and on average by
0.78%.

In order to preserve these ratios, as the number of priority levels
alloted to timesharing have been raised from 136 to 168 (and the subsets
of them dedicated to either interactive or batch threads scaled
accordingly), we keep the ratio of levels reserved to handle nice values
to those reserved for CPU usage by applying a factor of 5/4 (which is
close to 168/136).

Time-based advance of the timesharing circular queue's head is ULE's
main fairness and anti-starvation mechanism.  The higher number of
queues subject to the timesharing scheduling policy is now compensated
by allowing a greater increment of the head offset per tick.  Because
there are now 109 queue levels dedicated to the timesharing scheduling
policy (in contrast with the 168 levels alloted to timesharing levels,
which include the former but also those dedicated to threads considered
interactive) whereas there previously were 64 ones (priorities spread
into a single, separate runqueue), we advance the circular queue's head
7/4 faster (a ratio close to 109/64).

While here, take into account 'cnt' as the number of ticks when
advancing the circular queue's head.  This fix depends on the other code
changes enabling incrementation by more than one.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D46566
2025-06-17 22:09:37 -04:00
Olivier Certner dee257c28d sched: Internal priority ranges: Reduce kernel, increase timeshare
Now that a difference of 1 in priority level is significant, we can
shrink the priority range reserved for kernel threads.

Only four distinct levels are necessary for the bottom half (3 base
levels and arguably an additional one for demoted interrupt threads that
run for full time slices so that they finally don't compete with other
ones).  To leave room for other possible uses, we settle on 8 levels.

Given the symbolic constants for the top half, 10 levels are currently
necessary.  We settle on 16 levels.

This allows to enlarge the timesharing range, which covers ULE's both
interactive and batch range, to 168 distinct levels from less than 64
ones for ULE (as of before the changes to make it use a single runqueue
and have 256 distinct levels per runqueue) and 34 ones for 4BSD.

While here, note that the realtime range is required to have at least 32
priority levels since:
- POSIX mandates at least 32 distinct levels for the SCHED_RR/SCHED_FIFO
  scheduling policies.
- We directly map contiguous priority levels ('sched_priority') of these
  scheduling policies to distinct, contiguous internal priority levels.
Conversely, having at least 32 priority levels is enough to guarantee
compliance to the POSIX requirement mentioned above because different
internal priority levels are treated differently since commit "runq:
Switch to 256 levels".

While here, list explicit change restrictions for the realtime and idle
range.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45391
2025-06-17 22:09:37 -04:00
Olivier Certner d710acecc0 runq: Add copyright
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
2025-06-17 22:09:37 -04:00
Olivier Certner 055b5b5f85 runq: Restrict <sys/runq.h> to kernel only
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
2025-06-17 22:09:36 -04:00
Olivier Certner a2d1c3bc2b epoch_test: Assign different priorities using offset 1
Replace the hardcoded 4 (old RQ_PPQ) by 1 (new RQ_PPQ), as all priority
levels are now treated differently.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
2025-06-17 22:09:36 -04:00
Olivier Certner b2a9ee2a72 runq: Remove userland references to RQ_PPQ in rtprio contexts
Concerns only a single test (ptrace_test.c).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
2025-06-17 22:09:36 -04:00
Olivier Certner e3a4b989d7 runq: Bump __FreeBSD_version after switching to 256 levels
Corresponding to changing RQ_PPQ to 1.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
2025-06-17 22:09:29 -04:00
Olivier Certner af8de65ef2 runq: Switch to 256 levels
This increases the number of levels from 64 to 256, which coincides with
the distinct internal priority values (priority is currently encoded in
a 'u_char', whose range is entirely used).

With this change, we become POSIX-compliant for SCHED_FIFO/SCHED_RR in
that we really provide 32 distinct priority levels for these policies.
Previously, threads in the same "priority group", with priority groups
defined as the threads in consecutive spans of 4 priority levels
starting with level 0 up to 31 (so there are 8 groups), could not
preempt or be preempted by each other even if they were assigned
different priority levels.

See also commit "sched_ule: Use a single runqueue per CPU" for all the
drawbacks that this change also removes.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
2025-06-17 22:08:03 -04:00
Olivier Certner fd141584cf zfs: spa: ZIO_TASKQ_ISSUE: Use symbolic priority
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
2025-06-17 22:08:02 -04:00
Olivier Certner 8ecc419180 Internal scheduling priorities: Always use symbolic ones
Replace priorities specified by a base priority and some hardcoded
offset value by symbolic constants.  Hardcoded offsets prevent changing
the difference between priorities without changing their relative
ordering, and is generally a dangerous practice since the resulting
priority may inadvertently belong to a different selection policy's
range.

Since RQ_PPQ is 4, differences of less than 4 are insignificant, so just
remove them.  These small differences have not been changed for years,
so it is likely they have no real meaning (besides having no practical
effect).  One can still consult the changes history to recover them if
ever needed.

No functional change (intended).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
2025-06-17 22:08:02 -04:00
Olivier Certner baecdea10e sched_ule: Use a single runqueue per CPU
Previously, ULE would use 3 separate runqueues per CPU to store threads,
one for each of its selection policies, which are realtime, timesharing
and idle.  They would be examined in this order, and the first thread
found would be the one selected.

This choice indeed appears as the easiest evolution from the single
runqueue used by sched_4bsd (4BSD): It allows sharing most of the same
runqueue code, which currently defines 64 levels per runqueue, while
multiplying the number of levels (by 3).  However, it has several
important drawbacks:

1. The number of levels is the same for each selection policy.  64 is
unnecessarily large for the idle policy (only 32 distinct levels would
be necessary, given the 32 levels of our RTP_PRIO_IDLE and their future
aliases in the to-be-introduced SCHED_IDLE POSIX scheduling policy) and
unnecessary restrictive both for the realtime policy (which should
include 32 distinct levels for PRI_REALTIME, given our implementation of
SCHED_RR/SCHED_FIFO, leaving at most 32 levels for ULE's interactive
processes where the current implementation provisions 48 (perhaps taking
into account the spreading problem, see next point)) and the timesharing
one (88 distinct levels currently provisioned).

2. A runqueue has only 64 distinct levels, and maps priorities in the
range [0;255] to a queue index by just performing a division by 4.
Priorities mapped to the same level are treated exactly the same from
a scheduling perspective, which is generally both unexpected and
incorrect.  ULE's code tries to compensate for this aliasing in the
timesharing selection policy, by spreading the 88 levels into 256,
knowing the latter amount in the end to only 64 distinct ones.  This
scaling is unfortunately not performed for the other policies, breaking
the expectations mentioned in the previous point about distinct priority
levels.

With this change, only a single runqueue is now used to store all
threads, regardless of the scheduling policy ULE applies to them (going
back to what 4BSD has always been doing).  ULE's 3 selection policies
are assigned non-overlapping ranges of levels, and helper functions have
been created to select or steal a thread in these distinct ranges,
preserving the "circular" queue mechanism for the timesharing selection
policy that (tries to) prevent starvation in the face of permanent
dynamic priority adjustments.

This change allows to choose any arbitrary repartition of runqueue
levels between selection policies.  It is a prerequisite to the increase
to 256 levels per runqueue, which will allow to dispense with all the
drawbacks listed above.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45389
2025-06-17 22:08:01 -04:00
Olivier Certner fdf31d2747 sched_ule: runq_steal_from(): Suppress first thread special case
This special case was introduced as soon as commit "ULE 3.0"
(ae7a6b38d5, r171482, from July 2007).  It caused runq_steal_from() to
ignore the highest-priority thread while stealing.

Its functionality was changed in commit "Rework CPU load balancing in
SCHED_ULE" (36acfc6507, r232207, from February 2012), where the intent
was to keep track of that first thread and return it if no other one was
stealable, instead of returning NULL (no steal).  Some bug prevented it
from working in loaded cases (more than one thread, and all threads but
the first one not stealable), which was subsequently fixed in commit
"sched_ule(4): Fix interactive threads stealing." (bd84094a51, from
September 2021).

All the reasons for this mechanism we could second-guess were dubious at
best.  Jeff Roberson, ULE's main author, says in the differential
revision that "The point was to move threads that are least likely to
benefit from affinity because they are unlikely to run soon enough to
take advantage of it.", to which we responded: "(snip) This may improve
affinity in some cases, but at the same time we don't really know when
the next thread on the queue is to run. Not stealing in this case also
amounts to slightly violating the expected execution ordering and
fairness.".

As this twist doesn't seem to bring any performance improvement in
general, let's just remove it.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45388
2025-06-17 22:08:01 -04:00
Olivier Certner f4be333bc5 sched_ule: Re-implement stealing on top of runq common-code
Stop using internal knowledge of runqueues.  Remove duplicate
boilerplate parts.

Concretely, runq_steal() and runq_steal_from() are now implemented on
top of runq_findq().

Besides considerably simplifying the code, this change also brings an
algorithmic improvement since, previously, set bits in the runqueue's
status words were found by testing each bit individually in a loop
instead of using ffsl()/bsfl() (except for the first set bit per status
word).

This change also makes it more apparent that runq_steal_from() treats
the first thread with highest priority specifically (which runq_steal()
does not).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45388
2025-06-17 22:08:01 -04:00
Olivier Certner 9c3f4682bb runq: New runq_findq(), common low-level search implementation
That new runq_findq(), based on the implementation of the former
runq_findq_range(), is intended to become the foundation and unique
low-level implementation for all searches in a runqueue.  In addition to
a range of queues' indices, it takes a predicate function, allowing to:
- Possibly skip a non-empty queue with higher priority (numerically
  lower index) on some criteria.  This is not yet used but will be in
  a subsequent commit revising ULE's stealing machinery.
- Choose a specific thread in the queue, not necessarily the first.
- Return whatever information is deemed necessary.

It helps to remove duplicated boilerplate code, including redundant
assertions, and generally makes things much clearer.  These effects will
be even greater in a subsequent commit modifying ULE to use it.

runq_first_thread_range() replaces the old runq_findq_range() (returns
the first thread of the highest priority queue in the requested range),
and runq_first_thread() the old runq_findq() (same, but considering all
queues).

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:08:00 -04:00
Olivier Certner a31193172c runq: New function runq_is_queue_empty(); Use it in ULE
Indicates if some particular queue of the runqueue is empty.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:08:00 -04:00
Olivier Certner 757bab06fb runq: Tidy up and rename runq_setbit() and runq_clrbit()
Factorize common sub-expressions in a separate helper (runq_sw_apply())
for better readability.

Rename these functions so that the names refer to the use cases rather
than the implementations.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:08:00 -04:00
Olivier Certner de78657a3a runq: runq_check(): Re-implement on top of runq_findq()
Remove one more loop and duplicated code, with the benefit of less
instruction cache pollution at the expense of a few cycles more for the
function calls and computing 'idx' (however, this gives a better
diagnostic message).

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:59 -04:00
Olivier Certner 439dc920f2 runq: Revamp runq_find*(), new runq_find_range()
Rename existing functions to use the simpler prefix 'runq_findq' instead
of 'runq_findbit' (that they work on top of bit runs is an
implementation detail).

Add runq_findq_range(), which takes a range of indices to operate on
(bounds included).  This is in preparation for changing ULE to use
a single runqueue, since it needs to treat the timesharing range
differently.

Rename runq_findbit_from() to runq_findq_circular(), which is more
descriptive.

To reduce code duplication, have runq_findq() and runq_findq_circular()
leverage runq_findq_range() internally.  For the latter, this also
brings a small algorithmic improvement, since previously the second pass
(from queue 0) would cover the whole runqueue if it was completely
empty, scanning again empty queues after the start index.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:59 -04:00
Olivier Certner 200fc93dac runq: Re-order functions more logically
No code change in moved functions.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:59 -04:00
Olivier Certner 7e2502e3de runq: More macros; Better and more consistent naming
Most existing macros have ambiguous names regarding which index they
operate on (queue, word, bit?), so have been renamed to improve clarity.
Use the 'RQSW_' prefix for all macros related to status words, and
change the status word type name accordingly.

Rename RQB_FFS() to RQSW_BSF() to remove confusion about the return
value (ffs*() return bit indices starting at 1, or 0 if the input is 0,
whereas BSF on x86 returns 0-based indices, which is what the current
code assumes).  While here, add a check (under INVARIANTS) that
RQSW_BSF() isn't called with 0 as an argument.

Also, rename 'rqb_bits_t' to the more concise 'rqsw_t', 'struct rqbits'
to 'struct rq_status', its 'rqb_bits' field to 'rq_sw' (it designates an
array of words, not bits), and the type 'rqhead' to 'rq_queue'

Add macros computing a queue index from a status word index and a bit in
order to factorize code.  If the precise index of the bit is known,
callers can use RQSW_TO_QUEUE_IDX() to get the corresponding queue
index, whereas if they want the one corresponding to the first
(least-significant): set bit in a given status word (corresponding to
the non-empty queue with lower index in the status word), they can use
RQSW_FIRST_QUEUE_IDX() instead.

Add RQSW_BIT_IDX(), which computes the correspond bit's index in the
corresponding status word.  This allows more code factorization (even if
most uses will be eliminated in a later commit) and makes what is
computed clearer.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:58 -04:00
Olivier Certner 57540a0666 runq: Clarity and style pass
In runq_choose() and runq_choose_fuzz(), replace an unnecessary 'while'
with an 'if', and separate assignment and test of 'idx' into two lines.

Add missing parentheses to one 'sizeof' operator.

Remove superfluous brackets for one-line "then" and "else" branches (to
match style elsewhere in the file).

Declare loop indices in their 'for'.

Test for non-empty bit sets with an explicit '!= 0'.

Move TABs in some prototypes of <sys/runq.h> (should not split the
return type specifier, but instead separate the type specifier with the
function declarator).

No functional change intended.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:58 -04:00
Olivier Certner a11926f2a5 runq: API tidy up: 'pri' => 'idx', 'idx' as int, remove runq_remove_idx()
Make sure that external and internal users are aware that the runqueue
API always expects queue indices, and not priority levels.  Name
arithmetic arguments in 'runq.h' for better immediate reference.

Use plain integers to pass indices instead of 'u_char' (using the latter
probably doesn't bring any gain, and an 'int' makes the API agnostic to
a number of queues greater than 256).  Add a static assertion that
RQ_NQS can't be strictly greater than 256 as long as the 'td_rqindex'
thread field is of type 'u_char'.

Add a new macro CHECK_IDX() that checks that an index is non-negative
and below RQ_NQS, and use it in all low-level functions (and "public"
ones when they don't need to call the former).

While here, remove runq_remove_idx(), as it knows a bit too much of
ULE's internals, in particular by treating the whole runqueue as
round-robin, which we are going to change.  Instead, have runq_remove()
return whether the queue from which the thread was removed is now empty,
and leverage this information in tdq_runq_rem() (sched_ule(4)).

While here, re-implement runq_add() on top of runq_add_idx() to remove
its duplicated code (all lines except one).  Introduce the new
RQ_PRI_TO_IDX() macro to convert a priority to a queue index, and use it
in runq_add() (many more uses will be introduced in later commits).

While here, rename runq_check() to runq_not_empty() and have it return
a boolean instead of an 'int', and same for sched_runnable() as an
impact (and while here, fix a small style violation in sched_4bsd(4)'s
version).

While here, simplify sched_runnable().

While here, make <sys/sched.h> standalone include-wise.

No functional change intended.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:57 -04:00
Olivier Certner 28b54827f5 runq: Hide function prototypes under _KERNEL
And some structure definitions as well.

This header really is not supposed to be included by userland, so should
just error in this case.  However, there is one remaining use for it in
a test: Getting the value of RQ_PPQ to ensure a big enough priority
level difference in order to guarantee that a realtime thread preempts
another.  This use will soon be obsoleted by guaranteeing that
a realtime thread always preempts another one with lower priority, even
if the priority level is very close.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:57 -04:00
Olivier Certner c21c24adde runq: More selective includes of <sys/runq.h> to reduce pollution
<sys/proc.h> doesn't need <sys/runq.h>.  Remove this include and add it
back for kernel files that relied on the pollution.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
2025-06-17 22:07:57 -04:00