The practical scenario that leads to this is porch(1) spawning some
utility and sending it a SIGSTOP as a debugging aide. The user then
attaches a debugger and walks through how some specific input is
processed, then detaches to allow the script to continue. When ptrace
is detached, the process resumes execution but the parent is never
notified and may be stuck in wait(2) for it to continue or terminate.
Other platforms seem to re-suspend the process after the debugger is
detached, but neither behavior seems unreasonable. Just notifying the
parent that the child has resumed is a relatively low-risk departure
from our current behavior and had apparently been considered in the
past, based on pre-existing comments.
Move p_flag and p_xsig handling into childproc_continued(), as just
sending the SIGCHLD here isn't really useful without P_CONTINUED set
and the other caller already sets these up as well.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D50917
The kern.osreldate sysctl reports the kernel version, not a release
date. Also correct a comment about /usr/include/osreldate.h.
Reviewed by: kp, olce
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50938
In ieee80211_sta_join() there are currently two ways to set
"do_ht": (1) after checking HT IEs are avail, and (2) after
checking VHT IEs are avail and we are not on 2GHz.
In the latter case no one checks that HT IEs are available and
when we hit ieee80211_ht_updateparams_final() htinfo may be NULL
and we panic.
Avoid this by only checking for VHT if do_ht was set.
No VHT without HT IEs.
While here switch do_ht to be a bool.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
PR: 287625
Fixes: 51172f62a7
Reviewed by: adrian
Differential Revision: https://reviews.freebsd.org/D50923
In commit ae10431c98 ("vm_page: Allow PG_NOFREE pages to be freed"), I
changed the v_nofree_count counter to instead count the size of the
nofree queue, on the basis that with the ability to free nofree pages,
the size of the queue is unbounded.
The use of a counter(9) for this purpose is not really correct, as early
initialization of per-CPU counters interferes with precise accounting
that we want here. Instead, add a global tracker for this purpose,
expose it elsewhere in the sysctl tree, and restore v_free_nofree's
original use as a counter of allocated nofree pages.
Reviewed by: bnovkov, alc, kib
Reported by: alc
Fixes: ae10431c98 ("vm_page: Allow PG_NOFREE pages to be freed")
Differential Revision: https://reviews.freebsd.org/D50877
We should never "log" a statement on no match for a given device we
do not know about. We do not control the PCI ID assignments and thus
cannot predict if we would even support such a device.
This also triggers an invalid output in the installer.
Leave it as log_verbose for now.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
PR: 287639
Reviewed by: manu, emaste
Differential Revision: https://reviews.freebsd.org/D50916
Use the counter(9) KPI instead of atomics to maintain the L2 superpage
mapping counts. (A similar change was made to the amd64 pmap in 2021.)
While here, update the SYSCTL descriptions to reflect the possibility
that the base page size is 16KB.
FreeBSD has a fair number of tracing facilities. The new tracing(7)
manual page aims to provide a starting point for users to learn about
what is available.
Reviewed by: christos, bnovkov, markj, ziaee
Approved by: christos (mentor), bnovkov (mentor), markj (mentor)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D50854
"pass" is in several places, but should be a #define. Make it one. This
also lets folks with particular needs that copy this driver to reduce
diffs.
Sponsored by: Netflix
We can still get plenty of use out of a userboot that doesn't know
anything about how to load or boot a kernel; notably, the test harness
in tools/boot can still be used to test lua changes.
Hack out the necessary bits to simply build on other platforms, and add
a small warning with ample time to view the warning on other platforms.
We still won't build userboot by default on these platforms, since the
build product isn't useful for most people.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D41529
AUDIT_SYSCALL_EXIT() and indirectly audit_commit() is intended to be
called from arbitrary top-level context. This means that any sleepable
locks can be owned by the caller, and which makes the sleeping in
audit_commit() forbidden.
Since we need to sleep for the record in audit_alloc() anyway, move the
sleep for the queue limit there. At worst, if the audit is suspended is
disabled when we actually reach the commit location, this means that we
lost time uselessly.
PR: 287566
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D50879
The compiler would report a mismatch between the format and the actual
type of the runqueue status word because the latter is now
unconditionally defined as an 'unsigned long' (which has the "natural"
platform size) and the format expects a 'size_t', which expands to an
'unsigned int' on 32-bit platforms (although they are both of the same
actual size).
This worked before as the C type used depended on the architecture and
was set to 'uint32_t' aka 'unsigned int' on these 32-bit platforms.
Just fix the format (use 'l'). While here, remove outputting '0x' by
hand, instead relying on '#' (only difference is for 0, and is fine).
runq_print() should be moved out of 'sched_ule.c' in a subsequent
commit.
Reported by: Jenkins
Fixes: 79d8a99ee583 ("runq: Deduce most parameters, remove machine headers")
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
While these should be moved to the new format, it wasn't my intention
to force them over immediately. Downstreams may embed their own brands
in drawer.lua, and we shouldn't break them for something like this.
Move adapt_fb_shim() up and use it for preloaded definitions to avoid
forcing the matter for now. Perhaps in the future we'll start writing
out warnings for those that do need adapted.
Reported by: 0x1eef on IRC
There is an initialization warning where error may not be set when logging
extended BBlogs. Lets fix this so error is init'd to zero so we won't have
a warning.
Change the origin from PZERO to PUSER.
Doing so allows users to immediately detect if some thread is running
under a high priority (kernel or realtime) or under a low one
(timesharing or idle).
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
The current formula wastes queues and degrades usage estimation
precision, since any increase of ticks that goes over 40 priorities (so,
8 * 40) is clamped to the last of these 40 levels (the nice value is
subsequently added to that number to get the final priority level).
Allow 'ts_estcpu' to grow up to a value corresponding to the greatest
(i.e., lowest) priority of the timeshare range.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45392
Substracting RQ_PPQ to the maximum number of allowed priority values
(the factor to INVERSE_ESTCPU_WEIGHT) has the effect of pessimizing the
number of processes assigned to the last priority bucket.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45392
No functional change (intended).
Also makes the comment about INVERSE_ESTCPU_WEIGHT() adjacent to its
definition.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45392
Computation of %CPU in sched_pctcpu() was overly complicated, wrong in
the case of a non-maximal window (10 seconds span; this is always the
case in practice as the window would oscillate between 10 and 11 seconds
for continuously running processes) and performed unshifted for the
first part, essentially losing precision (up to 9% for SCHED_TICK_SECS
being 10), and with some uneffective shift for the second part.
Conserve maximum precision by only shifting by the require amount to
attain FSHIFT before dividing. Apply classical rounding to nearest
instead of rounding down.
To generally avoid wraparound problems with tick fields in 'struct
td_sched' (as already happened once in sched_pctcpu_update()), make then
all unsigned, and ensure 'ticks' is always converted to some 'u_int'.
While here, fix SCHED_AFFINITY().
Rewrite sched_pctcpu_update() while keeping the existing formulas:
- Fix the hole in the cliff case that in theory 'ts_ticks' can become
greater than the window size if a running thread has not been
accounted for too long (today cannot happen because of sched_clock()).
- Make the decay ratio explicit and configurable (SCHED_CPU_DECAY_NUMER,
SCHED_CPU_DECAY_DENOM). Set it to the current value (10/11),
currently producing a 95% attenuation after about ~32s. This eases
experimenting with changing it. Apply the ratio on shifted ticks for
better precision, independently of the chosen value for
SCHED_TICK_MAX/SCHED_TICK_SECS.
- Remove redundant SCHED_TICK_TARG. Compute SCHED_TICK_MAX from
SCHED_TICK_SECS, the latter now really specifying the maximum size of
the %CPU estimation window.
- Ensure it is immune to varying 'hz' (which today can't happen), so
that after computation SCHED_TICK_RUN(ts) is mathematically guaranteed
lower than SCHED_TICK_LENGTH(ts).
- Thoroughly explain the current formula, and mention its main drawback
(it is completely dependent on the frequency of calls to
sched_pctcpu_update(), which currently manifests itself for sleeping
threads).
Rework sched_priority():
- Ensure 'p_nice' is read only once, to be immune to a concurrent
change.
- Clearly show that the computed priority is the sum of 3 components.
Make them all positive by shifting the starting priority and shifting
the nice value in SCHED_PRI_NICE().
- Compute the priority offset deriving from the %CPU with rounding to
nearest.
- Much more informative KASSERT() output with details regarding the
priority computation.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D46567
Justification for this change is to avoid disturbing ULE's behavior too
much at this time. We however acknowledge that the effect of "nice"
values is extremely weak and will most probably change it going forward.
Tuning allows to mostly recover ULE's behavior prior to the switch to
a single 256-queue runqueue and the increase of the timesharing priority
levels' range.
After this change, in a series of test involving two long-running
processes with varying nice values competing for the same CPU, we
observe that used CPU time ratios of the highest priority process to
change by at most 1.15% and on average by 0.46% (absolute differences).
In relative differences, they change by at most 2% and on average by
0.78%.
In order to preserve these ratios, as the number of priority levels
alloted to timesharing have been raised from 136 to 168 (and the subsets
of them dedicated to either interactive or batch threads scaled
accordingly), we keep the ratio of levels reserved to handle nice values
to those reserved for CPU usage by applying a factor of 5/4 (which is
close to 168/136).
Time-based advance of the timesharing circular queue's head is ULE's
main fairness and anti-starvation mechanism. The higher number of
queues subject to the timesharing scheduling policy is now compensated
by allowing a greater increment of the head offset per tick. Because
there are now 109 queue levels dedicated to the timesharing scheduling
policy (in contrast with the 168 levels alloted to timesharing levels,
which include the former but also those dedicated to threads considered
interactive) whereas there previously were 64 ones (priorities spread
into a single, separate runqueue), we advance the circular queue's head
7/4 faster (a ratio close to 109/64).
While here, take into account 'cnt' as the number of ticks when
advancing the circular queue's head. This fix depends on the other code
changes enabling incrementation by more than one.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D46566
Now that a difference of 1 in priority level is significant, we can
shrink the priority range reserved for kernel threads.
Only four distinct levels are necessary for the bottom half (3 base
levels and arguably an additional one for demoted interrupt threads that
run for full time slices so that they finally don't compete with other
ones). To leave room for other possible uses, we settle on 8 levels.
Given the symbolic constants for the top half, 10 levels are currently
necessary. We settle on 16 levels.
This allows to enlarge the timesharing range, which covers ULE's both
interactive and batch range, to 168 distinct levels from less than 64
ones for ULE (as of before the changes to make it use a single runqueue
and have 256 distinct levels per runqueue) and 34 ones for 4BSD.
While here, note that the realtime range is required to have at least 32
priority levels since:
- POSIX mandates at least 32 distinct levels for the SCHED_RR/SCHED_FIFO
scheduling policies.
- We directly map contiguous priority levels ('sched_priority') of these
scheduling policies to distinct, contiguous internal priority levels.
Conversely, having at least 32 priority levels is enough to guarantee
compliance to the POSIX requirement mentioned above because different
internal priority levels are treated differently since commit "runq:
Switch to 256 levels".
While here, list explicit change restrictions for the realtime and idle
range.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45391
Replace the hardcoded 4 (old RQ_PPQ) by 1 (new RQ_PPQ), as all priority
levels are now treated differently.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Concerns only a single test (ptrace_test.c).
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45390
This increases the number of levels from 64 to 256, which coincides with
the distinct internal priority values (priority is currently encoded in
a 'u_char', whose range is entirely used).
With this change, we become POSIX-compliant for SCHED_FIFO/SCHED_RR in
that we really provide 32 distinct priority levels for these policies.
Previously, threads in the same "priority group", with priority groups
defined as the threads in consecutive spans of 4 priority levels
starting with level 0 up to 31 (so there are 8 groups), could not
preempt or be preempted by each other even if they were assigned
different priority levels.
See also commit "sched_ule: Use a single runqueue per CPU" for all the
drawbacks that this change also removes.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45390
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45390
Replace priorities specified by a base priority and some hardcoded
offset value by symbolic constants. Hardcoded offsets prevent changing
the difference between priorities without changing their relative
ordering, and is generally a dangerous practice since the resulting
priority may inadvertently belong to a different selection policy's
range.
Since RQ_PPQ is 4, differences of less than 4 are insignificant, so just
remove them. These small differences have not been changed for years,
so it is likely they have no real meaning (besides having no practical
effect). One can still consult the changes history to recover them if
ever needed.
No functional change (intended).
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45390
Previously, ULE would use 3 separate runqueues per CPU to store threads,
one for each of its selection policies, which are realtime, timesharing
and idle. They would be examined in this order, and the first thread
found would be the one selected.
This choice indeed appears as the easiest evolution from the single
runqueue used by sched_4bsd (4BSD): It allows sharing most of the same
runqueue code, which currently defines 64 levels per runqueue, while
multiplying the number of levels (by 3). However, it has several
important drawbacks:
1. The number of levels is the same for each selection policy. 64 is
unnecessarily large for the idle policy (only 32 distinct levels would
be necessary, given the 32 levels of our RTP_PRIO_IDLE and their future
aliases in the to-be-introduced SCHED_IDLE POSIX scheduling policy) and
unnecessary restrictive both for the realtime policy (which should
include 32 distinct levels for PRI_REALTIME, given our implementation of
SCHED_RR/SCHED_FIFO, leaving at most 32 levels for ULE's interactive
processes where the current implementation provisions 48 (perhaps taking
into account the spreading problem, see next point)) and the timesharing
one (88 distinct levels currently provisioned).
2. A runqueue has only 64 distinct levels, and maps priorities in the
range [0;255] to a queue index by just performing a division by 4.
Priorities mapped to the same level are treated exactly the same from
a scheduling perspective, which is generally both unexpected and
incorrect. ULE's code tries to compensate for this aliasing in the
timesharing selection policy, by spreading the 88 levels into 256,
knowing the latter amount in the end to only 64 distinct ones. This
scaling is unfortunately not performed for the other policies, breaking
the expectations mentioned in the previous point about distinct priority
levels.
With this change, only a single runqueue is now used to store all
threads, regardless of the scheduling policy ULE applies to them (going
back to what 4BSD has always been doing). ULE's 3 selection policies
are assigned non-overlapping ranges of levels, and helper functions have
been created to select or steal a thread in these distinct ranges,
preserving the "circular" queue mechanism for the timesharing selection
policy that (tries to) prevent starvation in the face of permanent
dynamic priority adjustments.
This change allows to choose any arbitrary repartition of runqueue
levels between selection policies. It is a prerequisite to the increase
to 256 levels per runqueue, which will allow to dispense with all the
drawbacks listed above.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45389
This special case was introduced as soon as commit "ULE 3.0"
(ae7a6b38d5, r171482, from July 2007). It caused runq_steal_from() to
ignore the highest-priority thread while stealing.
Its functionality was changed in commit "Rework CPU load balancing in
SCHED_ULE" (36acfc6507, r232207, from February 2012), where the intent
was to keep track of that first thread and return it if no other one was
stealable, instead of returning NULL (no steal). Some bug prevented it
from working in loaded cases (more than one thread, and all threads but
the first one not stealable), which was subsequently fixed in commit
"sched_ule(4): Fix interactive threads stealing." (bd84094a51, from
September 2021).
All the reasons for this mechanism we could second-guess were dubious at
best. Jeff Roberson, ULE's main author, says in the differential
revision that "The point was to move threads that are least likely to
benefit from affinity because they are unlikely to run soon enough to
take advantage of it.", to which we responded: "(snip) This may improve
affinity in some cases, but at the same time we don't really know when
the next thread on the queue is to run. Not stealing in this case also
amounts to slightly violating the expected execution ordering and
fairness.".
As this twist doesn't seem to bring any performance improvement in
general, let's just remove it.
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45388
Stop using internal knowledge of runqueues. Remove duplicate
boilerplate parts.
Concretely, runq_steal() and runq_steal_from() are now implemented on
top of runq_findq().
Besides considerably simplifying the code, this change also brings an
algorithmic improvement since, previously, set bits in the runqueue's
status words were found by testing each bit individually in a loop
instead of using ffsl()/bsfl() (except for the first set bit per status
word).
This change also makes it more apparent that runq_steal_from() treats
the first thread with highest priority specifically (which runq_steal()
does not).
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45388
That new runq_findq(), based on the implementation of the former
runq_findq_range(), is intended to become the foundation and unique
low-level implementation for all searches in a runqueue. In addition to
a range of queues' indices, it takes a predicate function, allowing to:
- Possibly skip a non-empty queue with higher priority (numerically
lower index) on some criteria. This is not yet used but will be in
a subsequent commit revising ULE's stealing machinery.
- Choose a specific thread in the queue, not necessarily the first.
- Return whatever information is deemed necessary.
It helps to remove duplicated boilerplate code, including redundant
assertions, and generally makes things much clearer. These effects will
be even greater in a subsequent commit modifying ULE to use it.
runq_first_thread_range() replaces the old runq_findq_range() (returns
the first thread of the highest priority queue in the requested range),
and runq_first_thread() the old runq_findq() (same, but considering all
queues).
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
Indicates if some particular queue of the runqueue is empty.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
Factorize common sub-expressions in a separate helper (runq_sw_apply())
for better readability.
Rename these functions so that the names refer to the use cases rather
than the implementations.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
Remove one more loop and duplicated code, with the benefit of less
instruction cache pollution at the expense of a few cycles more for the
function calls and computing 'idx' (however, this gives a better
diagnostic message).
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
Rename existing functions to use the simpler prefix 'runq_findq' instead
of 'runq_findbit' (that they work on top of bit runs is an
implementation detail).
Add runq_findq_range(), which takes a range of indices to operate on
(bounds included). This is in preparation for changing ULE to use
a single runqueue, since it needs to treat the timesharing range
differently.
Rename runq_findbit_from() to runq_findq_circular(), which is more
descriptive.
To reduce code duplication, have runq_findq() and runq_findq_circular()
leverage runq_findq_range() internally. For the latter, this also
brings a small algorithmic improvement, since previously the second pass
(from queue 0) would cover the whole runqueue if it was completely
empty, scanning again empty queues after the start index.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
No code change in moved functions.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
Most existing macros have ambiguous names regarding which index they
operate on (queue, word, bit?), so have been renamed to improve clarity.
Use the 'RQSW_' prefix for all macros related to status words, and
change the status word type name accordingly.
Rename RQB_FFS() to RQSW_BSF() to remove confusion about the return
value (ffs*() return bit indices starting at 1, or 0 if the input is 0,
whereas BSF on x86 returns 0-based indices, which is what the current
code assumes). While here, add a check (under INVARIANTS) that
RQSW_BSF() isn't called with 0 as an argument.
Also, rename 'rqb_bits_t' to the more concise 'rqsw_t', 'struct rqbits'
to 'struct rq_status', its 'rqb_bits' field to 'rq_sw' (it designates an
array of words, not bits), and the type 'rqhead' to 'rq_queue'
Add macros computing a queue index from a status word index and a bit in
order to factorize code. If the precise index of the bit is known,
callers can use RQSW_TO_QUEUE_IDX() to get the corresponding queue
index, whereas if they want the one corresponding to the first
(least-significant): set bit in a given status word (corresponding to
the non-empty queue with lower index in the status word), they can use
RQSW_FIRST_QUEUE_IDX() instead.
Add RQSW_BIT_IDX(), which computes the correspond bit's index in the
corresponding status word. This allows more code factorization (even if
most uses will be eliminated in a later commit) and makes what is
computed clearer.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
In runq_choose() and runq_choose_fuzz(), replace an unnecessary 'while'
with an 'if', and separate assignment and test of 'idx' into two lines.
Add missing parentheses to one 'sizeof' operator.
Remove superfluous brackets for one-line "then" and "else" branches (to
match style elsewhere in the file).
Declare loop indices in their 'for'.
Test for non-empty bit sets with an explicit '!= 0'.
Move TABs in some prototypes of <sys/runq.h> (should not split the
return type specifier, but instead separate the type specifier with the
function declarator).
No functional change intended.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
Make sure that external and internal users are aware that the runqueue
API always expects queue indices, and not priority levels. Name
arithmetic arguments in 'runq.h' for better immediate reference.
Use plain integers to pass indices instead of 'u_char' (using the latter
probably doesn't bring any gain, and an 'int' makes the API agnostic to
a number of queues greater than 256). Add a static assertion that
RQ_NQS can't be strictly greater than 256 as long as the 'td_rqindex'
thread field is of type 'u_char'.
Add a new macro CHECK_IDX() that checks that an index is non-negative
and below RQ_NQS, and use it in all low-level functions (and "public"
ones when they don't need to call the former).
While here, remove runq_remove_idx(), as it knows a bit too much of
ULE's internals, in particular by treating the whole runqueue as
round-robin, which we are going to change. Instead, have runq_remove()
return whether the queue from which the thread was removed is now empty,
and leverage this information in tdq_runq_rem() (sched_ule(4)).
While here, re-implement runq_add() on top of runq_add_idx() to remove
its duplicated code (all lines except one). Introduce the new
RQ_PRI_TO_IDX() macro to convert a priority to a queue index, and use it
in runq_add() (many more uses will be introduced in later commits).
While here, rename runq_check() to runq_not_empty() and have it return
a boolean instead of an 'int', and same for sched_runnable() as an
impact (and while here, fix a small style violation in sched_4bsd(4)'s
version).
While here, simplify sched_runnable().
While here, make <sys/sched.h> standalone include-wise.
No functional change intended.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
And some structure definitions as well.
This header really is not supposed to be included by userland, so should
just error in this case. However, there is one remaining use for it in
a test: Getting the value of RQ_PPQ to ensure a big enough priority
level difference in order to guarantee that a realtime thread preempts
another. This use will soon be obsoleted by guaranteeing that
a realtime thread always preempts another one with lower priority, even
if the priority level is very close.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387
<sys/proc.h> doesn't need <sys/runq.h>. Remove this include and add it
back for kernel files that relied on the pollution.
Reviewed by: kib
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45387