The inpcb flag INP_DROPPED served two purposes.
It was used by TCP and subsystems running on top of TCP as a flag that
marks a connection that is now in TCPS_CLOSED, but was in some other state
before (not a new-born connection). Create a new TCP flag TF_DISCONNECTED
for this purpose.
The in_pcbdrop() was a TCP's version of in_pcbdisconnect() that also sets
INP_DROPPED. Use in_pcbdisconnect() instead.
Second purpose of INP_DROPPED was a negative lookup mask in
inp_smr_lock(), as SMR-protected lookup may see inpcbs that had been
removed from the hash. We already have had INP_INHASHLIST that marks
inpcb that is in hash. Convert it into INP_UNCONNECTED with the opposite
meaning. This allows to combine it with INP_FREED for the negative lookup
mask.
The Chelsio/ToE and kTLS changes are done with some style refactoring,
like moving inp/tp assignments up and using macros for that. However, no
deep thinking was taken to check if those checks are really needed, it
could be that some are not.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D56186
The iteration over all pcbs is possible without the global list. The
newborn inpcbs are put on a global list of unconnected inpcbs, then after
connect(2) or bind(2) they move to respective hash slot list.
This adds a bit of complexity to inp_next(), but the storage scheme is
actually simplified.
One potential problem before this change was that a couple of pcbs fall
into the same hash slot and are linked A->B there, but they also sit next
to each other in the global list, linked as B->A. This can deadlock of
course. The problem was never observed in the wild, but I was able to
instrument it with lots of effort: just few pcbs in the system, hash size
reduced down to 2 and a lot of repetitive calls into two kinds of
iterators.
However the main motivation is not the above problem, but make a step
towards splitting the big hash lock into per-slot locks.
Differential Revision: https://reviews.freebsd.org/D55967
With the SMR locking of inpcbs the use of this lock reduced down to the
global list and generation number. It was used only on an inpcb creation
and destruction. Use the inpcbinfo hash lock for this purpose.
Reviewed by: pouria, rrs, markj
Differential Revision: https://reviews.freebsd.org/D55966
While here remove ipi_lbgrouphashmask, as it is always has the same value
as ipi_porthashmask.
Differential Revision: https://reviews.freebsd.org/D56174
There are some files that don't include mutex.h and rwlock.h, but use
inpcb locking macros. With VIMAGE the net/vnet.h pulls half of the
possible kernel includes, masking the problem. The in_pcb.h also used to
mask the problem, so restore that.
Fixes: 041e9eb1ae
Pull up all user-visible stuff to the top of the file and isolate the
rest under _KERNEL. The user visible parts are:
- struct in_conninfo
- struct xinpcb
- defines for inp_flags bits, that are shared between xinpcb and inpcb
PR: 293493
These were hacks since FreeBSD 12 that provided some transition period for
utilities to migrate from reading kernel memory via kvm(3) to sysctl(3)
based APIs. The transition period is over.
This is much more compact. Thanks to markj@ for suggesting the change.
Reviewed by: markj
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D53507
Add /i option to the ddb commands show tcpcb and show all tcpcbs,
which enables the printing of the t_inpcb.
Reviewed by: markj
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D53497
The helper that derefs and rlocks the provided inp. Returns false if inp
is still usable.
Reviewed by: glebius, markj
Sponsored by: Nvidia networking
Differential revision: https://reviews.freebsd.org/D51143
This structure originates from the pre-FreeBSD times when system RAM was
measured in single digits of MB and Internet speeds were measured in Kb.
At first level the database hashes the port value only to calculate index
into array of pointers to lazily allocated headers that hold lists of
inpcbs with the same local port. This design apparently was made to
preserve kernel memory.
In the modern kernel size of the first level of the hash is derived from
maxsockets, which is derived from maxfiles, which in its turn is derived
from amount of physical memory. Then the size of the hash is capped by
IPPORT_MAX, cause it doesn't make any sense to have hash table larger then
the set of possible values. In practice this cap works even on my laptop.
I haven't done precise calculation or experiments, but my guess is that
any system with > 8 Gb of RAM will be autotuned to IPPORT_MAX sized hash.
Apparently, this hash is a degenerate one: it never has more than one
entries in any slot. You can check this with kgdb:
set $i = 0
while ($i <= tcbinfo->ipi_porthashmask)
set $p = tcbinfo->ipi_porthashbase[$i].clh_first
set $c = 0
while ($p != 0)
set $c = $c + 1
set $p = $p->phd_hash.cle_next
end
if ($c > 1)
printf "Slot %u count %u", $i, $c
end
set $i = $i + 1
end
Retiring the two level hash we remove a lot of complexity at the cost of
only one comparison 'inp->inp_lport != lport' in the lookup cycle, which
is going to be always false on most machines anyway. This comparison
definitely shall be cheaper than extra pointer traversal.
Another positive change to be singled out is that now we no longer need to
allocate memory in non-sleepable context in in_pcbinshash(), so a
potential ENOMEM on connect(2) is removed.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49151
The separation had been done back in 5200e00e72 for the purposes of
removing a true temporary connect of an unconnected UDP socket that does
sendto(2) in 90162a4e87. Now, with 69c05f4287 in place, the
separation is no longer needed. There should be no functional change.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49142
There are several functions that keep database locked and do address
and port selection before a caller commits the changes to the inpcb.
Mark the inpcb argument with a good documenting const.
It's only needed for in_pcb.c and in6_pcb.c, so can go to the private
header.
No functional change intended.
Reported by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Allow protocol layers to look up an inpcb belonging to a particular FIB.
This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be
used is obtained from the specificed mbuf or ifnet.
No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48662
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB
number. When this flag is specified, duplicate bindings are permitted,
so long as each FIB contains at most one inpcb bound to the same
address/port. If an inpcb is bound with this flag, it'll have the
INP_BOUNDFIB flag set.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48661
Suppose a thread is adds a socket to an existing TCP lbgroup that is
actively accepting connections. It has to do the following operations:
1. set SO_REUSEPORT_LB on the socket
2. bind() the socket to the shared address/port
3. call listen()
Step 2 makes the inpcb visible to incoming connection requests.
However, at this point the inpcb cannot accept new connections. If
in_pcblookup() matches it, the remote end will see ECONNREFUSED even
when other listening sockets are present in the lbgroup. This means
that dynamically adding inpcbs to an lbgroup (e.g., by starting up new
workers) can trigger spurious connection failures for no good reason.
(A similar problem exists when removing inpcbs from an lbgroup, but that
is harder to fix and is not addressed by this patch; see the review for
a bit more commentary.)
Fix this by augmenting each lbgroup with a linked list of inpcbs that
are pending a listen() call. When adding an inpcb to an lbgroup, keep
the inpcb on this list if listen() hasn't been called, so it is not yet
visible to the lookup path. Then, add a new in_pcblisten() routine which
makes the inpcb visible within the lbgroup now that it's safe to let it
handle new connections.
Add a regression test which verifies that we don't get spurious
connection errors while adding sockets to an LB group.
Reviewed by: glebius
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48544
Before a protocol specific control block started to embed inpcb in self
(see 0aa120d52f, e68b379244, 483fe96511) this pointer used to point
at it.
Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>. The
exp-run detected a minimal use of the field in ports:
* sysutils/lsof - patched upstream
* net-mgmt/netdata - patch accepted upstream
* emulators/qemu-user-static - upstream master branch seems not using
the field anymore
We can keep the field around for some time, but eventually it may be
reused for something else.
PR: 277659 (exp-run)
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D44491
These KPIs were added in 9d29c635da and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet. But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet. For the inpcb it is unclear what could be a large
collection of externally written kernel modules that need extensively use
inpcb and not be aware of its internals at the same time. This isolation
of a structure knowledge requires a lot of work, and just throwing in a
few KPIs isn't helpful.
Reviewed by: kib, bz, markj
Differential Revision: https://reviews.freebsd.org/D44310
First, merge in_pcbdetach() with in_pcbfree(). The comment for
in_pcbdetach() was no longer correct. Then, make sure we remove
the inpcb from the hash before we commit any destructive actions
on it. There are couple functions that rely on the hash lock
skipping SMR + inpcb lock to lookup an inpcb. Although there are
no known functions that similarly rely on the global inpcb list
lock, also do list removal before destructive actions.
PR: 273890
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D43122
Just like it was done for accept(2) in cfb1e92912, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.
Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.
Sponsored by: Netflix
Since f71cb9f748 socket stays connnected with inpcb through latter's
lifetime and there is no reason to complicate things and copy these
flags.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D41198
These flags are TCP specific. While here, make also several LRO
internal functions to pass tcpcb pointer instead of inpcb one.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39698
This makes inpcb lighter and allows future cache line optimizations
of tcpcb. The reason why HPTS originally used inpcb is the compressed
TIME-WAIT state (see 0d7445193a), that used to free a tcpcb, while the
associated connection is still on the HPTS ring.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39697
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d892129 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
It has no effect, and an exp-run revealed that it is not in use.
PR: 261398 (exp-run)
Reviewed by: mjg, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38822
This option was added in commit 0a100a6f1e but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.
The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree. Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient. So, let's remove it.
PR: 261398 (exp-run)
Reviewed by: glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38574
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_bind method and from there on go down the call
stack with family specific argument.
Reviewed by: zlei, melifaro, markj
Differential Revision: https://reviews.freebsd.org/D38601
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_connect method and from there on go down the call
stack with family specific argument.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38356
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process. It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.
This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual
listening sockets. One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.
Discussed with: glebius
MFC after: 1 month
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37029
The suppresion was added in 5f311da2cc with no explanation in the
commit message of the exact problem that was fixed. In the BSDCan
2006 talk [1], slides 12 to 14, we can find that it seems that there
was some problem with the TIME_WAIT state not properly being handled
on the remote side (also FreeBSD!), and this switching off the
suppression had hidden the problem. The rationale of the change was
that other stacks may also be buggy wrt the TIME_WAIT.
I did not find the actual problem in TIME_WAIT that the suppression
has hidden, neither a commit that would fix it. However, since that
time we started to handle SYNs with RFC5961 instead of RFC793, see
3220a2121c. We also now have the tcp-testsuite [2], that has full
coverage of all possible scenarios of receiving SYN in TIME_WAIT.
This effectively reverts 5f311da2cc
and 6ee79c59d2.
[1] https://www.bsdcan.org/2006/papers/ImprovingTCPIP.pdf
[2] https://github.com/freebsd-net/tcp-testsuite
Reviewed by: rscheff
Discussed with: rscheff, rrs, tuexen
Differential revision: https://reviews.freebsd.org/D37042
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
This interface allows to set a socket option on a TCP endpoint,
which is specified by its inp_gencnt. This interface will be
used in an upcoming command line tool tcpsso.
Reviewed by: glebius, rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D34138
Provide structure inpcbstorage, that holds zones and lock names for
a protocol. Initialize it with global protocol init using macro
INPCBSTORAGE_DEFINE(). Then, at VNET protocol init supply it as
the main argument to the in_pcbinfo_init(). Each VNET pcbinfo uses
its private hash, but they all use same zone to allocate and SMR
section to synchronize.
Note: there is kern.ipc.maxsockets sysctl, which controls UMA limit
on the socket zone, which was always global. Historically same
maxsockets value is applied also to every PCB zone. Important fact:
you can't create a pcb without a socket! A pcb may outlive its socket,
however. Given that there are multiple protocols, and only one socket
zone, the per pcb zone limits seem to have little value. Under very
special conditions it may trigger a little bit earlier than socket zone
limit, but in most setups the socket zone limit will be triggered
earlier. When VIMAGE was added to the kernel PCB zones became per-VNET.
This magnified existing disbalance further: now we have multiple pcb
zones in multiple vnets limited to maxsockets, but every pcb requires a
socket allocated from the global zone also limited by maxsockets.
IMHO, this per pcb zone limit doesn't bring any value, so this patch
drops it. If anybody explains value of this limit, it can be restored
very easy - just 2 lines change to in_pcbstorage_init().
Differential revision: https://reviews.freebsd.org/D33542
The intent is to provide more entropy than can be provided
by just the 32-bits of the IPv6 address which overlaps with
6to4 tunnels. This is needed to mitigate potential algorithmic
complexity attacks from attackers who can control large
numbers of IPv6 addresses.
Together with: gallatin
Reviewed by: dwmalone, rscheff
Differential revision: https://reviews.freebsd.org/D33254