Talk:Sysctl

From ArchWiki
Latest comment: 5 January by Gnattu in topic BBR with qdisc Cake

net.ipv4.tcp_rfc1337

From kernel doc:

tcp_rfc1337 - BOOLEAN
	If set, the TCP stack behaves conforming to RFC1337. If unset,
	we are not conforming to RFC, but prevent TCP TIME_WAIT
	assassination.
	Default: 0

So, isn't 0 the safe value? Our wiki says otherwise. -- Lahwaacz (talk) 08:56, 17 September 2013 (UTC)Reply

With setting 0 the system would 'assassinate' a socket in time_wait prematurely upon receiving a RST. While this might sound like a good idea (it frees up a socket quicker), it opens the door for tcp sequence problems/syn replay. Those problems were described in RFC1337 and enabling the setting 1 is one way to deal with them (letting TIME_WAIT packets idle out even if a reset is received, so that the sequence number cannot be reused meanwhile). The wiki is correct in my view. Kernel doc is wrong here - "prevent" should read "enable". --Indigo (talk) 21:12, 17 September 2013 (UTC)Reply
Since this discussion is still open: An interesting attack to the kernels implementation of the related RFC5961 was published yesterday under cve2016-5696. I have not looked into it enough to form an opinion whether leaving default 0 or 1 for this setting makes any difference to that, but it is exactly the kind of sequencing attack I was referring to three years back. --Indigo (talk) 08:38, 11 August 2016 (UTC)Reply
Any news about this? Does anyone already performed more research and analysis about this? Timofonic (talk) 17:51, 31 July 2017 (UTC)Reply

From file "net/ipv4/tcp_minisocks.c" in kernel:

 
		if (th->rst) {
			/* This is TIME_WAIT assassination, in two flavors.
			 * Oh well... nobody has a sufficient solution to this
			 * protocol bug yet.
			 */
			if (sysctl_tcp_rfc1337 == 0) {
kill:
				inet_twsk_deschedule_put(tw);
				return TCP_TW_SUCCESS;
			}
		} else {
			inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);

From Man page 7 : tcp:

tcp_rfc1337 (Boolean; default: disabled; since Linux 2.2)

Enable TCP behavior conformant with RFC 1337. When disabled, if a RST is received in TIME_WAIT state, we close the socket immediately without waiting for the end of the TIME_WAIT period.

From Google kernel security settings

# Implement RFC 1337 fix
net.ipv4.tcp_rfc1337 = 1

—This unsigned comment is by HacKurx (talk) 11:38, 10 May 2020‎. Please sign your posts with ~~~~!

It is not there by the link anymore. Seems like Google rejected the option at some point?
Hanabishi (talk) 12:19, 31 March 2023 (UTC)Reply
For reference: They apparently dropped it as requirement sometime after Feb '21, I did not find a reference for the specific reason (QUIC could be one). --Indigo (talk) 07:47, 2 April 2023 (UTC)Reply
QUIC uses UDP though, so not affected by TCP options obviously.
Hanabishi (talk) 08:42, 2 April 2023 (UTC)Reply
@Hanabishi: Yes, UDP to replace TCP in their backbone (see the second link above). It was just speculating why their cloud images don't adjust default anymore. --Indigo (talk) 18:27, 25 November 2024 (UTC)Reply

Virtual memory

The official documentation states that these two variables "Contain[s], as a percentage of total available memory that contains free pages and reclaimable pages,..." and that "The total available memory is not equal to total system memory.". However the comment underneath talks about them as if they were a percentage of system memory, making it quite confusing, e.g. I have 6GiB of system memory but only 1-2GiB available.

Also the defaults seem to have changed, I have dirty_ratio=50 and dirty_background_ratio=20.

-- DoctorJellyface (talk) 08:27, 8 August 2016 (UTC)Reply

Yes, I agree. When I changed the section a little with [1], I left the comment. The reason was that while it simplifies in current form, expanding it to show the difference between system memory and available memory first and only then calculate the percentages makes it cumbersome/complicated to follow. If you have an idea how to do it, please go ahead. --Indigo (talk) 09:07, 8 August 2016 (UTC)Reply
I would like this to be explained as an "introduction" to both concepts to avoid missconfiguration. I somewhat think to understand it, but I have some caveats about it (available memory after booting pre or post systemd? available memory while using the system? etc.). Despite there may be documents explaining it, it would make the document more friendly to read. Of course, there can be links to more specific documents to know more about it. Timofonic (talk) 18:14, 31 July 2017 (UTC)Reply
The problem is that the kernel docs don't explain what does "available memory" really mean. Assuming that it changes similarly to what free shows, taking the system memory instead is still useful to prepare for the worst case. -- Lahwaacz (talk) 09:11, 8 August 2016 (UTC)Reply
Yes, worst case also because "available" should grow disproportionately, because some slices, like system memory reserved for the bios or GPU will not change, regardless of total installed ram. I've had my go with [2]. --Indigo (talk) 07:54, 9 August 2016 (UTC)Reply
I'm still not sure about certain parameters: Default examples are provided, but not all of them are explained about why these numbers are used and how it can be calculated on different use cases. I may be wrong, but this article should provide a more comprehensive and pedagogically explanation of each concept compared to the Linux kernel documentation (that I assume is more focused on developers), explaining each of the best "default" values and how to tune them depending on system usage. From my limited perspective, I would like each parameter taking in mind different types of systems: desktop, (average, low latency at interactive operations, low latency at interactive operations and taking into account intensive software (COMPILING OVER GCC/LLVM/WHATEVER, tons of windows/tabs in a web browser, big apps done in interpreter/bytecode such as Python/Java/Mono, etc)), Server (Server with some interactivity: Providing HTPC features). I have no more ideas but just lots of questions, sorry. I hope someone with more knowledge is able to discuss this and provide some more explained information at least. Thanks for your efforts, Arch community is a great place to be Timofonic (talk) 18:14, 31 July 2017 (UTC)Reply

added vfs_cache_pressure parameter

let me know if it's OK --nTia89 (talk) 18:15, 26 December 2016 (UTC)Reply

Fine with me. Cheers. --Indigo (talk) 15:05, 27 December 2016 (UTC)Reply
That's okay, thanks a lot for adding it. But why did you choose 60 as parameter? Where's the logic behind this? May it be changed depending on certain situations of usage of the system? And if yes, how to be sure to change it in a correct way? Timofonic (talk) 18:18, 31 July 2017 (UTC)Reply


Troubleshooting: Small periodic system freezes

This is something happens eventually on my system, specially when having a considerable amounts of tabs opened in different window (50-70 on 4-5 windows, for example).

- Dirty bytes: Why using the 4M value? Are there an explanation about this? Can it be fine tuned? What does it mean? - Change kernel.io_delay_type: There's a list of different ones, but zero exmplanation about them. What does it mean each one? How it can change the behaviour of the system? How to find what can be the best one for the system?

Sorry for asking to much, I'm trying to understand certain concepts that are still difficult for me. I'm sorry if there's already good sources about them, I was unable to locate these. Thanks for your patience.

Timofonic (talk) 18:27, 31 July 2017 (UTC)Reply

About the "io_delay_type". It apparently has something to do with hardware accesses and nothing to do with kernel stuff.
* LWN: x86: provide a DMI based port 0x80 I/O delay override
* https://elixir.bootlin.com/linux/v5.5-rc4/source/arch/x86/kernel/io_delay.c
Gima (talk) 15:00, 30 December 2019 (UTC)Reply

Does the removal of SYSCTL_SYSCALL affect this page?

See https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.5-Kills-SYSCTL-SYSCALL Francoism (talk) 10:15, 28 January 2020 (UTC)Reply

I think it does not. Quoting https://cateee.net/lkddb/web-lkddb/SYSCTL_SYSCALL.html
sys_sysctl uses binary paths that have been found challenging to properly maintain and use. The interface in /proc/sys using paths with ascii names is now the primary path to this information.
Where binary paths mentioned in this page? CONFIG_SYSCTL=y on arch default 5.5.10-arch1 kernel, and related configuration, are still here.
Regid (talk) 01:43, 25 March 2020 (UTC)Reply

net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = "30000 65535" can improve VPN application's performance (1).

—This unsigned comment is by Yukiko05 (talk) 08:34, 15 May 2020. Please sign your posts with ~~~~!

This should be added to the page along with a reference link. -- Lahwaacz (talk) 08:36, 15 May 2020 (UTC)Reply
  1. I couldn't see where 1 mentions net.ipv4.ip_local_port_range.
  2. Isn't the default range "32768 60999"? Isn't the suggested range here "30000 65535"? Does the difference that much noticeable?
Regid (talk) 23:40, 4 April 2021 (UTC)Reply
I have no idea what Yukiko05 meant, they should explain the improvement properly. -- Lahwaacz (talk) 20:32, 5 April 2021 (UTC)Reply

After some experiments I found that the connect() system call takes longer when half of the ports are used (see the measurements made with a simple python script below). This could mean that increasing the port range can improve performance when connecting thousands of clients to a single server.

People familiar with C can try to analyze the __inet_hash_connect function algorithm (I am too dumb for this).

However, I still do not know how this can improve the VPN performance.

ip_local_port_range: 1024    60999  (59976 ports)
Socket #29978: port 14270, connect duration 0.036ms
Socket #29979: port 14273, connect duration 5.776ms

ip_local_port_range: 16384   60999  (44616 ports)
Socket #22308: port 25010, connect duration 0.031ms
Socket #22309: port 25013, connect duration 4.164ms

ip_local_port_range: 32768   60999  (28232 ports, Linux default)
Socket #14116: port 39486, connect duration 0.031ms
Socket #14117: port 39489, connect duration 2.559ms

ip_local_port_range: 58000   60999  (3000 ports)
Socket #1500: port 59824, connect duration 0.028ms
Socket #1501: port 59827, connect duration 0.267ms

andreymal (talk) 16:32, 20 April 2022 (UTC)Reply

Apparently, the page linked by the original poster in this discussion changed [3].

Shadowsocks documentation does mention the sweet spot for running a server is in the range of 10000 65000. To some extent, this might impact VPN performance, but considering there are a few other options recommended by the Shadowsocks docs, I could not say with full certainty to which extent this alone could improve the metrics. Still, the fact remains this is contrary to the currently recommended settings in the article, which I am assuming the original Discussion poster referred to, when speaking of inherent VPN improvements. Though considering there's no mention of VPN use in the current revision of the article's section, this is only an observation.

Do note that in the Shadowsocks docs, there are a few other kernel parameters that would wind up conflicting in accordance to the current parameters explained (and implicitly taken on by average users) in this article. E.g. the still open discussion on TCP Fast Open, where the article states the same option as Shadowsocks, contrary to the Talk page discussion (though this is probably a case of adding warnings;) this might be worth mentioning in the noted expansion. Then there are parameters like net.core.rmem_max, where this article's talk notes dispute the factual accuracy of the contents. Taking Shadowsock's docs' advise, there would be a conflict between (1) the currently explained options in the article, (2) those which might be changed in accordance to the Talk page note, and (3) those recommended by the Shadowsocks docs (all three offering different recommended values.) For this reason, it may be wise to first solve the relevant Discussions concerning kernel parameters which might not yet be fully clear for the purposes of the article's section, before moving on to adding any changes regarding this Discussion.

Surely there's also the option of completely drawing Shadowsocks out of the equation, and only mentioning it in either one of the sections containing the parameters that collide with the settings listed in their documentation page. This should ideally be in the form of some note for users interested in VPN use. For this I believe it would be best to at least mention the options and which path should a user take depending on some general use cases, like comfort desktop use or server use; but I am a new user, and this might not be the best approach. This should probably link to some article external to the Wiki, just so as to avoid lengthy explanations of how does one parameter specifically interact with one another and the consequences of mix-and-matching with values particular to the user's system performance and others related to Shadowsocks. I don't know my way around the kernel docs, and can thus not link to any pages that would be of use for this purpose.

Finally, I think it would be best if for VPN information, a link to some network configuration page could be linked instead, where this content might further elaborate. This page should ideally be Shadowsocks'. Though this presents another issue, as the current state of the link redirecting to kernel parameter optimisations [4][5], uses a GitHub repository Wiki that has not been updated in almost a decade. I take the fact that the link by the original poster of this Discussion worked back in 2021, and yet did not only a few years later (2024 as of writing this;) as an indication that the Shadowsocks documentation efforts have likely evolved in not only their infrastructure but their content. Thus, changing the link to the official Shadowsocks docs, and specifically the relevant article (same as the one around which this Discussion revolves) would also be necessary in Shadowsocks' ArchWiki page. Still, for clarity's sake, after comparing the contents of both the old GitHub Wiki and the current official docs, I can say that the latter are lacking in some of the extra analytics of the old guide. There are also some extra instructions in the old guide, which might or might not be required anymore. I personally don't have any use for Shadowsocks so I have refrained from testing. Though, for easier accessibility to the pages hereby discussed, the relevant links are provided for anyone who might want to:

Back to the topic of kernel parameters, there's another concern to be raised on the counter productiveness of including instructions relevant to this particular VPN use and forward redirecting to the Shadowsocks docs. Considering a use case where someone might have configured other sections of this article with the set kernel parameters or their own take at them, linking straight to the Shadowsocks docs would lead them (if lacking the required knowledge) to be doubtful of which of the parameters to keep the way the Wiki has shown off, otherwise their own settings; and which to configure the way Shadowsocks recommends. These vary vastly in some cases, so a lot of care of should be put when either including a warning when redirecting to the Shadowsocks article; or otherwise in Shadowsocks' page's relevant section.

Adam6 (talk) 23:48, 25 November 2024 (UTC)Reply

TCP Fast Open

At the time of this writing, TCP Fast Open is mentioned at sysctl#Enable TCP Fast Open. I would like to get the reader attention to https://squeeze.isobar.com/2019/04/11/the-sad-story-of-tcp-fast-open/ . Regid (talk) 14:45, 4 April 2021 (UTC)Reply

Your link went offline with 502, I looked up the archive, interesting read indeed but I cannot apply it to the expansion template. --Indigo (talk) 19:01, 25 November 2024 (UTC)Reply

BBR with qdisc Cake

At the time of this writing, the BBR is configured with CAKE in Sysctl#Enable BBR. However the BBR source still says BBR is required to be configured with fq with pacing: https://github.com/torvalds/linux/blob/ab75170520d4964f3acf8bb1f91d34cbc650688e/net/ipv4/tcp_bbr.c#L55-L57

And if I understand current implementation correctly CAKE does not have internal flow pacing feature which would hurt BBR performance. Gnattu (talk) 05:45, 5 January 2025 (UTC)Reply