Talk:Sysctl

From ArchWiki

net.ipv4.tcp_rfc1337

From kernel doc:

tcp_rfc1337 - BOOLEAN
	If set, the TCP stack behaves conforming to RFC1337. If unset,
	we are not conforming to RFC, but prevent TCP TIME_WAIT
	assassination.
	Default: 0

So, isn't 0 the safe value? Our wiki says otherwise. -- Lahwaacz (talk) 08:56, 17 September 2013 (UTC)

With setting 0 the system would 'assassinate' a socket in time_wait prematurely upon receiving a RST. While this might sound like a good idea (it frees up a socket quicker), it opens the door for tcp sequence problems/syn replay. Those problems were described in RFC1337 and enabling the setting 1 is one way to deal with them (letting TIME_WAIT packets idle out even if a reset is received, so that the sequence number cannot be reused meanwhile). The wiki is correct in my view. Kernel doc is wrong here - "prevent" should read "enable". --Indigo (talk) 21:12, 17 September 2013 (UTC)
Since this discussion is still open: An interesting attack to the kernels implementation of the related RFC5961 was published yesterday under cve2016-5696. I have not looked into it enough to form an opinion whether leaving default 0 or 1 for this setting makes any difference to that, but it is exactly the kind of sequencing attack I was referring to three years back. --Indigo (talk) 08:38, 11 August 2016 (UTC)
Any news about this? Does anyone already performed more research and analysis about this? Timofonic (talk) 17:51, 31 July 2017 (UTC)

From file "net/ipv4/tcp_minisocks.c" in kernel:

 
		if (th->rst) {
			/* This is TIME_WAIT assassination, in two flavors.
			 * Oh well... nobody has a sufficient solution to this
			 * protocol bug yet.
			 */
			if (sysctl_tcp_rfc1337 == 0) {
kill:
				inet_twsk_deschedule_put(tw);
				return TCP_TW_SUCCESS;
			}
		} else {
			inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);

From Man page 7 : tcp:

tcp_rfc1337 (Boolean; default: disabled; since Linux 2.2)

Enable TCP behavior conformant with RFC 1337. When disabled, if a RST is received in TIME_WAIT state, we close the socket immediately without waiting for the end of the TIME_WAIT period.

From Google kernel security settings

# Implement RFC 1337 fix
net.ipv4.tcp_rfc1337 = 1

—This unsigned comment is by HacKurx (talk) 11:38, 10 May 2020‎. Please sign your posts with ~~~~!

Virtual memory

The official documentation states that these two variables "Contain[s], as a percentage of total available memory that contains free pages and reclaimable pages,..." and that "The total available memory is not equal to total system memory.". However the comment underneath talks about them as if they were a percentage of system memory, making it quite confusing, e.g. I have 6GiB of system memory but only 1-2GiB available.

Also the defaults seem to have changed, I have dirty_ratio=50 and dirty_background_ratio=20.

-- DoctorJellyface (talk) 08:27, 8 August 2016 (UTC)

Yes, I agree. When I changed the section a little with [1], I left the comment. The reason was that while it simplifies in current form, expanding it to show the difference between system memory and available memory first and only then calculate the percentages makes it cumbersome/complicated to follow. If you have an idea how to do it, please go ahead. --Indigo (talk) 09:07, 8 August 2016 (UTC)
I would like this to be explained as an "introduction" to both concepts to avoid missconfiguration. I somewhat think to understand it, but I have some caveats about it (available memory after booting pre or post systemd? available memory while using the system? etc.). Despite there may be documents explaining it, it would make the document more friendly to read. Of course, there can be links to more specific documents to know more about it. Timofonic (talk) 18:14, 31 July 2017 (UTC)
The problem is that the kernel docs don't explain what does "available memory" really mean. Assuming that it changes similarly to what free shows, taking the system memory instead is still useful to prepare for the worst case. -- Lahwaacz (talk) 09:11, 8 August 2016 (UTC)
Yes, worst case also because "available" should grow disproportionately, because some slices, like system memory reserved for the bios or GPU will not change, regardless of total installed ram. I've had my go with [2]. --Indigo (talk) 07:54, 9 August 2016 (UTC)
I'm still not sure about certain parameters: Default examples are provided, but not all of them are explained about why these numbers are used and how it can be calculated on different use cases. I may be wrong, but this article should provide a more comprehensive and pedagogically explanation of each concept compared to the Linux kernel documentation (that I assume is more focused on developers), explaining each of the best "default" values and how to tune them depending on system usage. From my limited perspective, I would like each parameter taking in mind different types of systems: desktop, (average, low latency at interactive operations, low latency at interactive operations and taking into account intensive software (COMPILING OVER GCC/LLVM/WHATEVER, tons of windows/tabs in a web browser, big apps done in interpreter/bytecode such as Python/Java/Mono, etc)), Server (Server with some interactivity: Providing HTPC features). I have no more ideas but just lots of questions, sorry. I hope someone with more knowledge is able to discuss this and provide some more explained information at least. Thanks for your efforts, Arch community is a great place to be Timofonic (talk) 18:14, 31 July 2017 (UTC)

added vfs_cache_pressure parameter

let me know if it's OK --nTia89 (talk) 18:15, 26 December 2016 (UTC)

Fine with me. Cheers. --Indigo (talk) 15:05, 27 December 2016 (UTC)
That's okay, thanks a lot for adding it. But why did you choose 60 as parameter? Where's the logic behind this? May it be changed depending on certain situations of usage of the system? And if yes, how to be sure to change it in a correct way? Timofonic (talk) 18:18, 31 July 2017 (UTC)


Troubleshooting: Small periodic system freezes

This is something happens eventually on my system, specially when having a considerable amounts of tabs opened in different window (50-70 on 4-5 windows, for example).

- Dirty bytes: Why using the 4M value? Are there an explanation about this? Can it be fine tuned? What does it mean? - Change kernel.io_delay_type: There's a list of different ones, but zero exmplanation about them. What does it mean each one? How it can change the behaviour of the system? How to find what can be the best one for the system?

Sorry for asking to much, I'm trying to understand certain concepts that are still difficult for me. I'm sorry if there's already good sources about them, I was unable to locate these. Thanks for your patience.

Timofonic (talk) 18:27, 31 July 2017 (UTC)

About the "io_delay_type". It apparently has something to do with hardware accesses and nothing to do with kernel stuff.
* LWN: x86: provide a DMI based port 0x80 I/O delay override
* https://elixir.bootlin.com/linux/v5.5-rc4/source/arch/x86/kernel/io_delay.c
Gima (talk) 15:00, 30 December 2019 (UTC)

Does the removal of SYSCTL_SYSCALL affect this page?

See https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.5-Kills-SYSCTL-SYSCALL Francoism (talk) 10:15, 28 January 2020 (UTC)

I think it does not. Quoting https://cateee.net/lkddb/web-lkddb/SYSCTL_SYSCALL.html
sys_sysctl uses binary paths that have been found challenging to properly maintain and use. The interface in /proc/sys using paths with ascii names is now the primary path to this information.
Where binary paths mentioned in this page? CONFIG_SYSCTL=y on arch default 5.5.10-arch1 kernel, and related configuration, are still here.
Regid (talk) 01:43, 25 March 2020 (UTC)

net.ipv4.ip_local_port_range

net.ipv4.ip_local_port_range = "30000 65535" can improve VPN application's performance (1).

—This unsigned comment is by Yukiko05 (talk) 08:34, 15 May 2020. Please sign your posts with ~~~~!

This should be added to the page along with a reference link. -- Lahwaacz (talk) 08:36, 15 May 2020 (UTC)
  1. I couldn't see where 1 mentions net.ipv4.ip_local_port_range.
  2. Isn't the default range "32768 60999"? Isn't the suggested range here "30000 65535"? Does the difference that much noticeable?
Regid (talk) 23:40, 4 April 2021 (UTC)
I have no idea what Yukiko05 meant, they should explain the improvement properly. -- Lahwaacz (talk) 20:32, 5 April 2021 (UTC)

After some experiments I found that the connect() system call takes longer when half of the ports are used (see the measurements made with a simple python script below). This could mean that increasing the port range can improve performance when connecting thousands of clients to a single server.

People familiar with C can try to analyze the __inet_hash_connect function algorithm (I am too dumb for this).

However, I still do not know how this can improve the VPN performance.

ip_local_port_range: 1024    60999  (59976 ports)
Socket #29978: port 14270, connect duration 0.036ms
Socket #29979: port 14273, connect duration 5.776ms

ip_local_port_range: 16384   60999  (44616 ports)
Socket #22308: port 25010, connect duration 0.031ms
Socket #22309: port 25013, connect duration 4.164ms

ip_local_port_range: 32768   60999  (28232 ports, Linux default)
Socket #14116: port 39486, connect duration 0.031ms
Socket #14117: port 39489, connect duration 2.559ms

ip_local_port_range: 58000   60999  (3000 ports)
Socket #1500: port 59824, connect duration 0.028ms
Socket #1501: port 59827, connect duration 0.267ms

andreymal (talk) 16:32, 20 April 2022 (UTC)

TCP Fast Open

At the time of this writing, TCP Fast Open is mentioned at sysctl#Enable TCP Fast Open. I would like to get the reader attention to https://squeeze.isobar.com/2019/04/11/the-sad-story-of-tcp-fast-open/ . Regid (talk) 14:45, 4 April 2021 (UTC)