Problem :
Launching a JVM I have the message : "Cannot create GC thread. Out of system resources"
- Enough memory
- Enough swap
- Enough ulimit
- Enough
threads-max
Enough CPU
Event extend the PID limit...
Important (at the end) : debian version = 10.11
Solution :
After a hours of googling, I found :
But none of these solutions works and none was matching the number I had :
- number of open files < ulimit -n
- maximum process/tasks < ulimit -u
But in a thread, I found something that was working : UserTasksMax
.
I'm running SystemD, I have around 10805 task running for my user.
And from : https://manpages.debian.org/stretch/systemd/logind.conf.5.en.html
UserTasksMax=
Sets the maximum number of OS tasks each user may run concurrently. This controls the
TasksMax= setting of the per-user slice unit, see
systemd.resource-control(5) for details. If assigned the special value "infinity", no tasks limit is applied. Defaults to 33%, which equals 10813 with the kernel's defaults on the host, but might be smaller in OS containers.
For my suspect PID (a lot of files) :
- cat /proc/21890/status | grep Thread => 1 thread
- ls /proc/21890/task | wc
- confirmed by the usual command : ps -eLf | grep calrisk | wc
I have around 10805 threads running for a given JVM very close to the limit.
Complete guide :
https://www.journaldufreenaute.fr/nombre-maximal-de-threads-par-processus-sous-linux/
Parameters not present in all man page, it could grown up to 12288 on latest version.
To be check !
Problem :
I manage a dedicated server in OVH and I upgrade my debian from jessie to buster. Upgrade works quite well (it seems...) and I try to restart.
Server reboot fails as unreachable, fortunately OVH rescue mode allows me to login.
I check error log and first lost myself in RAID error message, but it was more simple than that.
Solution :
I check the /etc/network/interfaces file, it was OK
I check the logs files, clean, reboot, check again, still OK except that network was unreachable for named.
I finally remember that Debian switch to systemD in latest version so I tried to create system networking file manually : too complicate, it was not working.
In rescue mode, you can access your files as a mounted point so usual commands as systemctl does not work.
The solution was to chroot a shell :
- mkdir /mnt/md2
- mount /dev/md2 /mnt/md2
- chroot /mnt/md2 bash
- systemctl enable networking
And it works...
Now I have to check all other system to be sure that everything is working...
Begining with :
sudo apt-get update
sudo apt-get clean
sudo apt-get autoremove
sudo apt-get update && sudo apt-get upgrade
sudo dpkg --configure -a
Problem :
I used robertdebock/ansible-role-tomcat to install a Tomcat instance using Ansible. Works well until I deploy an application on it. Then java process hangs with 100% system CPU.
Starting with tomcat users without system work correctly.
Solution :
I suspected :
- SELinux
- Linux limits
- VM slow I/O
But after a while I ran strace :
- by modifying systemd configuration
- by modifying catalina.sh configuration
All I have was a simple FUTEX wait...
And then I read the manual, as simple as :
strace -f -e trace=all -p <PID>
No need to trace from startup and by default, not all is traced...
After that, easy way, the process was reading recursively :
/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/8156...
Just fixing the working_directory in the ansible role, and all is working.
Issue reported here.
Problem :
We are testing a springboot application in AWS with ELB in front.
After a while of load-testing, the application was hanging :
- HTTP 504 error code from Jmeter client
- HTTP 502 if we raise ELB timeout
- Once logged on the server :
- telnet localhost 8080 was OK
- sending GET / on this socket was not responding
- plenty of CLOSE_WAIT socket
- wget was also hanging (normal)
- connection was established during wget hang
- nothing in the log
Solution :
I initially think about the keepAlive timeout and pool of tomcat but
- SpringBoot copy the connectionTimeout parameter to keepAliveTimeout
- new socket is accepted and established
- CLOSE_WAIT wasn't shutdown after hour
Doing the test many times, I finally so a classical "Too many open files" in the log. That's why I could not see more log during the hang.
So we change the nproc and nofile in /etc/security/limits.conf
And taadaaaa ! Nothing change in :
cat /proc/<$PID>/limits
Thanks to blogs over the world like this one :
- the service is start with systemd
- to override ressources limits with systemd :
[Service]
...
LimitNOFILE=500000
LimitNPROC=500000
At last but not least, the value of Tomcat NIO socket queue is around 10000 + other files + other process... choose wisely your limit