I manage a dedicated server in OVH and I upgrade my debian from jessie to buster. Upgrade works quite well (it seems...) and I try to restart.
Server reboot fails as unreachable, fortunately OVH rescue mode allows me to login.
I check error log and first lost myself in RAID error message, but it was more simple than that.
I check the /etc/network/interfaces file, it was OK
I check the logs files, clean, reboot, check again, still OK except that network was unreachable for named.
I finally remember that Debian switch to systemD in latest version so I tried to create system networking file manually : too complicate, it was not working.
In rescue mode, you can access your files as a mounted point so usual commands as systemctl does not work.
The solution was to chroot a shell :
- mkdir /mnt/md2
- mount /dev/md2 /mnt/md2
- chroot /mnt/md2 bash
- systemctl enable networking
And it works...
Now I have to check all other system to be sure that everything is working...
Begining with :
sudo apt-get update
sudo apt-get clean
sudo apt-get autoremove
sudo apt-get update && sudo apt-get upgrade
sudo dpkg --configure -a
I used robertdebock/ansible-role-tomcat to install a Tomcat instance using Ansible. Works well until I deploy an application on it. Then java process hangs with 100% system CPU.
Starting with tomcat users without system work correctly.
I suspected :
- Linux limits
- VM slow I/O
But after a while I ran strace :
- by modifying systemd configuration
- by modifying catalina.sh configuration
All I have was a simple FUTEX wait...
And then I read the manual, as simple as :
strace -f -e trace=all -p <PID>
No need to trace from startup and by default, not all is traced...
After that, easy way, the process was reading recursively :
Just fixing the working_directory in the ansible role, and all is working.
Issue reported here.
We are testing a springboot application in AWS with ELB in front.
After a while of load-testing, the application was hanging :
- HTTP 504 error code from Jmeter client
- HTTP 502 if we raise ELB timeout
- Once logged on the server :
- telnet localhost 8080 was OK
- sending GET / on this socket was not responding
- plenty of CLOSE_WAIT socket
- wget was also hanging (normal)
- connection was established during wget hang
- nothing in the log
I initially think about the keepAlive timeout and pool of tomcat but
- SpringBoot copy the connectionTimeout parameter to keepAliveTimeout
- new socket is accepted and established
- CLOSE_WAIT wasn't shutdown after hour
Doing the test many times, I finally so a classical "Too many open files" in the log. That's why I could not see more log during the hang.
So we change the nproc and nofile in /etc/security/limits.conf
And taadaaaa ! Nothing change in :
Thanks to blogs over the world like this one :
- the service is start with systemd
- to override ressources limits with systemd :
At last but not least, the value of Tomcat NIO socket queue is around 10000 + other files + other process... choose wisely your limit