HOAB

History of a bug

Upgrade debian et lost network

Rédigé par gorki Aucun commentaire

Problem :

I manage a dedicated server in OVH and I upgrade my debian from jessie to buster. Upgrade works quite well (it seems...) and I try to restart.

Server reboot fails as unreachable, fortunately OVH rescue mode allows me to login.

I check error log and first lost myself in RAID error message, but it was more simple than that.

Solution :

I check the /etc/network/interfaces file, it was OK

I check the logs files, clean, reboot, check again, still OK except that network was unreachable for named.

I finally remember that Debian switch to systemD in latest version so I tried to create system networking file manually : too complicate, it was not working.

In rescue mode, you can access your files as a mounted point so usual commands as systemctl does not work.

The solution was to chroot a shell :

  1. mkdir /mnt/md2
  2. mount /dev/md2 /mnt/md2
  3. chroot /mnt/md2 bash
  4. systemctl enable networking

And it works...

Now I have to check all other system to be sure that everything is working...

Begining with :

sudo apt-get update

sudo apt-get clean

sudo apt-get autoremove

sudo apt-get update && sudo apt-get upgrade

sudo dpkg --configure -a

 

SystemD and tomcat hang on startup

Rédigé par gorki Aucun commentaire

Problem :

I used robertdebock/ansible-role-tomcat to install a Tomcat instance using Ansible. Works well until I deploy an application on it. Then java process hangs with 100% system CPU.

Starting with tomcat users without system work correctly.

Solution :

I suspected :

  • SELinux
  • Linux limits
  • VM slow I/O

But after a while I ran strace :

  • by modifying systemd configuration
  • by modifying catalina.sh configuration

All I have was a simple FUTEX wait...

And then I read the manual, as simple as :

strace -f -e trace=all -p <PID>

No need to trace from startup and by default, not all is traced...

After that, easy way, the process was reading recursively :

/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/81569/cwd/proc/self/task/8156...

Just fixing the working_directory in the ansible role, and all is working.

Issue reported here.

 

Tomcat, NIO, Hanging et CLOSE_WAIT

Rédigé par gorki Aucun commentaire

Problem :

We are testing a springboot application in AWS with ELB in front.

After a while of load-testing, the application was hanging :

  • HTTP 504 error code from Jmeter client
  • HTTP 502 if we raise ELB timeout
  • Once logged on the server :
    • telnet localhost 8080 was OK
    • sending GET / on this socket was not responding
    • plenty of CLOSE_WAIT socket
    • wget was also hanging (normal)
    • connection was established during wget hang
    • nothing in the log

 

Solution :

 

I initially think about the keepAlive timeout and pool of tomcat but

  1. SpringBoot copy the connectionTimeout parameter to keepAliveTimeout
  2. new socket is accepted and established
  3. CLOSE_WAIT wasn't shutdown after hour

Doing the test many times, I finally so a classical "Too many open files" in the log. That's why I could not see more log during the hang.

So we change the nproc and nofile in /etc/security/limits.conf

And taadaaaa ! Nothing change in :

cat /proc/<$PID>/limits

Thanks to blogs over the world like this one :

  • the service is start with systemd
  • to override ressources limits with systemd :
[Service]
...
LimitNOFILE=500000
LimitNPROC=500000

At last but not least, the value of Tomcat NIO socket queue is around 10000 + other files + other process... choose wisely your limit

Fil RSS des articles de ce mot clé