HOAB

History of a bug

Tomcat, NIO, Hanging et CLOSE_WAIT

Rédigé par gorki Aucun commentaire

Problem :

We are testing a springboot application in AWS with ELB in front.

After a while of load-testing, the application was hanging :

  • HTTP 504 error code from Jmeter client
  • HTTP 502 if we raise ELB timeout
  • Once logged on the server :
    • telnet localhost 8080 was OK
    • sending GET / on this socket was not responding
    • plenty of CLOSE_WAIT socket
    • wget was also hanging (normal)
    • connection was established during wget hang
    • nothing in the log

 

Solution :

 

I initially think about the keepAlive timeout and pool of tomcat but

  1. SpringBoot copy the connectionTimeout parameter to keepAliveTimeout
  2. new socket is accepted and established
  3. CLOSE_WAIT wasn't shutdown after hour

Doing the test many times, I finally so a classical "Too many open files" in the log. That's why I could not see more log during the hang.

So we change the nproc and nofile in /etc/security/limits.conf

And taadaaaa ! Nothing change in :

cat /proc/<$PID>/limits

Thanks to blogs over the world like this one :

  • the service is start with systemd
  • to override ressources limits with systemd :
[Service]
...
LimitNOFILE=500000
LimitNPROC=500000

At last but not least, the value of Tomcat NIO socket queue is around 10000 + other files + other process... choose wisely your limit

SSH remote connection in cron is refused

Rédigé par gorki Aucun commentaire

Problem :

I was creating a simple cron job to connect to from remote-server-1 to remote-server-2.

Testing the job with direct call or run-parts was OK

# direct call to my script
/home/admin/myscript.sh

# or with run-parts
run-parts -v –-test /etc/cron.hourly

But when called from cron I had a : Permission denied (publickey).

Solution :

First, trying to reproduce in cron environment with this command line (extract from there)

I finally reproduce the problem.

So I add -vvv options to my ssh connection to get more details : still not enough clue : permission is refused.

Then I decided to compare my ssh connection from bash command line :

remote-server-1@myuser > ssh -vvv remote-server-2

What a surprise :

- it uses my personal key to connect to remote-server-2 instead of remote-server-1 key !

- my personal key is deployed on remote-server-1 and remote-server-2

So when I run the connection, it works because it uses my personal key but when ran from cron environment it uses remote-server-1 key and this one was not declared on remote-server-2.

SSH is able to use your connection key in priority to try to connect to another server...

 

 

 

PEER_DNS=no on debian or how to prevent a specific DHCP interface to update the DNS

Rédigé par gorki Aucun commentaire

Problem :

On Debian, do not update resolv.conf (DNS) when we have multiple DHCP network interfaces.

Solution :

A first link : Never update resolv.conf with DHCP client

But we don't want to never update, but sometimes update...

On Redhat families it's simple (see the previous link) : PEERDNS=NO on the right interfaces

On Debian families.... let's use the hook as suggested :

Create hook to avoid /etc/resolv.conf file update

You need to create /etc/dhcp3/dhclient-enter-hooks.d/nodnsupdate file under Debian / Ubuntu Linux:
# vi /etc/dhcp3/dhclient-enter-hooks.d/nodnsupdate
Append following code:

#!/bin/sh
make_resolv_conf()
{ : }

OK, but the hook prevent ALL interfaces to update resolv.conf, the idea :

  1. in the hook test the interface name
  2. if one authorized, call the original make_resolv_conf
  3. otherwise to nothing

In bash it's not easy to have multiple function with the same name, but thanks stackoverlow !:

#!/bin/bash


# copies function named $1 to name $2
copy_function() {
    declare -F $1 > /dev/null || return 1
    eval "$(echo "${2}()"; declare -f ${1} | tail -n +2)"
}

# Import the original make_resolv_conf
# Normally useless, hooks are called after make_resolv_conf declaration
# . /sbin/dhclient-script

copy_function make_resolv_conf orignal_make_resolv_conf

make_resolv_conf() {
        if [ ${interface} = "auhtorizedInterface" ] ; then
                original_make_resolv_conf
        fi
}

Update :

The previous solution is not working...  declare is not known by sh/dash and the script is run by sh/dash. So the copy function is not possible.

Ideas :

  • copy make_resolv_conf in this file under original_make_resolv_conf : it works, but ugly due to security patch not handled
  • use 2 hooks : one enter : save resolv.conf, one on exit : restore resolv.conf if ${interface} is not authorized
  • try to extract make_resolv_conf from /sbin/dhclient-script : not so easy...

Best solution, the two hooks, it's a pity :) I like the copy_functions :) :

# vi /etc/dhcp3/dhclient-enter-hooks.d/selectdns-enter

#!/bin/sh

cp /etc/resolv.conf /tmp/resolv.conf.${interface}

# vi /etc/dhcp3/dhclient-exit-hooks.d/selectdns-exit

#/bin/sh

if [ ${interface} = "auhtorizedInterface" ] ; then
       echo "${interface} not authorized"
       cp /tmp/resolv.conf.${interface} /etc/resolv.conf
fi

 

Bash and SSH completion with Include directive

Rédigé par gorki Aucun commentaire

Problem :

I use bash as shell and usually the autocompletion (with bash-completion) works well.

Until I create some Include files...

Solution :

Not a final one for every one, but a quick workaround is :

  1. put your Include file in a directory, for example : ~/ssh/config.d
  2. add those config file in bash_completion configuration.
sudo vi /usr/share/bash-completion/bash_completion

// Add your directory in the config file list

for i in /etc/ssh/ssh_config ~/.ssh/config ~/.ssh/config.d/* ~/.ssh2/config; do
    [[ -r $i ]] && config+=( "$i" )
done

That's all folks.

Cannot process volume group "xxx" on boot

Rédigé par gorki Aucun commentaire

Le problème :

Je joue avec des VM en ce moment, copie, clone, etc... et lors de la mise au point je me suis amusé à modifier le nom des volumes groups.... Dont le volume group qui hébergait la partition "/" :)

Alors après ça ne démarre plus... curieusement...

failed to connect to lvmetad

volume group "xxx" not found

Cannot process volume group "xxx" on boot

On arrive à la fin sur le initramfs qui indique bien la cause probable : boot args, cat / proc/cmdline

Et en effet, dans ce cas, le boot_image pointe vers l'ancien nom du volume group. Quoi de plus normal.

Mais bon j'ai un peu galéré pour un truc super simple à corriger.

Solution :

J'ai cherché des disques de dépannage dont le supergrub2-iso qui ne m'a pas aidé (pas réussi à le faire booter celui-là).

Bref, la solution toute simple (parce que j'ai grub2 !), au démarrage :

  • booter une première fois jusqu'à avoir l'invite initramfs
  • lister les volumes logiques : ls /dev/mapper
  • repérer le nom du volume logique (qui contient -root en général)
  • reboot de la vm
  • lors du menu grub appuyer sur e pour éditer la configuration choisie
  • repérer la ligne qui contient /dev/mapper/<ancien nom du volume logique>
  • corriger le nom
  • ctrl-x

ça devrait booter.

Pour corriger complètement :

  1. Corriger le fichier /etc/fstab s'il contient encore la référence vers l'ancien volume logique
  2. Vérifier le disque initfsram :
update-initramfs -u

# Si des warnings sont affiché, vérifier la configuration dans : 
cd /etc/initramfs-tools
# particulièrement :
/etc/initramfs-tools/conf.d/resume
  1. Ensuite on demande à grub de se mettre à jour :
update-grub /dev/sda
grub-install /dev/sda

 

Fil RSS des articles de cette catégorie