package: x2gobroker-agent version: 0.0.4.0-0~1038~ubuntu16.04.1 priority: bug
I don't have a "0" value anymore since latest fixes so the loadchecker process don't crash anymore but still, there is something strange. Here is a fragment of my loadchecker logs from this morning. Just to give you the context, I have 22 servers which are all automaticaly started at 6 AM (wake on lan) and they are absolutely the same (blade servers with same CPU, memory amount, bios version, ...). I checked our monitoring to see if users were correctly distributed over the farm and at 7:30AM, I had about 7 or 8 users connected but 4 of them were on tce-server-21 where I should have had 1 user on 8 servers. Here is the loadchecker log fragment :
2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809; numCPU:16; typeCPU:2400;
DISCLAIMER: This e-mail is private and confidential and may contain proprietary or legally privileged information. It is for the intended recipient only. If you have received this email in error, please notify the author by replying to it and then destroy it. If you are not the intended recipient you must not use, disclose, distribute, copy, print or rely on this e-mail or any attachment. Thank you
Hi Walid,
On Mo 13 Aug 2018 12:34:00 CEST, Walid MOGHRABI wrote:
package: x2gobroker-agent version: 0.0.4.0-0~1038~ubuntu16.04.1 priority: bug
I don't have a "0" value anymore since latest fixes so the
loadchecker process don't crash anymore but still, there is
something strange. Here is a fragment of my loadchecker logs from this morning. Just to give you the context, I have 22 servers which are all
automaticaly started at 6 AM (wake on lan) and they are absolutely
the same (blade servers with same CPU, memory amount, bios version,
...). I checked our monitoring to see if users were correctly distributed
over the farm and at 7:30AM, I had about 7 or 8 users connected but
4 of them were on tce-server-21 where I should have had 1 user on 8
servers.
Have you seen this issues more often? Does it hop from one server to
another or occur on more than one server at a time?
Here is the loadchecker log fragment :
root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;'
/var/log/x2gobroker/loadchecker.log ... 2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810;
numCPU:16; typeCPU:2400;2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812;
numCPU:16; typeCPU:2400;2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;
numCPU:16; typeCPU:2400;2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;
numCPU:16; typeCPU:2400;2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;
numCPU:16; typeCPU:2400;2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;
numCPU:16; typeCPU:2400;2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;
numCPU:16; typeCPU:2400;2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;
numCPU:16; typeCPU:2400;
The log message "Broker agent answered:" comes directly from X2Go
Broker Agent. It is basically its raw output.
This means, that the flaw must be in x2gobroker-agent.pl on the remote
X2Go Server. Or that the loadchecker stops querying the broker agent
and re-uses old data.
Looking at x2gobroker-agent.pl: If we focus on the loadavgXX for now,
we come to the conclusion, that the load was really "0" or it was
negative (both gives us a loadavgXX value of "1". The value should
normally be greater (system load of 1.0 brings a loadavgXX of 100).
Looking at x2gobroker.agent.py: As the values always change slightly,
we can't say that Python provides us the same return result string all
the time. The query to the broker agent must have happened.
We need to do more debugging if this issue reoccurs:
run '/usr/lib/x2go/x2gobroker-agent foo checkload' on the
affected X2Go Server
and see if the reported values match with what the load checker sees.
check if it is reoccuring on the same X2Go Server
if /usr/lib/x2go/x2gobroker-agent returns a load of zero, look at /proc/loadavg
and /proc/sys/vm/min_free_kbytes, /proc/meminfo /proc/cpuinfo
... and report all back here...
As you can see, there is only 1 server with a loadavgXX = 1 (which
means that in fact, we got a zero value from the broker agent). This is not normal, at 7:34, there were 4 users already connected to
this server and most of my other servers were empty.
Restarting x2gobroker-loadchecker service fixed the issue.
Considering the above analysis that the issue must come from
x2gobroker-agent.pl, a restart of the loadchecker can in theory not
solve such an issue.
Can you see the x2gobroker-agent.pl process appear and disappear in
the process list on the remote X2Go Server? Or does it stay open, even
zombied?
I think there is a problem in retrieving this informations ... even
memAvail seem strange on this server to me ... with 4 connected
users, it should have been lower than that.
Hmmm... Ok... Maybe the wrong server got tested? Two identical IPs on
the subnet?
I also think the number of connected users should be taken into
account when calculating the load factor (maybe this is already the
case, not sure about that).
Yes, we take the number of sessions into account. But that is not
provided by the broker agent, but is available in the X2Go Server
database and queried from there.
DAS-NETZWERKTEAM mike gabriel, herweg 7, 24357 fleckeby mobile: +49 (1520) 1976 148 landline: +49 (4354) 8390 139
GnuPG Fingerprint: 9BFB AEE8 6C0A A5FF BF22 0782 9AF4 6B30 2577 1B31 mail: mike.gabriel@das-netzwerkteam.de, http://das-netzwerkteam.de
Control: close -1
On Do 13 Sep 2018 15:19:19 CEST, Mike Gabriel wrote:
Hi Walid,
On Mo 13 Aug 2018 12:34:00 CEST, Walid MOGHRABI wrote:
package: x2gobroker-agent version: 0.0.4.0-0~1038~ubuntu16.04.1 priority: bug
I don't have a "0" value anymore since latest fixes so the
loadchecker process don't crash anymore but still, there is
something strange. Here is a fragment of my loadchecker logs from this morning. Just to give you the context, I have 22 servers which are all
automaticaly started at 6 AM (wake on lan) and they are absolutely
the same (blade servers with same CPU, memory amount, bios version,
...). I checked our monitoring to see if users were correctly distributed
over the farm and at 7:30AM, I had about 7 or 8 users connected but
4 of them were on tce-server-21 where I should have had 1 user on 8
servers.Have you seen this issues more often? Does it hop from one server to
another or occur on more than one server at a time?Here is the loadchecker log fragment :
root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;'
/var/log/x2gobroker/loadchecker.log ... 2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810;
numCPU:16; typeCPU:2400;2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812;
numCPU:16; typeCPU:2400;2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;
numCPU:16; typeCPU:2400;2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;
numCPU:16; typeCPU:2400;2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;
numCPU:16; typeCPU:2400;2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;
numCPU:16; typeCPU:2400;2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;
numCPU:16; typeCPU:2400;2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent
command on remote host tce-server-21 (10.50.0.221): sh -c
'/usr/lib/x2go/x2gobroker-agent foo checkload' 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent
answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;
numCPU:16; typeCPU:2400;The log message "Broker agent answered:" comes directly from X2Go
Broker Agent. It is basically its raw output.This means, that the flaw must be in x2gobroker-agent.pl on the
remote X2Go Server. Or that the loadchecker stops querying the
broker agent and re-uses old data.Looking at x2gobroker-agent.pl: If we focus on the loadavgXX for
now, we come to the conclusion, that the load was really "0" or it
was negative (both gives us a loadavgXX value of "1". The value
should normally be greater (system load of 1.0 brings a loadavgXX of
100).Looking at x2gobroker.agent.py: As the values always change
slightly, we can't say that Python provides us the same return
result string all the time. The query to the broker agent must have
happened.We need to do more debugging if this issue reoccurs:
run '/usr/lib/x2go/x2gobroker-agent foo checkload' on the
affected X2Go Server and see if the reported values match with what the load checker sees.check if it is reoccuring on the same X2Go Server
if /usr/lib/x2go/x2gobroker-agent returns a load of zero, look at /proc/loadavg
and /proc/sys/vm/min_free_kbytes, /proc/meminfo /proc/cpuinfo
... and report all back here...
As you can see, there is only 1 server with a loadavgXX = 1 (which
means that in fact, we got a zero value from the broker agent). This is not normal, at 7:34, there were 4 users already connected
to this server and most of my other servers were empty.Restarting x2gobroker-loadchecker service fixed the issue.
Considering the above analysis that the issue must come from
x2gobroker-agent.pl, a restart of the loadchecker can in theory not
solve such an issue.Can you see the x2gobroker-agent.pl process appear and disappear in
the process list on the remote X2Go Server? Or does it stay open,
even zombied?I think there is a problem in retrieving this informations ... even
memAvail seem strange on this server to me ... with 4 connected
users, it should have been lower than that.Hmmm... Ok... Maybe the wrong server got tested? Two identical IPs
on the subnet?I also think the number of connected users should be taken into
account when calculating the load factor (maybe this is already the
case, not sure about that).Yes, we take the number of sessions into account. But that is not
provided by the broker agent, but is available in the X2Go Server
database and queried from there.Mike
Request from Walid on IRC. Not an issue anymore.
Thus, closing...
DAS-NETZWERKTEAM mike gabriel, herweg 7, 24357 fleckeby mobile: +49 (1520) 1976 148 landline: +49 (4354) 8390 139
GnuPG Fingerprint: 9BFB AEE8 6C0A A5FF BF22 0782 9AF4 6B30 2577 1B31 mail: mike.gabriel@das-netzwerkteam.de, http://das-netzwerkteam.de
Processing control commands:
close -1 Bug #1313 [x2gobroker-agent] there is still a problem in getting a correct value for loadavgXX with loadchecker Marked Bug as done
-- 1313: http://bugs.x2go.org/cgi-bin/bugreport.cgi?bug=1313 X2Go Bug Tracking System Contact owner@bugs.x2go.org with problems