跟楼主一样的情况

torque-6.1.2 安装问题,节点down状态如何启动

qterm -t quick

pbs_server

pbsnodes -a

发现子节点是 state = down

已关防火墙,配置正确,可ssh切换,节点服务都启动,还是出问题

主节点:

[root@calserver calserver]# for i in pbs_server pbs_sched pbs_mom trqauthd; do service $i start; done

Starting pbs_server (via systemctl):                       [  OK  ]

Starting pbs_sched (via systemctl):                        [  OK  ]

Starting pbs_mom (via systemctl):                          [  OK  ]

Starting trqauthd (via systemctl):                         [  OK  ]

[root@calserver calserver]#  ps -ef | grep pbs

root       1160      1  0 01:18 ?        00:00:00 /usr/local/torque/sbin/pbs_server -F -d /var/spool/torque

root       3566      1  0 01:20 ?        00:00:00 /usr/local/torque/sbin/pbs_sched -d /var/spool/torque

root       3593      1  0 01:20 ?        00:00:00 /usr/local/torque/sbin/pbs_mom -F -d /var/spool/torque

root       3659   3428  0 01:21 pts/0    00:00:00 grep --color=auto pbs

[root@calserver calserver]# qnodes

calserver

state = free

power_state = Running

np = 16

ntype = cluster

status = opsys=linux,uname=Linux calserver 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64,sessions=1593 2113 2237 2247 2501 3135 3185 3240,nsessions=8,nusers=2,idletime=256,totmem=5960692kb,availmem=4875732kb,physmem=3863544kb,ncpus=16,loadave=0.18,gres=,netload=89393,state=free,varattr= ,cpuclock=Fixed,macaddr=00:0c:29:a0:9b:d2,version=6.1.2,rectime=1540660913,jobs=

mom_service_port = 15002

mom_manager_port = 15003

calnode02

state = down

power_state = Running

np = 4

ntype = cluster

mom_service_port = 15002

mom_manager_port = 15003

calnode03

state = down

power_state = Running

np = 12

ntype = cluster

mom_service_port = 15002

mom_manager_port = 15003

主节点上pbs_server的log

[root@calserver calserver]# systemctl status pbs_server.service -l

● pbs_server.service - TORQUE pbs_server daemon

Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)

Active: active (running) since Sun 2018-10-28 01:18:08 CST; 35min ago

Main PID: 1160 (pbs_server)

Tasks: 12

Memory: 1.6M

CGroup: /system.slice/pbs_server.service

└─1160 /usr/local/torque/sbin/pbs_server -F -d /var/spool/torque

Oct 28 01:18:08 calserver systemd[1]: Starting TORQUE pbs_server daemon...

Oct 28 01:18:08 calserver PBS_Server[1160]: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = -2] [addr = 127.0.0.1:15003]

Oct 28 01:18:08 calserver PBS_Server[1160]: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy to host calserver:15003

Oct 28 01:18:08 calserver PBS_Server[1160]: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.10.102:15003]

Oct 28 01:18:08 calserver PBS_Server[1160]: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy to host calnode02:15003

Oct 28 01:18:08 calserver PBS_Server[1160]: LOG_ERROR::tcp_connect_sockaddr, Failed when trying to open tcp connection - connect() failed [rc = 15096] [addr = 192.168.10.103:15003]

Oct 28 01:18:08 calserver PBS_Server[1160]: LOG_ERROR::sendHierarchyToNode, Could not send mom hierarchy to host calnode03:15003

Oct 28 01:28:09 calserver pbs_server[1160]: Assertion failed, bad pointer in link: file "req_select.c", line 401

Oct 28 01:38:09 calserver pbs_server[1160]: Assertion failed, bad pointer in link: file "req_select.c", line 401

Oct 28 01:48:09 calserver pbs_server[1160]: Assertion failed, bad pointer in link: file "req_select.c", line 401

计算节点:

[root@calnode02 ~]# systemctl status pbs_mom.service -l

● pbs_mom.service - TORQUE pbs_mom daemon

Loaded: loaded (/usr/lib/systemd/system/pbs_mom.service; enabled; vendor preset: disabled)

Active: active (running) since Sun 2018-10-28 01:18:50 CST; 10min ago

Main PID: 1041 (pbs_mom)

Tasks: 11

Memory: 101.8M

CGroup: /system.slice/pbs_mom.service

└─1041 /usr/local/torque/sbin/pbs_mom -F -d /var/spool/torque

Oct 28 01:29:05 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

Oct 28 01:29:05 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 154 MOM status update intervals

Oct 28 01:29:09 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

Oct 28 01:29:09 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 155 MOM status update intervals

Oct 28 01:29:14 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

Oct 28 01:29:14 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 156 MOM status update intervals

Oct 28 01:29:18 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

Oct 28 01:29:18 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 157 MOM status update intervals

Oct 28 01:29:22 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update

Oct 28 01:29:22 calnode02 pbs_mom[1041]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 158 MOM status update intervals

参考安装方法

MS7、Torque在CentOS6.5上的安装-即MS计算集群搭建(原创) - 第一性原理 - MS - 小*虫论坛-学术科研互动平台  http://muchong.com/t-9836836-1-authorid-1192095

Centos7安装-多节点Torque - u012460749的博客 - CSDN博客  https://blog.csdn.net/u012460749/article/details/78583026

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐