Convert almost all blocking system calls to non-blocking to avoid hanging the server when a mom dies. This is based on the now-classic CPlant fault tolerance patch, but heavily modified from that original. --- pbs-2.3.12-pw/README.cplant | 171 ++++++++++++++++++++++++++++++ pbs-2.3.12-pw/src/include/pbs_config.h.in | 5 pbs-2.3.12-pw/src/include/pbs_nodes.h | 4 pbs-2.3.12-pw/src/lib/Libifl/nonblock.c | 45 +++++++ pbs-2.3.12-pw/src/lib/Libnet/net_client.c | 50 +++++++- pbs-2.3.12-pw/src/lib/Libpbs/Makefile.in | 2 pbs-2.3.12-pw/src/mom_rcp/rcp.c | 1 pbs-2.3.12-pw/src/resmom/mom_inter.c | 17 ++ pbs-2.3.12-pw/src/server/node_func.c | 54 +++++++++ pbs-2.3.12-pw/src/server/node_manager.c | 15 +- pbs-2.3.12-pw/src/server/run_sched.c | 8 + pbs-2.3.12-pw/src/server/svr_connect.c | 9 + 12 files changed, 362 insertions(+), 19 deletions(-) diff -puN /dev/null README.cplant --- /dev/null 2003-09-15 09:02:32.000000000 -0400 +++ pbs-2.3.12-pw/README.cplant 2004-04-17 10:07:45.000000000 -0400 @@ -0,0 +1,171 @@ +Original README.cplant, not all of this applies. --pw 26 Dec 01 + + +Cplant Fault Recovery Patch for PBS +September, 2000 +Lee Ann Fisk, lafisk@sandia.gov + +What this is: +------------ +The patch file that creates PBS for Cplant is quite large. The +cplantFRpatch file in this directory is a subset of that patch +file. It contains fault recovery code only. It would be +applicable to non-Cplant sites as well as Cplant sites. + +How to patch PBS: +----------------- +The patch file was built from Open PBS v 2.2, patch level 8. It +will probably patch later revisions without trouble. If not, the +patches are simple and you should be able to read the patch file +and patch your code manually. + +We build PBS on Linux/Alpha machines. We have thousands, running +everything from Red Hat 5.1 to Red Hat 6.1. The patches just use +libc functions and will most likely build and run with the desired +result on other systems as well. + +To patch the PBS source, cd to the top of your PBS source tree +(where "src" and "doc" and "configure" are) and (assuming the +patch file is here too) : + + patch -N -p1 -l < cplantFRpatch + +(I'm using "patch" version 2.5, Larry Wall, Free Software Foundation.) + +The new code is ifdef'd out. You need to define CPLANT_SERVICE_NODE +and CPLANT_NONBLOCKING_CONNECTIONS to get the patches included when +you compile. The two problems solved by these two enhancements are +described below. + +Problem 1: +---------- + +The first problem is that every scheduling cycle, the server sends a +list of MOMs to the scheduler (we use the FIFO scheduler). The scheduler +tries to contact each MOM to get resource information so it can make +an intelligent scheduling decision. If the MOM or the MOM's node +is no longer talking, the scheduler hangs for three minutes (or +whatever number of seconds it's "-a" argument specified) and then +takes an alarm and exits. + +The patches ifdef'd with CPLANT_SERVICE_NODE make it far less likely +that the server will hand the MOM a bad node. It can still happen, +but the window between when the server tests the state of the MOM +node and when the server hands the scheduler a list of MOMs is greatly +reduced. + +This message I sent to the PBS users list explains the details: + +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +From lafisk Thu Apr 27 09:13:43 2000 +Subject: Re: [PBS-USERS] machine crash cause PBS to cease op +To: tim.leight@evsx.com (Timothy S. Leight) +Date: Thu, 27 Apr 2000 09:13:43 -0600 (MDT) +Cc: berend@growthnetworks.com (Berend Ozceri), + hender@pbspro.com (Bob Henderson), pbs-users@pbspro.com ('PBS Users') +In-Reply-To: <3908474B.88DEF804@evsx.com> from "Timothy S. Leight" at Apr 27, 2000 01:57 +:31 PM +X-Mailer: ELM [version 2.5 PL2] +Content-Length: 2693 +Status: OR + +I greatly reduced the likelihood of the scheduler getting a +bad node from the server with these three changes to ping_nodes() +in server/node_manager.c. (The server normally pings nodes every +5 minutes, and only if they are in an unknown state or some other +routine in the server marked them as needing a ping. And it +doesn't ping nodes it believes are running a job.) + +Remove this code: + + if (np->nd_state & (INUSE_JOB|INUSE_JOBSHARE)) { + if (!(np->nd_state & INUSE_NEEDS_HELLO_PING)) + continue; + } + +It causes the server to skip nodes that are running a job. + +Replace the NEEDS_HELLO check like this: + +#ifdef CPLANT_SERVICE_NODE + /* + ** In our environment, nodes are down until proven otherwise + */ + com = IS_HELLO; + np->nd_state |= INUSE_DOWN; +#else + if (np->nd_state & INUSE_NEEDS_HELLO_PING) + com = IS_HELLO; +#endif + +The IS_HELLO requires an acknowledgement from the node and the state +of the node is set to DOWN until we hear from it. + +And set the ping interval to your taste. We are pinging all nodes +every 2 minutes: + +#ifdef CPLANT_SERVICE_NODE + /* + ** Let's try a ping every 2 minutes. + */ + i = 120; +#else + i = 300; /* relaxed ping rate for normal run */ +#endif + +There is still a window of time where a node can crash after the +server has pinged it and before the scheduler is invoked. But this +rarely happens now. I haven't seen a scheduler toolong alarm in +quite a while now. + +Also, ping nodes uses datagram sockets so it doesn't hang like +the connections made for qstat. +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + +Problem 2: +---------- +The second problem is that the server hangs if it tries to contact +a MOM (or a scheduler) on a dead node. The solution implemented here +is to use non-blocking sockets and timeout with an error. + +This code is ifdef'd CPLANT_NONBLOCKING_SOCKETS. + +These are the affected files: + +lib/Libnet/net_client.c - In client_to_svr() open non-blocking sockets, + wait 5 seconds for the connection, and return PBS_NET_RC_RETRY + if connection times out. + +include/pbs_config.h.in - Redefine read() and write() to check EAGAIN. + pbs_config.h is conveniently included in every file. + +server/node_func.c - New function bad_node_warning() writes a + warning to server's log file if MOM or scheduler can't be reached. It + writes no more than once per hour per node. It also uses set_task + to schedule a trip to the ping_nodes function. ping_nodes will + discover the node is down and set the appropriate status fields for + the node. For this to work you need CPLANT_SERVICE_NODE defined + (that's Problem 1) so that ping_nodes will be sure to ping the node. + + New function addr_ok() tests if a node is down or OK. + +pbs_nodes.h - Add a field in struct pbsnode that notes at what time + the last warning was written to the log file that the node is down. + +server/run_sched.c - In contact_sched(), test addr_ok() before contacting + the scheduler. Return EHOSTDOWN if it's not OK. If it is + OK, and if connection to scheduler fails, call bad_node_warning(). + +server/svr_connect.c - In svr_connect(), test addr_ok() before contacting + a MOM. Return EHOSTDOWN if !addr_ok(). If socket connection to MOM + fails, call bad_node_warning(). + +That's all it takes. +============================================================================= +Lee Ann Fisk Phone: 505-844-2059 +Scalable Computing Systems Department (9223) FAX: 505-845-7442 +Sandia National Labs, Mail Stop 1110 Email: lafisk@mp.sandia.gov +Albuquerque, NM 87185-1110 http://www.cs.sandia.gov/cplant +============================================================================= + diff -puN src/include/pbs_config.h.in~fault-tolerance src/include/pbs_config.h.in --- pbs-2.3.12/src/include/pbs_config.h.in~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/include/pbs_config.h.in 2004-04-17 10:05:55.000000000 -0400 @@ -233,4 +233,9 @@ /* Define if you have the socket library (-lsocket). */ #undef HAVE_LIBSOCKET +#ifndef NEED_BLOCKING_CONNECTIONS +#define write(a,b,c) write_nonblocking_socket(a,b,c) +#define read(a,b,c) read_nonblocking_socket(a,b,c) +#endif + #endif /* _PBS_CONFIG_H_ */ diff -puN src/include/pbs_nodes.h~fault-tolerance src/include/pbs_nodes.h --- pbs-2.3.12/src/include/pbs_nodes.h~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/include/pbs_nodes.h 2004-04-17 10:05:55.000000000 -0400 @@ -132,6 +132,7 @@ struct pbsnode { unsigned short nd_state; unsigned short nd_ntype; /* node type */ short nd_order; /* order of user's request */ + time_t nd_warnbad; }; @@ -204,6 +205,9 @@ extern struct tree_t *streams; extern int update_nodes_file A_(()); +extern void bad_node_warning(pbs_net_t addr); +extern int addr_ok(pbs_net_t addr); + #ifdef BATCH_REQUEST_H extern void initialize_pbssubn A_((struct pbsnode *, struct pbssubn*, struct prop*)); extern void effective_node_delete A_((struct pbsnode *)); diff -puN /dev/null src/lib/Libifl/nonblock.c --- /dev/null 2003-09-15 09:02:32.000000000 -0400 +++ pbs-2.3.12-pw/src/lib/Libifl/nonblock.c 2004-04-17 10:09:30.000000000 -0400 @@ -0,0 +1,45 @@ +/* + * Defns of nonblocking read,write. + * Headers redefine read/write to name these instead, before inclusion + * of stdio.h, so system declaration is used. + */ +#include +#include + +/* + * Assumes full-block read/write. No accounting for partial blocks, + * this would have had to be handled by the main pbs code anyway. + */ +ssize_t +write_nonblocking_socket(int fd, const void *buf, ssize_t count) +{ + ssize_t i; + + for (;;) { + i = write(fd, buf, count); + if (i >= 0) return i; + if (errno != EAGAIN) return i; + } +} + +ssize_t +read_nonblocking_socket(int fd, void *buf, ssize_t count) +{ + ssize_t i; + + for (;;) { + i = read(fd, buf, count); + if (i >= 0) return i; + if (errno != EAGAIN) return i; + } +} + +/* + * Call the real read, for things that want to block. + */ +ssize_t +read_blocking_socket(int fd, void *buf, ssize_t count) +{ + return read(fd, buf, count); +} + diff -puN src/lib/Libnet/net_client.c~fault-tolerance src/lib/Libnet/net_client.c --- pbs-2.3.12/src/lib/Libnet/net_client.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/lib/Libnet/net_client.c 2004-04-17 10:10:07.000000000 -0400 @@ -89,6 +89,41 @@ static char ident[] = "@(#) $RCSfile: fault-tolerance.patch,v $ $Revision: 1.1 $"; +#include +#include +#include + +/* +** wait for connect to complete. We use non-blocking sockets, +** so have to wait for completion this way. +*/ +static int +await_connect(int timeout, int sockd) +{ + fd_set fs; + int n, val, len, rc; + struct timeval tv; + + tv.tv_sec = (__time_t) timeout; + tv.tv_usec = 0; + + FD_ZERO(&fs); + FD_SET(sockd, &fs); + + if ((n = select(FD_SETSIZE, 0, &fs, 0, &tv)) != 1) + return -1; + + len = sizeof(int); + rc = getsockopt(sockd, SOL_SOCKET, SO_ERROR, &val, &len); + + if ((rc==0) && (val==0)) + return 0; + else { + errno=val; + return -1; + } +} + /* * client_to_svr - connect to a server * @@ -104,7 +139,6 @@ static char ident[] = "@(#) $RCSfile: ne * hosts with the same port. Let the caller keep the addresses arround * rather than look it up each time. */ - int client_to_svr(hostaddr, port, local_port) pbs_net_t hostaddr; /* Internet addr of host */ unsigned int port; /* port to which to connect */ @@ -114,6 +148,7 @@ int client_to_svr(hostaddr, port, local_ struct sockaddr_in remote; int sock; unsigned short tryport; + int flags; local.sin_family = AF_INET; local.sin_addr.s_addr = 0; @@ -128,6 +163,10 @@ int client_to_svr(hostaddr, port, local_ return (PBS_NET_RC_RETRY); } + flags = fcntl(sock, F_GETFL); + flags |= O_NONBLOCK; + fcntl(sock, F_SETFL, flags); + /* If local privilege port requested, bind to one */ /* Must be root privileged to do this */ @@ -153,11 +192,14 @@ int client_to_svr(hostaddr, port, local_ remote.sin_addr.s_addr = htonl(hostaddr); remote.sin_port = htons((unsigned short)port); remote.sin_family = AF_INET; - if (connect(sock, (struct sockaddr *)&remote, sizeof(remote)) < 0) { + if (connect(sock, (struct sockaddr *)&remote, sizeof(remote)) < 0) switch (errno) { case EINTR: case EADDRINUSE: case ETIMEDOUT: + case EINPROGRESS: + if (await_connect(5, sock) == 0) + break; case ECONNREFUSED: close(sock); return (PBS_NET_RC_RETRY); @@ -166,7 +208,5 @@ int client_to_svr(hostaddr, port, local_ return (PBS_NET_RC_FATAL); } - } else { - return (sock); - } + return (sock); } diff -puN src/lib/Libpbs/Makefile.in~fault-tolerance src/lib/Libpbs/Makefile.in --- pbs-2.3.12/src/lib/Libpbs/Makefile.in~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/lib/Libpbs/Makefile.in 2004-04-17 10:05:55.000000000 -0400 @@ -111,7 +111,7 @@ OBJS1 = PBS_data.o PBS_attr.o get_svr enc_Manage.o enc_MsgJob.o enc_MoveJob.o enc_QueueJob.o enc_Reg.o \ enc_ReqExt.o enc_ReqHdr.o enc_RunJob.o enc_Shut.o enc_Sig.o \ enc_Status.o enc_Track.o enc_attrl.o enc_attropl.o enc_reply.o \ - enc_svrattrl.o tcp_dis.o tm.o rpp.o + enc_svrattrl.o tcp_dis.o tm.o rpp.o nonblock.o DISOBJS = PBSD_jcred.o PBSD_manager.o PBSD_status.o pbsD_alterjo.o \ pbsD_asyrun.o pbsD_connect.o pbsD_deljob.o pbsD_holdjob.o \ diff -puN src/mom_rcp/rcp.c~fault-tolerance src/mom_rcp/rcp.c --- pbs-2.3.12/src/mom_rcp/rcp.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/mom_rcp/rcp.c 2004-04-17 10:05:55.000000000 -0400 @@ -41,6 +41,7 @@ static char copyright[] = static char sccsid[] = "@(#)rcp.c 8.2 (Berkeley) 4/2/94"; #endif /* not lint */ +#define NEED_BLOCKING_CONNECTIONS #include /* the master config generated by configure */ #include diff -puN src/resmom/mom_inter.c~fault-tolerance src/resmom/mom_inter.c --- pbs-2.3.12/src/resmom/mom_inter.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/resmom/mom_inter.c 2004-04-17 10:05:55.000000000 -0400 @@ -101,6 +101,7 @@ #include "portability.h" #include "pbs_ifl.h" #include "server_limits.h" +#include "net_connect.h" static char ident[] = "@(#) $RCSfile: fault-tolerance.patch,v $ $Revision: 1.1 $"; @@ -243,7 +244,6 @@ int setwinsize(pty) return (0); } - /* * reader process - reads from the remote socket, and writes * to the master pty @@ -252,12 +252,13 @@ int mom_reader(s, ptc) int s; int ptc; { + extern ssize_t read_blocking_socket(int fd, void *buf, ssize_t count); char buf[1024]; int c; /* read from the socket, and write to ptc */ while (mom_reader_go) { - c = read(s, buf, sizeof(buf)); + c = read_blocking_socket(s, buf, sizeof(buf)); if (c > 0) { int wc; char *p = buf; @@ -336,8 +337,18 @@ int conn_qsub(hostname, port) long port; { pbs_net_t hostaddr; + int s; if ((hostaddr = get_hostaddr(hostname)) == (pbs_net_t)0) return (-1); - return (client_to_svr(hostaddr, (unsigned int)port, 0)); + s = client_to_svr(hostaddr, (unsigned int)port, 0); + + /* this one should be blocking */ + if (s >= 0) { + int flags = fcntl(s, F_GETFL); + flags &= ~O_NONBLOCK; + fcntl(s, F_SETFL, flags); + } + + return s; } diff -puN src/server/node_func.c~fault-tolerance src/server/node_func.c --- pbs-2.3.12/src/server/node_func.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/server/node_func.c 2004-04-17 10:10:30.000000000 -0400 @@ -155,7 +155,60 @@ extern char *path_nodestate; * create_pbs_node - create basic node structure for adding a node */ +#include +extern void ping_nodes(struct work_task *ptask); +void +bad_node_warning(pbs_net_t addr) +{ + int i; + time_t now, last; + + for (i=0; ind_addrs[0] == addr) { + now = time(0); + last = pbsndlist[i]->nd_warnbad; + + if (last && (now - last < 3600)) + return; + + /* + ** once per hour, log a warning that we can't reach the node, and + ** ping_nodes to check and reset the node's state. + */ + + sprintf(log_buffer, "!!! unable to contact node %s !!!", + pbsndlist[i]->nd_name); + log_event(PBSEVENT_ADMIN, PBS_EVENTCLASS_SERVER, "WARNING", + log_buffer); + + (void) set_task(WORK_Timed, now+5, ping_nodes, NULL); + pbsndlist[i]->nd_warnbad = now; + break; + } + } +} + +/* + * Returns 1 if node is OK, 0 if node is down. Since this function is + * also called for inter-server communication for, e.g. qmove, a node + * is presumed good if it is not in the pbsndlist list. + */ +int +addr_ok(pbs_net_t addr) +{ + int i, status = 1; + + if (pbsndlist) + for (i=0; ind_addrs[0] == addr + && (pbsndlist[i]->nd_state & (INUSE_DOWN|INUSE_OFFLINE|INUSE_DELETED|INUSE_UNKNOWN))) { + status = 0; + break; + } + + return status; +} /* * find_nodehbyname() - find a node host by its name @@ -407,6 +460,7 @@ static void initialize_pbsnode (pnode, p pnode->nd_first = init_prop(pnode->nd_name); pnode->nd_last = pnode->nd_first; pnode->nd_nprops = 0; + pnode->nd_warnbad = 0; } /* diff -puN src/server/node_manager.c~fault-tolerance src/server/node_manager.c --- pbs-2.3.12/src/server/node_manager.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/server/node_manager.c 2004-04-17 10:05:55.000000000 -0400 @@ -354,11 +354,6 @@ ping_nodes(ptask) if (np->nd_state & (INUSE_DELETED|INUSE_OFFLINE)) continue; - if (np->nd_state & (INUSE_JOB|INUSE_JOBSHARE)) { - if (!(np->nd_state & INUSE_NEEDS_HELLO_PING)) - continue; - } - if (np->nd_stream < 0) { np->nd_stream = rpp_open(np->nd_name, pbs_rm_port); np->nd_state |= INUSE_DOWN; @@ -377,8 +372,12 @@ ping_nodes(ptask) com = IS_NULL; DBPRT(("%s: ping %s\n", id, np->nd_name)) - if (np->nd_state & INUSE_NEEDS_HELLO_PING) - com = IS_HELLO; + + /* + * In our environment, nodes are down until proven otherwise + */ + com = IS_HELLO; + np->nd_state |= INUSE_DOWN; ret = is_compose(np->nd_stream, com); if (ret == DIS_SUCCESS) { @@ -403,7 +402,7 @@ ping_nodes(ptask) if (server_init_type == RECOV_HOT) i = 15; /* rapid ping rate while hot restart */ else - i = 300; /* relaxed ping rate for normal run */ + i = 120; (void)set_task(WORK_Timed, time_now+i, ping_nodes, NULL); } } diff -puN src/server/run_sched.c~fault-tolerance src/server/run_sched.c --- pbs-2.3.12/src/server/run_sched.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/server/run_sched.c 2004-04-17 10:05:55.000000000 -0400 @@ -121,10 +121,16 @@ static int contact_sched(cmd) char *myid = "contact_sched"; /* connect to the Scheduler */ - +#if 0 /* don't check if scheduler runs on same node as server */ + if (!addr_ok(pbs_scheduler_addr)) { + pbs_errno = EHOSTDOWN; + return -1; + } +#endif sock = client_to_svr(pbs_scheduler_addr, pbs_scheduler_port, 1); if (sock < 0) { + bad_node_warning(pbs_scheduler_addr); log_err(errno, myid, msg_sched_nocall); return (-1); } diff -puN src/server/svr_connect.c~fault-tolerance src/server/svr_connect.c --- pbs-2.3.12/src/server/svr_connect.c~fault-tolerance 2004-04-17 10:05:55.000000000 -0400 +++ pbs-2.3.12-pw/src/server/svr_connect.c 2004-04-17 10:05:55.000000000 -0400 @@ -101,6 +101,7 @@ #include /* the master config generated by configure */ +#include #include #include "libpbs.h" #include "server_limits.h" @@ -136,8 +137,13 @@ int svr_connect(hostaddr, port, func, cn /* obtain the connection to the other server */ + if (!addr_ok(hostaddr)) { + pbs_errno = EHOSTDOWN; + return (PBS_NET_RC_RETRY); + } sock = client_to_svr(hostaddr, port, 1); if (sock < 0) { + bad_node_warning(hostaddr); pbs_errno = errno; return (sock); /* PBS_NET_RC_RETRY or PBS_NET_RC_FATAL */ } @@ -183,7 +189,8 @@ void svr_disconnect(handle) if ( (encode_DIS_ReqHdr(sock, PBS_BATCH_Disconnect, pbs_current_user) == 0) && (DIS_tcp_wflush(sock) == 0) ) { /* wait for other server to close connection */ while (1) { - if (read(sock, &x, 1) < 1) + /* don't call the non-blocking function */ + if (__read(sock, &x, 1) < 1) break; } } _