Solaris 8 brandz getppid hack
Sun has always said that any application written for Solaris 6 will be forwards compatible with any other version. It’s a big claim, but as long as the application adheres to the API it’s possible.
Occasionally you bump into things that don’t adhere to the API, (quite a lot of applications really). This is usually caused by developers thinking that they should do things “this way”, because Sun does it oh so wrong all the time. Consequently, you end up with legacy applications that can’t run on Solaris 10.
Enter Solaris 8 brandz zones. Being able to run legacy applications in a Solaris 8 brandz generally gets you around most of the issues in the remaining apps that do stupid things. However, there’s a couple of quirks with a zone that can introduce odd behaviour in an application.
Progress for example.…… Read On.
The issue
An old legacy Progress database was sitting on an old legacy server which was so EOL that it would make even Bill cringe… or laugh… Running Solaris 6 was not an option, but running anything else, (even Solaris 9), would be adventurous to say the least. Can I say legacy again?
The solution
So first try a brandz Solaris 8. Configured, installed, and booted a brand spanking new zone. Copy the data over. Attempt a start of the Progress database.
10:03:21 BROKER 0: Multi-user session begin. (333) 10:03:24 BROKER 0: Begin Physical Redo Phase at 13312 . (5326) 10:03:30 BROKER 0: Physical Redo Phase Completed at blk 13743 off 2457 upd 2636. (7161) 10:03:31 BROKER 0: Started for mfglive using TCP, pid 20533. (5644) 10:04:02 BROKER 0: SYSTEM ERROR: Unable to kill parent process, errno= 4. (1680) 10:04:02 BROKER 0: SYSTEM ERROR: The broker is exiting unexpectedly, beginning Abnormal Shutdown. (5292) 10:04:02 BROKER 0: drexit: Initiating Abnormal Shutdown 10:04:02 BROKER 0: ** Save file named core for analysis by Progress Software Corporation. (439) 10:04:02 BROKER 0: Begin ABNORMAL shutdown code 2 (2249) 10:04:04 BROKER : Removed shared memory with segment_id: 41 10:04:04 BROKER : Multi-user session end. (334)
Nada. Mmmmm. Seems to start up OK, but then crashes in a screaming heap a short while later.
Checking shared libraries seemed to be OK. Nothing missing or odd.
root % ldd -r /mfgpro/dlc91c/bin/_mprosrv
/usr/lib/secure/s8_preload.so.1
libsocket.so.1 => /usr/lib/libsocket.so.1
libnsl.so.1 => /usr/lib/libnsl.so.1
libintl.so.1 => /usr/lib/libintl.so.1
libdl.so.1 => /usr/lib/libdl.so.1
libm.so.1 => /usr/lib/libm.so.1
libthread.so.1 => /usr/lib/libthread.so.1
libc.so.1 => /usr/lib/libc.so.1
libmp.so.2 => /usr/lib/libmp.so.2
/usr/platform/SUNW,Sun-Fire-V490/lib/libc_psr.so.1
So, I thought I’d truss, (the old friend), a few processes to see what was going on. Chucking the below in various scripts, and comparing the working server to the non-working server. (This produces VERY NOISY output, but it’s great for diagnosing what’s going on.)
truss -feal -vall -xall -rall -wall -sall -mall -o /tmp/logfile exec_process
If you look at the appendices to this post, I’ve added the relevant, sections of the truss. You may notice an interesting thing. The working server roughly follows this process, (in the snippet):
– Bind to 0.0.0.0 port 3500.
– fstat64() filehandles.
– Yadda, yadda, write output to STDOUT.
– getpid() — find parent process, (24012).
– kill –TERM 24012
– wait()
– Parent — ALRM signal
– Parent — die.
– Child — Context switch.
– getpid() — find parent process, (1). OK!
Now this is a little odd. It would seem that some developer thought that it’d be a damn good idea to re-invent the old “I’m going to fork a child process in the background” API calls, and chuck in their own.
They are checking the parent process, killing it, and rechecking to see if it’s now 1, (which is sysvinit), then it progresses on.
Now, the non-working server.
– Bind to 0.0.0.0 port 3500.
– fstat64() filehandles.
– Yadda, yadda, write output to STDOUT.
– getpid() — find parent process, (21267).
– kill –TERM 21267
– wait()
– Parent — ALRM signal <-IMPORTANT
– Parent — no die.
– getpid() — find parent process, (21267).
– kill –TERM 21267
– wait()
– Parent — ALRM signal <-IMPORTANT
– Parent — no die.
Ad naseum.….
A quick process check:
UID PID PPID C STIME TTY TIME CMD
root 2260 1417 0 May 27 ? 0:09 /etc/init
root 1417 1417 0 May 27 ? 0:00 zsched
root 2960 1417 0 May 27 ? 0:00 /usr/lib/picl/picld
Hey! Of course you’re not gonna be able to kill zsched! You see in a Solaris zone the ‘root’ process is called ‘zsched’, and it will never ever be 1 at all, and neither will init for that matter. Not only that you’re not ever going to be able to kill the zsched process from within the zone as it’s protected within the global zone memory space.
Well, it’s too late to kick the developer up the bum. So time to hack.
The hack
Preloading shared libs has been around since I was a Mars Bar in my Dad’s top pocket, (well almost). So it should be a simple case of replacing any getpid() API call with my own. Simple enough, done it before.
See the appendix for the getpidhack.c file that I used. This is a shared lib that replaces both getppid() and getpid(), (for good measure, but not really needed in my case), API calls. It will return the PID 1, when _getppid() returns the same PID as the environment variable PIDHACK.
Note that I call _getppid() which allows me to fetch the REAL PID so I can return it when they don’t match.
So, to use it. First build it:
gcc -fPIC -c getpidhack.c gld getpidhack.o -L/usr/lib -lc -G -assert pure-text -o /usr/local/lib/libgetpidhack.so
Set an environment variable called PIDHACK to the currently running zsched process:
HACKPID=`ps -e | awk '/zsched/{print$1}'`
export HACKPID
Then set your LD_PRELOAD variable to point to the shared lib:
LD_PRELOAD="/usr/local/lib/libgetpidhack.so" export LD_PRELOAD
After this the Progress database just started. Woot! The DB can now run on hardware that it would never in it’s life be able to run on under a brandz zone. Another EOL server to chuck out the computer room window to see what happens.
BTW, if you are having issues with this, you can run a DTrace to see if you’re calling the real getpid(), or the getpidhack version. Create the following script, and run it with one argument, (the PID of your zsched). Every time you hit the getppid() API call it’ll register — this means that you are not replacing the API call and your shared lib preload isn’t working.
#!/usr/sbin/dtrace -s
syscall::getpid:return
/ppid == $1/
{
printf("%d %d", ppid, pid);
}
Appendix 1 — getpidhack.c
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
pid_t getppid(void)
{
pid_t hackppid;
pid_t realppid;
// hackpid - the PID of the zsched process.
hackppid = atol(getenv("HACKPID"));
// realppid - the real PID of the parent process.
realppid = _getppid();
// If the realppid equals hackppid, then we want to reset it to 1.
if (realppid == hackppid)
{
return((pid_t) 1);
}
// Else return whatever PID we were given.
return(realppid);
}
pid_t getpid(void)
{
pid_t hackpid;
pid_t realpid;
// hackpid - the PID of the zsched process.
hackpid = atol(getenv("HACKPID"));
// realpid - the real PID of the parent process.
realpid = _getpid();
// If the realpid equals hackpid, then we want to reset it to 1.
if (realpid == hackpid)
{
return((pid_t) 1);
}
// Else return whatever PID we were given.
return(realpid);
}
Appendix 2 — The working server
24013/1: setsockopt(76, 65535, 4, 0xEFFFF7FC, 4) = 0 24013/1: setsockopt(76, 65535, 8, 0xEFFFF7FC, 4) = 0 24013/1: bind(76, 0x0015DCA8, 16) = 0 24013/1: name = 0.0.0.0/3500 24013/1: listen(76, 10) = 0 24013/1: getpid() = 24013 [24012] 24013/1: lseek(3, 457164, 0) = 457164 24013/1: read(3, 0xEFFFF66C, 81) = 81 24013/1: % L S t a r t e d f o r % s u s i n g % s , p i d % 24013/1: l . ( 5 6 4 4 )\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0 24013/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 24013/1: time() = 1252542477 24013/1: write(9, 0xEFFFF154, 69) = 69 24013/1: 1 0 : 2 7 : 5 7 B R O K E R 0 : S t a r t e d f o r 24013/1: m f g l i v e u s i n g T C P , p i d 2 4 0 1 3 . ( 5 24013/1: 6 4 4 )\n 24013/1: fstat64(0, 0xEFFFF058) = 0 24013/1: d=0x01540000 i=17139 m=0020620 l=1 u=0 g=7 rdev=0x0 0600005 24013/1: at = Sep 10 10:27:46 EST 2009 [ 1252542466 ] 24013/1: mt = Sep 10 10:27:56 EST 2009 [ 1252542476 ] 24013/1: ct = Sep 10 09:39:31 EST 2009 [ 1252539571 ] 24013/1: bsz=8192 blks=0 fs=ufs 24013/1: fstat64(1, 0xEFFFF058) = 0 24013/1: d=0x01540000 i=17139 m=0020620 l=1 u=0 g=7 rdev=0x0 0600005 24013/1: at = Sep 10 10:27:46 EST 2009 [ 1252542466 ] 24013/1: mt = Sep 10 10:27:56 EST 2009 [ 1252542476 ] 24013/1: ct = Sep 10 09:39:31 EST 2009 [ 1252539571 ] 24013/1: bsz=8192 blks=0 fs=ufs 24013/1: fstat64(2, 0xEFFFF058) = 0 24013/1: d=0x01540000 i=17139 m=0020620 l=1 u=0 g=7 rdev=0x0 0600005 24013/1: at = Sep 10 10:27:46 EST 2009 [ 1252542466 ] 24013/1: mt = Sep 10 10:27:56 EST 2009 [ 1252542476 ] 24013/1: ct = Sep 10 09:39:31 EST 2009 [ 1252539571 ] 24013/1: bsz=8192 blks=0 fs=ufs 24013/1: write(1, 0xEFFFF154, 69) = 69 24013/1: 1 0 : 2 7 : 5 7 B R O K E R 0 : S t a r t e d f o r 24013/1: m f g l i v e u s i n g T C P , p i d 2 4 0 1 3 . ( 5 24013/1: 6 4 4 )\n 24013/1: pwrite64(7, 0xEFFFF908, 4, 0) = 4 24013/1: 0xEFFFF908: "\0\0\002" 24013/1: close(7) = 0 24013/1: getpid() = 24013 [24012] 24013/1: getpid() = 24013 [24012] 24013/1: kill(24012, 0x0000000F) = 0 24012/2: signotifywait() = 15 24013/1: lwp_alarm(1) = 0 24012/2: Incurred fault #11, FLTPAGE %pc = 0xEF5CDC80 addr = 0xEF67 0F48 24012/2: Incurred fault #11, FLTPAGE %pc = 0xEF64A3C8 addr = 0xEF66 FA68 24012/2: lwp_sigredirect(1, 0x0000000F) = 0 24012/1: Received signal #15, SIGTERM, in wait() [caught] 24012/1: siginfo: SIGTERM pid=24013 uid=0 24012/1: wait() Err#4 EINTR 24012/1: sigprocmask(3, 0xEF667DF8, 0x00000000) = 0 24012/1: set = 0 0 0 0 24012/1: Incurred fault #11, FLTPAGE %pc = 0x0004593C addr = 0x0013 10F4 24012/1: Incurred fault #11, FLTPAGE %pc = 0x000459A8 addr = 0x0015 3AEC 24012/1: _exit(0) 24013/1: Received signal #14, SIGALRM, in pause() [caught] 24013/1: pause() Err#4 EINTR 24013/1: Incurred fault #11, FLTPAGE %pc = 0xEF5CDCE8 addr = 0xEF5C DCE8 24013/1: sigprocmask(3, 0xEF667DF8, 0x00000000) = 0 24013/1: set = 0 0 0 0 24013/1: context(1, 0xEFFFF240) 24013/1: getpid() = 24013 [1] 24013/1: close(0) = 0 24013/1: open(0xEFFFF920, 0) = 0 24013/1: 0xEFFFF920: "/qaddb/eblive/mfglive.lg" 24013/1: close(1) = 0 24013/1: open(0xEFFFF920, 0) = 1 24013/1: 0xEFFFF920: "/qaddb/eblive/mfglive.lg" 24013/1: close(2) = 0 24013/1: open(0xEFFFF920, 0) = 2 24013/1: 0xEFFFF920: "/qaddb/eblive/mfglive.lg" 24013/1: lseek(3, 342954, 0) = 342954 24013/1: read(3, 0xEFFFF570, 81) = 81 24013/1: % L P R O G R E S S V e r s i o n % s o n % s . ( 4 2 24013/1: 3 4 )\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0 24013/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 24013/1: time() = 1252542478 24013/1: write(9, 0xEFFFF05C, 61) = 61 24013/1: 1 0 : 2 7 : 5 8 B R O K E R 0 : P R O G R E S S V e r 24013/1: s i o n 9 . 1 C o n S O L A R I S . ( 4 2 3 4 )\n 24013/1: fstat64(0, 0xEFFFEF60) = 0 24013/1: d=0x01540002 i=68332 m=0100644 l=1 u=0 g=1 sz=97192 88 24013/1: at = Sep 10 03:01:28 EST 2009 [ 1252515688 ] 24013/1: mt = Sep 10 10:27:58 EST 2009 [ 1252542478 ] 24013/1: ct = Sep 10 10:27:58 EST 2009 [ 1252542478 ] 24013/1: bsz=8192 blks=19008 fs=ufs 24013/1: fstat64(1, 0xEFFFEF60) = 0 24013/1: d=0x01540002 i=68332 m=0100644 l=1 u=0 g=1 sz=97192 88 24013/1: at = Sep 10 03:01:28 EST 2009 [ 1252515688 ] 24013/1: mt = Sep 10 10:27:58 EST 2009 [ 1252542478 ] 24013/1: ct = Sep 10 10:27:58 EST 2009 [ 1252542478 ] 24013/1: bsz=8192 blks=19008 fs=ufs 24013/1: fstat64(2, 0xEFFFEF60) = 0 24013/1: d=0x01540002 i=68332 m=0100644 l=1 u=0 g=1 sz=97192 88 24013/1: at = Sep 10 03:01:28 EST 2009 [ 1252515688 ] 24013/1: mt = Sep 10 10:27:58 EST 2009 [ 1252542478 ] 24013/1: ct = Sep 10 10:27:58 EST 2009 [ 1252542478 ] 24013/1: bsz=8192 blks=19008 fs=ufs 24013/1: write(1, 0xEFFFF05C, 61) Err#9 EBADF 24013/1: 1 0 : 2 7 : 5 8 B R O K E R 0 : P R O G R E S S V e r 24013/1: s i o n 9 . 1 C o n S O L A R I S . ( 4 2 3 4 )\n 24013/1: lseek(3, 346761, 0) = 346761 24013/1: read(3, 0xEFFFF570, 81) = 81 24013/1: % L S e r v e r s t a r t e d b y % s o n % s . ( 4 24013/1: 2 8 1 )\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0 24013/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 24013/1: time() = 1252542478 24013/1: write(9, 0xEFFFF05C, 65) = 65 24013/1: 1 0 : 2 7 : 5 8 B R O K E R 0 : S e r v e r s t a r t 24013/1: e d b y r o o t o n / d e v / p t s / 5 . ( 4 2 8 1) 24013/1: \n
Appendix 3 — The non-working server
21268/1: setsockopt(76, 65535, 4, 0xFFBFF9AC, 4, 1) = 0 21268/1: setsockopt(76, 65535, 8, 0xFFBFF9AC, 4, 1) = 0 21268/1: bind(76, 0x0015DCA8, 16, 3) = 0 21268/1: AF_INET name = 0.0.0.0 port = 3500 21268/1: listen(76, 10, 1) = 0 21268/1: getpid() = 21268 [21267] 21268/1: lseek(3, 457164, 0) = 457164 21268/1: read(3, 0xFFBFF81C, 81) = 81 21268/1: % L S t a r t e d f o r % s u s i n g % s , p i d % 21268/1: l . ( 5 6 4 4 )\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 21268/1: \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 21268/1: time() = 1252542489 21268/1: write(9, 0xFFBFF304, 69) = 69 21268/1: 1 0 : 2 8 : 0 9 B R O K E R 0 : S t a r t e d f o r 21268/1: m f g l i v e u s i n g T C P , p i d 2 1 2 6 8 . ( 5 21268/1: 6 4 4 )\n 21268/1: fstat64(0, 0xFFBFF208) = 0 21268/1: d=0x05980009 i=4426329 m=0020620 l=1 u=0 g=7 rdev=0x00600011 21268/1: at = Sep 10 10:27:57 EST 2009 [ 1252542477 ] 21268/1: mt = Sep 10 10:28:09 EST 2009 [ 1252542489 ] 21268/1: ct = Sep 10 09:44:02 EST 2009 [ 1252539842 ] 21268/1: bsz=8192 blks=0 fs=lofs 21268/1: fstat64(1, 0xFFBFF208) = 0 21268/1: d=0x05980009 i=4426329 m=0020620 l=1 u=0 g=7 rdev=0x00600011 21268/1: at = Sep 10 10:27:57 EST 2009 [ 1252542477 ] 21268/1: mt = Sep 10 10:28:09 EST 2009 [ 1252542489 ] 21268/1: ct = Sep 10 09:44:02 EST 2009 [ 1252539842 ] 21268/1: bsz=8192 blks=0 fs=lofs 21268/1: fstat64(2, 0xFFBFF208) = 0 21268/1: d=0x05980009 i=4426329 m=0020620 l=1 u=0 g=7 rdev=0x00600011 21268/1: at = Sep 10 10:27:57 EST 2009 [ 1252542477 ] 21268/1: mt = Sep 10 10:28:09 EST 2009 [ 1252542489 ] 21268/1: ct = Sep 10 09:44:02 EST 2009 [ 1252539842 ] 21268/1: bsz=8192 blks=0 fs=lofs 21268/1: write(1, 0xFFBFF304, 69) = 69 21268/1: 1 0 : 2 8 : 0 9 B R O K E R 0 : S t a r t e d f o r 21268/1: m f g l i v e u s i n g T C P , p i d 2 1 2 6 8 . ( 5 21268/1: 6 4 4 )\n 21268/1: pwrite64(7, 0xFFBFFAB8, 4, 0) = 4 21268/1: 0xFFBFFAB8: "\0\0\002" 21268/1: close(7) = 0 21268/1: getpid() = 21268 [21267] 21268/1: kill(21267, 0x0000000F) = 0 21267/1: Received signal #15, SIGTERM, in wait() [caught] 21267/1: siginfo: SIG#0 21268/1: Incurred fault #11, FLTPAGE %pc = 0x000C3048 addr = 0x58900137234 21267/1: wait() Err#4 EINTR 21268/1: alarm(1) = 0 21267/1: lwp_sigtimedwait(0xFFBFF570, 0xFFBFF498, 0x00000010) = 0 21267/1: sigmask = 0 0 0 0 21267/1: siginfo: SIG#0 21267/1: lwp_sigtimedwait(0xFFBFF488, 0xFFBFF570, 0x00000010) = 0 21267/1: sigmask = 0 0 0 0 21267/1: siginfo: SIG#0 21267/1: lwp_sigtimedwait(0xFFBFF484, 0xFFBFF2B8, 0x00000010) = 0 21267/1: sigmask = 0x00004000 0 0 0 21267/1: siginfo: SIG#16384 21267/1: lwp_sigtimedwait(0xFFBFF2A8, 0xFFBFF330, 0x00000010) = 0 21267/1: sigmask = 0x00004000 0 0 0 21267/1: siginfo: SIG#16384 21267/1: sigprocmask(3, 0xFFBFF330, 0x00000000) = 0 21267/1: set = 0x00004000 0 0 0 21267/1: Incurred fault #11, FLTPAGE %pc = 0x0004593C addr = 0x589001310F4 21267/1: Incurred fault #11, FLTPAGE %pc = 0x000459A8 addr = 0x58900153AEC 21267/1: _exit(0) 21268/1: Received signal #14, SIGALRM, in pause() [caught] 21268/1: siginfo: SIG#0 21268/1: pause() Err#4 EINTR 21268/1: lwp_sigtimedwait(0xFFBFF698, 0xFFBFF5C0, 0x00000010) = 0 21268/1: sigmask = 0 0 0 0 21268/1: siginfo: SIG#0 21268/1: lwp_sigtimedwait(0xFFBFF5B0, 0xFFBFF698, 0x00000010) = 0 21268/1: sigmask = 0 0 0 0 21268/1: siginfo: SIG#0 21268/1: lwp_sigtimedwait(0xFFBFF5AC, 0xFFBFF3E0, 0x00000010) = 0 21268/1: sigmask = 0x00002000 0 0 0 21268/1: siginfo: SIG#8192 21268/1: lwp_sigtimedwait(0xFFBFF3D0, 0xFFBFF458, 0x00000010) = 0 21268/1: sigmask = 0x00002000 0 0 0 21268/1: siginfo: SIG#8192 21268/1: sigprocmask(3, 0xFFBFF458, 0x00000000) = 0 21268/1: set = 0x00002000 0 0 0 21268/1: lwp_sigtimedwait(0xFF1D7C18, 0xFFBFF1B0, 0x00000010) = 0 21268/1: sigmask = 0xFFBFFEFF 0x00001FFF 0 0 21268/1: siginfo: SIG#-4194561 21268/1: lwp_sigtimedwait(0xFFBFF1A0, 0xFFBFF228, 0x00000010) = 0 21268/1: sigmask = 0xFFBFFEFF 0x0000FF1F 0 0 21268/1: siginfo: SIG#-4194561 21268/1: sigprocmask(3, 0xFFBFF228, 0xFFBFF238) = 0 21268/1: set = 0xFFBFFEFF 0x0000FF1F 0 0 21268/1: oset = 0x00002000 0 0 0 21268/1: lwp_sigtimedwait(0xFFBFF238, 0xFFBFF1B0, 0x00000010) = 0 21268/1: sigmask = 0x00002000 0 0 0 21268/1: siginfo: SIG#8192 21268/1: lwp_sigtimedwait(0xFFBFF1A0, 0xFFBFF360, 0x00000010) = 0 21268/1: sigmask = 0x00002000 0 0 0 21268/1: siginfo: SIG#8192 21268/1: lwp_park(1, 1, 1) = 0 21268/1: lwp_sigtimedwait(0xFFBFF370, 0xFFBFF088, 0x000001C0) = 0 21268/1: sigmask = 0x0000082F 0 0 0 21268/1: siginfo: SIG#2095 21268/1: lwp_sigtimedwait(0xFFBFF090, 0xFFBFEFF8, 0x00000010) = 0 21268/1: sigmask = 0 0 0 0 21268/1: siginfo: SIG#0 21268/1: lwp_sigtimedwait(0xFFBFEFE8, 0xFFBFF090, 0x00000010) = 0 21268/1: sigmask = 0 0 0 0 21268/1: siginfo: SIG#0 21268/1: context(1, 0xFFBFF088) 21268/1: getpid() = 21268 [1417] 21268/1: kill(1417, 0x0000000F) = 0 21268/1: alarm(1) = 0 21268/1: Received signal #14, SIGALRM, in pause() [caught] 21268/1: siginfo: SIG#0 21268/1: pause() Err#4 EINTR
No related posts.
