Well isn’t this special. I’ve been battling a sporadic error with jumpstarting some brand spanking new Sun equipment. You see, every other jump or so, one to three systems would get to the point where they loaded the Solaris kernel ok but then halted with a “ERROR: No disks found”.
Poking these systems and telling them to jump “boot net - install” a second time almost always resulted in a successful jump. This is with about 14 v210’s and 7 v440’s in an environment were they will need to be jumped on a regular basis; perhaps 2-3 times a week.
Getting reliable jumps was critical and maddenly just out of my grasp.
No amount of disk reseating, swapping disks around, trying Solaris 9 for the network boot kernel instead of Solaris 10 (actual jump data was Solaris 10 FLAR’s) seemed to work. I eventually had to break out the Expect book and code up a babysitter script.
The babysitter gets kicked off every time a series of jumps starts and essentially connects to all of the ALOMs in sequence and checks the console history, then sleeps for a minute and checks them all again. It looks for several different strings and acts accordingly. I even managed to discover a few cases where a kernel panic could occur during jump.
A day or two after getting this script functional and watching it perform well enough for a short-term hack… (still wanted to find the root cause but no time to pursue it further) I found out from the onsite Sun engineer that there’s a run of bad disks from Hitachi. Including approximately 54 disks in the batch of servers that I had been fighting with.
Doh!
For systems that are covered under a maintenance contract (and perhaps warranty, but I don’t care about details like that.. just get this fixed damnit!) Sun is doing a proactive replacement of Fujitsu disks of type MAT3073N with a serial number range of 0450Bxxxxx through 0532Bxxxxx. It’s not like the disks are outright failing, but rather there are indications of a problem with that production run where early failure or similar could happen. Hence the term ‘proactive replacement’.
I hope this helps someone avoid spinning their wheels needlessly, if they should stumble upon this article via google whilst looking for a way to make their new servers jumpstart reliably.