Thursday, February 3, 2011

Upgrading Ubuntu remotely: Howto minimize the risk of losing the server?

Background: I am forced to remotely upgrade a server from Ubuntu 8.04 LTS to 10.04 LTS due to an incompability issue with the raid controller.

The internet connection to the server is somewhat stable and seldom drops. Despite that I am concerned about losing the connection over SSH while doing the upgrade, leaving the server in an unreachable state. I am also worried about the server not being able to boot after the upgrade, in case I will be unable to know what is the problem.

Action plan: What I am looking for is advice to minimize the risk of losing the server, I am aware that what I am doing is very risky. This is my current action plan:

1) Backup everything that matters, locally and externally.

2) Temporarily disable boot-time disk checks with fsck. (I will have no clue what is going on if the disk check would take a long time to finish). This would be done through fstab by changing the very last paramter from 1 to 0:

UUID=5b1ff964-7608-44fd-a38d-7e43ad6b4c11 /               ext3    relatime,errors=remount-ro 0       0

3) Starting all upgrade processes with with screen so that they can be resumed if I lose the connection. Ie:

sudo screen apt-get upgrade

Questions:

  • Does my proposed action plan seem reasonable?
  • Is disabling the boot-time disk check a bad idea?
  • What else could be done to decrease the risk of losing the server?

Update: Almost all answeres suggested me to setup DRAC/IPMI which I have now done. This feels like a really great acheivement that will for sure make the risk much much smaller as I can follow the entire power cycle over KVM/console redirection. For future references, this is what I have done:

1) Installed ipmitool to setup IP address, gateway etc for IPMI v2.0:

sudo ipmitool lan set 1 ipaddr 192.168.1.99 
sudo ipmitool lan set 1 defgw ipaddr 192.168.1.1

2) Installed free-ipmi to change the NIC selection mode to shared (I have only one network interface connected to the network):

sudo ipmi-oem dell set-nic-selection shared 

3) Used DRAC's https interface on https://192.168.1.99 to launch the console redirection viewer. This allows me to follow the entire boot sequence as well as configuring BIOS, raid controllers etc. Awesome.


Update 2. Done. All went with a charm, took less than 30 min to do the job. I ended up not turning off the disk check as the redirected console gave me the freedom to interrupt it whenever I wanted to, but I let it run to the end.

Thank you guys, your wisdom is invaluable!

  • If hardware does not break, there isn't anything you can't do with a serial console, so that's the way to go:

    • get some remote access to serial console (IPMI serial over lan if the system has >=IPMI-2.0, or a null modem serial cable connected to another system where you'll run minicom)
    • configure grub and linux to use the serial console
    • redirect the system BIOS interface on serial if it is possible (many server systems are able to do that)
    • reboot the system and check out that you can use (bios), grub, see dmesg, see init scripts, and login all over the serial console
    • run the upgrade
    • cross your fingers

    Also, install the new system on another disk or partition if at all possible, so you can test the new system before erasing the old one. I usually do that with two disks system: I take one disk out of the mirror, create a new (degraded) mirror with the free disk, install there, if everything is ok I destroy the old mirror and hot-add the 'old' disk to the new mirror and let it rebuild.

    EDIT: I read it's a Dell R710, AFAIK that should have IPMI2. Configure it running ipmitool locally on the system, and test the serial over lan feature using ipmitool sol enable on another system. Bang! You have your serial console. Dells also are able to redirect BIOS on the serial console (that IPMI will in turn redirect on serial-over-lan). You should have done that anyway to get access to the system if anything goes really bad. I manage a couple of old Dell PE1425 using null modem cables with bios,grub,system serial consoles, and a couple of Dell R300 the same way but using IPMI serial over lan in place of the actual serial cable.

    Avada Kedavra : Very interesting indeed! I dont know jack about these things though so I need to check out how much I could actually use from it. Really good pointer though, Ill follow up on this as well as the iDRAC6 interface!
    Avada Kedavra : Hello Luke. I find this lead extremely useful but I dont seem to be able to set the DRAC's "NIC selection mode" to "shared" using ipmitool (IP settings etc is changeable). Since the server is far away from here I will not be able to reboot it to set this parameter, and I am not able to connect to the server as long as this parameter is set to default "dedicated". Do you have any suggestions on how to get around this? (I have a local spare server to test everything on).
    Avada Kedavra : Yeey, found a way to do this: install freeipmi-tools and `sudo ipmi-oem dell set-nic-selection shared`
    Avada Kedavra : OMG, this is amazing! I feel enlighted! DRAC/IPMI/Console redirecing is setup and I can follow the entire reboot via KVM. Thank you so much!
    Luke404 : it's ok to feel enlightened and all, but please do remember that what you're doing is NOT something to do on prod systems, and the idea of losing control of the remote system and/or having to wipe it for a clean install (could be done remotely via serial console if you have a boot server on the lan) is always a possible result. the console will just let you be in control, so you'll be able to resolve *simple* boot problems and/or know without doubt if/when you get totally screwed up.
    From Luke404
  • Personally, depending on how important this server is to your (your business, etc.), I'd get my hands on a similar system and try reproducing the environment and then upgrading it via SSH right in the room (or physically accessible to you) so you can test your procedure. If you can upgrade that without losing your configuration/connection, you stand a pretty good chance of being able to upgrade the remote server.

    This won't be 100% exact, but it at least should eliminate errors caused by software upgrades, software configuration, alterations and the like as long as you can make the test system as closely configured to your remote server as possible.

    EDIT: Another solution is to create a second server as failover first. This way if the server dies you still have a backup for customers/users until the primary server comes back up. This should alleviate some of the butterflies you're experiences with having one server so far away. Again, this may be kind of overkill in many circumstances, but that depends on how important this business server is to your company and the impact downtime will have as to how much you're willing to spend on making sure it's available in the event of total failure.

    Avada Kedavra : I have already upgraded 5 local servers, of which 3 was upgraded over SSH. All upgrades running smoothly. But this server is located in Australia and I am sitting in Sweden, so I need to be extra careful this time. And unfortunately this server is indeed important to our business :/
    Bart Silverstrim : The remote console access hardware is still a big help, but I'm reminded of the Apollo 13 movie when NASA needed a solution to help the astronauts and they began testing on the ground with simulators to find solutions replicating their remote environment. It sounds like you have some reason to believe this should work as you've been doing and are nervous about what could go wrong...which there is always a chance of a wrench in the works no matter how hard you try to account for possibilities.
    Bart Silverstrim : For all you know, a drive may fail on reboot, or a driver may change that can kill your volume access. I'm editing a second thought too in the answer...
    Avada Kedavra : Yeah, I do beleive that this will work out to my liking, and that I am looking for things just to make it less prune to go wrong. I also want to try to pull this of without bringing in extra hardware nor external help to rewire etc. Rigging a failover solution feels a bit overkill. I appreciate your thoughts though!
  • I think that Out-of-Band Management (I'm most familiar with HP's iLO), or even IP KVM would be your best bet.

    As Bart mentioned, Testing is invaluable if you have the resources (read: a spare similar box or fellow cluster member).

    Finally, (or first, actually) Backups. Tested Backups. Backups you can be proud of...

    From gWaldo

0 comments:

Post a Comment