With the announcement that Ksplice is now offering its pay-for service for Red Hat, Ubuntu (LTS), CentOS, and a few other Linux distros allowing no-reboot updates to the Linux kernel, a lot of people have been wondering whether or not there was a real market. Of course, if you really care about uptime, you can tolerate the failure or reboot of a system (because there’s another one to take it’s place, not because downtime is tolerable). That works for systems in the large all right (they could still benefit), but gets a bit interesting in the small.
In smaller, lower budget systems, this product could be a life-saver. When admins are crunched for time and services aren’t redundant (as is the case when purchasing single dedicated servers with commodity hosting services, which really are massive amounts of smaller environments), reboots are a pain and can create extra maintenance windows. Saving an admin from having to schedule downtime in the middle of the night for $4/month is completely worth it.
What Ksplice doesn’t provide (and some people seem to think it will help) is any sort of high availability. Just because kernel and system upgrades can be completed without a reboot, it doesn’t mean that it helps with preventing downtime. It can, however, prevent planned downtime for maintenance which is a welcome change to any sysadmin.
In the time that I spend over on the Zimbra forums, it seems that the biggest thing that causes new installations to fall flat on their faces and frustrate potential new users is name resolution. Most issues can be narrowed down to:
Incorrectly set hostname
During the installation of some distros, it’s easy to get the hostname messed up. Just running a ‘hostname -f’ before the installation to verify the name has been set can easily catch any issues that may come up later.
Incorrect /etc/hosts file
Zimbra is very sensitive to the values in your /etc/hosts file. Making sure that it is set correctly is very important for a successful installation. If your hostname is, say mail.thesysadminlog.com and the IP is 10.10.10.10, then your hosts file should look like:
127.0.0.1 localhost.localdomain localhost
10.10.10.10 mail.thesysadminlog.com mail
Missing, misconfigured, or misunderstanding of DNS
Another key component to a working Zimbra system is proper configuration of DNS for the server. There are a number of scenarios surrounding DNS configuration with Zimbra and plenty of help in the wiki and forums, but in general you need:
- An A record in your DNS server pointed to the new Zimbra server’s real IP address. It doesn’t matter if this is a private, non-routable IP or a public IP. It simply needs to be an A record for the Zimbra server’s hostname that points to the real IP address of the network interface of the Zimbra server.
- An MX record for the domain you want to use pointed at the A record you created. While this isn’t entirely necessary, it does make the install go a bit smoother.
- Configure the Zimbra server to use the DNS server you just setup and no other. Of course, you can replicate to one or more name servers and add those to your /etc/resolv.conf file, but don’t add extra name servers to the list in /etc/resolv.conf.
Hopefully I’ve helped mitigate some of the pitfalls you may find yourself falling into during your Zimbra installation. For a more inclusive how-to, you can also check out my full Zimbra how-to. While it was written against Zimbra 5.0, the article should apply against 6.0 (I’ll be going through it soon to check for missing prerequisites
Greetings. I would like to make my debut to TheSysAdminLog by discussing a fundamental necessity of any IT department. While not as glamorous as virtualization, disaster recovery, or Windows/Linux debates, this concept is equally as important to those of us who must support a large number of systems on a daily basis…
Documentation.
While most of us acknowledge the importance of documentation, we all hate doing it. It’s a cumbersome and inconvenient chore. Why spend time diagramming, taking screenshots, and writing intricate details about the setup of a particular system, when we could be lab-testing the features of a new upgrade or stringing CAT5 cable around the office of a vacationing co-worker? (guilty)
However, documentation can be a life-saver in a number of situations. Firstly, your co-workers will one day appreciate it. I am a member of a team of about a half dozen people. While each of us hold an over-arching knowledge of all the systems we support, we also each have our own areas of specialty. Documentation can be a lifeline for your co-workers in the event that you are unavailable. It is bad enough being called in the middle of the night because of a critical system crash. It’s even worse being unable to reach the person who set up the system, then being left to feel your way through the dark to get the system back online.
Also, your own documentation can provide a much-needed help to yourself. Whether it be short-term memory loss, or simply the span of a couple of years since the initial setup, we may often forget the intricate details of system. The only problem with a stable system is that we deal with them so little that we tend to forget exactly how they work. In the midst of a crisis, we should not be taking the time to re-learn or re-familiarize ourselves with a particular setup. It’s game time! Open up that PDF and get to work!
Next, keep your successors in mind. This may be the weakest motivation for writing documentation, since day to day system administration at a former employer no longer seems important. However, once you leave your company, someone else needs to take care of your responsibilities. If you leave behind a flaming pile of chaos, this won’t reflect well on you. If nothing else, summon some compassion and empathy for your IT descendants. They’re people too!
Undoubtedly, many system admins avoid documenting in an attempt to ensure job security for themselves. However, what does it say about one’s confidence in their job performance when they hold their knowledge hostage to keep their job? It’s also important to realize that securing a current position in this manner could also secure a lack of any promotion in the future.
So, save yourself and your co-workers a bit of headache, and document your work. If you don’t, then please be sure to look both ways before crossing the street. Your fellow employees will appreciate it.
It seems like every six months we hear another sad tale of a huge amount of data lost by an online service. The story usually revolves around the fact that IT management forgets that replication is not backup. Last week the story hit the web about Microsoft / Danger’s disaster with their online service for Sidekick. For anyone who didn’t read about the issues, the Sidekick is a mobile phone from T-Mobile that doesn’t store data locally, but pulls is from (what is now known as) a cloud service. The device and service are offered by Danger which is now owned by Microsoft. The service went offline about a week and a half ago and last week T-Mobile, Microsoft, and Danger let customers know that their data is most likely gone for good.
The announcement told customers that the data loss was due to a “server failure”, though some have speculated that it was due to some sort of botched SAN maintenance. The story is truly sad since loads of people lost data including contacts, calendars, to-do lists, and photos; however it’s not nearly as sad until you factor in that the whole thing was easily preventable. In this case, Microsoft apparently didn’t have any sort of backups.
Interestingly enough, a lot of sysadmins seem to forget some basic rules of keeping valuable data around. After disasters like this, it’s important to examine our own operations with this cardinal rule in mind: replication is by no means of the imagination a backup. This is especially true when one tries to depend upon RAID to protect from data loss. RAID is effective against hard drive failures, but if you delete a file on one drive, it deletes it on the other. This is important to keep in mind as well when replicating data across a network either through a straight file copy or a clustered file system. No matter the method, keep in mind that you still need old copies for those accidental deletes.
I can’t think of a better way to kick this blog into gear than a good old-fashioned hair-on-fire story. Last week one day I was working on rolling some updates to a beta system for our internal users and ended up learning a rough lesson in VMware snapshots.
Before rolling out the change, I took a snapshot as usual. After doing the upgrade, I put the software into debug mode to make sure that everything was running smoothly. After being satisfied, I went back to the Virtual Infrastructure Client to remove the snapshot just in time to watch the snapshot fill up the disk. Apparently, the debug logs had been using disk so fast that I’d used up the disk writing changes. Dang, not quite what I had in mind.
Well, VMware shut down the guest for me, I removed the snapshot, and rebooted the guest. I was able to free up some space to get the guest booted back up and get out of trouble.. for the moment. Somewhere along the line, however, the guest still had a snapshot attached to it without it showing up in the Snapshot Manager. After just a few hours, I went back to make another snapshot and noticed the free space on the datastore that the guest was stored on had shrunk drastically.
After doing a bit of searching, I found a few others with this problem on VMware Fusion and VMware ESX. Apparently my guest was stuck in some awkward not-snapshotted-but-still-snapshotted state. There was a vmdk file for the disk as well as one with the same name with -000001 appended to the end which was growing, indicating that it was still under a snapshot yet nothing showed up in the GUI.
In the end, I had to perform the following steps to get out of this weird state:
1) shut down the guest
2) create a new snapshot
3) “Delete All” snapshots from the Snapshot Manager
4) start guest
Not cool, especially when it happens during production hours. Fortunately for me, this happened to an internal beta system, but still not cool by any stretch of the imagination.
In the end, I learned two things:
1) apparently VMware snapshots aren’t overly robust under high I/O loads as someone else in the VMware forums mentioned as well
2) never, ever fill up a vmfs volume (well, I re-learned this one)
UPDATE: I’ve since learned that in most cases, Virtual Center is the one timing out here, not the actual snapshot deletion process. If you have a little patience and wait for a while, eventually the snapshots delete themselves. At least, that seemed to work for me most of the time.
Hello again, world. As an author of TheSysAdminLog, I would like to take a moment to introduce our new blog and give you an idea of what we’re all about.
Who we are
We are a group of systems administrators who manage various systems and work in various industries. We obsess over technology and business, and we enjoy conversations about how the two interact. We also enjoy just plain ol’ technology as well.
Our focus
TheSysAdminLog is focused on keeping systems administrators with news and tips to keep them competitive in the ever-changing world of IT. Just as systems administrators tend to support a wide range of hardware and software, we have no formal focus outside of things that we have encountered or are noticing as trends for other sysadmins. Our posts may include anything from getting the most out of your hardware to service monitoring to what the latest software or hardware releases mean for your company.
We’re just getting started, so check in often, subscribe to the RSS feed, and follow us on Twitter!
Recent Comments