Entries tagged as alertingFriday, July 25. 2008Bring on the presents : It's SysAdmin Appreciation Day!
I'm eagerly awaiting large amounts of presents for all those demanding users out there that think sleep is a value-added extra for Sysadmins, or that weekends/public holidays/holidays in general are something that don't apply to us.
Today is System Administration Appreciation Day. So before you ring me up today to fix your urgent problem, ensure to have that present sent over first. Any calls raised without the appropriate attached gift will be ignored today. Geek presents earn additional bonus points. Caffeine and chocolate substances are also welcome. Wednesday, May 28. 2008UPS monitoring under Linux
I recently made reference to the fact I purchased some UPSes.
For our desktop machines I purchased 2 x Powerware 5110 1500VA units. We also purchased a 1 x Powerware 5125 2200VA (15A socket) to run our 19" racks. (This will also include some networking gear, a 8-port KVM with 15" LCD monitor, and a reasonable server and Disk Arrays). I make it a habit when sizing a UPS of ensuring I don't load it too high. Whilst I can run many things on each UPS, I would refer a relatively low load, so that if required I can hold the box up long enough to shutdown everything safely. There is nothing worse than hard crashing a server because the UPS load sat at 70-80% load and couldn't hold up for 5-10 minutes whilst machines started shutting down. This is particularly true of a busy database server. I generally try to run a UPS load of around < 50% and ideally between 10-30% if I can. I would rather have a few minutes up my sleeve then spending time worrying whether I will be able to recover from backup. The software that comes with the Powerware 5110 runs under Windows. (They do provide 'Linux' software but it's pretty shoddy and a PITA to try and get working). LanSafe - does work well under Windows. Pauline runs Windows as her default OS (mainly due to her need for MYOB and ATO's ECI Client Software both only native to Windows. There is also the occasional business website she requires that is IE only - much to our disgust...) We will probably look at moving her over to a Linux desktop in the future and running Windows in either a dual boot or virtual machine. In fact, we'll probably do that sooner rather than later, for a number of reasons.
I wanted something that easy under my Linux desktop (Ubuntu in this case). That's where NUT comes in. NUT supports a wide range of UPS makes, models and connectivity options. Feature-wise it is quite impressive. Reading through the documentation for NUT it's quite clear that you can extend on the notifications you can generate. From items like sending wall messages, Emails, syslog alerts, to even Email messages. You can even get an OSD display if the UPS state changes. I'll probably make use of the SMS daemon I wrote about previously to send me an SMS. Apparently Gnome Power Manager is supposed to pickup a UPS (it is just a battery after all) is attached. Seems it doesn't in this instance. There has been a bit of discussion in recent times of getting an independent system together that relies on D-BUS/HAL so that other Window Managers can also hook into it. A lot like what we're seeing now in Network Manager (love or hate it!) (See: BetterPowerManager and the Power Management Specification For the record, you can also use check_ups from Nagios to actively monitor your UPS that is being managed by NUT Those using an APC UPS should look at apcupsd (a new stable version was just released on the 20th of May 2008). There is also native Windows versions available. Saturday, May 24. 2008Nagios 2-way alerting via SMS - Part 3
This is a 3 part posting that covers how you can setup 2-way Nagios alerting via SMS.
The series is broken down as such:
Review ![]() The SMS message Nagios generates when there is a problem. Whilst this is great, it's not of great value if we can't do anything with these SMS messages. In addition, if no-one is sitting in front of the actual Nagios web console to acknowledge the issue, then Nagios will continue sending the SMS messages and will quickly fill up your phone's message memory. Not ideal! The message is quite straight forward as well to read. A service problem message contains:
Here is a Host problem message for comparison:
Host 'MyServer' is DOWN I: CRITICAL - Host Unreachable (10.0.0.74) T: 2008-05-22 17:52:33 R: ![]() Sending a response back to Nagios via SMS. All you need to do is reply to the message (and include the text - which most mobile phones on the market provide a feature to do). Whilst not required, you can append a simple message to your reply. This will be used in the acknowledgment response and add a comment to the service. Comments don't need to be long, but they do help your other team members or interested parties know what is happening with the issue. Be careful what you write! The response you send will also be used in a follow-up SMS to everyone notifying them that someone is looking at the issue. You don't want to send something that might not be appreciated others that recipients Acknowledging SMS messages Acknowledging messages from Nagios is quite simple now we have our SMS Daemon setup (see Part 2 for details) As a result, any incoming SMS messages get stored into our MySQL database. We just need a process now to read them from the database, interpret them and then update Nagios appropriately. Continue reading "Nagios 2-way alerting via SMS - Part 3" Friday, May 23. 2008Nagios 2-way alerting via SMS - Part 2
This is a 3 part posting that covers how you can setup 2-way Nagios alerting via SMS.
The series is broken down as such:
Sending SMS messages from Nagios Introduction To send messages from Nagios, I'm going to assume you already have a working Nagios environment. 2-way SMS messaging refers to the fact that you can reply to an SMS message and action is taken based on your response to the sender. In this instance, Nagios will send a SMS (a Nagios alert), and you can reply to the SMS (a Nagios alert acknowledgment). This part will cover sending out the SMS from Nagios once a host or service problem occurs. Part 3 will deal with how to send the reply and process it within Nagios. Keep reading! Due to the large size of this posting (it contains step-by-step instructions on setting up the SMS gateway), you may find your feed reader only contains the post up to here. If that is the case continue reading the post here. Continue reading "Nagios 2-way alerting via SMS - Part 2" Monday, May 19. 2008Nagios 2-way alerting via SMS - Part 1
This is a 3 part posting that covers how you can setup 2-way Nagios alerting via SMS.
The series is broken down as such:
For those that wish to setup simple, inexpensive monitoring you will find that it is simpler than first thought. Feel free to grab the code-snippets provided over the series and make use of it in your own environment. The Nagios SMS alert system has been running here now in production for approximately 6 months and works quite well. The Nagios system manages a range of services/hosts and checks approximately 1,500 items (by no means large); it has however cut down the amount of time I require looking at Nagios alert screens. Background and Requirements Background I like to dabble in web development and have now for around 10 years. As a result I undertake hosting in commercial data centers in order to ensure maximum uptime and good responsiveness. To offset the cost, I now provide hosting/email and web-development services to a number of clients. As a result, it means one cannot easily look at the diagnostic LEDs, or even the console easily to determine when a problem has occurred. (Whilst I can make use of an IP KVM; it is normally a PITA and best avoided!) Ideally I was after a monitoring system that alerted me prior to a problem so that I can deal with it before it became a bigger issue. Nagios was chosen as it provided all the features we required, had support for a large number of items we wished to monitor, and was simple to extend to check custom services based on being written in Perl. Nagios comes with a swag of documentation making it easy to write simple checks or extend on as was seen fit. As I've coded a fair amount in Perl over the years for various System Administration tasks, Nagios was seen as a good fit. Being an Open-Source project ensured that it was an added bonus of both being free-to-use, but also I had access to the source code to understand better how the system worked. I have used Nagios for years now, and historically have relied on web/email alerts. However, being human means I can't always be in front of my computer 24x7. This worked the majority of the time, though as luck would have it we ended up with a system failure that occurred early one mid-week morning. We already had a working Nagios monitoring environment, so we wanted to make use of this system going forward. Requirements I wrote down our requirements for a paging system. I wanted to ensure whatever solution was put in place that it could grow with us, but also that we weren't locked into a solution for years or even worse, that we embarked upon a solution that would ultimately restrict any future requirements we might have. There were several ways we could be alerted:
We ruled out a telephone, as a recorded message wouldn't provide us any real details of what the issue was, unless we actually started recording exactly what the issue was. It was beginning to look too complicated for a simple requirement. I really thought a Voice-IVR was an overkill for what we required. (Whilst it may of indeed been fun to hack/configure!) A pager was then looked into. Whilst pagers work well in ensuring a message is delivered we saw this as a drawback, in the sense it meant carrying around another device on us. (We already carried a mobile phone and the prospect of carrying a pager didn't appeal). Also pagers traditionally send one-way messages; and we wanted a system where we could acknowledge an alert. We settled on a alert delivery system to be used via SMS. Whilst SMS doesn't have the guaranteed delivery system of that of a pager, it did open up the possibility of being able to respond to an alert. It had the added bonus that we already carried our mobiles close by 24x7 so there was no need to carry an additional device. Using a SMS to deliver a text based message also ensured we could then place useful information in the message to assist in the problem diagnosis. We had now determined our delivery system (SMS), it was time to find technical solutions that would fit in with the existing Nagios monitoring platform. Nagios by default has the ability to alert via various means. Pagers, SMS and Emails are handled quite easily within the standard configuration that ships with the software. It would make the integration of SMS quite easily. However, it was only by default a 1-way system. There was no provision to handle a 2-way message. ( Alert Message then acknowledgment message) A 2-way SMS alert system was preferred as it could be setup to stop alerting us continually via SMS once the alert had been acknowledged. It also had the added benefit that whilst the message was sent to other parties, it could also ensure that they were aware the problem was being investigated and avoid duplication of work. We also wanted to run the SMS alert system as close as possible to the Nagios monitoring host. This would ensure that if we had a physical network or IP link down, we could still get messages out. This cut out using a 3rd party SMS service. Coming Up In the next part of the series... In the next part, I'll describe the components that make up our 2-way SMS solution using OSS software and some of the scripts we hacked together to 'glue' the solution together. It's now been running without a hitch now for about 6 months. We also use it as a Web2SMS gateway locally, as well as being open enough that I can utilize it via the CLI to send out alerts. Looking back the solution was quite simple, and met all our requirements. Tuesday, April 25. 2006Little to no sleep...
Well the page at 1:35am this morning turned out to be a marathon.
One of our production machines crashed... a result of a failed memory board. Tried bringing the sucker up until around 4:00am, without much success - it would come up briefly and then crash again. ![]() Couldn't do much more by 4:00am and so it was off to get some sleep. ![]() Woke at 7:00am to the mobile going off, and have been on the phone/computer since 8:30am. It's 1:30pm now... so I'm managed around 3 hours sleep. ![]() Mobile is still going off, and I will probably have another 1-2 hours for this to be finally wrapped up (business has to ensure the machine/applications are all working). I think it's time to crash on the couch in front of the TV for an hour.. ![]() Oncall - two weeks running
I recently posted about the craziness at work. In particular the on-call support required.
I'm commencing my 2nd consecutive week of on-call. Basically on-call means:
![]() You really don't get much sleep, and what sleep you do get isn't quality sleep as you're listening out for the mobile all night. You average between 10-30 hours in call-outs for the week... but it's the constant 'stuck at home' that annoys me... your life is on hold whilst you are on call. The money is decent -- though I think I would prefer the sleep! ![]() So if you're wondering who keeps yellowpages.com.au, whitepages.com.au, whereis.com, tradingpost.com.au, citysearch.com.au and sensis.com.au going in the wee-hours --- that would be moi at present! ![]() Ahh the joys of the mushroom administrator! So my Anzac Day will be spent not far from my work mobile, new work notebook and wireless access card. ![]() And on that note ..... at 1:35am there goes my mobile paging me!
Friday, March 10. 2006Craziness
Work is insane.... there is no other word that describes it.
Currently job loads are probably up 300-400% from 'normal operations'. Though no additional resources have been brought in.... it's just "Get the job done". That type of management can survive for a short period, but can never be sustained. People are pulling 40+ hours at home ON TOP of their usual 40-60 hour week in the office. I've told them I'm not doing on-call... the management thought they could pull a swifty and pay me half of what the fulltime employees get, and no penalty rates for afterhours, weekend and public holiday work. Looks like that's one that has back-fired in their face. I've flatly refused to do any on-call/overtime until something reasonable is put on the table.Unfortunately it means the FTE have to carry the load and they are already at breaking point... so I am expecting it will hit the fan big time next week.... ![]() For a 24x7 operation it's about time they cough up the $$$s and run seperate shifts... you can't expect people who work all day to then turn around and work all night and are then expected to return to work the next day! I don't know what they are thinking... ![]() So for the interim... I'm doing the usual 40 hour week... which suits be fine. I can finally catch up on my sleep and remember what bed is!
Sunday, March 5. 2006Oncall coming to an end... for the time being.
I'm just about finished my week of on-call support.
![]() So far I've still got tonight to go and all the way through until 8:30am on Monday and I've clocked up ~28hours (most in the wee-hours and weekend work) -- that is on top of my normal 40 hours!Don't expect much from me this week... I'm likely to be getting a lot of sleep! Tuesday, February 28. 2006Arrggg.... on-call support
Computers can be wonderful things.... you can accomplish tasks that takes a matter of seconds that would take you days, weeks, or months if done manually.
They can also be a right pain in the @#$&! Last night was my first night on call at work. Some of the crazy servers kept going up and down like yoyos.... which meant that about every 10 minutes I got a page -- all night. I feel like crap and even look it.... I'm sure Pauline didn't get much sleep as well. ![]() Dragging myself into work was an effort... how I'll make it through the entire week will be a miracle! I'm really hoping last night was the exception and not the norm. ![]() Current mood: sleepy |