View Full Version : Why was the site down today?
CougTek
12-16-2005, 12:19 AM
I've been unable to access SF for most of the day. I'm sure it's the CIA, alarmed by Flagreen, that shut it down because of my (and Merc's) anti-Bush comments.
Damn conservative rednecks!
Handruin
12-16-2005, 12:25 AM
Man I'm steaming right now. The support staff was non-existent ALL day. I've opened several tickets and heard nothing until roughly 30 minutes ago. Apparently some problem with MySQL happened tonight requiring a patch to the OS?!? WTF.
So, in short, I'm sorry for the downtime. I was not able to do anything other than change the front page to indicate my displeasure with their response time.
CityK
12-18-2005, 11:46 AM
The site seems the snappiest its ever been for me right now (very responsive) ... did the support monkeys actually make any changes?
Handruin
12-19-2005, 12:45 AM
They rebooted. ;) It had been 128 days since the last reboot, and they also applied some patches to the system (so they say).
Handruin
12-19-2005, 09:47 AM
You know what, the site does seem a lot snappier. Anyone else notice this?
You know what, the site does seem a lot snappier. Anyone else notice this?
Yes.
sechs
12-19-2005, 05:04 PM
No.
Actually is quite a bit slower for me now.
Handruin
12-20-2005, 12:56 AM
I guess it was short-lived. It still seems faster than it was. Earlier today is was flying for me.
CityK
12-20-2005, 11:30 PM
Its still pretty good for me, albeit not as peppy as the other day ...
Explorer
12-22-2005, 01:47 PM
This was what it looked like on Black Thursday (15/Dec/2005)...
Warning: mysql_connect(): Can't create a new thread (errno 11). If you are not out of available memory, you can consult the manual for a possible OS-dependent bug in /home/handruin/public_html/forum/db/mysql4.php on line 48
Warning: mysql_error(): supplied argument is not a valid MySQL-Link resource in /home/handruin/public_html/forum/db/mysql4.php on line 330
Warning: mysql_errno(): supplied argument is not a valid MySQL-Link resource in /home/handruin/public_html/forum/db/mysql4.php on line 331
phpBB : Critical Error
Could not connect to the database
CougTek
02-27-2006, 06:54 PM
I've been unable to access SF during the past few minutes, but then poof, it went back like if nothing happened. I even took the time to write Doug a message (which I now think I should have waited before).
Anyone else experienced the same?
P5-133XL
02-27-2006, 06:59 PM
Not here
Handruin
02-27-2006, 08:53 PM
I wasn't around during that time, but maybe another client on the server created a new hosting account. That causes apache to restart. It's nothing new with this server, but it will cause a moment of downtime. Either way, thanks for letting me know.
LunarMist
06-21-2006, 07:27 PM
What wild events occurred last night? The site was full of errors again.
Tannin
06-21-2006, 07:36 PM
Yeah, there was some weirdness. I couldn't acceess the top-level pages (forum index for e.g.) but could get the individual pages. Top-leevel pages gave me the dreaded white page of death. But it's fine now.
Handruin
06-22-2006, 07:03 AM
Sorry guys, I don't honestly know. I was out all night and this morning so this is the first I'm hearing about it. Seems to be fine now, perhaps our server had a bad night and apache needed a restart. Thanks for letting me know.
P5-133XL
06-23-2006, 06:24 AM
The site was down again this evening.
Tannin
02-23-2007, 07:41 AM
Is it time to consider better hosting? Over the last couple of months, Storage Forum has been down more often than I can count of late. It's never off for long at a time, typically 5 or 15 minutes, but it's starting to bug me, and it must be driving you mad, Doug.
OK, I'm here in Australia, but I'm comparing with my own little stable of sites, and they are much more reliable. If you are looking for a better hosting company, you might do worse than try the California-based firm I've been using for the last three yeras or so.
ddrueding
02-23-2007, 11:19 AM
Tannin, we've been talking about this over here (http://www.storageforum.net/forum/showthread.php?t=5851) if you want to catch up. I know it's not a logical place for it, but these things happen.
CougTek
05-10-2007, 11:15 AM
The web site was unaccessible for me for the previous 15-20 minutes at least.
Mercutio
05-10-2007, 11:29 AM
Over an hour by my count. Came back just as I started to bitch at Handy about it. :)
timwhit
05-10-2007, 12:31 PM
I know we have talked about changing providers in the past, but maybe it is time to think about it again.
How would something like this (http://www.lunarpages.com/reseller-hosting/) compare to the reseller account you have now?
timwhit
05-10-2007, 12:35 PM
I also don't know how much work it would be to switch providers. If it is 20+ hours of work, then it is a pretty big pain, but if it is 2-3 hours, then I don't see the drawback. If the new provider turns out to be crappy, then try someone else. Just make sure not to sign a 1 year contract with the new host.
I'm sure the regulars would be willing to donate to the cause.
Handruin
05-10-2007, 01:23 PM
I guess that means bitching works. :)
Handruin
05-10-2007, 01:28 PM
My major concern with switching are having access to SSH to allow me to run my backup scripts (along with restore). Do you know if Lunar pages provides that feature?
timwhit
05-10-2007, 02:49 PM
My major concern with switching are having access to SSH to allow me to run my backup scripts (along with restore). Do you know if Lunar pages provides that feature?
It looks like only their dedicated servers (http://www.lunarpages.com/dedicated-hosting/) offer SSH.
I don't know if any of those are doable, pricing seems pretty steep.
Handruin
05-10-2007, 02:56 PM
That's usually the catch I face when looking for a new place. I'd either have to find a hosting service that allows SSH access or get a dedicated box. I'm not ready to swing $100-150/month just yet. We don't have that much traffic.
timwhit
05-10-2007, 03:50 PM
Is there an alternative method you could use for doing the nightly backup?
timwhit
05-10-2007, 03:58 PM
That's usually the catch I face when looking for a new place. I'd either have to find a hosting service that allows SSH access or get a dedicated box. I'm not ready to swing $100-150/month just yet. We don't have that much traffic.
Host Gator (http://hostgator.com/resellers.shtml) or BlueHost (http://www.bluehost.com/tell_me_more.html) offer nice packages with SSH.
Handruin
05-10-2007, 04:00 PM
I don't know of any (I'm open to suggestions). The database is close to 200MB, which I know isn't enormous by any means, but it does rule out most of the ways I can think of. This is why I use a bash script to run mysqldump for me and then compress it which brings the size down greatly.
Mercutio
05-10-2007, 04:01 PM
LunarPages pisses me off.
I know there are people who hate Dreamhost with a passion but honestly, I e-mail dreamhost about a problem and they help. I call Lunarpages for something and the answer I've gotten every time is some variation on "Well, what do you expect on a shared hosting plan?"
timwhit
05-10-2007, 04:23 PM
Dreamhost (http://dreamhost.com/hosting.html) does offer SSH (http://www.dreamhost.com/hosting-features.html), but I am unsure if they support reseller accounts.
Mercutio
05-10-2007, 04:32 PM
Dreamhost, at least my server, has issues with oversold capacity.
We have tons of disk space, and awesome bandwidth, but my server has 1107 accounts listed in /etc/passwd and every time I log in and look, load averages are way high (4.5 - 6 is common), making shell access unresponsive and a little frustrating.
Dreamhost more or less does not limit disk usage; most anything I'd want as far as software is already installed. Dreamhost DOES put quotas on CPU time. Tons of php and MySQL might stretch what's allowed, were we to move there.
I'm more than willing to give Handy a sub-account off my hosting, if he wants to try it out.
LOST6200
05-30-2007, 09:28 PM
The site was conking out toadat too. :( Fortunately Iused a clippboard.
CougTek
05-31-2007, 10:27 AM
I too noticed that the site was erratic yesterday in the late afternoon / early evening.
Stereodude
06-05-2007, 07:48 PM
I noticed a few hiccups today...
sechs
06-06-2007, 11:26 PM
Ditto
ddrueding
06-18-2007, 01:12 AM
Considering that there are no new posts, I can assume the site was down for everyone?
P5-133XL
06-18-2007, 01:59 AM
It was down for me all day and when I did get on (just now), it was very slow to load ...
It was down for me all day and when I did get on (just now), it was very slow to load ...
Ditto
Bozo:jokecolor:
Handruin
06-19-2007, 10:24 PM
Yes, the site was down due to a failed hard drive on the system. They replaced the drive and reloaded the OS. The data drive was fine apparently.
ddrueding
06-19-2007, 10:29 PM
Sweet. Thanks for the update.
Handruin
06-19-2007, 10:33 PM
I kind of assume that when the drive had file system errors several days back that it was an indication of problems to come. They backup data daily (as do I on my own) but you never know when the system might go down. I still feel like SF is a bit poky in performance.
Yes, the site was down due to a failed hard drive on the system. They replaced the drive and reloaded the OS. The data drive was fine apparently.
No RAID setup???
Bozo :joker:
Handruin
06-20-2007, 10:07 AM
No, the server we're on has never had RAID 1.
Also, the HTTP failed this morning and I had to ask them to restart it (they didn't say why it died). I've also put in a ticket to be transfered off this server and onto one of their new ones. The new systems are the Core 2 Duo's and actually do have RAID 1 configured. I'll let you know what happens when I hear back from them.
Site's been very, very close to unusable for me. Just waited an hour or so before I could even load a page.
Must be a lot of failing disk drives ...
Handruin
06-20-2007, 10:44 AM
The site is inconsistently slow today. I've been noticing double posts from members between yesterday and today. I'll open another ticket; this is ridiculous.
timwhit
06-20-2007, 10:53 AM
Maybe it's time to switch to Dreamhost, or some other provider? I would be willing to kick in money to host the site and I bet a lot of other regulars would also be willing. But, it is still Doug that has to do the work of moving everything, so it is ultimately up to him.
Handruin
06-20-2007, 11:05 AM
I'm home sick today so I have a bunch of time to kill. Let me search around and see what I can find. I need a place that can provide us with SSH access which seems to be problematic.
CougTek
06-20-2007, 03:35 PM
Go Daddy! ;-)
P5-133XL
06-20-2007, 03:54 PM
The site continues to run poorly when replying thereby creating duplicate posts because the poster doesn't believe it went through and trys again.
Handruin
06-20-2007, 04:32 PM
Go Daddy! ;-)
I'll be avoiding them. During many of my searches today, many separate sites said to avoid them as a hosting service.
Handruin
06-20-2007, 04:42 PM
The site continues to run poorly when replying thereby creating duplicate posts because the poster doesn't believe it went through and trys again.
I know, I'm seeing it. I've complained twice so far today and they're telling me the page loads fine. Just a few minutes ago apache died again. I really am looking for a new host. There are literally hundreds to read through and EVERYONE has an opinion.
LunarMist
06-23-2007, 10:42 PM
The site was down for quite some time again. :(
ddrueding
08-03-2007, 03:08 AM
Pulling a late night to move the server? That is dedication ;)
What's the status?
Handruin
08-03-2007, 07:47 AM
I want to bash some heads. Sorry about the downtime. :bibber:
Apparently they moved everyone else last night and since we moved a few weeks ago, they copied the DNS information again and overwrote ours. They sent us back to the old server for a short period of time, that's why you saw the old message that we were moving servers. nothing was lost, but they screwed up.
I finally closed on my house and I'm looking for a new place...hopefully I'll be able to get us into a new situation sooner than later.
ddrueding
08-03-2007, 11:30 AM
Dude, you've got plenty of stuff going on. Don't sweat it.
LunarMist
02-10-2008, 09:30 AM
Was the site down for several hours yesterday?
Handruin
02-10-2008, 09:35 AM
Sorry, I don't know. I didn't visit the site yesterday.
Stereodude
02-10-2008, 08:26 PM
Was the site down for several hours yesterday?I think it was. I couldn't get to it either.
ddrueding
02-10-2008, 09:32 PM
I tried once and it was down. An hour later it was back. Dunno.
LunarMist
03-19-2008, 12:18 AM
It is acting up again today.
For the last two days the site has been agonizingly slow to load.
Anybody else having any problems?
Also, the smilely faces keep changing in the window to the right of the reply window, each time I open the reply page. Strange.
Bozo :joker:
LunarMist
03-22-2008, 08:53 AM
Yes. It is a continual struggle to load pages.
Fushigi
03-22-2008, 09:39 AM
Very slow.
Handruin
03-22-2008, 10:51 AM
I opened a ticket a couple hours ago and this was the response I got:
We are currently running the initial backup on the new server, which has to copy all the data. After this is done the backup system will only copy files that change. Unfortunately this initial backup takes some time, about another 12 hours.
Please kindly bear with us.
ddrueding
03-22-2008, 02:01 PM
Wow. 50 hours is one hell of a backup.
Handruin
03-22-2008, 07:16 PM
I guess they're thorough.
CougTek
08-10-2011, 08:11 PM
Read thread title, apply today. I'm starting to wish Doug doesn't like Utah too much ;-)
Handruin
08-10-2011, 08:30 PM
I like Utah a lot. I'll be sad when I have to leave. I've been traveling through Vegas, Arizona, and now I'm in Utah working my way back to Vegas. We've put 1600 miles on the rental so far. It has had a rough ride with us taking onto several questionable dirt roads. :-)
Anyway, the site was down due to a power plant failure which provides the power to our host which is in Colo4 in Texas. I guess they didn't have generators or other forms of backup. Lots of knownhost customers were down with us during this time.
LunarMist
08-10-2011, 08:37 PM
Heh. Now you see why I go there almost every year. I did 2400 miles in 8 days, practically killing ourselves this year. I try to avoid the mid-summer crowds though.
Handruin
08-10-2011, 08:46 PM
I did not know you visited Utah every year. I've been here once before back in 2006 and saw only a little of Monument Valley.
This time I'm happy to say that I've made use of the seasonal national park pass and it's covered the cost ($80). We've visited so many parks throughout southern Utah. I'm currently on my way to Zion national park.
Also, crowded you say? I can see some humor in that considering how few people are in this area. :-D. It has been crowded with wild animals.
LunarMist
08-10-2011, 09:02 PM
I did not know you visited Utah every year. I've been here once before back in 2006 and saw only a little of Monument Valley.
This time I'm happy to say that I've made use of the seasonal national park pass and it's covered the cost ($80). We've visited so many parks throughout southern Utah. I'm currently on my way to Zion national park.
Also, crowded you say? I can see some humor in that considering how few people are in this area. :-D. It has been crowded with wild animals.
Well I don't mean the whole state of Utah and neighboring states, but some of the parks can be relatively crowded compared to May and September for example. It's not like Yellowstone or places like that in peak season. Of course the places that require a 4x4 see less traffic and some of the national monuments and BLM land have very few visitors.
Mercutio
08-10-2011, 09:27 PM
What the hell kind of datacenter doesn't have backup generators?
Handruin
08-10-2011, 10:08 PM
What the hell kind of datacenter doesn't have backup generators?
You and 40+ other people I read in the hosting company's support forums asked the same thing. They didn't get feedback as to why there was no APC or generators that supported the place during the outage.
sdbardwick
08-10-2011, 10:18 PM
Especially considering their front page touts their power arrangements:
Colo4’s true 2N power infrastructure ensures redundancy and reliability for customers with the most demanding power requirements.
11.1MW of utility power
Spare 3.75 MW transformer on site
250 watts/sq ft
Four (4) autonomous N+1 power plants delivering true A & B power supply
Six (6) backup diesel generators on standby
Generators tested bi-weekly and routinely run at full load
Multiple voltages available including 110v, 208v, and three phase 208v
On–site staff constantly inspect and monitor power equipment
If commercial power fails, automatic transfer switches (ATS) immediately activate generators, monitor electric feeds and transfer stabilized loads to generators
Handruin
08-10-2011, 11:53 PM
If anyone is interested here are the sequence of events that occured throughout the day causing the downtime.
https://accounts.colo4.com/status/
LunarMist
08-12-2011, 07:29 PM
So the ATS are not easily replaceable. :doh:
Handruin
08-14-2011, 08:06 PM
Here is a followup from the event that occurred:
Reason for Outage Follow-up (8/10/11)
Dear Colo4 Customers,
Thank you for your patience and understanding with our equipment failure this week. We apologize for the disruption to your business and the stress and frustration that you experienced. As promised, we have compiled this Reason for Outage report as part of our after-action assessment.
What Happened: On Wednesday, August 10, 2011 at 11:01AM CDT, the Colo4 facility at 3000 Irving Boulevard experienced an equipment failure with one of the automatic transfer switches (ATS) at service entrance #2, which supports some of our long-term customers. The ATS device was damaged and did not allow either commercial or generator power automatically -- or through bypass mode. Thus, to restore the power connection, a temporary replacement ATS was required to be put into service.
Colo4’s standard redundant power offering has commercial power backed up by diesel generator and UPS. Each of our six ATSs reports to its own generator and service entrance. The five other ATSs and service entrances at the facility were unaffected.
The ATS failure at service entrance #2 affected customers who had single circuit connectivity (one power supply). For customers who had redundant circuits (or A/B dual power supplies), they access two ATS switches, so the B circuit automatically handled the load. (A few customers with A/B power experienced initial downtime due to a separate switch that was connected to two PDUs and the same service entrance. Power was quickly restored.)
Response Actions: As soon as this incident occurred we worked to mobilize the proper professionals in our facility and extended team. Our on-site electrical contractors and technical team, worked quickly with the general contractors and UPS contractors to assess the situation and determine fastest course of action to bring customers back online.
As part of our protocol, we first conducted a thorough check of the affected ATS as well as the supporting PDU, UPS, transformer, generator, service entrance, HVAC, and electrical. It was determined that all other equipment was functioning properly and that the failure was limited to the ATS device. This step was important for us to ensure that the problem did not affect other equipment or replicate at other service entrances.
It was further determined that the ATS would need extensive repairs and that the best scenario for our customers would be to install a temporary ATS. As the ATS changeover involved high-voltage power, it was important that we moved cautiously and deliberately to ensure the safety of our employees, contractors and customers in the building as well as our customers’ equipment. Safely bringing the new unit online was our top priority.
After the temporary ATS was installed and tested, the team brought up the HVAC, UPS and PDU individually to ensure that there was no damage to those devices. Then, the team restored power to customer equipment. Power was restored as of 6:31PM CDT.
The UPSs were placed in bypass mode on the diesel generator to allow the batteries to fully charge. The transition from diesel generator to commercial power occurred at 9:00PM CDT with no customer impact.
Colo4 technicians worked with customers to help bring any equipment online that did not come back on with the power restore or to help reset devices where breakers tripped during the power restoration. This process continued throughout the evening.
Assessment: As part of our after-action assessment, the Colo4 management team has debriefed with all on-site technical team and electrical contractors as well as the equipment manufacturer, UPS contractors and general contractors to provide assessments on the ATS failure. While an ATS failure is rare, it is even rarer for an ATS to fail and not allow it to go into bypass mode.
While the ATS could be repaired, we made the decision to order a new replacement ATS. This is certainly a more expensive option, but it is the option that provides the best solution for the long-term stability for our customers.
Lessons Learned: Thankfully we’ve experienced few issues during our 11 years in business though any issue is one too many. As part of our after-action review, we have made additional improvements to our existing emergency/disaster recovery plans.
Our technical team, HVAC, electrical and general contractors brought exceptionally fast, sophisticated thinking and action to get our customers back in business as quickly as possible. The complexity of working with power of that size and scale at any time, but especially under pressure, shows the level of merit, knowledge and resolve that these individuals have. Thank you to the technical team and all our contractors for a job well done to safely restore power for our customers.
As part of the debrief, all Colo4 network gear in both facilities was checked to ensure all equipment was on redundant power, and all is connected properly.
Unfortunately, we weren’t well prepared on the customer service side. Our customers were stressed and needed more frequent updates from us along the way. We very much wanted to provide you with an ETA earlier. Due to the extent and complexity of the failure, we were unable to provide a proper ETA quickly and did not want to send out false information or set the wrong expectation.
For any future scenarios, we plan to provide process updates along the way even if we are unable to provide an exact ETA at that moment. We hope that this step will provide insight into the assessment period efforts that are occurring.
We will continue to send direct emails to affected customers and post website status updates. As the website received heavy hits during the incident, we are upgrading the website server to better handle requests. Based on our web server stats for the past year, the server had excellent capacity, but in this case, we experienced a heavier load from our customers and our customers’ customers. We will move some equipment to secondary offsite locations.
We’ve also set up a Twitter account @colo4 to post future updates and more timely responses. As you may have noticed, we began using Twitter during that afternoon.
Next Steps: Once we receive and test the new ATS, we will schedule a maintenance window to replace the equipment. We will provide at least three days advance notice and timelines to minimize any disruption.
Thank you again for your patience and understanding. We take our relationships very seriously and realize that you rely on us to keep your business online. We’re sorry that our equipment failure caused challenges this week.
Please let us know if you have any questions or need assistance.
Sincerely,
Paul Estes Paul VanMeter
CEO CTO
*********************************************
***************
ddrueding
08-14-2011, 08:23 PM
The switch to bypass on ours is mechanical (generator/commercial is electronic and automatic). Obviously several orders of magnitude smaller, but it struck me as the most reliable way to do it.
LunarMist
08-14-2011, 09:10 PM
I don't see a CAPA there. :(
ddrueding
08-14-2011, 09:37 PM
I don't see a CAPA there. :(
Corrective and preventive action?
Howell
08-14-2011, 10:56 PM
David, what are you bypassing then?
ddrueding
08-15-2011, 12:15 AM
We have a fairly large line interactive UPS that lives between the generator/commercial power and our equipment. Typically it just runs the server room and a single outlet in each office dedicated to the computers, but it is capable of running the whole building (except the elevator, provided everyone turns off their damned space heaters) for at least 60 minutes. I've managed to reduce the server load through virtualization that the runtime on just servers is colossal, even though the generator is programmed to start after just 30 seconds of downtime.
Howell
08-15-2011, 10:50 PM
So the switch will switch from the one outlet configuration to the whole building configuration?
ddrueding
08-15-2011, 10:52 PM
Honestly I'm not sure if it does everything in the building, but I know it does the lights.
Powered by vBulletin® Version 4.1.11 Copyright © 2012 vBulletin Solutions, Inc. All rights reserved.