Strongspace's 10-Day Crash Highlights Web Storage Risks

For the last 10 days, Sausalito, CA-based online document and storage hosting company Joyent struggled to get its online secure document collaboration service, Strongspace, back online.

During the outage, the company's clients had no access to their hosted documents--leaving some IT pros to wonder whether the features of such online collaboration services are worth the risks.

A Bad Week Gets Worse
It all started Saturday, Jan. 12, when the company's two online repositories--BingoDisk, for data storage, and Strongspace--went down owing to issues encountered by its Sunfire X4500 server.

Joyent CEO David Young started a Jan. 16 post announcing the crash by joking that it "was not the week to stop drinking."

"We got bit by a massive ZFS bug. That's the long and the short of it," he wrote in a follow-up post. "The good news is we can unravel the corruption. The bad news, given the fact that Strongspace and BingoDisk ran on a Thumper (aka SunFire X4500) (48 500GiB drives), was that we have to use other Thumpers to stage the uncoding of the ZFS mess. Moving so much data around to decode the ZFS corruption has taken time."

And it did. While BingoDisk went back online Friday, problems with Storagespace continued. On Sunday, the company sent an e-mail to customers stating that the service was back up, only to send another e-mail today that it was back down. Several "up" and "down" notifications on the company's Web site and blogs followed, with estimates for the service being restored "late Monday afternoon."

The service was down around 3 p.m PST Monday afternoon, then appeared to be back up a short time later. A post made late Monday said that the servers were back up and being watched "closely."

"I'll put it back into production for a period of 24 hours and we'll watch it closely," a company tech stated in the latest post. "The hope will be that ... Tuesday night we can do a clean shutdown and be beyond this silliness. Fingers crossed."

Joyent did not respond to our request for verification of the service's current status and comment on the situation by press time.

Too Late?
While Joyent has repeatedly assured customers that no data has been lost -- it praised ZFS highly for its work in helping fix the problem--the amount of time the service was out combined with the up-and-down nature of the restoration appears to have shaken some customers.

"While I appreciate the hard work you guys are doing to get everything back online, I'm starting to find it unacceptable that we've unable to use the service for over a week," wrote one user on the company's blog. "We rely on the service for client data transfers that are critical to our business. When our non-technical clients ask why there seems to be no redundancy built into a service that is likely used by many for business critical purposes, I find myself with no explanation."

"What a mess. Please, just fix it or simply admit that you cannot. This has been going on for ONE WEEK," wrote another.

Another blog poster wrote that while he's relatively happy with how Joyent has communicated about the problems, "the false starts are unfortunate...in a situation like this, given everything that's happened, I would expect them to fully test things (and double check them) before claiming that things are back to normal."

An IT executive and Joyent customer we talked with who asked not to be named said that the whole experience is "just unacceptable," leaving him to question the future of his company using online document hosting -- whether from Joyent or others.

"How could they allow this to happen?" he questioned. "This is really bad--you can't even access your account, let alone the documents."

While his company is not hosting any current projects on the service, it did a few months ago, and he said if this outage had happened then, "We'd be dead in the water."

"Obviously, it changes my intial view that this was a very secure, high-availability solution for remote storage," he stated, adding that the situation is a game-changer for any IT professional looking at this or similar solutions: "[IT professionals are] going to have to...to treat [hosted services] much like you would internal storage, and have a backup and recovery plan, because it's clear these types of vendors face internally the same IT issues [as] internal IT departments."

Young posted Joyent and customers will receive "generous" compensation for the outages; details have yet to be released.

In the meantime, customers located near Joyent's office may want to cruise by: The company has a standing offer to give any customer who drops in a Joyent T-shirt plus "a fine whiskey, a glass of Pernod, or for you lightweights ... a good old bottle of water."

Considering what's happened during the last week, it shouldn't be too hard to convince Joyent to give you a double--or to find someone to join you.

comments powered by Disqus