I'd like to preface this by saying these comments, opinions, and views are my own and in no way represent my employer, coworkers, or any one else.
[Edit: One of the downsides of running a beta version of Ghost is that there aren't any comments. If you would like to discuss, please head over here]
The Hazards of Proclaiming 'Fanatical Support'
A number of people have asked for a write-up on my troubles with Rackspace over the last month or so. I had originally written up a tirade during the height of the outage but ultimately a cooler head prevailed and I never published that (of course, that didn't stop my Twitter rants). Now I've taken my time and written what I believe to be a fair account the outages, Rackspace's delay in response, and ultimately the resolution.
Through this process, I've spent time reflecting on what exactly went wrong (hint: a technology issue was not the problem) as well as how a company soured a great relationship with a public advocate. Finally, I provide some recommendations on how Rackspace can improve their support and communication systems.
I work with a small company that builds realtime sound and networked communication systems for military and commercial aviation training (like this flight simulator).
In January of 2013, our Labs team started building a product evaluation platform for prospective customers utilizing public computing infrastructure. Essentially, users can register for an account and then launch a self-contained virtual machine running our product.
We built this evaluation platform to be vendor agnostic - AWS, Rackspace, Joyent, Digital Ocean - it doesn't really matter to us. However, since launching the service in March of 2013, we've come to love and rely on Rackspace's feature set (ahem, Openstack), their support team, and most importantly, their Developer Relations Group.
Of course, since this evaluation platform is hosted in the public cloud, we are at the mercy of our datacenter providers. This means that we keep a close eye on API response times, Twitter/Blog/Status updates, IRC channels and more.
Most of the time we know when things are about to hit the fan and can respond accordingly. Regardless, I expect to be notified if there is an outage (no matter how small).
Additionally, when we report an issue, I expect that our vendor will attempt to resolve the problem as soon as possible and maintain open lines of communication with us through the troubleshooting and fixing stages. This still holds true even though we are a small customer.
In my opinion, that's just plain good old fashioned customer service.
January 30th, 2014
I awoke to a priority email from our alert system. A VIP potential customer had just signed up for an account and was attempting to create their evaluation system in the IAD datacenter. Unfortunately, the instance they had created was stuck at 10% completion for 20+ minutes (this normally takes less than 3 minutes with Rackspace's performance VMs).
Our customer, clearly frustrated, attempted to create an additional 2 instances, one in IAD and in ORD. Both of these instances also appeared to be stuck. We reached out to the customer and notified him that our datacenter was having issues and that we would contact him as soon as the system returned to normal.
We contacted Rackspace support while creating several additional VMs to see if the issue persisted.
Here is the ticket based on my conversation with Rackspace Support text-chat - 8:37 AM PST:
Greeting! It was a pleasure chatting with you! The purpose of this ticket is because you reported not being able to create Cloud Servers via the API in the IAD datacenter for 20-30 minutes and you would like to know why. Unfortunately at the time you came into chat you were not able to provide any error messages that may have been returned. I'm going to have a higher-level technician look into this and advise.
Cheers, Pete M
Our other test VMs all were successfully created, so we reached out to our customer and resumed their evaluation.
Besides the initial contact with Rackspace Support, I heard absolutely nothing for the remainder of the day. I pinged some of the Rackspace Developer Relations Group folks that I work with. Ruben O. helped me out and looked into the support ticket. An engineer had privately commented on the ticket six hours earlier, but didn't make it visible. Ruben published the comment so I could see the content:
This server it looks like the image just didn't complete the downloading of the image. It disconnected during that process. This image it was not as clear as to why it failed. - Jim C.
So... it seems the engineer had no idea why the instance failed to come up and no one looked any further into the issue. Furthermore, no one bothered to reach out to the customer with any additional information.
Hi Ruben, Thanks for passing on Jim's comment. From what I understand you have no idea why this image failed (or, more specifically, why it wasn't authorized). That's not very comforting to me. Has anyone else run into this problem? Why did this create fail but subsequent ones succeeded? How are you going to proactively keep a watch out for this error again?
This was at the end of the business day so I didn't expect to hear more until the next day.
January 31st, 2014
The next morning, we preemptively tested creating instances in several datacenters. To our surprise, we encountered similar issues. I posted to the same support ticket:
So this 'Bad Luck' has struck us again. We had a server build this morning take 78 minutes - ca73b105-ecce-4ba8-be41-3926da4b7776. I can't confirm, but I suspect it came up in 'Error'. We have another instance that has been stuck at 10% for over 10 minutes [Was stuck for over 50+ minutes]. Both are from the same image that up until yesterday or so had no problems. This is unacceptable for VMs to not come up, let alone take 78 minutes for them to hit 'ERROR'.
I waited patiently for over 30 minutes for a second instance to come up before getting on the phone to Rackspace Support. I spoke with A.J. G. who did everything he could -- he could see the failure, but wasn't sure why. He raised the issue with their infrastructure team and noted on the ticket that I would like a call back when they have more information.
Hello Ross, Thanks for call! You are still having issues spinning up servers on ORD and IAD. When you called you were spinning up a server called admin-testinsta-insta in IAD, however you had a build failure prior to that as well. The Two servers mentioned in the updates above point to XXXX that errored in ORD, and YYYY in IAD that seemed to have completed after 30 min, but the logs are inconclusive. The main problem is that you have customers needing these servers to test your appllication, however since they are not building it is causing lots of frustration on their end. Normally you see 3 minute build times, but you need a solution for the long build times and error states. Im sending this to our infrastructure admins to investigate this issue, and you would like a call back at XXX-YYY-ZZZZ when we have more information for you. In the mean time, if you have any questions or need anything else, please let us know! Thanks, A.J. G
Again, I waited by the phone waiting for someone from Rackspace to contact me.
February 2nd, 2014
Three days later, someone finally posted a comment on the issue:
Hey Ross, The failure associated with your server in ORD is due to the issues we are seeing in Cloud Files as mentioned here: https://status.rackspace.com/ The failure associated with the IAD server is no longer on record as it was deleted. However, if you feel the error is consistent and has occurred more than once in the last week we can look in to any issues on the back end. However, we would need another failure to analyze. If you'd like to build another IAD server from the image and it fails we can get a closer look to see what might have caused it. Thank you for your patience and please let us know if you have additional questions or concerns. Sincerely, Sean
At this point, I should note that the Cloud Files issue Sean referred to was going to be an ongoing problem for the entire month of February. From my point of view, Rackspace was telling us that we would have intermittent failures for 30+ days in two datacenters. Furthermore, the fact that IAD had failed for us two days in a row clearly didn't seem to be a high priority for Rackspace.
Hi Sean, I tried creating a few instances in IAD and they all came up pretty quickly. However, this has been happening for the second week now across datacenters, consistently enough for us to no longer trust the Rackspace platform. Just this morning, I created an instance in ORD, 'rossk-testerc' at 10:10 AM PST. It's now 10:55 AM PST and the build is still stuck at 10%. The two previous ones (rossk-testera &b) came up just fine.
[Note: rossk-testerc managed to finally finish booting after 1 hour 59 minutes.]
February 5th, 2014
After another three days without a response from Rackspace support, I updated the ticket:
Feb 5, 2014
Hello Rackspace, It's been almost 3 days now since I've heard an official update from Rackspace. Could some please let me know what's going on?
and then initiated another multi-tweet Twitter rant at the lack of response
As usual, Rackspace folks were quick to respond to public complaints -- I had Jesse N., Chris L., Liz J., Ruben O., and Ken P., all jump up and get involved.
Not long after, Elaine (who is our account manager), updated the ticket:
Hi Ross, Any build failure within ORD is likely due to Cloud Files issues mentioned on our status page. The one error "514f2c7a-e963-4a11-8a97-c8f59f34b9c8" failed because Cloud Files couldn't deliver the image successfully. It is currently estimated that this will be corrected by the end of this month. If you encounter this in our IAD DC please feel free to call us immediately so that we can investigate on the back end. I do apologize the recent experiences, however as noted we are performing a maintenance to mitigate the issues. Elaine L.
I appreciated her response, but I was still very frustrated with the overall poor response. I let loose with a cheeky reply:
We have ultimately decided to no longer provide our users the ability to create instances in ORD or DFW. The Cloud Files is a blocking issue for us that in our minds is unacceptable.
Rackspace's 'Fanatical Support' is far from it. It has been like pulling teeth to get a response back from you on this issue. Consistently I had to raise hell on Twitter and with your developer relations group to get any sort of response. If I wanted to have a cheaper cloud provider with bad support I would have gone with AWS. I pay more with Rackspace for the high-quality support. If we don't get that then there's no reason for us to stay.
At this point our account managers Elaine and Tremain scheduled a call with me later in February to discuss this issue as well as some other feedback I had provided on their Role Based Authentication and Private Cloud offerings.
Ken P. of the Developer Relations Group reached out to me. I have had the pleasure of working with Ken on some open source projects, so our friendship extends beyond that of vendor/customer. Ken was very apologetic for the issues that we had faced. He couldn't go into details, but he said that Rackspace was taking the communication breakdown and technical issues very seriously. I considered this a good sign, but reminded him that the technical issues were still unresolved and that I had yet to have an official response from Rackspace.
I had a really great call with Elaine, Tremain, James C., and some high-ups on the RBAC and Infrastructure teams. It was at this time that I was officially notified that Rackspace had discovered an issue in their Hypervisor, which had caused our issues in the ORD datacenter. Earlier that day (?) they had posted the following notice to the status page:
Elaine explained that this bug was discovered and resolved largely due to my persistence. I was thanked for my help. (Yay?) I received an official apology for the problems that we had faced and also encouraged to directly contact my account team (Elaine & Tremain) should we have any additional problems. While its great to now have access to their 'marquee' support team, I know all too well that most customers probably deal with spotty support all the time.
Of course, there still was no explanation for the IAD problems. We hadn't seen an issue since February 2nd, so we chalked it up to being 'unlucky' (yeah, I don't believe that for a second).
After the call I spent some time reflecting on the whole experience. Elaine and others on the call had done a remarkable job turning around my opinions of the company. I couldn't help but feel that Rackspace has done a remarkable job of hiring intelligent and caring people and put them directly in front of customers. Having an outstanding outreach team will do wonders for any company.
But, it was painfully obvious that Rackspace had been enduring growing pains. Communication between different groups was clearly lacking. The end support engineers have to fight pretty hard up the chain to get any sort of response from the infrastructure teams. Additionally, there doesn't seem to be any continuity between the support shifts. If your ticket isn't resolved during a shift then don't expect a response in a timely fashion (and don't get me started on the support folks who will do whatever they can to close a ticket ASAP).
The Developer Relations Group is clearly an important team that brings in business and tries hard to support advanced customers but fundamentally they are external to the core Rackspace organization. As such, they also have to really shake the tree to get help. If anything, I think they should have additional sway so they can fight on behalf of their API 'customers' like us.
I also feel that there is a notable lack of communication between the operations folks and customers. Rackspace does have their status page, but I've largely found it to be lagging significantly behind 'now'.
I understand that from a business point of view, no cloud vendor wants to have the appearance of consistently having 'outages'. But as a customer who is programmatically utilizing their services, I need to know ASAP when something might be going pear shaped. I want to know that some percentage of VMs in a certain datacenter are taking longer than usual to boot. Empower your customers with knowledge so they can better server their customers. THAT is fanatical support.
In a similar vein, it is frustrating that there is no simple programmatic way of consuming their status page. Before I complained 8+ months ago, they didn't even have an RSS feed for the page. A half-decent engineer could put together a prototype JSON REST API for the page in a half-day. Give them the rest of the week to productionize the module and you've added real value for customers using your APIs.
The most dishearting thing is that over a 30-day period Rackspace lost a public advocate. Before these events I was a vocal supporter of Rackspace. I championed Rackspace improvements and features on Twitter. I recommended their services to companies and startups I interacted with. I highlighted my use of their services in presentations for the BayNode meetup I help organize.
Now, I view Rackspace as just another cloud vendor. Their 'fanatical' support is simply a marketing term, not a reality. Most importantly, I'm sad to have been let down by a company with great people but poor organizational communication.
While I was finishing up this post, we had encountered more issues with VMs taking a long time to boot in IAD. After speaking with support, it seems this is a 'known issue', but not something publicly documented. I can't even begin to explain how frustrating it is to hear that phrase.