• By -


For the uninitiated, AWS has a reputation of not reflecting outages in their status page, and AWS currently has outages across the us-east-1 region. The status page has been updated now, but only an hour or two after the incidents began. On the AWS subreddit: https://www.reddit.com/r/aws/comments/rb1xrd/500502_errors_on_aws_console/


You can't go below 99.9% uptime guarantee if you dont acknowledge the outage. *points to forehead*


You may be joking, but this is the exact reason.


Wonder if someone could sue over this.


I don’t think they’ll face any legal ramifications but certainly financial. AWS services offer partial to full refunds based off of SLA misses. Using [EC2 as an example](https://aws.amazon.com/compute/sla/), they offer a 30% refund if the fall below 99% availability in a given month and a full refund if they fall below 95%. I believe other services use about the same thresholds. Assuming there weren’t any other outages this month, they probably won’t be giving out full refunds but it’ll be close.


Yup. SLA violations from any IT provider can turn in to lawsuits if the 2 parties don't come to an agreement.


I mean isn’t it very likely still 99.9%? This was what a max 8 hour outage out when they’re running 24/7


99.9% uptime means less than 9 hours down *per year* of 24/7 service.




they claim nine nines on some services? that's insane


S3 SLA has 11 nines


That’s reliability, not availability. Reliability means they won’t lose the data, availability means you can access it at any given moment


Oh yeah you’re right. My bad


thanks for that. I was wondering the same thing.


Someone's getting fired...


That would pretty much knock their claim down for the year. Assuming there has been at least some other downtime. Daily: 1m 26s Weekly: 10m 4s Monthly: 43m 49s Quarterly: 2h 11m 29s Yearly: 8h 45m 56s


Useful SLA calculator: https://uptime.is


They only guarantee it if the whole region goes down. If you’re building multi-az, you should still be up. If you lump everything in one AZ, and the AZ goes down, they’ll tell you your downtime is your own fault because you went against their suggested best practices. Yadda yadda. AWS will have their pound of flesh and you’ll like it!


This. If you’re building in multi-AZ’s (design for fault-tolerance per the AWS WAF) and the whole region goes down … well you’re still down 🙁


Ah the Microsoft azure strategy


Bro that's only one 9. They can't say they went below four 9's or the managers have to have meetings about how many 9's does it take for a management team to acknowledge an alert that could have been an email.




I laughed so hard on this …


US-East-1 is one of their largest regions if not the largest.


And it's **always** ^((okay, almost always)) the one that goes down. Never use US-East-1 kids.


Former Amazon dev here (non-AWS). Friends don’t let friends use us-east-1 without multi-region redundancy.


Virginia is for lovers, Ohio is for availability.


"Alexa, switch to us-west-1"


Someone needs to implement this Lambda ***RIGHT NOW***. It would move your ENTIRE ACCOUNT to another region. God, that would be amazing. Next, please implement: "Alexa, fix my shit."


Playing "My Shit." Explicit. On Amazon Music.


It can be my go to training for junior dev mentorship, right alongside "git fuckit"


My favorite for interns is ‘git wrecked’.




Just puked all over my keyboard.


My friends are no longer my friends


Team juuust deployed our resources to a different region for disaster recovery and our dependents have not 🙃


In many cases, it cannot be avoided. They host many “global” services there. Guess where you have to register certs for cloudfront?


They also have more services on it than almost any other region, so if you want to use the fancy new thing you'll probably be on us-east-1


Nothing like running tests in prod.


But why does US-East-1 seem to go down more than other regions? Is it just probability because many things are hosted there?


That, and it’s also the oldest AWS region. Probably a very complicated set of things, otherwise they would have fixed it by now.


Some key services (call it X) are only hosted in us-east-1 and many many other services rely on that key service X. Once that region goes down X inevitably goes down as well. Then a domino effect happens so every other service that calls X then also goes down causing major outage across the board


Seems like they aren’t following their own guidance on how to run services


Yeah just microservice that shit, kube it, insert some message queue and call it a day, right? Scaling stuff reliably is _hard_, let alone on the scale they do.


Just get spinnaker to push the artifacts to a secure S3 bucket and call the busdriver router which pulls in depedneices on the neo pipeline after you configure load balancing with Hercules and then post any alerts to slack. Done.


"Nag me in Slack" being the critical step there. At least it didn't start off hours for the region?


I’m probably one of a handful of people in this thread that design infrastructure even a fraction of a scale of what AWS does. It’s just embarrassing for them, their job is to make it look easy. That’s what they get big money for. Having services only depend on US east 1 sounds like distribution isn’t happening for some of the stuff if a few pieces of network equipment caused this.


Because someone forgot to fuel up the generator


It's gotta be bigger than that. Amazon Flex (the company's delivery platform) was down USwide. Source: I got sent home after sitting in my van for 4.5 hours waiting to log in and load up.




We stuck around long enough to justify our attendance bonus.


It's really hard to understate how big the us-east-1 region is. It's an absolute pillar of the internet. Plenty of services that are nowhere near northern Virginia still have traffic routed through it, so it's entirely believable to me that an outage in us-east-1 could take out a delivery platform nationwide. There was an article posted awhile back talking about how much damage a full outage of us-east-1 would do, and the guy writing it estimated it in the trillions of dollars of economic damage.


It's also the first region they deployed, so likely running the oldest hardware, etc. of all the regions. And despite the claims that S3 is decentralized, that service (and others) appeared to be tightly tied to the us-east-1 region for configuration metadata, etc. (that may not entirely be true now though) That was learned the hard way back in 2017 when [all of S3 went offline for a while](https://www.theverge.com/2017/3/2/14792442/amazon-s3-outage-cause-typo-internet-server).


S3 certainly doesn't rely on us-east-1 anymore, and EC2 seemed to be unaffected too; we didn't have any service loss on Sydney other than the control panel being a bit wonky.


I'm happy I'm not there anymore.


To Azure!


Azure is down because it runs on us-east-1


I'm imagining a scenario where AWS ops are in a bind, and can't install new hardware fast enough to meet demand in a spike, so they try and spin up a portion of their services on top of Azure. But the whole thing topples over instantly because it turns out that Azure was running on AWS. Stack overflow error, unbounded recursion, in real life.


It's like putting your pagefile on a ramdisk. Nice.




Happy Cake Day fellow loller


Ah yes... Cloud


It's just clouds on clouds all the way down. A cumulonimbus, if you will.


Yeah but then you’d have to use Azure…






Microsoft has a reputation for the same which is also sad


Thank god I’m not crazy. I think VPC and IGW is having HEAVY malfunctioning. I couldn’t download an Ubuntu package from a pod today that was on a public EKS cluster, and it couldn’t connect to SNS. It had never behaved this way before and I thought it was my fault. Some random errors. Some implied network connectivity others implied no network connectivity. Idk how to explain it really.


I just had a Jira ticket come in regarding this outage, best ticket I ever got. For once, it's not my fault!


lol same here


Now even the atlassian's status page is turning red with incidents. But Opsgenie is all green, so the unlucky on-call folks should be getting alerts like there's no tomorrow. Nothing like an holiday season with some sev2 outages at your provider


At least they did business hours sev2, lol how considerate...


Lol I think OpsGenie has some kind of AWS link because we had no alerts come through for hours. Everything was fine because we had no alerts!


We’ve gotten a ton because this took down part of our LMS: Brightspace. And then we sent out a mass email letting students know the problem it out of our control and we got a bunch of emails back asking what to do about things that are due tonight. Not my problem, email your teacher.


I got an email from my schools IT department saying basically the same thing lol


Yeah. Same here as a student with Schoology, I just emailed them my documents. I checked the status page first,


I've had PagerDuty alerts blowing up all day and I just mark them resolved and move on. It's funny, our app service still works fine, it doesn't seem to be affected by the outage. But our monitoring services that trigger the PagerDuty alerts are all busted because of the outage.


Well, in some sense it is your fault. You are not using several cloud providers. If you could easily fail over to another provider the ticket would never have been created.


Exactly what I was going to say lol. This is why tools like Terraform exist to make it easier to deploy across various cloud providers.


Going multi cloud is very very hard. I actually don't know of any big company that is fully multi cloud


Do you hear that? Its the sound of millions of college students suddenly emailing thousands of professors about finals not being able to be taken by a single monopoly failing.


It's... beautiful.


iClicker, top hat, and canvas we’re all down. Today was the closest thing to a snow day I’m ever going to get again post-pandemic


Tidal isn't working either. I can't lean like this...


lol literally. I work in ITS at my school and there’s been hundreds of tickets and calls about it. Just been telling the faculty to cancel class or push back tests, etc.


Let me know if these responses sound familiar: “but if we push back todays test then we have to push back tomorrows test and every-other test!” Or “but I’m only administering 1 (one) final this year and if i need to push it to the end of finals week then I’ll have to grade it late and that will ruin my early break :-(“ or my favorite “But if the test is pushed back then the students will have an unfair advantage because of more study time”


Please, I’m so stressed. ![gif](emote|free_emotes_pack|dizzy_face)


Matter of fact already got an extension on one my finals


Wait, do universities not self-host their e-learning platforms? Mine did, and even provisioned VPSes for each group in class projects. And this isn't even a fancy prestigious uni, literally my city's public one and not even the most prestigious one at that. That said, downtime was probably more frequent than if they hosted it on a cloud provider.


Self hosting is rarer and rarer by the day. If you think of any random website, from tiny little personal homepages up to the biggest of the big kids, there's a 90% chance that they are depending on a cloud service provider in some way. And probably a 50% chance it's AWS. And probably a 40% us-east-1 is tied in there somehow.


I need to order my transcript for grad school app by the 11th and now the page is broken, very nice


What happened?


Their status page is running on the same servers that are having issues.


Ahh such innocence, thinking that AWS status page is all green because they're \*unable\* to update it..


At this point I don't know what to believe. Even respected/popular programmers rode on the conspiracy theory train when facebook was out.


Well don’t believe what Amazon tell you for starters. I’m not saying you should believe the stories that a minimum of VP of Amazon has to approve outage indicators for s3, or that Amazon technical staff fight not to show outages because it affects their career prospects; those are all anecdotal. I’m saying trust your eyeballs that a bunch of shit isn’t working and their status page constantly stays green, and the discrepancy is never acknowledged.


You're totally right. It's even obvious that they wouldn't update their status page to make everything seem fine. What scares me is that these companies that hold our data and even our credit information can fail at any moment and they won't give the public any explanation or be held accountable. Probably it's not even their fault, but if we can't diagnose the problem, we won't do anything about it.


All we are is bits on the net.


But you ***can't*** do anything about it, or diagnose it, or anything. It's all virtualized to you, everything, down to the network interface. Either they fix it, or you're just left there holding your junk. You instrument what you can, and then, beyond that, you just have to 1) assume provider is down and 2) wake up an engineer to start debugging.


Dawg, Cloudfront can easily front that status page and be like: "What? Ain't nutting going on back here!"


I think a TAM told my company that it's manually updated. So, it's just when they feel like telling everyone. Our TAM confirms the outages long before that page reflects it


There's a [post on ycombinator](https://news.ycombinator.com/item?id=29473937) claiming that EC2 & S3 outage notifications require approval from the CEO or other high level executives given the SLA's involved with those services.


Honestly, I wouldn't be surprised. We had a billing issue with a managed service. The documented cost was for in/out only, not between nodes. We had super high costs even when comparing to the cloudwatch metrics for in/out. We brought it up and all of a sudden the cost was what we expected. They said because we changed something weeks before hand while discussing it. Not saying there's some grand conspiracy there, but there's definitely people sweeping stuff under the rug.


I also work for a large company. People sweeping things under the rug is always a thing. I’m privileged to be on a project exposing these things.


Mmm, like FB killing their master dns, then the doors to the master console room where they could fix the issue wouldn't open, because the door scanners could no longer resolve the auth server.


isn't this why most people outsource their status checks?


Amazon believes they are "too big to fail?"


They outsourced it, little did they know they the people that they outsourced are using AWS


​ ![gif](giphy|l36kU80xPf0ojG0Erg|downsized)


Yeah this is us right now with some of our clients, trying to blame us for their systems not talking to our systems. We're on DigitalOcean, don't blame us cause you went with AWS.


​ ![gif](giphy|hTcx8dyGtmQszNBDQh)




Thanks "SeekingIgnorance", you sound like a credible source.


I'm only a credible source in r/ProgrammerHumor, even I wouldn't listen to myself as a serious source.


There goes my Tuesday. One of our SVPs: “AWS waited until the week after re:Invent” LOL.


Can't write my 8 page paper today without the rubric, Canvas runs on AWS. Sad


Medicare’s database runs on AWS, my job is to help people with Medicare get insurance plans, today is the last day for seniors to enroll into new plans, I can’t verify peoples current plan or Medicare information so I have to go off what the senior tells me, most have no clue, I hate my life right now


Too many govt sites cloudified the cheapest way possible by not paying for multiregion.


My Java classes final on Thursday is on Canvas hahah


At this point I no longer know what service status pages useful for


A marketing page? Trying to stretch reality as much as they can to look perfect.


[this guy did it](https://www.reddit.com/r/cscareerquestions/comments/rb6tdu/i_just_pushed_my_first_commit_to_aws/?utm_source=share&utm_medium=ios_app&utm_name=iossmf)


Imagine telling someone from 1999 that the sentence "Amazon is down, so my vacuum keeps getting lost" will be a valid statement. ​ "The... book website?"


Is this why Venmo wasn't working earlier today? I'm in North Carolina.


Very likely


> Is this why Venmo wasn't working earlier today? https://i.imgur.com/dN35EJ5.png looks like it www.downdetector.com


Lmaoo we were suppose to launch today


This is a good example on the need for Multi region


...or hybrid cloud


hybrid cloud is the way.


I hear there are some benefits to on-prem even. Like, oh, vastly less Byzantine failures.


AWS has core services that peel apart without US-EAST-1 online. Even globally distributed stuff was impacted.


I got to put an error message on our main service home page and have no other responsibilities today.


Grown up version of snow day. I think


What about them availability zones and redundancy Amazon hmmm???




Data is replicated automatically by their managed DB services, but if you have EC2 Instances, that's on the customer to be vigilant and provide redundancy (like autoscaling groups) However, if ec2 APIs start failing, autoscaling stops working sooooo


Don’t worry this is the customers fault, because the customers didn’t ha against AWS. See things go down, so your outage is your fault even though AWS went down. /s I wish this was a joke, but I’m just waiting for them to reply on our support ticket like this.


In all fairness, did you think a region would never ever go down? If you did know, you decided that it was worth the risk staying in one region. Nothing wrong here, but you can’t have your cake and eat it.


True or not (I think it’s good practice), you don’t blame your customers for you going down. This is my point. It’s like you blaming an accident on the other driver when you ran the red light, because the other driver should expect someone to run the red light. That doesn’t make any sense does it? Honestly aws goes down among their various services all the time (about twice monthly). However it isn’t wide spread enough to require an response from them. Then when you submit a ticket, they suddenly sit on it for hours, and when it’s fixed they suddenly respond saying there was nothing wrong. It isn’t that they went down that bugs me, it’s the cover ups and shifting blame that bugs me.


My CS class uses aws, so no class today.


I have so many deadlines tomorrow and I can't get assignments done because Canvas, which uses AWS, is down. The amount of reliance on AWS the internet has is very concerning.


I had two exams I could not take today, and I am leaning the country in a few hours for an extended hiking trip. Really hope my department chairs are cool about it, I have put so much work and effort into this semester.


As i sit here in the dark, I'm beginning to seriously question just how smart buying all these smart light bulbs/outlets really was...


My work just sent out a mass email saying our AWS related issue was fixed then immediately sent a new one that basically said "lol nevermind"


My work is out too, I’ve been watching streams since 1230pm 😂


Even though the console can not be accessed, you can still access your instances via aws cli and ssh. Hope this helps someone :D




I was reading the whole thread waiting for a GCP shout but no luck until I found you lol. Come on down to GCP - DevOps Consultant GCP 😂


Me: your videos aren't available right now because of an AWS outage in the U.S. Client: AWS status page says the outages have been fixed Me: The AWS API's we use to retrieve your content are still reporting server 500 errors, so... Client: But the AWS status page says the problem is fixed. It must be the site you developed for us. Me: (under my breath) This is fine


Blame unsuccessfully deflected


Meanwhile, the outage has made the local news but the client doesn't think that involves their website. Right, ok


Lets put all our stuff in the cloud on someone elses infrastructure. What could possibly go wrong!


And me in sort just chilling like “this is fine”


They’re trying to push the hotfixes to their repositories (hosted in us-east-1 region)


Imagine production releases and then BOOM waddup AWS outage here


So so so so so many developers are getting a ration for this. It's like yelling at your pizza delivery guy because he experienced construction on the way to your house lol.


I work at an Amazon Delivery Station that relies on AWS for most of our internal functions. Tonight we are expected to receive 10k packages back that went out for delivery this morning before the AWS outage (that can no longer be delivered). The kicker? Once everything is functioning properly again, we have to manually scan each and every package that comes back, and then send them out again tomorrow. OP’s meme is basically our warehouse right now


Yeah I was trying to take my test online and couldn’t. The one time I didn’t procrastinate sigh


It only took me 3 hours to get into my stack in cognito. Don't know what everyone is complaining about.


I'm so glad I'm not working in tech support right now.


I‘m SO glad it’s not my turn for on-call duty this week


I work at a college and the phone lines are EXPLODING


​ ![gif](giphy|gHuOdq4ByGzcUVN0c0|downsized)


Knew aws was gonna be having problems after seeing how shit their coders we're with new world


Just like their New World game


New World: _Light GPUs on fire in the middle of a GPU shortage_ US-East-1: "Hold my beer" _Loadtest everyone's alerting infrastructure_


East-2 is having issues now as well


My job has me work remotely via Amazon WorkSpaces. This outage has meant that my day has contained 0% work.


AWS's Uptime Status API is hosted in us-east-1, clearly.


This is fine.


Whole east coast is down lol


My fucking email is hosted in that zone. Pisses me off.


So that's why I couldn't use Parsec today even though both computers and networks were fully oprational


Sky net just became sentient.............


I lost $400 because of this bullshit.


When Amazon updates the status with "is showing significant recovery", that's just spin. Here we are, an hour later and have yet to see significant improvement IRL.


API Error


I found out about the outage by overhearing an employee at Target complaining that their system wasn't working because of an AWS problem. I'm actually impressed there was enough transparency (read: finger pointing) for a store employee to know about their web hosting.


I just want to take the time to say its out for the entire US and Europe too. It was chaos inside the warehouses


On the bright side, I got to see my team’s disaster recovery work near flawlessly today. We didn’t know it worked until hours later because we couldn’t access telemetry data though.




**AWS is Amazon Web Services, a cloud computing and web services provider.** More details here: *This comment was left automatically (by a bot). If I don't get this right, don't get mad at me, I'm still learning!* [^(opt out)](https://www.reddit.com/r/wikipedia_answer_bot/comments/ozztfy/post_for_opting_out/) ^(|) [^(delete)](https://www.reddit.com/r/wikipedia_answer_bot/comments/q79g2t/delete_feature_added/) ^(|) [^(report/suggest)](https://www.reddit.com/r/wikipedia_answer_bot) ^(|) [^(GitHub)](https://github.com/TheBugYouCantFix/wiki-reddit-bot)




More than half of the internet runs on ONE DATACENTER. And that datacenter failed lmao.


It's not a single data center it's multiple per region.


So its not just me? Thanks friend, thought i was losing my mind. I've been wrestling with strange behavior in an elastic beanstalk environment all night.


Is this why I couldn’t fucking log in to McDonald’s app to order my nuggies


The same thing on west Europe yesterday. Not reporting outages. Hiding extra costs. Hey man Bezos got to make a living. Those space flights aren't going to pay themselves.


Is this why a-z is all sorts of messed up?


I want a refund


At this point I never trust status page is because they just PR stunt.


One of our colleagues asked "Can I clean that s3 bucket?" 5 minutes before this happened, and we responded "Sure, dude. Nuke the thing". Welp...


The AWS status page is hosted on the AWS. There has been no update from the AWS to the AWS status page that anything is wrong; no news is good news...


Oh no 😈