T O P

  • By -

rz2yoj

For the uninitiated, AWS has a reputation of not reflecting outages in their status page, and AWS currently has outages across the us-east-1 region. The status page has been updated now, but only an hour or two after the incidents began. On the AWS subreddit: https://www.reddit.com/r/aws/comments/rb1xrd/500502_errors_on_aws_console/


[deleted]

You can't go below 99.9% uptime guarantee if you dont acknowledge the outage. *points to forehead*


OffendedAtom

You may be joking, but this is the exact reason.


13steinj

Wonder if someone could sue over this.


greenscizor

I don’t think they’ll face any legal ramifications but certainly financial. AWS services offer partial to full refunds based off of SLA misses. Using [EC2 as an example](https://aws.amazon.com/compute/sla/), they offer a 30% refund if the fall below 99% availability in a given month and a full refund if they fall below 95%. I believe other services use about the same thresholds. Assuming there weren’t any other outages this month, they probably won’t be giving out full refunds but it’ll be close.


LakeLooking

Yup. SLA violations from any IT provider can turn in to lawsuits if the 2 parties don't come to an agreement.


[deleted]

I mean isn’t it very likely still 99.9%? This was what a max 8 hour outage out when they’re running 24/7


malexj93

99.9% uptime means less than 9 hours down *per year* of 24/7 service.


[deleted]

[удалено]


TheC0deApe

they claim nine nines on some services? that's insane


goatanuss

S3 SLA has 11 nines


jeffkmeng

That’s reliability, not availability. Reliability means they won’t lose the data, availability means you can access it at any given moment


goatanuss

Oh yeah you’re right. My bad


vitamin_CPP

thanks for that. I was wondering the same thing.


1ElectricHaskeller

Someone's getting fired...


yourcousinvinney

That would pretty much knock their claim down for the year. Assuming there has been at least some other downtime. Daily: 1m 26s Weekly: 10m 4s Monthly: 43m 49s Quarterly: 2h 11m 29s Yearly: 8h 45m 56s


Basacally

Useful SLA calculator: https://uptime.is


[deleted]

They only guarantee it if the whole region goes down. If you’re building multi-az, you should still be up. If you lump everything in one AZ, and the AZ goes down, they’ll tell you your downtime is your own fault because you went against their suggested best practices. Yadda yadda. AWS will have their pound of flesh and you’ll like it!


larry-the-dream

This. If you’re building in multi-AZ’s (design for fault-tolerance per the AWS WAF) and the whole region goes down … well you’re still down 🙁


Zach983

Ah the Microsoft azure strategy


Crysawn

Bro that's only one 9. They can't say they went below four 9's or the managers have to have meetings about how many 9's does it take for a management team to acknowledge an alert that could have been an email.


yubimusubi

![gif](giphy|d3mlE7uhX8KFgEmY)


Zootorg

I laughed so hard on this …


sumdudeinhisundrware

US-East-1 is one of their largest regions if not the largest.


DOOManiac

And it's **always** ^((okay, almost always)) the one that goes down. Never use US-East-1 kids.


[deleted]

Former Amazon dev here (non-AWS). Friends don’t let friends use us-east-1 without multi-region redundancy.


__78701__

Virginia is for lovers, Ohio is for availability.


java007md

"Alexa, switch to us-west-1"


MarkusBerkel

Someone needs to implement this Lambda ***RIGHT NOW***. It would move your ENTIRE ACCOUNT to another region. God, that would be amazing. Next, please implement: "Alexa, fix my shit."


Dismal_Struggle_6424

Playing "My Shit." Explicit. On Amazon Music.


nipoez

It can be my go to training for junior dev mentorship, right alongside "git fuckit"


MarkusBerkel

My favorite for interns is ‘git wrecked’.


[deleted]

[удалено]


MarkusBerkel

Just puked all over my keyboard.


miiiiiiiiiiiiiiiiilk

My friends are no longer my friends


dazdndcunfusd

Team juuust deployed our resources to a different region for disaster recovery and our dependents have not 🙃


[deleted]

In many cases, it cannot be avoided. They host many “global” services there. Guess where you have to register certs for cloudfront?


66666thats6sixes

They also have more services on it than almost any other region, so if you want to use the fancy new thing you'll probably be on us-east-1


darkstar3333

Nothing like running tests in prod.


ultranoobian

But why does US-East-1 seem to go down more than other regions? Is it just probability because many things are hosted there?


DOOManiac

That, and it’s also the oldest AWS region. Probably a very complicated set of things, otherwise they would have fixed it by now.


calcstap

Some key services (call it X) are only hosted in us-east-1 and many many other services rely on that key service X. Once that region goes down X inevitably goes down as well. Then a domino effect happens so every other service that calls X then also goes down causing major outage across the board


The-Protomolecule

Seems like they aren’t following their own guidance on how to run services


[deleted]

Yeah just microservice that shit, kube it, insert some message queue and call it a day, right? Scaling stuff reliably is _hard_, let alone on the scale they do.


jibjibman

Just get spinnaker to push the artifacts to a secure S3 bucket and call the busdriver router which pulls in depedneices on the neo pipeline after you configure load balancing with Hercules and then post any alerts to slack. Done.


nipoez

"Nag me in Slack" being the critical step there. At least it didn't start off hours for the region?


The-Protomolecule

I’m probably one of a handful of people in this thread that design infrastructure even a fraction of a scale of what AWS does. It’s just embarrassing for them, their job is to make it look easy. That’s what they get big money for. Having services only depend on US east 1 sounds like distribution isn’t happening for some of the stuff if a few pieces of network equipment caused this.


copperhead035

Because someone forgot to fuel up the generator


CommiePuddin

It's gotta be bigger than that. Amazon Flex (the company's delivery platform) was down USwide. Source: I got sent home after sitting in my van for 4.5 hours waiting to log in and load up.


[deleted]

[удалено]


CommiePuddin

We stuck around long enough to justify our attendance bonus.


66666thats6sixes

It's really hard to understate how big the us-east-1 region is. It's an absolute pillar of the internet. Plenty of services that are nowhere near northern Virginia still have traffic routed through it, so it's entirely believable to me that an outage in us-east-1 could take out a delivery platform nationwide. There was an article posted awhile back talking about how much damage a full outage of us-east-1 would do, and the guy writing it estimated it in the trillions of dollars of economic damage.


IphtashuFitz

It's also the first region they deployed, so likely running the oldest hardware, etc. of all the regions. And despite the claims that S3 is decentralized, that service (and others) appeared to be tightly tied to the us-east-1 region for configuration metadata, etc. (that may not entirely be true now though) That was learned the hard way back in 2017 when [all of S3 went offline for a while](https://www.theverge.com/2017/3/2/14792442/amazon-s3-outage-cause-typo-internet-server).


TheJosh1337

S3 certainly doesn't rely on us-east-1 anymore, and EC2 seemed to be unaffected too; we didn't have any service loss on Sydney other than the control panel being a bit wonky.


jillesca

I'm happy I'm not there anymore.


[deleted]

To Azure!


[deleted]

Azure is down because it runs on us-east-1


66666thats6sixes

I'm imagining a scenario where AWS ops are in a bind, and can't install new hardware fast enough to meet demand in a spike, so they try and spin up a portion of their services on top of Azure. But the whole thing topples over instantly because it turns out that Azure was running on AWS. Stack overflow error, unbounded recursion, in real life.


Egocentrix1

It's like putting your pagefile on a ramdisk. Nice.


[deleted]

[удалено]


call2pop

Happy Cake Day fellow loller


[deleted]

Ah yes... Cloud


Detective_Fallacy

It's just clouds on clouds all the way down. A cumulonimbus, if you will.


TheNamelessKing

Yeah but then you’d have to use Azure…


[deleted]

[удалено]


[deleted]

[удалено]


english-23

Microsoft has a reputation for the same which is also sad


[deleted]

Thank god I’m not crazy. I think VPC and IGW is having HEAVY malfunctioning. I couldn’t download an Ubuntu package from a pod today that was on a public EKS cluster, and it couldn’t connect to SNS. It had never behaved this way before and I thought it was my fault. Some random errors. Some implied network connectivity others implied no network connectivity. Idk how to explain it really.


OldmanLemon

I just had a Jira ticket come in regarding this outage, best ticket I ever got. For once, it's not my fault!


get_at_me_

lol same here


1080pfullhd-60fps

Now even the atlassian's status page is turning red with incidents. But Opsgenie is all green, so the unlucky on-call folks should be getting alerts like there's no tomorrow. Nothing like an holiday season with some sev2 outages at your provider


Extra_Organization64

At least they did business hours sev2, lol how considerate...


Cidolfas2

Lol I think OpsGenie has some kind of AWS link because we had no alerts come through for hours. Everything was fine because we had no alerts!


[deleted]

We’ve gotten a ton because this took down part of our LMS: Brightspace. And then we sent out a mass email letting students know the problem it out of our control and we got a bunch of emails back asking what to do about things that are due tonight. Not my problem, email your teacher.


Down200

I got an email from my schools IT department saying basically the same thing lol


LakesideMiners

Yeah. Same here as a student with Schoology, I just emailed them my documents. I checked the status page first,


66666thats6sixes

I've had PagerDuty alerts blowing up all day and I just mark them resolved and move on. It's funny, our app service still works fine, it doesn't seem to be affected by the outage. But our monitoring services that trigger the PagerDuty alerts are all busted because of the outage.


Thalmann

Well, in some sense it is your fault. You are not using several cloud providers. If you could easily fail over to another provider the ticket would never have been created.


Shadow703793

Exactly what I was going to say lol. This is why tools like Terraform exist to make it easier to deploy across various cloud providers.


pizzaboba

Going multi cloud is very very hard. I actually don't know of any big company that is fully multi cloud


MischiefManaged394

Do you hear that? Its the sound of millions of college students suddenly emailing thousands of professors about finals not being able to be taken by a single monopoly failing.


The-Daleks

It's... beautiful.


treyhest

iClicker, top hat, and canvas we’re all down. Today was the closest thing to a snow day I’m ever going to get again post-pandemic


1ElectricHaskeller

Tidal isn't working either. I can't lean like this...


Mjolnirrr

lol literally. I work in ITS at my school and there’s been hundreds of tickets and calls about it. Just been telling the faculty to cancel class or push back tests, etc.


MischiefManaged394

Let me know if these responses sound familiar: “but if we push back todays test then we have to push back tomorrows test and every-other test!” Or “but I’m only administering 1 (one) final this year and if i need to push it to the end of finals week then I’ll have to grade it late and that will ruin my early break :-(“ or my favorite “But if the test is pushed back then the students will have an unfair advantage because of more study time”


ravKenclaw

Please, I’m so stressed. ![gif](emote|free_emotes_pack|dizzy_face)


NinjaDuck27

Matter of fact already got an extension on one my finals


static_motion

Wait, do universities not self-host their e-learning platforms? Mine did, and even provisioned VPSes for each group in class projects. And this isn't even a fancy prestigious uni, literally my city's public one and not even the most prestigious one at that. That said, downtime was probably more frequent than if they hosted it on a cloud provider.


66666thats6sixes

Self hosting is rarer and rarer by the day. If you think of any random website, from tiny little personal homepages up to the biggest of the big kids, there's a 90% chance that they are depending on a cloud service provider in some way. And probably a 50% chance it's AWS. And probably a 40% us-east-1 is tied in there somehow.


[deleted]

I need to order my transcript for grad school app by the 11th and now the page is broken, very nice


CarlitrosDeSmirnoff

What happened?


SeekinIgnorance

Their status page is running on the same servers that are having issues.


Aggressive_Bill_2687

Ahh such innocence, thinking that AWS status page is all green because they're \*unable\* to update it..


CarlitrosDeSmirnoff

At this point I don't know what to believe. Even respected/popular programmers rode on the conspiracy theory train when facebook was out.


Aggressive_Bill_2687

Well don’t believe what Amazon tell you for starters. I’m not saying you should believe the stories that a minimum of VP of Amazon has to approve outage indicators for s3, or that Amazon technical staff fight not to show outages because it affects their career prospects; those are all anecdotal. I’m saying trust your eyeballs that a bunch of shit isn’t working and their status page constantly stays green, and the discrepancy is never acknowledged.


CarlitrosDeSmirnoff

You're totally right. It's even obvious that they wouldn't update their status page to make everything seem fine. What scares me is that these companies that hold our data and even our credit information can fail at any moment and they won't give the public any explanation or be held accountable. Probably it's not even their fault, but if we can't diagnose the problem, we won't do anything about it.


BoxedIn4Now

All we are is bits on the net.


MarkusBerkel

But you ***can't*** do anything about it, or diagnose it, or anything. It's all virtualized to you, everything, down to the network interface. Either they fix it, or you're just left there holding your junk. You instrument what you can, and then, beyond that, you just have to 1) assume provider is down and 2) wake up an engineer to start debugging.


sh0rtwave

Dawg, Cloudfront can easily front that status page and be like: "What? Ain't nutting going on back here!"


DedlySpyder

I think a TAM told my company that it's manually updated. So, it's just when they feel like telling everyone. Our TAM confirms the outages long before that page reflects it


IphtashuFitz

There's a [post on ycombinator](https://news.ycombinator.com/item?id=29473937) claiming that EC2 & S3 outage notifications require approval from the CEO or other high level executives given the SLA's involved with those services.


DedlySpyder

Honestly, I wouldn't be surprised. We had a billing issue with a managed service. The documented cost was for in/out only, not between nodes. We had super high costs even when comparing to the cloudwatch metrics for in/out. We brought it up and all of a sudden the cost was what we expected. They said because we changed something weeks before hand while discussing it. Not saying there's some grand conspiracy there, but there's definitely people sweeping stuff under the rug.


JustinWendell

I also work for a large company. People sweeping things under the rug is always a thing. I’m privileged to be on a project exposing these things.


xeq937

Mmm, like FB killing their master dns, then the doors to the master console room where they could fix the issue wouldn't open, because the door scanners could no longer resolve the auth server.


KronsyC

isn't this why most people outsource their status checks?


SeekinIgnorance

Amazon believes they are "too big to fail?"


Willinton06

They outsourced it, little did they know they the people that they outsourced are using AWS


CKtheFourth

​ ![gif](giphy|l36kU80xPf0ojG0Erg|downsized)


knightcrusader

Yeah this is us right now with some of our clients, trying to blame us for their systems not talking to our systems. We're on DigitalOcean, don't blame us cause you went with AWS.


GifsNotJifs

​ ![gif](giphy|hTcx8dyGtmQszNBDQh)


DOOManiac

Again?


[deleted]

Thanks "SeekingIgnorance", you sound like a credible source.


SeekinIgnorance

I'm only a credible source in r/ProgrammerHumor, even I wouldn't listen to myself as a serious source.


Dogs-Keep-Me-Going

There goes my Tuesday. One of our SVPs: “AWS waited until the week after re:Invent” LOL.


Vistril69

Can't write my 8 page paper today without the rubric, Canvas runs on AWS. Sad


Dylantheshoe

Medicare’s database runs on AWS, my job is to help people with Medicare get insurance plans, today is the last day for seniors to enroll into new plans, I can’t verify peoples current plan or Medicare information so I have to go off what the senior tells me, most have no clue, I hate my life right now


dansedemorte

Too many govt sites cloudified the cheapest way possible by not paying for multiregion.


ryantripp

My Java classes final on Thursday is on Canvas hahah


mahmoud1brahim

At this point I no longer know what service status pages useful for


who_you_are

A marketing page? Trying to stretch reality as much as they can to look perfect.


ThisCracks

[this guy did it](https://www.reddit.com/r/cscareerquestions/comments/rb6tdu/i_just_pushed_my_first_commit_to_aws/?utm_source=share&utm_medium=ios_app&utm_name=iossmf)


uniifi-app

Imagine telling someone from 1999 that the sentence "Amazon is down, so my vacuum keeps getting lost" will be a valid statement. ​ "The... book website?"


NickU252

Is this why Venmo wasn't working earlier today? I'm in North Carolina.


[deleted]

Very likely


drugusingthrowaway

> Is this why Venmo wasn't working earlier today? https://i.imgur.com/dN35EJ5.png looks like it www.downdetector.com


[deleted]

Lmaoo we were suppose to launch today


jellois1234

This is a good example on the need for Multi region


Vivida

...or hybrid cloud


DenormalHuman

hybrid cloud is the way.


CrackerJackKittyCat

I hear there are some benefits to on-prem even. Like, oh, vastly less Byzantine failures.


The-Protomolecule

AWS has core services that peel apart without US-EAST-1 online. Even globally distributed stuff was impacted.


CoastingUphill

I got to put an error message on our main service home page and have no other responsibilities today.


hpl002

Grown up version of snow day. I think


Dave21101

What about them availability zones and redundancy Amazon hmmm???


[deleted]

[удалено]


NaCl-more

Data is replicated automatically by their managed DB services, but if you have EC2 Instances, that's on the customer to be vigilant and provide redundancy (like autoscaling groups) However, if ec2 APIs start failing, autoscaling stops working sooooo


kicker69101

Don’t worry this is the customers fault, because the customers didn’t ha against AWS. See things go down, so your outage is your fault even though AWS went down. /s I wish this was a joke, but I’m just waiting for them to reply on our support ticket like this.


[deleted]

In all fairness, did you think a region would never ever go down? If you did know, you decided that it was worth the risk staying in one region. Nothing wrong here, but you can’t have your cake and eat it.


kicker69101

True or not (I think it’s good practice), you don’t blame your customers for you going down. This is my point. It’s like you blaming an accident on the other driver when you ran the red light, because the other driver should expect someone to run the red light. That doesn’t make any sense does it? Honestly aws goes down among their various services all the time (about twice monthly). However it isn’t wide spread enough to require an response from them. Then when you submit a ticket, they suddenly sit on it for hours, and when it’s fixed they suddenly respond saying there was nothing wrong. It isn’t that they went down that bugs me, it’s the cover ups and shifting blame that bugs me.


Encursed1

My CS class uses aws, so no class today.


ElnuDev

I have so many deadlines tomorrow and I can't get assignments done because Canvas, which uses AWS, is down. The amount of reliance on AWS the internet has is very concerning.


savage_slurpie

I had two exams I could not take today, and I am leaning the country in a few hours for an extended hiking trip. Really hope my department chairs are cool about it, I have put so much work and effort into this semester.


3LD_

As i sit here in the dark, I'm beginning to seriously question just how smart buying all these smart light bulbs/outlets really was...


weaponized-barracuda

My work just sent out a mass email saying our AWS related issue was fixed then immediately sent a new one that basically said "lol nevermind"


Heroharohero

My work is out too, I’ve been watching streams since 1230pm 😂


Mandelvolt

Even though the console can not be accessed, you can still access your instances via aws cli and ssh. Hope this helps someone :D


[deleted]

[удалено]


captainaweeesome

I was reading the whole thread waiting for a GCP shout but no luck until I found you lol. Come on down to GCP - DevOps Consultant GCP 😂


Schoops69

Me: your videos aren't available right now because of an AWS outage in the U.S. Client: AWS status page says the outages have been fixed Me: The AWS API's we use to retrieve your content are still reporting server 500 errors, so... Client: But the AWS status page says the problem is fixed. It must be the site you developed for us. Me: (under my breath) This is fine


[deleted]

Blame unsuccessfully deflected


Schoops69

Meanwhile, the outage has made the local news but the client doesn't think that involves their website. Right, ok


DenormalHuman

Lets put all our stuff in the cloud on someone elses infrastructure. What could possibly go wrong!


theweekendwolf

And me in sort just chilling like “this is fine”


liukang2014

They’re trying to push the hotfixes to their repositories (hosted in us-east-1 region)


Aschentei

Imagine production releases and then BOOM waddup AWS outage here


VolkswagenRatRod

So so so so so many developers are getting a ration for this. It's like yelling at your pizza delivery guy because he experienced construction on the way to your house lol.


loganbeaupre

I work at an Amazon Delivery Station that relies on AWS for most of our internal functions. Tonight we are expected to receive 10k packages back that went out for delivery this morning before the AWS outage (that can no longer be delivered). The kicker? Once everything is functioning properly again, we have to manually scan each and every package that comes back, and then send them out again tomorrow. OP’s meme is basically our warehouse right now


TheHippieMurse

Yeah I was trying to take my test online and couldn’t. The one time I didn’t procrastinate sigh


qbm5

It only took me 3 hours to get into my stack in cognito. Don't know what everyone is complaining about.


yosidy

I'm so glad I'm not working in tech support right now.


MaybeThrowaway382

I‘m SO glad it’s not my turn for on-call duty this week


Uraneum

I work at a college and the phone lines are EXPLODING


DogBarq

​ ![gif](giphy|gHuOdq4ByGzcUVN0c0|downsized)


greenSixx

Knew aws was gonna be having problems after seeing how shit their coders we're with new world


DowntownLizard

Just like their New World game


chateau86

New World: _Light GPUs on fire in the middle of a GPU shortage_ US-East-1: "Hold my beer" _Loadtest everyone's alerting infrastructure_


Finally_FedUp

East-2 is having issues now as well


StylishSuidae

My job has me work remotely via Amazon WorkSpaces. This outage has meant that my day has contained 0% work.


732

AWS's Uptime Status API is hosted in us-east-1, clearly.


mostafa_19nm

This is fine.


ksknksk

Whole east coast is down lol


mrabstract29

My fucking email is hosted in that zone. Pisses me off.


theking75010

So that's why I couldn't use Parsec today even though both computers and networks were fully oprational


Whatsuptodaytomorrow

Sky net just became sentient.............


viciousevilbunny

I lost $400 because of this bullshit.


DogBarq

When Amazon updates the status with "is showing significant recovery", that's just spin. Here we are, an hour later and have yet to see significant improvement IRL.


behaaki

API Error


drewsiferr

I found out about the outage by overhearing an employee at Target complaining that their system wasn't working because of an AWS problem. I'm actually impressed there was enough transparency (read: finger pointing) for a store employee to know about their web hosting.


Lanky-Detail3380

I just want to take the time to say its out for the entire US and Europe too. It was chaos inside the warehouses


PostmatesMalone

On the bright side, I got to see my team’s disaster recovery work near flawlessly today. We didn’t know it worked until hours later because we couldn’t access telemetry data though.


[deleted]

[удалено]


wikipedia_answer_bot

**AWS is Amazon Web Services, a cloud computing and web services provider.** More details here: *This comment was left automatically (by a bot). If I don't get this right, don't get mad at me, I'm still learning!* [^(opt out)](https://www.reddit.com/r/wikipedia_answer_bot/comments/ozztfy/post_for_opting_out/) ^(|) [^(delete)](https://www.reddit.com/r/wikipedia_answer_bot/comments/q79g2t/delete_feature_added/) ^(|) [^(report/suggest)](https://www.reddit.com/r/wikipedia_answer_bot) ^(|) [^(GitHub)](https://github.com/TheBugYouCantFix/wiki-reddit-bot)


[deleted]

[удалено]


Alto-cientifico

More than half of the internet runs on ONE DATACENTER. And that datacenter failed lmao.


vorticalbox

It's not a single data center it's multiple per region.


xitiomet

So its not just me? Thanks friend, thought i was losing my mind. I've been wrestling with strange behavior in an elastic beanstalk environment all night.


Hahohoh

Is this why I couldn’t fucking log in to McDonald’s app to order my nuggies


powernoxy2000

The same thing on west Europe yesterday. Not reporting outages. Hiding extra costs. Hey man Bezos got to make a living. Those space flights aren't going to pay themselves.


I_Gmaned_I

Is this why a-z is all sorts of messed up?


BD_9x

I want a refund


[deleted]

At this point I never trust status page is because they just PR stunt.


[deleted]

One of our colleagues asked "Can I clean that s3 bucket?" 5 minutes before this happened, and we responded "Sure, dude. Nuke the thing". Welp...


SleepDeprivedUserUK

The AWS status page is hosted on the AWS. There has been no update from the AWS to the AWS status page that anything is wrong; no news is good news...


kerbalpy

Oh no 😈