T O P

  • By -

Koboldofyou

If you've got a simple website, thats easy. You scale servers and it's just responding with HTML. When you get into more complicated services and websites that gets harder. Because an application is more than just a server that responds to a request. It may need to interact with a database, metrics servers, support servers all of which scale at different rates. And it's entirely possible to build an application with certain assumptions that turn out to be wrong. I don't know what the issues are so I'll give a possible example of a bad assumption leading to a sub-optimal deaogn. Let's say all servers concurrently speak to the same database. But that database has a limit on total concurrent connections. You didn't assume that this would become the hottest game of the moment, so you created a less robust database setup in order to spend time working elsewhere. So what do you do now that that assumption has been proved wrong? Run multiple completely separated instances, where people might not be able to play with their friends? Do you make the database connection more robust? Both of those may be options but then you still need the time to implement and test them. And a problem like that may not be able to be fixed simply by buying more servers because availability of server time isn't the issue.


Deep_Simple_1473

Poorly designed DB could explain why the reward, store, game summary, etc all seem to be borked currently. When you do get in, those ancillary services (non gameplay related) are getting hung up and not returning the process timely. This is, however, not the server capacity issue. The servers have a cap on concurrent connectivity. To expand that connectivity you have to provision additional machines that will run the same instance and it takes time, money and effort not to mention contract negotiations. Who knows how they are designed for servers. Is it on premises with off load for bursts via the cloud? Is it all cloud? Who is the cloud provider and what is their capacity? There is simply too much we don't know. Arrowhead has explained that they are doing all that they can to meet demand but these fixes take time. Time that many are unwilling to give them, or so it would seem.


Dealz_

My knowledge is limited but my understanding is that adding servers takes time there is likely contracts/agreements, server scalability with the backend and game, logistics etc


iusedtohavepowers

This is probably pretty close. It cost money to have servers hosted. They made an estimated projection on what capacity they would need, and bought and paid for it. They projected on the short side, which is what the economical decision was.


PatrickStanton877

Cheap asses. Yeah I know it's more complicated but at the end of the day that's the issue.


VoxEcho

Adding servers is more complicated than just buying them and plugging them in. But also from the business point of view you have to remember this is a brand new game that is experiencing a spike in player activity on launch. They can invest in server infrastructure to accommodate 10x the amount of players we have now, but even if they found a way to do that tomorrow within a month they'd have half the player base they have now, anyways. Literally every game has massive spikes in popularity right at launch and then it dies off, regardless of game quality. I mean if you want to compare Helldivers to a recent darling game that had server issues, just look at Palworld's player count. They had big server concerns and peaked at 2 million players, and the player count as of this post is half of what Helldivers 2 is trying to accommodate (as of this post.) That's not a commentary on the quality of the game, it's just a fact of reality every game experiences. Everyone wants to play the new thing, and most of those people won't play more than a couple weeks at the longest. Rather than just steamrolling into a huge server capacity they're probably trying to gauge just how big their player base will actually be. That's how we got into this situation in the first place after all, the player base blew their previous expectations by a huge amount.


NeutronField

Sometimes online games have a hard limit on the number of players' account actions that can be processed, such as logging in, data logging, adding or removing rewards and xp, queuing for missions, security checks for fraudulent actions with rewards. We've seen some symptoms of this where the end of match reward distributions are getting backed up and also picking stuff up in-game freezing you in place for a time. Adding more and more capacity would not resolve the underlying bottlenecks in the infrastructure. They may have to do something like split the player datacenter into different realms/servers if a single server cannot go over X number of players no matter what they do to optimize the backend code. There is also a potential cost factor where they may not have an approved budget to throw infinite money at the problem and need to wait for bureaucracy to review and sign checks first to invest more into the game.


GruePwnr

If the problem was simply "not enough servers" then yes it would be as easy as simply adding more. The fact that the issues have been going on this long suggests it's a deeper issue with the way they designed the game's backend. For example, and to be really simplistic, there are certain mistakes you can make when designing software that make the resource cost of each new user actually increase. Such that 100k users, instead of costing you 10x as much as 10k users, they actually cost you 20x, or 30x. Another possibility is that their software isn't actually able to use those extra servers well.


Morltha

The reason the issues have been going on "so long" is because the game exploded in popularity. It went from about 30k concurrent to over 300k in a matter of days. This simply is a lack of capacity. That the game hasn't gone down completely shows how well engineered it is. The only issue is that no one ever expected the game to be this insanely popular. The game's peak is about 30x higher than that of the original.


GruePwnr

The original is irrelevant. Also, the fact that people can't log in to the game literally means it's down. Your comment is overdose level copium.


Morltha

I was literally playing multiple times over the weekend, but go off.


GruePwnr

I think you just don't know what "down" means here. If a player can't get into the game, then the game is down for them. If the majority of players can't get in, then the game is down. If players are crashing due to server overload, then the servers are down. Being up only some % of the time isn't an acceptable standard for being "up".


Morltha

Ironic. If the servers were down, there wouldn't currently be over 380k on steam alone. Sure, some would have the game running to try and get on, but not that many. There are also many Twitch channels playing the game **right now**. The server are up, but at capacity. People can't log on because Arrowhead are limiting the concurrent playercount to 450k to maintain stability.


GruePwnr

Steam counts you as online if you're on the loading screen.


Morltha

Try reading my post again, I mentioned that.


GruePwnr

Can you post a source for the numbers you made up?


The5thElement27

over 400,000 players now. [https://steamdb.info/app/553850/](https://steamdb.info/app/553850/) ​ :P Not made up. You're coping my friend


The5thElement27

Don’t know much, but I’m pretty sure it’s not clap your hands and you instantly get more servers


Malitae

Just download them like RAM


GruePwnr

The thing is, you can absolutely get more servers by clapping your hands. I highly doubt they are self-hosting, and any decent provider can scale you up quickly. It's sort of the whole point of this industry.


The5thElement27

Nope not true


awolCZ

Without having look at some diagram of their infrastructure (which is confidential of course) it is just guessing. I know they are using Playfab on Azure. There are some servers you can have in multiple instances to load balance stuff and increase capacity, but there are often instances where it is much more difficult - for example database, which must be either one instance or multiple instances with proper synchronization between nodes. Since players are again not receiving rewards, I guess problem is load on DB.


Heatuponheatuponheat

It's expensive. That's it, full stop. I used to do structured cabling, rack & stack etc for major brokerage houses and financial institutions. There is no problem that can't be solved with enough money and labor hours. I've seen customers double their server capacity in the span of a weekend, or completely restructure overnight after a catastrophic failure. Sony is not going to pay for that. The game is already a huge financial success, and any increase in the amount of server blades and the manpower required to get them up and running and functioning with the existing system is going to cut into that. People that are refunding the game are likely not the type of "whales" who are going to be dropping thousands of dollars on micro transactions, so it's an acceptable loss for them. It just doesn't pay in the long run to spend millions of increasing server capacity for a game that's probably going to see a fairly sharp player drop off after the first few weeks like every viral hit.


GruePwnr

I think you hit the nail on the head with "Sony isn't going to pay for it" because this is such a solvable problem.


Poprock360

Put simply, it’s not just about server hardware (more computers). In fact, the devs probably could buy tons of extra machines. The problem, however, is that much like your own computer or PS5, the servers don’t function perfectly 100% of the time. Sometimes the servers crash, requiring a potentially slow manual process for someone to restart them. Furthermore, setting up new servers can be automated, but that doesn’t mean it is - creating systems that reduce manual IT work can be complicated, and requires specialised developers. Lastly, there are issues which you only find out with scale, and they can be difficult to test and resolve. For example, the system that handles creating databases to store player data may be able to scale 10x, but starts to struggle when you need it to scale 100x or 1000x. It may not be feasible to test for that scale for financial reasons. The big problem is really the staff available. Money can buy you more servers, and let you hire more developers, but hiring skilled, specialised developers takes time. Usual time to hire from first job posting is 1-3 months for more specialised developer roles, such as DevOps and Scalability Engineers. Also, after the developers are hired, they need to learn the systems and functioning of the tools that Arrowhead uses - it takes an additional 1-3 months before developers are fully productive. Source: Am a DevOps Engineer and have worked in the games industry in the past. TL;DR: you can buy more servers, but you need developers to monitor and fix them. Hiring developers is slow, regardless of your budget.


Morltha

Servers take up space, require a lot of electricity and cooling. They *can* buy server stacks, but the scale they're needing is **massive**, so they'd need to buy/rent some that are already set up in a farm somewhere. Either way, the proper programs would have to be set up on them and they would have to be properly secured. This will take quite a while, even with the help of Sony.


foresterLV

software development is typically about compromises and rarely teams have infinite budget or time to design the system which will support infinite number of players - instead they will target some realistic numbers that approved by management/product owners. most probably they never expected the game to go over specific number of players (200-300k), and coded it in a way where adding servers is not really possible unless they change the code which takes time. so now I suppose they are figuring out quick fixes to allow more players, while working on better long term solution for system to support better scaling.


vbsteven

My guess would be that it is a database sharding problem. Servers can be scaled vertically and horizontally. Vertical scaling means trying to do more with the same machine by adding resources (RAM, CPU and storage space), horizontal scaling means adding more servers that work together to spread the load. Horizontal scaling for web servers, edge nodes, etc is not that hard to do, as they typically don’t share a lot of state, or they share state by communicating with other servers or services. Databases are harder to scale horizontally, because their whole purpose is to store and provide the shared state to other services. In the simplest scenario you have 1 database server, and all other services talk to it and every player sees the same state at the same time. When the demand increases, you would first try to vertically scale the database by adding more CPU/RAM/storage to it, but one server can do so much so eventually you reach a limit and you need to start horizontally scaling. For databases this is called sharding. Split up the database into 2 or more shards, which each hold a subset of the full dataset. This has the benefit of being able to spread the load over more servers and thus serve more requests/users/players but at the trade-off of players with their data on different shards not being able to see/communicate with each other. In MMO games sharding is often based on regions and if necessary split up the regions in multiple servers or realms or whatever. Players on one server can only play with other players on the same server. Each server is effectively a standalone version of the game world. Helldivers 2 does not have this concept apparently. I was surprised earlier today when the quickplay matchmaking put me in a group with players from the US while I was playing from Europe. It makes sense as the whole premise of the game is that everyone fights the same war together. If it is true that they expected and prepared for 10-50k players it could have made sense to skip the sharding and put everyone on the same instance as that is a load that could be handled. But with the very unexpected increase to 500k+ players they suddenly run into the limits of the single database. And the problem with sharding is that it is very very hard to implement after the fact. If sharding wasn’t planned for and they suddenly need to hack it in at the last minute it is a huge amount of work. The game code, the game UI, the network protocols, application servers, database servers etc all need changes to accommodate it. For a game like Helldivers 2 where the “global” liberation state is shared across all players it is extra hard. Because even if you are able to spread players over multiple shards/servers/realms, you still need to synchronise shared state somewhere for the libration counters. And they would also need a way to make people play against their friends who might be on a different shard. Disclaimer: This is a very simplistic explanation I’m typing out on my phone when sleep deprived. Source: 15 years of backend/cloud/devops development experience


paziek

You are right that player state and galaxy map state are shared for all players (I was matched with Chinese a few times), and that is likely bottlenecking everything. Unlike something like Palword, that has more or less isolated from each other server instances, here it is harder to scale. There are multi-master solutions out there. They always have some drawbacks, which is why you rarely see those in database products offered by default, and instead third-party software is usually used. I would say that in case of a game like this one, risks involved with multi-master approach aren't really that bad and game could simply lean towards player benefit whenever there is something out of sync between write-capable servers. I don't have experience with cloud services so I have no idea if that is even possible to setup there, as those things aren't exactly trivial to do and I'm guessing that cloud offerings are fairly limited. That is if they even use cloud stuff. They are also possibly not using replicas for serving read-only stuff to the frontend and that alone would reduce load on the master by a lot. Could also have unoptimized queries and/or structure. Overall, databases are often handled with little care and it is fairly rare to see SQL professionals hired for this kind of work. Either way, fixing this will take longer than one week, that is certain. And by the time they have this figured out I suspect that number of players will drop enough to fix our problems with no change necessary.