Måns Andersen

Client Migration and Load Balancing

Details

5 weeks at 50%
Written in C++ using winsock

Goals

A server solution that is:

Robust, it shouldn't 'fall' over if something unexpected happens.
Capable of migrating clients to share and balance execution load.
Scalable to a theoretically infinite size.
Dynamically able to scale while running
Adaptable, capable of being put into/replace most client-server systems without much fuss.

Intent:

The intent of this particular project is to experiment with and simulate heavy loaded servers trying to balance out their load.

But not to just share the load, but to do it in a scalable and robust manner.

Plan:

My initial thoughs were that aiming for an easily scalable solution makes a lot of things a lot harder.

To make it scaleable some form of routing will have to happen.

Likely at the very start of execution.

I want to be able to dynamically add or remove computational power to simulate a server that is never allowed to go down.

This computational power would likely best take the form of server instances spun up & down to fit demand/maintenance.

So to start i'll need to make a Client, a Game server (host) and a Lobby server (router)

The Lobby Server

As with the execution of this not yet functioning program i wanted to start with the routing. However as i quickly realized you can't really redirect nothing so to begin with i needed something of substance to route to. Lets fix that title shall we. (press the button).

There we go, first time's the charm if you've got the power to change the past that is.

I obviously couldnt start with the routing as i'd have nothing to build it towards so i elected to start with a simple client and server. Luckly i already had a bit of a framework from a couple of school assigments, so i started with stripping out any random hard coded chatmessages, usernames and localhost routing from these and got to working.

With all of the junk quickly removed it started looking clean and I started to get some hope for this project.

Using this newfound hope i decided to completly blow up the issue of address parsing and spent a good day or two finding and contemplating a domain to use. After realizing all good ones were already taken i settled on this one, spend another day or so trying to get dns to cooperate, i ended up being able to find the public name of my router but not connecting to it.

  
memcpy(aAddressTarget, &result->ai_addr, result->ai_addrlen);
memcpy(aAddressTarget, result->ai_addr, result->ai_addrlen);

If you've programmed in c/c++ you know the implications of that tiny little ampersand. If you don't, well you should feel very lucky right about now. Lets just say the function had some really weird behaviours when trying to use memory addresses as ip addresses. After fixing that up everything ran smoothly after that.

Now, with this one hickup behind me I now had a robust way to translate any web address into a usable ip address.

The rest of setting up the client and server was pretty uneventfull in comparison, they start up, connect and send a few handshakes back and forth.

I did try to make sure that nothing ever crashed and after kinking out a few dents i'd like to say i succeded. If anything ever broke the connection would be dropped/invlidated and connection would start from the top, if the client reconnected fast enough it'd even just re-inherit it's old profile and continue as if nothing happened.

  
bool TranslateAddress(const std::string& aAddress, sockaddr* aAddressTarget, bool aAllowFailure, std::function aPrinter) {
	if (inet_pton(AF_INET, aAddress.c_str(), &aAddressTarget) != 1) {
		struct addrinfo* ptr = NULL;
		struct addrinfo hints;
		memset(&hints, 0, sizeof(hints)); hints.ai_family = AF_INET; hints.ai_socktype = SOCK_DGRAM; hints.ai_protocol = IPPROTO_UDP;
		bool addressFound = false;
		while (!addressFound && !aAllowFailure) {
			int error = getaddrinfo(aAddress.c_str(), NULL, &hints, &ptr);
			if (error) {
				if (aAllowFailure) {
					return false;
				}
			}
			else {
				struct addrinfo* result = ptr;
				while (result) {
					using namespace std::string_literals;
					char host[NI_MAXHOST];
					if (getnameinfo(result->ai_addr, result->ai_addrlen, host, NI_MAXHOST, NULL, NULL, 0) != NULL) {
						aPrinter("wsaerror: " + std::to_string(WSAGetLastError()),true);
						if (aAllowFailure) {
							return false;
						}
					}
					else {
						char address[INET6_ADDRSTRLEN];
						if (inet_ntop(result->ai_family, result->ai_addr, address, INET6_ADDRSTRLEN) != NULL) {
							aPrinter("Address: "s + address + std::string(16 - strlen(address), ' ') + host,false);
							memcpy(aAddressTarget, result->ai_addr, result->ai_addrlen);
							addressFound = true;
							break;
						}
						else {
							aPrinter("could not translate address",true);
							if (aAllowFailure) {
								return false;
							}
						}
					}
					result = result->ai_next;
				}
			}
			freeaddrinfo(ptr);
		}
	}
	aPrinter("Address parsing done",false);
	return true;
}

The Lobby Server

Once a functioning client and host were running i could once again turn my eyes on the lobby server.

To get a quick start i took the current Game server, copied it over and tore out most of what made it a game server.

I lauched the lobby server and connected a client to it, and wouldn't you know since it was running the same code it worked.

The next logical step was to make the host a client..

Wait thats not logical at all but bear with me. Since i wanted the system as a whole to be expandable while running i needed to be able to hook up and remove servers on the fly, now that sounds an awfull lot like clients being able to connect and disconnect so thats where that line of thinking is coming from.

So once i ripped out the client part of the client and put it in the host aswell iquickly realized that because of the way i had gone about it the lobby server had no idea what was connecting to it, it could be a client or a host, some form of identification step was necessary. So one was added, upon connecting to a server the client or host identifies what service it's providing/requesting from the server.

The next step

With both a client and a host connected to the lobby and identified the next step would be to connect them to eachother.

Simple enough, Send the address of the host to the client and reconect to the host.

If only that would have been that easy, well it was... i was the one that made it hard. since i wanted to simulate something realistic i wanted to make sure the client connected to the optimal host. So the question arose, what is the optimal host?

The criteria i settled on were:

Can the server handle my load or will i decrease the tickrate below some arbitrary threshold ruining the experience.
What's the latency of my actions i.e how long does a round-trip-time ping take.

So now i needed to measure. I started with trying to figure out the load on each server to see if any of them needn't be considered at all. This was simply solved by having the hosts selfrapport to the lobby which also served as an excellent way to filter out dead servers.

Pingtime was a bit harder however as my network interface was only built for having one on one communication.

Since only neaded to send one packet i didn't want to to start up a whole socket just to test the ping so i had to rewrite the interface a bit to allow what i called customly Addressed packets. once every relevant function had it's arguments changed to reflect the exsistance of these packets everything went swimmingly. If i had had more time on this project i would have wanted to rework the interface to better allow multiple connections.

Now measuring the ping of any server was easy create a pingpacket, hock a callback send it of to a custom address. If it takes to long assume the server went down and discard it.

Once all servers have been evaluated by the client send the results back to the server and wait for a decision.

Had this been a real game this is where the player would have been able to interfer and connect to a specific server of their chosing if they wanted to play with friends etc.

Once the lobby server gets the results of every possible server (or too much time has passed) it figures out which of them would be best suited based of of current load and round-trip-time to each of them. It then send off the result to the client who quickly disconnects and connects to the chosen server instead.

I played around a bit with the thought of making this whole process recursive i.e repeat until a host is found, and have more than one lobby server in a large tree structure. Sadly due to time constraints i didn't have the chance to add this functionality.

Migration

Now i had a pretty well functioning setup going where both clients and hosts could connect and disconnect while everything was running. Due to the fact that the lobby server is ignored once a connection to something else has been established means that in a production environment the lobby server could be temporarally shut down for maintence and the only thing that would break is that new players will not be able to join until it's back up again.

This is all fine and relativly effective for what it's doing but if you scroll back up to the top you'll see that i labled it client migration and load balancing when i started. I didn't do that just for fun, now we've covered how it's load balancing (to some extent) lets get to why and how migration works.

Lets paint a scenario, imagine a game, something like runescape it's a friday afternoon and the servers are under pretty much no load as most players are at work or in school. The players that are online are not very computationally heavy doing mostly busywork after a day of playing. Once schools and work ends however serverload skyrockets as everyone is logging on.

The server load balance as best they can as people are logging in. But when all these new people join some of the players that were already online get excited and start playing more activly increasing their load. This puts strain on some servers more than others as they were only balanced at the time of login. Later in the evening most players are done with their daily fix and log off but some stay behind playing for longer further imbalancing the load on the servers.

So what do you do to avoid having the load imbalance over time. Well you redistribute the load, atleast that is what i did.

To begin with we have to identify when there is an imbalanced load. Once a server is nearing it's capacity to keep up with tickrate it signals the lobby server which checks the other servers in the network if there is anyone that's under capacity, if there is it responds to the host and sends the adress of the most suitable new host. The original host then begins migrating a user by first neatly serializing any relevant data, then sending that data to the new host.

But i couldn't just send the data from host to host, remember both servers and clients connect in the same way. I had to reutilize the identification part of the connection to establish that this new connection was in fact not a client but rather a server sending a client.

After sending all of the data the old host signals the new host and the package is deserialized. Now this is where my quest for robustness payed of. Remember how if the client re-connected fast enough it'd re-inherit it's profile. So to get a client to join a host with an already established profile all that i had to really do was pretend that it was reconnecting after having disconnected.

Now all i had to do was, prepare the profile on the new host as if it had disconnected and send the client the goahead to reconnect to the new host.

Takeaways

Since i didnt want the project to be bound to a specific game i made sure to keep everything as abstract as possible codewise and i found it harder and harder to work in a vacuum and without something tangable to see/interact with. Had i done this project again i would start with getting something to see, likely some kind of player that could be moved and build from that rather than working with only the infrastructure, at certain points it would have been nice to be able to see if everything was working as intended or merely seemed like it.
If i had had more time on my hands i would have liked to make the lobby recursive capable, i.e a lobby of lobbies as that could help cut down some marginal load on mostly the lobby as it doesn't have to know of every host there is in the world.
I had originally wanted to hook the servers it up to something like grafana but with my small and everchanging setup it didn't end up being worth it so i made my own little monitoring program to figure out the status of everything.
I really like these kinds of challanges as there is a very clear-cut goal to reach. It needs to work and it needs to do this, that and this. making it easy to figure out when you're done.

https://github.com/Fiskmans/ClientMigration