Maintenance due to server migration.

Scheduled Maintenance Report for SukebeZone+

Postmortem

The migration is finally over. And we have learned several things from this "trip", mistakes that we hope not to repeat in the future.

  • Initially we estimated that the migration was going to take 3 hours, however it was extended to almost 12 hours and we didn't even succeed the first time.
  • Migrating the user files (photos, invoices, sessions, etc) was a simple process that took about 1 hour and a half.
  • Migrating the database... made us realize that we were not prepared.

So what went wrong?

  • The database is a SQL file of approximately 2.5 GB.
  • The heaviest tables are the ones that store information about events, photos and the overwatch program.
  • We discovered, too late, that the volume we were using for the new server had the same speed as a traditional (ultra slow) hard disk 🤦.

Here is a summary (without the times we cursed the database and the server) of what happened:

  • After a basic check and successfully migrating other services to the new server we decided it was time to migrate SukebeZone+, what could go wrong? We activated the maintenance at 1am on January 24th and started with the migration tasks.
  • The first big task was to import the database (freshly exported) to the new server. this process took approximately 40 minutes before it seemed to be stuck at 100% so we interrupt it and did not notice any missing data.
  • We decided it was a good idea (🤷‍♂️) to do tests in the staging environment, unfortunately there seemed to be problems in the code that prevented it from being accessed correctly, so we spent about 30 minutes fixing these bugs.
  • Once the bugs have been fixed we found several errors related to the database, when investigating we realized that operations such as the creation of relationships, indexes and AUTO_INCREMENT had not been executed (operations interrupted early by us).
  • This was a surprise, at first we thought that the database had not been exported completely, so we checked the SQL file and saw that these operations were at the bottom.
  • That explained why it seemed to freeze at the end during import, so we decided to re-import the database but this time waiting for it to finish completely.
  • It took almost 2 hours to finish... but due to the pressure of finishing the migration tasks we didn't give it importance.
  • We performed the tests in the staging environment and everything seemed to work correctly. We quickly started to prepare the production environment.
  • As in the staging environment, we re-imported the database and let it finish completely.
  • We remove the maintenance and... disaster. The same database-related errors appeared in production.
  • That did not make any sense, so we checked and indeed, the final operations had not been executed, even though the re-import finished clean.
  • We did not want to wait another 2 hours at re-importing, so we decided to isolate the final operations and run them one by one.
  • At first it seemed to work, until it was the turn of the heavier tables, the creation of indexes and relations took up to 40 minutes per operation.
  • The long time was so weird. We had never experienced anything like this on past servers, but the pressure was still on, so we decided to continue even with those long waiting times.
  • No matter how long we waited, the process was silently interrupted, no errors and not working.
  • After several hours investigating, we realized that these operations created a copy of the same table to be applied, being a large table, this copy was created in the storage volume instead of RAM.
  • We finally realized that the storage volume was VERY VERY slow, which caused these operations to take so long and then fail. So we decided to migrate the entire volume to a new high speed volume.
  • But of course, if migrating a database was time consuming, migrating a whole volume was even more time consuming...
  • When the process was over, despair set in. The volume was still slow, sometimes even slower than the previous one. How was this possible? If the volume was categorized as "high speed gen 2".
  • We run speed tests over and over again, migrating data from one volume to another. Nothing made sense. Until we read the documentation from our server vendor. The speed of the "high speed gen 2" volumes depended on the amount of GB used and because our volume only used 17 GB, the speed was slow.
  • By this point we had already spent more than 12 hours performing migration tasks, waiting for hours and hours. We decided to end the migration as a failure, restore the functionality of the previous server and try again tomorrow.
  • By the 26th we had done everything from scratch. New server, new fast volume, everything should be fine. We first imported other services (as before) and finished well. Now it was SukebeZone+'s turn.
  • We performed the platform migration, transferred user files and the database was imported in its entirety in about 30 minutes! At last!!

With this we learned a valuable lesson: ALWAYS, ALWAYS use fast volumes/disks.

Now that all our services are on the new server and infrastructure, we can continue to maintain it and create new ones.

Posted Jan 26, 2024 - 22:12 CST

Completed

The scheduled maintenance has been completed.
Posted Jan 26, 2024 - 22:11 CST

Update

The photos are now loading correctly. It is possible that the performance of the photos may be slower for the next few hours until they are loaded into cache.

The migration seems to have been successful, we will be monitoring in the next hour.
Posted Jan 26, 2024 - 21:41 CST

Update

We are checking a problem at loading photos and results.
Posted Jan 26, 2024 - 21:28 CST

Verifying

The database and user files have finished transferring. We will be testing in production and monitoring performance.
Posted Jan 26, 2024 - 21:20 CST

Update

The migration of user files (photos, invoices, sessions, etc.) and the database has started.
Posted Jan 26, 2024 - 20:39 CST

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Jan 26, 2024 - 20:31 CST

Scheduled

After several tests and checks, we are ready to retry the server migration of the platform.

This time we are confident that the migration will take 3 hours or less. During this time the platform will not be accessible.
Posted Jan 26, 2024 - 20:13 CST