My migration war story

ninja · 2 November 2022 12:29

Hi all,

Sharing my migration story of Nudie.Social. There were a few unexpected speed bumps along the way, I hope someone will find the following info useful when they migrate their instances from Pleroma to Akkoma.

Kudos to FloatingGhost for a very polished fork, and the many user-centric new features. 110% worth the effort to migrate!

TL;DR - pg_restore sometimes give you headaches; Double check your database schema and indices after migration, especially if running a recent Pleroma develop branch.

Disclaimer: YMMV. I’m sharing my own migration experience of a single instance. Most of the issues I had during migration have nothing to do with Akkoma code. 100% of this post is about lessons I learned during migration, and to help other instance admins.

Some background: My instance was running a recent (09/2022) version of Pleroma develop branch, from source. I was migrating the instance to Akkoma as well as to a new VM in a single move. New box will run from source as well, not OTP.

Friendly reminder: Always have a working backup of both the source tree, which includes the static and uploads directories, as well as a full dump of the Postgres database. I recommend using the plain SQL dump, not the custom postgres dump (more on this later). Make sure the Pleroma server is shut down before creating the dump for 100% consistent data.

Migration starts

Pulled latest Akkoma develop branch, deleted _build directory and made sure the code is 100% unmodified Akkoma code - no merge, no rebase. Then, ran mix pleroma.instance gen command to get the latest generated_config.exs file and setup_db.sql. I manually inspected the generated config and merge any new options to the existing prod.secret.exs. p.s. I was running in-database config so not much to merge. Checked setup_db.sql and made sure all Postgres extentions were installed, which they were. Those were the easy parts.

Speed bump #1: pg_restore woes

As mentione above, I was migrating the instance to a new VM with newer OS (Debian bullseye), so I needed to perform a PostgreSQL upgrade from 10.x to 13.x. I decided to use pg_dump > pg_restore to create the database brand new. Immediately, I ran into issues:

1. Lots of errors during restore

This turns out to be a charset/collate/encoding issue. On the new VM, databases were created with default SQL_ASCII charset, which caused certain data rows to error on restore. Re-created database with ENCODING = UTF8 solved the issue. Further reading: https://stackoverflow.com/a/9963618 My original db ran on Postgres Docker image, which probably defaults to UTF8.

2. Restore stuck at certain CREATE INDEX statement

I don’t fully understand the root cause, but one particular CREATE INDEX statement was stuck with 100% CPU usage during restore. I wasted more than an hour waiting for it to complete. Tweaking Postgres config to give more resources to the maintenance workers had no effect.

Offending index name: activities_visibility_index.

I had two choices: 1) Skip the bad CREATE INDEX statement, or 2) Abandon the pg_restore route, and transfer the /var/lib/postgresql directory content to the new host, and run pg_upgrade.

At the end, I re-ran pg_dump using plain SQL format, not custom format, and commented out the CREATE INDEX activities_visibility_index statement. With the custom format, there’s no way to edit the SQL statements, hence plain SQL dump format is the way to go.

Luckily, no other SQL statement caused me grief. I was able to fully restore the database to new VM with zero error. Next, I ran the CREATE INDEX activities_visibility_index statement manually and, lo and behold, took 3-5 minutes to complete, LOL.

Lesson learned: Don’t waste too much time waiting for pg_restore to complete. UTF8 encoding is great.

Speed bump #2: Frozen, everything

At this point, I’ve got a pristine database and a pristine source tree, mix ecto.migrate was done, I was hoping to launch the newly migrated instance and call it a day! Not so fast, as I had a huge problem immediately after site start: Everything slowed to a crawl, all CPU cores maxed out 100%, I/O maxed out! Timelines didn’t load AT ALL. Pandemonium.

First thing I did to diagnose, was to make sure PGtune config was applied correctly - It was. Then I did some frantic Googling and enabled slow query logs. There were thousands of slow queries around timeline related queries. They all timed out after 15 seconds.

I shut down mix phx.server and ran EXPLAIN ANALYZE on the slow queries - Each one of them took 10 seconds or so to complete, using full table scans. Then, as a test, I fired up the old Pleroma source tree, connecting to the new database - The instance ran silky smooth and generated almost no load to the box. This effectively ruled out potential db corruption or config issue. At this stage, the issue was most likely around database indexes (I hoped).

To solve this, I created a new, empty Postgres database, switched back to the Akkoma source tree, generated another brand new config using mix pleroma.instance gen, and ran ecto.migrate on top of the empty database, to generate a pristine database schema. Then, used pg_dump to grab the schema of both databases and looked at the diff between them.

There were at least 3 significant differences between the Pleroma/Akkoma schemas. The last one was the most interesting: CREATE INDEX activitie_create_objects_index. In the Pleroma version, the index was created on top of a function; OTOH, the Akkoma version was a COALESCE().

p.s. Looks like this issue was experienced by at least one other instance admin: Making sure you're not a bot!

After dropping and recreating the 3 indices / discrepencies with the pristine Akkoma version, I restarted the instance and - big sigh of relief - Job done!

Lesson learned: Slow query log is awesome, alway have backups (app code as well as database)

Thanks for reading,
Ninja

ninja · 2 November 2022 12:37

Bonus speed bump: Frontends not updating

After the successful migration, I noticed pleroma-FE looked more or less than same as the old one, even after running mix pleroma.frontend install pleroma-fe a few times. Odd, wasn’t it?

Turned out the in-database config was storing old download URLs to the various FEs.

Problem was easily fixed by running mix pleroma.config delete pleroma frontends to remove the in-database config section, then restart the server to let the default URLs re-populate themselves, then run ``mix pleroma.frontend install` again to download the latest and greatest FEs (Yay!)

Ninja

FloatingGhost · 4 November 2022 04:44

interesting, thanks for the detailed account!

seems that function index thing was added reasonably recently → Add function to calculate associated object id (!3692) · Merge requests · Pleroma / pleroma · GitLab

wonder why that particular thing is causing an issue - i suppose if one is migrating from a newer pleroma instance, they might conflict

so you say the fix was to delete the index and recreate as per akkoma? interesting, i didn’t consider what effect newer pleroma migrations might have

i’ll have to think of how to mitigate against that..

ninja · 4 November 2022 06:12

The fork has diverged quite a bit now Yup !3692 is the offending merge request

Pleroma has some new tables too, not sure what they’re for:

oban_beats
user_frontend_setting_profiles