Hi all,
Sharing my migration story of Nudie.Social. There were a few unexpected speed bumps along the way, I hope someone will find the following info useful when they migrate their instances from Pleroma to Akkoma.
Kudos to FloatingGhost for a very polished fork, and the many user-centric new features. 110% worth the effort to migrate!
TL;DR - pg_restore sometimes give you headaches; Double check your database schema and indices after migration, especially if running a recent Pleroma develop
branch.
Disclaimer: YMMV. I’m sharing my own migration experience of a single instance. Most of the issues I had during migration have nothing to do with Akkoma code. 100% of this post is about lessons I learned during migration, and to help other instance admins.
Some background: My instance was running a recent (09/2022) version of Pleroma develop
branch, from source. I was migrating the instance to Akkoma as well as to a new VM in a single move. New box will run from source as well, not OTP.
Friendly reminder: Always have a working backup of both the source tree, which includes the static and uploads directories, as well as a full dump of the Postgres database. I recommend using the plain SQL dump, not the custom
postgres dump (more on this later). Make sure the Pleroma server is shut down before creating the dump for 100% consistent data.
Migration starts
Pulled latest Akkoma develop
branch, deleted _build
directory and made sure the code is 100% unmodified Akkoma code - no merge, no rebase. Then, ran mix pleroma.instance gen
command to get the latest generated_config.exs
file and setup_db.sql
. I manually inspected the generated config and merge any new options to the existing prod.secret.exs
. p.s. I was running in-database config so not much to merge. Checked setup_db.sql
and made sure all Postgres extentions were installed, which they were. Those were the easy parts.
Speed bump #1: pg_restore woes
As mentione above, I was migrating the instance to a new VM with newer OS (Debian bullseye), so I needed to perform a PostgreSQL upgrade from 10.x to 13.x. I decided to use pg_dump > pg_restore
to create the database brand new. Immediately, I ran into issues:
1. Lots of errors during restore
This turns out to be a charset/collate/encoding issue. On the new VM, databases were created with default SQL_ASCII
charset, which caused certain data rows to error on restore. Re-created database with ENCODING = UTF8
solved the issue. Further reading: UTF8 Postgresql Create Database Like MySQL (including character set, encoding, and lc_type) - Stack Overflow My original db ran on Postgres Docker image, which probably defaults to UTF8
.
2. Restore stuck at certain CREATE INDEX statement
I don’t fully understand the root cause, but one particular CREATE INDEX
statement was stuck with 100% CPU usage during restore. I wasted more than an hour waiting for it to complete. Tweaking Postgres config to give more resources to the maintenance workers had no effect.
Offending index name: activities_visibility_index
.
I had two choices: 1) Skip the bad CREATE INDEX
statement, or 2) Abandon the pg_restore
route, and transfer the /var/lib/postgresql
directory content to the new host, and run pg_upgrade
.
At the end, I re-ran pg_dump
using plain SQL format, not custom format, and commented out the CREATE INDEX activities_visibility_index
statement. With the custom
format, there’s no way to edit the SQL statements, hence plain SQL dump format is the way to go.
Luckily, no other SQL statement caused me grief. I was able to fully restore the database to new VM with zero error. Next, I ran the CREATE INDEX activities_visibility_index
statement manually and, lo and behold, took 3-5 minutes to complete, LOL.
Lesson learned: Don’t waste too much time waiting for pg_restore to complete. UTF8 encoding is great.
Speed bump #2: Frozen, everything
At this point, I’ve got a pristine database and a pristine source tree, mix ecto.migrate
was done, I was hoping to launch the newly migrated instance and call it a day! Not so fast, as I had a huge problem immediately after site start: Everything slowed to a crawl, all CPU cores maxed out 100%, I/O maxed out! Timelines didn’t load AT ALL. Pandemonium.
First thing I did to diagnose, was to make sure PGtune config was applied correctly - It was. Then I did some frantic Googling and enabled slow query logs. There were thousands of slow queries around timeline related queries. They all timed out after 15 seconds.
I shut down mix phx.server
and ran EXPLAIN ANALYZE
on the slow queries - Each one of them took 10 seconds or so to complete, using full table scans. Then, as a test, I fired up the old Pleroma source tree, connecting to the new database - The instance ran silky smooth and generated almost no load to the box. This effectively ruled out potential db corruption or config issue. At this stage, the issue was most likely around database indexes (I hoped).
To solve this, I created a new, empty Postgres database, switched back to the Akkoma source tree, generated another brand new config using mix pleroma.instance gen
, and ran ecto.migrate
on top of the empty database, to generate a pristine database schema. Then, used pg_dump
to grab the schema of both databases and looked at the diff
between them.
There were at least 3 significant differences between the Pleroma/Akkoma schemas. The last one was the most interesting: CREATE INDEX activitie_create_objects_index
. In the Pleroma version, the index was created on top of a function; OTOH, the Akkoma version was a COALESCE()
.
p.s. Looks like this issue was experienced by at least one other instance admin: #215 - A lot of errors when migrating from Pleroma - akkoma - Akkoma Development
After dropping and recreating the 3 indices / discrepencies with the pristine Akkoma version, I restarted the instance and - big sigh of relief - Job done!
Lesson learned: Slow query log is awesome, alway have backups (app code as well as database)
Thanks for reading,
Ninja