Oban_jobs keeps old jobs

ilja · 18 May 2024 11:53

I noticed when posting that I often got an error, but somewhat later the post was made. Sometimes it wouldn’t arrive to everyone. I also noticed that checking the “preview” would load a long time and throw the same error, but after trying a couple of times, it worked and posting also worked. Today I decided to look somewhat deeper into it.

I had already seen that a lot of IO to the disk happened when trying this, so I expected some heavy query running. I checked what queries were running using

-- Currently running queries
-- https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW
select now() - query_start, psa.query, psa.* FROM pg_catalog.pg_stat_activity psa 
where psa.datname = 'pleroma_ynh' -- my DB name
and psa.application_name not like 'DBeaver%' -- ignore queries from the client I use 
and psa.query not like 'LISTEN%'
order by query_start
;

One query I saw was on oban_jobs.

SELECT o0."id", o0."state", o0."queue", o0."worker", o0."args", o0."meta", o0."tags", o0."errors", o0."attempt", o0."attempted_by", o0."max_attempts", o0."priority", o0."attempted_at", o0."cancelled_at", o0."completed_at", o0."discarded_at", o0."inserted_at", o0."scheduled_at"
FROM "oban_jobs" AS o0
WHERE (o0."state" = 'scheduled')
AND (o0."queue" = 'activity_expiration')
AND (o0."args"->>'activity_id' = '$1')
;

According to the statistics there’s about 7.000.000 records in there!

SELECT reltuples::bigint AS estimate FROM pg_class where relname='oban_jobs';
-- 7856348|

Most are completed or discarded.

select state, count(*) FROM oban_jobs
group by state;
/*
state    |count  |
---------+-------+
available|    137|
scheduled|  13623|
executing|    457|
retryable| 410436|
completed|5633742|
discarded|1798991|
cancelled|      1|
*/

So I wondered if these can’t be removed and if we shouldn’t do this by default. I quickly found Oban.Plugins.Pruner — Oban v2.15.2 (I’m on Akkoma 3.12.0 and in mix.exs I see that Oban is version 2.15.2).

There I see there’s a setting to activate Oban.Plugins.Pruner, but checking config.ex, it’s enabled akkoma/config/config.exs at develop - AkkomaGang/akkoma - Akkoma Development. And I’m not overriding it in neither prod.exs nor prod.secret.exs. I also don’t see a key in the database configuration with oban (select * from config).

So afaict, it should prune ‘completed’, ‘cancelled’ and ‘discarded’ jobs, but it seems it hasn’t for my instance since 2023-04-15, a bit over a year ago.

select min(completed_at) FROM oban_jobs
where state in ('completed','cancelled','discarded');
/*
min                    |
-----------------------+
2023-04-15 15:17:52.591|
*/

For now, I’ll try deleting them manually and see if this helps me. But the question does remain why it doesn’t prune, and also if other people maybe also have this problem?
select state, count(*) FROM oban_jobs group by state; should give no or barely any results for ‘completed’, ‘cancelled’ or ‘discarded’.

ilja · 18 May 2024 14:44

The ‘completed’, ‘cancelled’ and ‘discarded’ are deleted. At first glance, the problem is still there, but it seems better? (I had to do the preview thing multiple times before it worked faster, now it seems that 2 or 3 times was enough). But I still have 424.997 records. A bunch are scheduled in the past, some have state “executing” but have this since 2022… I’ll try to delete everything with a last attempt before the beginning of the year and see what happens (yolo I guess, but I do have a backup in case it’s required) delete FROM oban_jobs where attempted_at < DATE('2024-01-01');.

I gotta be honest though, I wouldn’t be surprised if the hardware simply can’t keep up any more with the total DB size and this is in part what causes issues.

[1] For those interested; Here’s what I did now for deleting the things the Oban prune job normally does automatically (after stopping Akkoma):

First make a backup table just in case

-- CREATE TABLE oban_jobs_pre_202405181356 AS 
TABLE oban_jobs;
/*
Updated Rows  7857599
Query CREATE TABLE oban_jobs_pre_202405181356 AS 
  TABLE oban_jobs
Start time  Sat May 18 13:57:05 CEST 2024
Finish time Sat May 18 14:05:21 CEST 2024
*/

Then delete the jobs who should be deleted by Oban prune

delete from oban_jobs
where state in ('completed','cancelled','discarded');
/*
Updated Rows  7432923
Query delete from oban_jobs
  where state in ('completed','cancelled','discarded')
Start time  Sat May 18 14:09:42 CEST 2024
Finish time Sat May 18 14:19:14 CEST 2024
*/

select state, count(*) FROM oban_jobs
group by state;
/*
state    |count |
---------+------+
available|   137|
scheduled| 13623|
executing|   459|
retryable|410457|
*/

Reindex bc the index still returned the previous number of rows.
mix pleroma.database vacuum analyze

SELECT reltuples::bigint AS estimate FROM pg_class where relname='oban_jobs';
-- 425770

And then restart Akkoma.

norm · 18 May 2024 15:02

I checked my database and it seems like it’s working fine on my instance.

SELECT reltuples::bigint AS estimate FROM pg_class where relname='oban_jobs';
/*
estimate 
----------
     3645
(1 row)
*/

select state, count(*) FROM oban_jobs
group by state;
/*
   state   | count 
-----------+-------
 available |    93
 scheduled |  2130
 executing |   493
 retryable |    85
 completed |    99
 discarded |    50
(6 rows)
*/

I’m on 3.13.1.

ilja · 20 May 2024 12:39

I did some more on this. My instance is running, so good. But the Oban pruner still doesn’t seem to work, and I still don’t understand why (yet?), so let me share what I’ve learned so far in case it can be useful to someone else.

Oban has different queues, they are defined in the config.exs. The ones who are defined in config, should run. With this query you can find them in the database with how many pending jobs there are

select queue, count(*) from oban_jobs
where state not in ('completed','cancelled','discarded')
group by queue;

I had jobs who should’ve run ages ago. You can force a queue by going into the iex where Akkoma is running. To get this shell for from source, you can start using iex -S mix phx.server in the Akkoma folder (I think OTP has a way to get it on a running instance, but I don’t know it). Then in that iex shell, to run everything in e.g. the activity_expiration queue, Oban.drain_queue(queue: :activity_expiration, with_scheduled: true, with_limit: 1, with_recursion: true) (see Oban — Oban v2.15.2). I also increased the timeout for some jobs, check config.exs for the settings.

After draining the queues I wanted to drain, I cleared the table with sql truncate oban_jobs and restarted.

The Oban.Plugins.Pruner should run every 30 seconds and remove ‘completed’, ‘cancelled’ and ‘discarded’ older than 60 seconds (see Oban.Plugins.Pruner — Oban v2.15.2), but it doesn’t seem like it’s actually working on my instance bc the completed jobs just keep increasing. I’m not sure why, but the docs do mention that the jobs are best effort and old jobs may not be pruned immediately (see Oban — Oban v2.15.2 ).

There was the idea that it’s maybe a timeout on the database causing problems, and I did see a “connection timed out” error, so I increased the timeout for now from 15s to 60s (based on disconnected: ** (DBConnection.ConnectionError) client #PID<0.368.0> timed out because it checked out the connection for longer than 15000ms · Issue #1658 · elixir-ecto/ecto · GitHub). But this doesn’t seem to help for the pruning.

config :pleroma, Pleroma.Repo,
  timeout: 60_000

That’s it. The instance is currently running, it’s just that this doesn’t seem OK and I want to share what I’ve got in case its useful for others. The weekend is over, so for now I’ll leave it be and we’ll see if I come back to it.

FloatingGhost · 20 May 2024 15:31

my only other theory for why the pruner might not be running would be maybe you’ve got some override somewhere (configdb?) that is somehow changing the plugins setting

not sure if you can do that via adminfe but maybe if it wasn’t enabled before the setting got moved to configdb and sorta lives there?

Oneric · 20 May 2024 16:16

Looking at the plugin source nothing stands out in particular, other than a check for cluster leadership which if failed prevents any pruning. Is your DB clustered?

If it persists, copying upstream’s code into a custom module, adding some debug logs and using it instead of upstream’s pruner might give more insight. If nothing else it should reveal whether it runs at all

ilja · 22 June 2024 17:06

It’s working again! (Also, I may be spamming your instance with month-old fetch-requests)

I was able to make some time to work on my instance again today. I decided to first update and then maybe check this. After the update, I restarted my instance and saw the following lines in the error logs (not sure why I didn’t see/notice it last time)

[error] The `oban_peers` table is undefined and leadership is disabled.
Run migrations up to v11 to restore peer leadership. In the meantime, distributed plugins
(e.g. Cron, Pruner) will not run on any nodes.

This is interesting, because this can explain the symptoms I was seeing. Apparently some table is missing, and it needs to get there through a migration. I checked the database and the table was indeed missing. The migration, however, is part of Akkoma (20220718102634_upgrade_oban_to_v11.exs) and mix ecto.migrate showed all migrations up. This migration ends with calling Oban migration with version 11 (which is correctly done as it should be), which should correspond to https://github.com/sorentwo/oban/blob/main/lib/oban/migrations/postgres/v11.ex who indeed creates this missing table.

I eventually reran the migration by copying the file, renaming the module name and run mix ecto.migrate again. Now the table is there and after restarting I immediately saw four-month old posts coming in. Querying the oban_jobs table also shows a lot of activity happening, indicating it’s working now.

I have no idea what happened that the migration supposedly ran, while clearly something went wrong. I also can’t really be sure this is the only one. But at least it’s working again. And since no one else seems to have the same problem, it doesn’t seem like a general issue, so I’ll leave it at that.

Thank you for everyone who helped thinking along!