We noticed Payload was struggling to serve images from a media collection. It would take 1-2 minutes for an image to load. Looking at the AWS metrics for the Fargate cluster indicates there could be a memory leak somewhere. Restarting the tasks in Fargate resolved the issue.
Versions:
- payload@1.6.6
- @payloadcms/plugin-cloud-storage@1.0.12
- @aws-sdk/client-s3@3.266.1
- @aws-sdk/lib-storage@3.267.0
Interesting - we have never come across this ourselves but it's definitely something we need to look into
is this the only time you've seen it?
and can you trace it back to one action? or be able to reproduce it? like, did someone upload a large file?
This is the only time we have seen it yeah, and we can't identify anything in particular that has caused this. This is from our production instance, and actually noone can access the admin dashboard directly, everything is edited in our staging environment and then the db gets promoted to production through
mongodump
and
mongorestore
, s3 objects get cloned from staging to prod buckets
we've seen this again, and nothing happened on the Payload instances for ~2 days prior
The start of the slope increase is at ~9:30PM on Sunday, and the last action we took that affected the instances was Friday ~5PM, where we promoted our staging db to prod using
mongodump/mongorestore
Everything else works fine, and this seems to only impact the loading of images from our media collection
Hmmm, this is very interesting to me. Can you try and create a reproduction locally? We will get on this immediately but we'll likely need more in terms of reproduction to be able to assist
are you using any type of
afterRead
hooks or something that would run in production?
Not for media
what about for anything else?
we have some big projects in production on digitalocean droplets, but we have never seen this before
Not for our currently deployed instances; we're developing something that uses one but it's not on prod or staging yet
The really strange thing is that it affects nothing but the retrieval of media from the cms. Data returns are just as fast, but trying to fetch the media slows down dramatically
ok, well, at least we can narrow it down to media at least
I just found that this package (which we rely on) may be causing a potential memory leak
i released a fix in 1.6.28
can you try that version to see if this solves your issue?
and followup question for you: are you using the
useTempFiles: true
option?
Awesome news, thanks for looking into it! We have a ticket for upgrades next week so will report back then, but there is another issue that's slowing down our upgrade process because it appears that some future version of payload/payload-cloud-storage from what we currently use no longer handles spaces in the filenames of media
no we aren't using
useTempFiles: true
Our media collection is very simple:
import { CollectionConfig } from "payload/types";
const Media: CollectionConfig = {
slug: "media",
access: {
read: () => true,
},
admin: {
useAsTitle: "alt",
},
fields: [
{
name: "alt",
type: "text",
required: true,
},
],
upload: {
staticURL: "/media",
staticDir: "media",
adminThumbnail: "thumbnail",
},
};
export default Media;
And the plugin config:
import { fromContainerMetadata } from "@aws-sdk/credential-providers";
import { cloudStorage } from "@payloadcms/plugin-cloud-storage";
import { s3Adapter as payloadS3Adapter } from "@payloadcms/plugin-cloud-storage/s3";
const s3Adapter = payloadS3Adapter({
config: {
credentialDefaultProvider: fromContainerMetadata,
},
bucket: process.env.PAYLOAD_CMS_S3_BUCKET,
});
export default buildConfig({
...
plugins: [
cloudStorage({
collections: {
[Media.slug]: {
adapter: process.env.PAYLOAD_CMS_S3_BUCKET ? s3Adapter : null,
},
},
}),
],
});
Hmm, unfortunately it seems we're still having this issue
Versions:
"dependencies": {
"@aws-sdk/client-s3": "^3.305.0",
"@aws-sdk/credential-providers": "^3.303.0",
"@aws-sdk/lib-storage": "^3.305.0",
"@payloadcms/plugin-cloud-storage": "^1.0.14",
"express": "^4.18.2",
"payload": "^1.6.30"
},
this still happens and it really is quite bizarre. It is
onlythe image loading that degrades, and media is the only thing that actually "hits" the running Payload instances API. For everything else, we use a local instance of
payload
inside our own API container to connect directly to the database for fetching data. This also seems to be exclusive to our production environment. The differences between our envs:
- Staging
- 1 Fargate container instance
- 0.25 cpu units
- 0.5 mem units
- Prod
- minimum 2, maximum 20 Fargate container instances
- 1 cpu units
- 4 mem units
You may already know this - so feel free to ignore, but there are ways to trigger heap snapshots for node.js in production. You can then bring them down for analysis in the standalone version of dev tools. Here's an excellent presentation from Matteo and Kent....
https://www.youtube.com/watch?v=vkys6Wk-jYkhttps://kentcdodds.com/blog/fixing-a-memory-leak-in-a-production-node-js-app
Also here..
https://nodejs.org/en/docs/guides/diagnostics/memory/using-heap-snapshotIn Chrome...
@364124941832159242
I think what Kent did was create a custom route that he could 'hit' that would trigger heap snapshots and then he downloaded them and the two of them went through the complete analysis process (you can compare snapshots as well - a baseline, against the increased memory version)
Again - ignore all of this if I'm 'preaching to the choir' ;-)
Also again - I'm sure you know this - but where the snapshot gets written will depend on your Fargate instance config. We use EFS.
Awesome! I didn't actually know about the heap stuff, will investigate how we can incorporate this so we can investigate the issues
Thank you
Matteo is a member of the Node.js Technical Steering Committee. He really knows his stuff.
Good luck!
Star
Discord
online
Get dedicated engineering support directly from the Payload team.