Simplify your stack and build anything. Or everything.
Build tomorrow’s web with a modern solution you truly own.
Code-based nature means you can build on top of it to power anything.
It’s time to take back your content infrastructure.

Apparent memory leak

default discord avatar
itsjxck2 years ago
11

We noticed Payload was struggling to serve images from a media collection. It would take 1-2 minutes for an image to load. Looking at the AWS metrics for the Fargate cluster indicates there could be a memory leak somewhere. Restarting the tasks in Fargate resolved the issue.



Versions:


- payload@1.6.6


- @payloadcms/plugin-cloud-storage@1.0.12


- @aws-sdk/client-s3@3.266.1


- @aws-sdk/lib-storage@3.267.0

  • discord user avatar
    jmikrut
    2 years ago

    Interesting - we have never come across this ourselves but it's definitely something we need to look into



    is this the only time you've seen it?



    and can you trace it back to one action? or be able to reproduce it? like, did someone upload a large file?

  • default discord avatar
    itsjxck2 years ago

    This is the only time we have seen it yeah, and we can't identify anything in particular that has caused this. This is from our production instance, and actually noone can access the admin dashboard directly, everything is edited in our staging environment and then the db gets promoted to production through

    mongodump

    and

    mongorestore

    , s3 objects get cloned from staging to prod buckets



    @364124941832159242

    we've seen this again, and nothing happened on the Payload instances for ~2 days prior



    The start of the slope increase is at ~9:30PM on Sunday, and the last action we took that affected the instances was Friday ~5PM, where we promoted our staging db to prod using

    mongodump/mongorestore

    Everything else works fine, and this seems to only impact the loading of images from our media collection

  • discord user avatar
    jmikrut
    2 years ago

    Hmmm, this is very interesting to me. Can you try and create a reproduction locally? We will get on this immediately but we'll likely need more in terms of reproduction to be able to assist



    are you using any type of

    afterRead

    hooks or something that would run in production?

  • default discord avatar
    itsjxck2 years ago

    Not for media

  • discord user avatar
    jmikrut
    2 years ago

    what about for anything else?



    we have some big projects in production on digitalocean droplets, but we have never seen this before

  • default discord avatar
    itsjxck2 years ago

    Not for our currently deployed instances; we're developing something that uses one but it's not on prod or staging yet



    The really strange thing is that it affects nothing but the retrieval of media from the cms. Data returns are just as fast, but trying to fetch the media slows down dramatically

  • discord user avatar
    jmikrut
    2 years ago

    ok, well, at least we can narrow it down to media at least



    https://github.com/nodeca/probe-image-size/issues/78

    I just found that this package (which we rely on) may be causing a potential memory leak



    i released a fix in 1.6.28



    can you try that version to see if this solves your issue?



    and followup question for you: are you using the

    useTempFiles: true

    option?

  • default discord avatar
    itsjxck2 years ago

    Awesome news, thanks for looking into it! We have a ticket for upgrades next week so will report back then, but there is another issue that's slowing down our upgrade process because it appears that some future version of payload/payload-cloud-storage from what we currently use no longer handles spaces in the filenames of media



    @364124941832159242

    no we aren't using

    useTempFiles: true

    Our media collection is very simple:


    import { CollectionConfig } from "payload/types";
    
    const Media: CollectionConfig = {
      slug: "media",
      access: {
        read: () => true,
      },
      admin: {
        useAsTitle: "alt",
      },
      fields: [
        {
          name: "alt",
          type: "text",
          required: true,
        },
      ],
      upload: {
        staticURL: "/media",
        staticDir: "media",
        adminThumbnail: "thumbnail",
      },
    };
    
    export default Media;


    And the plugin config:


    import { fromContainerMetadata } from "@aws-sdk/credential-providers";
    import { cloudStorage } from "@payloadcms/plugin-cloud-storage";
    import { s3Adapter as payloadS3Adapter } from "@payloadcms/plugin-cloud-storage/s3";
    
    const s3Adapter = payloadS3Adapter({
      config: {
        credentialDefaultProvider: fromContainerMetadata,
      },
      bucket: process.env.PAYLOAD_CMS_S3_BUCKET,
    });
    
    export default buildConfig({
      ...
      plugins: [
        cloudStorage({
          collections: {
            [Media.slug]: {
              adapter: process.env.PAYLOAD_CMS_S3_BUCKET ? s3Adapter : null,
            },
          },
        }),
      ],
    });


    Hmm, unfortunately it seems we're still having this issue



    Versions:


    "dependencies": {
        "@aws-sdk/client-s3": "^3.305.0",
        "@aws-sdk/credential-providers": "^3.303.0",
        "@aws-sdk/lib-storage": "^3.305.0",
        "@payloadcms/plugin-cloud-storage": "^1.0.14",
        "express": "^4.18.2",
        "payload": "^1.6.30"
      },


    @364124941832159242

    this still happens and it really is quite bizarre. It is

    only

    the image loading that degrades, and media is the only thing that actually "hits" the running Payload instances API. For everything else, we use a local instance of

    payload

    inside our own API container to connect directly to the database for fetching data. This also seems to be exclusive to our production environment. The differences between our envs:


    - Staging


    - 1 Fargate container instance


    - 0.25 cpu units


    - 0.5 mem units



    - Prod


    - minimum 2, maximum 20 Fargate container instances


    - 1 cpu units


    - 4 mem units

  • default discord avatar
    58bits2 years ago

    You may already know this - so feel free to ignore, but there are ways to trigger heap snapshots for node.js in production. You can then bring them down for analysis in the standalone version of dev tools. Here's an excellent presentation from Matteo and Kent....

    https://www.youtube.com/watch?v=vkys6Wk-jYk

    https://kentcdodds.com/blog/fixing-a-memory-leak-in-a-production-node-js-app

    Also here..

    https://nodejs.org/en/docs/guides/diagnostics/memory/using-heap-snapshot

    In Chrome...







    @93699784942034944

    @364124941832159242

    I think what Kent did was create a custom route that he could 'hit' that would trigger heap snapshots and then he downloaded them and the two of them went through the complete analysis process (you can compare snapshots as well - a baseline, against the increased memory version)



    Again - ignore all of this if I'm 'preaching to the choir' ;-)



    Also again - I'm sure you know this - but where the snapshot gets written will depend on your Fargate instance config. We use EFS.

  • default discord avatar
    itsjxck2 years ago

    Awesome! I didn't actually know about the heap stuff, will investigate how we can incorporate this so we can investigate the issues



    Thank you

  • default discord avatar
    58bits2 years ago

    Matteo is a member of the Node.js Technical Steering Committee. He really knows his stuff.



    Good luck!

Star on GitHub

Star

Chat on Discord

Discord

online

Can't find what you're looking for?

Get dedicated engineering support directly from the Payload team.