Friday, April 15, 2022

[SOLVED] PyTorch model take too much to load the first time in a new machine

Issue

I have a manual scaling set-up on EC2 where I'm creating instances based on an AMI which already runs my code at boot (using Systemd). I'm facing a fundamental problem: on the main instance (the one I use to create the AMI, the Python code takes 8 seconds to be ready after the image is booted, this includes importing libraries, loading state dicts of models, etc...). Now, on the images I create with the AMI, the code takes 5+ minutes to boot up the first time, it takes especially long to load the state dicts from disk to GPU memory, after the first time the code takes about the same as the main instance to load.

The AMI keeps the same pycache folders as the main instance, so it shouldn't take that much time since I think the AMI should include everything, shouldn't it?. So, my question is: Is there any other caching to make CUDA / Python faster that I'm not taking into consideration? I'm only keeping the pycache/ folders, but I don't know if there's anything I could do to make sure it doesn't take that much time to boot everything the first time. This is my main structure:

# Import libraries
import torch
import numpy as np

# Import personal models (takes 1 minute)
from model1 import model1
from model2 import model2

# Load first model
model1_object = model1()
model2_object = model2()

# Load state dicts (takes 3+ minutes, the first time in new instances, seconds other times)
# Note: the models are a bit heavy
model1_object.load_state_dict(torch.load("model1.pth"))
model2_object.load_state_dict(torch.load("model2.pth"))
 

Note: I'm using g4dn.xlarge instances, for both the main instance and for newer ones in AWS.


Solution

This was caused because of the high latencies required while restoring AWS EBS snapshots. At first when you restore a snapshot, the latency is extremely high, explaining why the model takes so much to load in my example when the instance is freshly created.

Check the initialization section of this article: https://cloudonaut.io/ebs-snapshot-pitfalls/

The only solution that I've found to use an instance fast when it is first created is to enable Fast Snapshot Restore, which costs around 500$ a month: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-fast-snapshot-restore.html

If you have time to spare, you can wait until the maximum performance is achieved, or try to warm the volume up beforehand https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html



Answered By - Marcelo Diaz
Answer Checked By - Candace Johnson (WPSolving Volunteer)