I am trying to run llama2 on an ec2 instance with GPU T4 and encountering some errors. the current GPU has the memory of 14 Gb. what is the best way of deploying llama2 on aws without constraints and issues with memory?
If you are running th quantized version of llama, it should work on an ec2 instance. However, if you are trying to run the 16bit or 32bit versions, you will run out of memory.