Meet with us at Nvidia GTC in San Jose, CA (March 17 - 21) Click here to book

Optimizing LLM Deployment with LLaMA 3.2 and Denvr Cloud
4
101
0
Deploying large language models (LLMs) for real-time applications, such as conversational AI, has become a key component for businesses aiming to automate interactions and enhance customer experiences. However, running these models on edge devices or low-capacity compute environments presents a challenge. In this post, we’ll explore how Llama 3.2—a refined and optimized version of its predecessors— bridges this gap, allowing powerful AI chatbots to operate on both high-end and constrained hardware setups. We’ll also show why using Denvr Cloud’s NVIDIA A100 GPUs offers an efficient solution for training and scaling these models, reducing costs while maximizing performance.
What is Llama 3.2 and Why Does it Matter?
Evolution from Llama 3.1 to Llama 3.2
Llama 3.2 is a newer iteration designed to be more lightweight and efficient compared to its predecessor, Llama 3.1. While Llama 3.1 laid the groundwork for high-quality language generation with strong contextual understanding, it was often computationally heavy, making it less practical for edge deployments. Llama 3.2 addresses this challenge, offering the performance of a robust language model while being optimized for edge and constrained environments.
Key Improvements in Llama 3.2 Over Llama 3.1:
Reduced Model Size: Llama 3.2 has been optimized to have a smaller footprint, making it easier to deploy on devices with limited memory (e.g., 4GB to 8GB RAM).
Faster Inference: By refining the internal architecture, Llama 3.2 performs inference faster, reducing the latency between a user’s query and the model’s response. This is crucial for real-time applications.
Lower Power Consumption: With efficiency improvements, Llama 3.2 consumes less power per inference, making it suitable for battery-operated or power-constrained devices like Raspberry Pi or Jetson Nano.
Enhanced Edge Compatibility: The model is optimized to run not just on high-end GPUs, but also on devices with minimal hardware acceleration (e.g., NVIDIA Jetson series, ARM CPUs).
The Business Case for Llama 3.2 at the Edge
Deploying AI models at the edge has distinct advantages, especially in scenarios where latency, data privacy, and connectivity are critical. Traditional LLMs like GPT-3 or even Llama 3.1 require substantial hardware resources, making them impractical for on-device deployments. Llama 3.2, however, can lead to more responsive applications, cost reductions, and greater control over data by providing:
Efficient Memory Utilization: It fits within the RAM constraints of most modern edge devices without significant performance degradation.
Real-Time Interactivity: Reduced latency ensures the model can deliver responses quickly, making it suitable for interactive applications.
Cost Efficiency: The ability to deploy on low-cost devices means reduced reliance on expensive cloud infrastructure, which can result in significant cost savings.
In essence, Llama 3.2 enables developers to run sophisticated AI applications outside traditional data centers, opening the door for real-time, on-premise solutions in industrial IoT, robotics, smart home devices, and more. This results in better customer experiences and lower operational costs.
Why Deploy Llama 3.2 on Denvr Cloud?
While Llama 3.2 is designed for edge use, training and initial deployment still require substantial compute power. This is where Denvr Cloud comes into play. Denvr Cloud offers specialized infrastructure for AI and ML workloads, leveraging high-performance NVIDIA GPUs such as the A100 and H100 series. This makes it the ideal solution for businesses looking to deploy, test, and fine-tune AI applications seamlessly, all without the complexities of managing on-premise infrastructure.
Why use NVIDIA A100 GPUs on Denvr Cloud?
For this project, we have opted to use NVIDIA A100 40GB cards. Here’s why:
Cost Efficiency: The A100 GPUs offer a more cost-effective solution while still delivering top-tier performance. When you don’t need the absolute latest technology, A100 GPUs provide a compelling balance between cost and compute capability.
Memory Capacity: With 40GB of VRAM, the A100s can handle large batch sizes and datasets, making them ideal for training LLMs like Llama 3.2 without running into memory limitations.
Compatibility: The A100 has a well-established software stack and support ecosystem, ensuring that models and libraries are highly optimized for this hardware.
Why Not Use H100 GPUs?
While the H100 GPUs are known for their high throughput and advanced features, such as enhanced tensor cores for FP8 precision, these benefits are most impactful when working with larger models or specialized tasks (e.g., training a 100B parameter model). For Llama 3.2, which is designed to be a compact and efficient model, business can achieve their goals without overspending on compute resources they don’t need.
A Hybrid Deployment: Cloud for Training, Edge for Inference
Businesses looking to optimize both performance and cost should consider a hybrid approach. Use Denvr Cloud’s powerful NVIDIA A100 GPUs for training and fine-tuning Llama 3.2, then deploy the model on edge devices for real-time local inference:
Cloud for Training and Model Tuning: Use Denvr Cloud’s high-capacity GPUs to train and fine-tune models. The A100s can easily accommodate Llama 3.2’s training requirements, making cloud-based training efficient and fast.
Edge for Inference: Once the model is trained, deploy it on edge devices like Raspberry Pi or NVIDIA Jetson Nano for real-time, local inference. This minimizes latency and dependency on cloud resources. This hybrid setup offers the best of both worlds—scalable training in the cloud and efficient deployment at the edge.
Deploying the Streamlit Chatbot on Denvr Cloud
In this example, we’ve chosen Streamlit to deploy the Llama 3.2 chatbot because of its simplicity and flexibility in building and sharing data apps with minimal coding overhead. This approach allows your team to quickly deploy the model and focus on enhancing customer interactions, while taking advantage of a cloud infrastructure designed specifically for AI workloads.
Sign up for Denvr Cloud
Launch the VM ( Single A100 - 40G Card VM )

Select the Pre-installed package option

Connect to the VM
ssh ubuntu@130.250.171.43
Update APT repos
sudo apt update -y
sudo apt install python3-pip
Set Up VM and Repo
Setup Virtual env ( if required )
sudo apt install python3-virtualenv -y
virtualenv venv
source venv/bin/activate
Clone the repo
git clone https://github.com/denvrdata/examples.git
cd simple-streamlit-chatbot
pip3 install -r requirements.txt
Install Ollama and server Llama 3.2 1B model
ubuntu@llama3-chatbot:~/$ sudo docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollamaUnable to find image 'ollama/ollama:latest' locallylatest: Pulling from ollama/ollama6414378b6477: Pull complete565ec4b441fe: Pull completef159bfc8d800: Pull completef9aa9bba7fad: Pull complete450c78a4a605: Pull completed7e3ad6c5000: Pull complete9901a35ab7cb: Pull completee961bfcbe854: Pull completeDigest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470Status: Downloaded newer image for ollama/ollama:latestd959caa8411f4edfc9ec3b783a2b7b1de5a5e9ac6113e61148596745d31a4325
ubuntu@llama3-chatbot:~/$ sudo docker exec -it ollama ollama run llama3.2pulling manifestpulling dde5aa3fc5ff... 100% ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 2.0 GBpulling 966de95ca8a6... 100% ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.4 KBpulling fcc5a6bec9da... 100% ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.7 KBpulling a70ff7e570d9... 100% ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.0 KBpulling 56bb8bd477a5... 100% ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 96 Bpulling 34bb5ab01051... 100% ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 561 Bverifying sha256 digestwriting manifestsuccess>>> How are you LlamaI'm doing well, thank you for asking! I'm a large language model, so I don't have feelings or emotions like humans do, but I'm always happy to chat with you and help with any questions or topics you'd like to discuss.I've been trained on a massive dataset of text from the internet and can generate human-like responses to a wide range of questions and topics. I'm constantly learning and improving my abilities, so please bear with me if I make anymistakes or don't quite understand what you're asking.What about you? How's your day going? Is there anything on your mind that you'd like to talk about? I'm all ears (or rather, all text)!>>> Send a message (/? for help)
Set up Environment variables
Configure the Ollama server IP address using an environment variable. This ensures that the application knows where to send the requests.
For Windows (PowerShell):
$env:OLLAMA_IP = "<your_ollama_ip>"
For Linux / Mac:
export OLLAMA_IP=<your_ollama_ip>
Run the Streamlit App
Once the environment is set up, run the Streamlit app using the following command:
streamlit run Chatbot.py
Open the Application in Your Browser
After starting the app, Streamlit will output a local URL, usually something like:
Local URL: http://localhost:8501
Troubleshooting
Environment Variable Not Set: If the application is not connecting to the Ollama backend, ensure that the OLLAMA_IP variable is set correctly and that the server is reachable.
CORS Issues: If running the backend on a different machine, make sure CORS settings allow connections from the Streamlit frontend.
Port Issues: Ensure that port 11434 (or the port you specified) is open and accessible.
Key Takeways for Llama 3.2 on Denver Cloud
Deploying Llama 3.2 on edge devices allows businesses to deliver real-time, on-premise AI solutions, reducing reliance on cloud services. This approach improves efficiency and reduces costs, while maintaining high responsiveness for customer-facing applications. By leveraging Denvr Cloud’s scalable infrastructure, businesses can streamline model training in the cloud and deploy them at the edge, creating a balanced hybrid environment.
For developers and teams aiming to optimize their AI deployments, Denvr Cloud provides the tools and infrastructure to simplify the process while managing costs effectively.
Get started with Llama 3.2 on Denvr Cloud today.
Follow the steps outlines above or visit our GitHub repository at: https://github.com/denvrdata/examples.git to explore deployment options.