Meet with us at Nvidia GTC in San Jose, CA (March 17 - 21) Click here to book

How to Host a Retrieval Augmented Generation (RAG) Pipeline for LLMs on Denvr Cloud: A Step-by-Step Guide
8
529
0
Over the past few years, tools like ChatGPT and Copilot have made LLMs a part of our daily lives, appearing in everything from email clients to code editors.
Despite their reach, LLMs still have notable limitations. As probabilistic models trained on finite datasets, they exhibit the following weaknesses:
Overly General Responses - Vague or non-specific answers due to broad training data
Outdated Knowledge - Training data is static, representing past information.
Lack of Verifiable Sources - Responses lack citations, making it difficult to verify accuracy.
Hallucinations - Occasionally generate plausible-sounding but incorrect or fabricated information.
One mitigation strategy is Retrieval Augmented Generation (RAG) pipelines which pairs LLMs with a search component and prompt engineering. The diagram below covers one of the more common workflows, incorporating an embedding model and a vector database. Conceptually, you just need a:
Document Store - Searchable with the input prompt
Query Augmentation Step - Incorporate that relevant data/context into the original prompt you’re sending to the LLM.
In our diagram, we are:
Populating a vector database by:
Parsing the raw documents (e.g., doc, pdf, html)
Sending the raw text through a small embedding model
Inserting the embedded text into the database
Running our original prompt through the same embedding model and using the result to query the database for relevant stored context.
Augmenting the prompt with the most relevant context chunks and sending that new prompt to the LLM.
A nice feature of this approach is that we can experiment with different independent search and prompt augmentation strategies. Similarly, you can also experiment with different LLMs, embedding models and reference datasets.
Accounts
NVIDIA (Required)
Since we're using an NVIDIA NIM container for our backend / foundation LLM, we'll need an NVIDIA account. We'll assume you don't already have access to NVIDIA enterprise and would like to sign up for a free developer account.

Once you've signed up, make sure to wait for an email from account@nvidia.com with the subject "NVIDIA Account Created". You'll need to verify your email and accept the terms before any personal access tokens you generate will be usable.
Denvr (Optional)
Technically, you just need access to a Linux box with the NVIDIA GPU(s), drivers and container runtime. That being said, if you're gonna use a cloud provider anyway, why not Denvr? :)
You can register for an account with Denvr directly from our console. For a free trial, please contact our sales team.
Provisioning
NGC Keys
To pull down an NVIDIA NIM container, you'll need to log into your developer account and navigate to the NGC setup page.

When you hit "Generate Personal Key", you'll need to give your key a name and expiration. At a minimum, you should authorize access to the "NGC Catalog".
You'll also want to save the token `nvapi-****` somewhere safe, like in a password manager.
Denvr VM
After logging into Denvr's console, navigate to the 'Virtual Machines' tab using the left navigation menu and hit "Create Virtual Machine".


We'll start by naming our VM and selecting the instance type we want from either the on-demand or reserved pools.

Next, we'll select the OS and decide whether we want the NVIDIA drivers and Docker container runtime environment preinstalled (recommended).
We'll also specify any NFS shares (personal or shared) to mount in the VM. Finally, we'll provide our SSH public key for access to the VM.

Hit "Launch Instance" and wait for the machine to come "ONLINE".

Docker
Now that we have our NGC key and Denvr VM, we'll SSH into our machine.
> ssh ubuntu@<public_ip>
We'll clone this demo repo from this machine and run the config.sh script.
> git clone -b rf/yara https://github.com/denvrdata/denvrdemos.git
> cd denvrdemos/yara
> bash config.sh
Enter your NGC API Key (nvapi-****): nvapi-***************************************************************
Writing key to .config/ngc-api-key
Writing key to docker environment variable in .config/nim.env
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15 100 15 0 0 354 0 --:--:-- --:--:-- --:--:-- 357
Writing .config/caddy/Caddyfile
198.145.127.121.nip.io {
reverse_proxy webui:8080
}
HTML docs already found. Skipping download.
Logging into nvcr.io
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credential-stores
Login Succeeded
Pulling down docker images
[+] Pulling 49/49
✔ webui Pulled 78.4s
...
✔ caddy Pulled 1.6s
...
✔ nim Pulled 168.6s
...
real 2m49.063s
user 0m0.044s
sys 0m0.329s
Starting docker services
[+] Running 4/4
✔ Network yara_default Created 0.1s
✔ Container yara-nim-1 Healthy 101.8s
✔ Container yara-webui-1 Healthy 109.8s
✔ Container yara-caddy-1 Started 110.0s
real 1m58.309s
user 0m0.024s
sys 0m0.024s
Configuration complete. Open 198.145.127.121.nip.io in your browser.
NOTE - We've already provided a copy of Denvr's public docs used in the webinar in `data/webui/docs`, but feel free to remove these and add your own.
The command used to download the .html files is provided below for reference.
cd data/webui/docs
wget -q https://docs.denvrdata.com/docs/sitemap.xml --output-document - | grep -E -o "https://docs\.denvrdata\.com[^<]+" | wget -q -E -i - --wait 0
Open WebUI should be able to parse standard file formats like .txt, .html and .pdf files.
Configuration
During our webinar, we walked you through our preconfigured Open WebUI container. In this section, we'll show you how to configure an Open WebUI RAG pipeline for yourself. Feel free to play with the system prompts, RAG Templates or reference documents as we work through this section.
Authentication


If you haven't uncommented the line WEBUI_AUTH=False in the .config/webui.env file, you'll be prompted to create the initial admin account. For this example, we'll stick to simple email/password authentication.
You can use the "Admin Panel" to add your friends and coworkers to your server.


From the same Admin Panel, navigate to the 'settings' tab and hit "Connections".

We'll see that our OpenAI API endpoint is pointed to http://nim:8000/v1. The password doesn't matter, but Open WebUI requires it.
Documents
As previously mentioned, we've already provided our public Denvr docs at /data/docs inside the container. Before our AI assistant can reference these documents, we must tell Open WebUI to scan them and store the embeddings in a vector database. Thankfully, Open WebUI already comes with a default embedding model and a vector database. From the 'settings' tab shown earlier, navigate to the "Documents" page.

On this page, we'll just hit the "Scan" button and replace the 'RAG Template' with the following:
**Generate Response to User Query**
**Step 1: Parse Context Information**
Extract and utilize relevant knowledge from the provided context within `<context></context>` XML tags.
**Step 2: Analyze User Query**
Carefully read and comprehend the user's query, pinpointing the key concepts, entities, and intent behind the question.
**Step 3: Determine Response**
If the answer to the user's query can be directly inferred from the context information, provide a concise and accurate response in the same language as the user's query.
**Step 4: Handle Uncertainty**
If you don't know the answer, simply state that you don't know. If the answer is not clear, ask the user for clarification to ensure an accurate response.
**Step 5: Respond in User's Language**
Maintain consistency by ensuring the response is in the same language as the user's query.
**Step 6: Provide Response**
Generate a clear, concise, and informative response to the user's query, adhering to the guidelines outlined above.
User Query: [query]
<context>
[context]
</context>
This template is almost identical to the one used here. We just added the following guard.
If you don't know the answer, simply state that you don't know
Hit the "Save" button at the bottom and navigate to the the 'Workspace' seen in the left sidebar.
YARA Model
From the 'Workspace', you should see the base model configuration:

From here, select "Create a model"

Fill out the following fields:
Name: YARA
Model ID: yara
Base Model (From): meta/llama3-8b-instruct
Description: Yet Another RAG Assistant
Model Params - System Prompt
You are a friendly AI assistant for onboarding new Denvr Dataworks employees. Use a conversational tone and provide helpful and informative responses, utilizing external knowledge when possible.
Under knowledge hit "Select Documents" and select 'COLLECTION - All Documents'
Then hit "Save & Create" at the bottom.
Usage
Let's play around with some prompts on both the base Llama 3 model and our YARA configuration.
We'll start by selecting the base Llama model from the "Workspace" window and give it an easy question.
What is a Large Language Model?

As expected, the base Llama 3 model gives us reasonable output.
What if we ask it a question about Denvr Dataworks network bandwidth?
What network bandwidth does Denvr Dataworks offer?

Unfortunately, the base Llama model knows nothing about Denvr Dataworks.
In this case, the LLM just gave us a generic response that isn't grounded in any particular documentation on our site.
What happens if we start another chat with our YARA model and ask it this same question?

As we can see, the response from the YARA model utilizes specific values from our documentation. For example, it correctly identifies the different network speeds like inter-cluster communication, InfiniBand, and shared Internet access by region. The values are also clearly pulled from the documents we provided. However, there are still some issues as this is a generative model. For example, it would be more helpful if it differentiated the network speeds offered in each region (e.g., MSC1 vs HOU1)
To ensure that wasn't a fluke, what if we ask the YARA model about Denvr's instance types?
What instance types does Denvr Dataworks offer?

Again, we get accurate responses, though more detail on which region each instance type is offered in might be helpful.
Feel free to play around with different prompts and settings to see how it changes the output.
Conclusion
In this guide, we reviewed what RAG pipelines are and how they can help you build low-cost and personalized chat tools.
We also stepped through deploying a simple RAG application on Denvr Cloud using the following:
NVIDIA NIM - for our inference backend
Open WebUI - for our chat interface and RAG pipeline
Caddy + nim.io - for automatic HTTPS encryption and a domain name
We also included instructions for adding your own documents.
Links
Denvr:
NVIDIA:
Caddy: