How to Host a Retrieval Augmented Generation (RAG) Pipeline for LLMs on Denvr Cloud: A Step-by-Step Guide

Sep 10, 2024

7 min read

616

Over the past few years, tools like ChatGPT and Copilot have made LLMs a part of our daily lives, appearing in everything from email clients to code editors.

Despite their reach, LLMs still have notable limitations. As probabilistic models trained on finite datasets, they exhibit the following weaknesses:

Overly General Responses - Vague or non-specific answers due to broad training data
Outdated Knowledge - Training data is static, representing past information.
Lack of Verifiable Sources - Responses lack citations, making it difficult to verify accuracy.
Hallucinations - Occasionally generate plausible-sounding but incorrect or fabricated information.

One mitigation strategy is Retrieval Augmented Generation (RAG) pipelines which pairs LLMs with a search component and prompt engineering. The diagram below covers one of the more common workflows, incorporating an embedding model and a vector database. Conceptually, you just need a:

Document Store - Searchable with the input prompt
Query Augmentation Step - Incorporate that relevant data/context into the original prompt you’re sending to the LLM.

Hosting RAG pipeline on Denvr Cloud, LLM retrieval augmentation, Cloud computing for LLMs, RAG architecture on Denvr, AI pipeline cloud deployment, Optimizing RAG for LLM

In our diagram, we are:

Populating a vector database by:
1. Parsing the raw documents (e.g., doc, pdf, html)
2. Sending the raw text through a small embedding model
3. Inserting the embedded text into the database
Running our original prompt through the same embedding model and using the result to query the database for relevant stored context.
Augmenting the prompt with the most relevant context chunks and sending that new prompt to the LLM.

A nice feature of this approach is that we can experiment with different independent search and prompt augmentation strategies. Similarly, you can also experiment with different LLMs, embedding models and reference datasets.

How to Host a Retrieval Augmented Generation (RAG) Pipeline for LLMs on Denvr Cloud: A Step-by-Step Guide

Accounts

NVIDIA (Required)

Since we're using an NVIDIA NIM container for our backend / foundation LLM, we'll need an NVIDIA account. We'll assume you don't already have access to NVIDIA enterprise and would like to sign up for a free developer account.

Once you've signed up, make sure to wait for an email from account@nvidia.com with the subject "NVIDIA Account Created". You'll need to verify your email and accept the terms before any personal access tokens you generate will be usable.

Denvr (Optional)

Technically, you just need access to a Linux box with the NVIDIA GPU(s), drivers and container runtime. That being said, if you're gonna use a cloud provider anyway, why not Denvr? :)

You can register for an account with Denvr directly from our console. For a free trial, please contact our sales team.

Provisioning

NGC Keys

To pull down an NVIDIA NIM container, you'll need to log into your developer account and navigate to the NGC setup page.

When you hit "Generate Personal Key", you'll need to give your key a name and expiration. At a minimum, you should authorize access to the "NGC Catalog".

You'll also want to save the token `nvapi-****` somewhere safe, like in a password manager.

Denvr VM

After logging into Denvr's console, navigate to the 'Virtual Machines' tab using the left navigation menu and hit "Create Virtual Machine".

We'll start by naming our VM and selecting the instance type we want from either the on-demand or reserved pools.

Next, we'll select the OS and decide whether we want the NVIDIA drivers and Docker container runtime environment preinstalled (recommended).

We'll also specify any NFS shares (personal or shared) to mount in the VM. Finally, we'll provide our SSH public key for access to the VM.

Hit "Launch Instance" and wait for the machine to come "ONLINE".

Docker

Now that we have our NGC key and Denvr VM, we'll SSH into our machine.

> ssh ubuntu@<public_ip>

We'll clone this demo repo from this machine and run the config.sh script.

> git clone -b rf/yara https://github.com/denvrdata/denvrdemos.git

> cd denvrdemos/yara

> bash config.sh
Enter your NGC API Key (nvapi-****): nvapi-***************************************************************
Writing key to .config/ngc-api-key
Writing key to docker environment variable in .config/nim.env
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    15  100    15    0     0    354      0 --:--:-- --:--:-- --:--:--   357
Writing .config/caddy/Caddyfile
198.145.127.121.nip.io {
    reverse_proxy webui:8080
}
HTML docs already found. Skipping download.
Logging into nvcr.io
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credential-stores

Login Succeeded
Pulling down docker images
[+] Pulling 49/49
 ✔ webui Pulled                                                                                                                78.4s
   ...
 ✔ caddy Pulled                                                                                                                 1.6s
   ...
 ✔ nim Pulled                                                                                                                 168.6s
   ...

real    2m49.063s
user    0m0.044s
sys     0m0.329s
Starting docker services
[+] Running 4/4
 ✔ Network yara_default    Created                                                                                              0.1s
 ✔ Container yara-nim-1    Healthy                                                                                            101.8s
 ✔ Container yara-webui-1  Healthy                                                                                            109.8s
 ✔ Container yara-caddy-1  Started                                                                                            110.0s

real    1m58.309s
user    0m0.024s
sys     0m0.024s
Configuration complete. Open 198.145.127.121.nip.io in your browser.

NOTE - We've already provided a copy of Denvr's public docs used in the webinar in `data/webui/docs`, but feel free to remove these and add your own.

The command used to download the .html files is provided below for reference.

cd data/webui/docs
wget -q https://docs.denvrdata.com/docs/sitemap.xml --output-document - | grep -E -o "https://docs\.denvrdata\.com[^<]+" | wget -q -E -i - --wait 0

Open WebUI should be able to parse standard file formats like .txt, .html and .pdf files.

Configuration

During our webinar, we walked you through our preconfigured Open WebUI container. In this section, we'll show you how to configure an Open WebUI RAG pipeline for yourself. Feel free to play with the system prompts, RAG Templates or reference documents as we work through this section.

Authentication

If you haven't uncommented the line WEBUI_AUTH=False in the .config/webui.env file, you'll be prompted to create the initial admin account. For this example, we'll stick to simple email/password authentication.

You can use the "Admin Panel" to add your friends and coworkers to your server.

From the same Admin Panel, navigate to the 'settings' tab and hit "Connections".

We'll see that our OpenAI API endpoint is pointed to http://nim:8000/v1. The password doesn't matter, but Open WebUI requires it.

Documents

As previously mentioned, we've already provided our public Denvr docs at /data/docs inside the container. Before our AI assistant can reference these documents, we must tell Open WebUI to scan them and store the embeddings in a vector database. Thankfully, Open WebUI already comes with a default embedding model and a vector database. From the 'settings' tab shown earlier, navigate to the "Documents" page.

On this page, we'll just hit the "Scan" button and replace the 'RAG Template' with the following:

**Generate Response to User Query**
**Step 1: Parse Context Information**
Extract and utilize relevant knowledge from the provided context within `<context></context>` XML tags.
**Step 2: Analyze User Query**
Carefully read and comprehend the user's query, pinpointing the key concepts, entities, and intent behind the question.
**Step 3: Determine Response**
If the answer to the user's query can be directly inferred from the context information, provide a concise and accurate response in the same language as the user's query.
**Step 4: Handle Uncertainty**
If you don't know the answer, simply state that you don't know. If the answer is not clear, ask the user for clarification to ensure an accurate response.
**Step 5: Respond in User's Language**
Maintain consistency by ensuring the response is in the same language as the user's query.
**Step 6: Provide Response**
Generate a clear, concise, and informative response to the user's query, adhering to the guidelines outlined above.
User Query: [query]
<context>
    [context]
</context>

This template is almost identical to the one used here. We just added the following guard.

If you don't know the answer, simply state that you don't know

Hit the "Save" button at the bottom and navigate to the the 'Workspace' seen in the left sidebar.

YARA Model

From the 'Workspace', you should see the base model configuration:

From here, select "Create a model"

Fill out the following fields:

Name: YARA
Model ID: yara
Base Model (From): meta/llama3-8b-instruct
Description: Yet Another RAG Assistant

Model Params - System Prompt

You are a friendly AI assistant for onboarding new Denvr Dataworks employees. Use a conversational tone and provide helpful and informative responses, utilizing external knowledge when possible.

Under knowledge hit "Select Documents" and select 'COLLECTION - All Documents'

Then hit "Save & Create" at the bottom.

Usage

Let's play around with some prompts on both the base Llama 3 model and our YARA configuration.

We'll start by selecting the base Llama model from the "Workspace" window and give it an easy question.

What is a Large Language Model?

As expected, the base Llama 3 model gives us reasonable output.

What if we ask it a question about Denvr Dataworks network bandwidth?

What network bandwidth does Denvr Dataworks offer?

Unfortunately, the base Llama model knows nothing about Denvr Dataworks.

In this case, the LLM just gave us a generic response that isn't grounded in any particular documentation on our site.

What happens if we start another chat with our YARA model and ask it this same question?

As we can see, the response from the YARA model utilizes specific values from our documentation. For example, it correctly identifies the different network speeds like inter-cluster communication, InfiniBand, and shared Internet access by region. The values are also clearly pulled from the documents we provided. However, there are still some issues as this is a generative model. For example, it would be more helpful if it differentiated the network speeds offered in each region (e.g., MSC1 vs HOU1)

To ensure that wasn't a fluke, what if we ask the YARA model about Denvr's instance types?

What instance types does Denvr Dataworks offer?

Again, we get accurate responses, though more detail on which region each instance type is offered in might be helpful.

Feel free to play around with different prompts and settings to see how it changes the output.

Conclusion

In this guide, we reviewed what RAG pipelines are and how they can help you build low-cost and personalized chat tools.

We also stepped through deploying a simple RAG application on Denvr Cloud using the following:

NVIDIA NIM - for our inference backend
Open WebUI - for our chat interface and RAG pipeline
Caddy + nim.io - for automatic HTTPS encryption and a domain name

We also included instructions for adding your own documents.

Links

Denvr:

NVIDIA:

Caddy:

Posts

Rory Finnegan