Document Q&A

This guide will help you build a Document Q&A app with step-by-step instructions, using Data Sources.

This document assumes that you know how to create a Dust app and that you have already set up OpenAI as a provider on your account. If you haven't done so, please refer to our Quickstart guide for step-by-step instructions.

In this guide we will use a Data Source to store and index a set of documents and make their content available to a large language model using semantic search in order to answer questions about them.

The main limitation of LLMs when it comes to answering questions about long documents is that the documents cannot fit in the LLM context size. Data sources allow us to overcome this limitation by splitting documents into smaller chunks and indexing these chunks separately.

Given a question, we can then use semantic search to find the most relevant chunks relative to the question and feed these chunks to a LLM to generate an answer.

Creating your first Data Source

We will use a Data Source to automatically chunk, embed and index 4 IPCC reports: AR6-WG1, AR6-WG2, AR6-WG3 and AR6-SYR.

You can create your first Data Source from your Dust home page by selecting the Data Sources tab and clicking on the New DataSource button. You'll need to provide a name and description as well as select OpenAI's embedding model text-embedding-ada-002. Finally, it is important to pick a max_chunk_size value. We decided to use max_chunk_size=256 for this example (in tokens), to be able to insert 12 chunks in-context and leave 1024 tokens for generating an answer as we plan to use gpt-3.5-turbo which has a context size of 4096 tokens.

Increasing the max_chunk_size would provide more context per chunk but would also reduce the number of chunks the model can attend to in-context. Reducing it would allow to consider more "sources" of information but would reduce the amount of information captured by each chunk. 256 tokens appears like a good trade-off for here but the right value will depend on your use case and the model you use (as an example, gpt-4 has 8192 tokens of context, so we could use a max_chunk_size of 512).

Dust Data Source creation panel

The Dust Data Source creation panel. The name should be short and memorable, lowercase without spaces.

Inserting documents

Once the Data Source is created, you can click on Upload Document to insert documents from the interface. Note that it is also possible to insert Documents by API.

When inserting a Document, a unique document identifier needs to be provided. Free form tags (on which semantic search can be filtered) can also be provided. Text content is filled automatically here by uploading the text format of a local ipcc_ar6_wg1.txt file which was obtained by extracting the text of the associated IPCC PDFs. Insertion acts as an upsert, updating the document if it already existed for the provided identifier.

Dust Data Source document insertion

The Dust Data Source document insertion interface. Tags are optional and arbitrary. The text content was filled here using the Upload button.

Once the 4 documents are inserted, you should be able to see them in the Data Source interface. You can review the public ipcc-ar6 Data Source as an example.

Dust Data Source document list

The Dust Data Source documents list. You can inspect and edit each of the document from this interface. The size of each document and the resulting number of chunks that were created, embedded and indexed is also reported.

Now that your Data Source is ready, you can move on to the next section to create a Q&A app that will use it.

Creating a Q&A app

In this section we'll provide step-by-step instructions to create a Q&A app that will use the data source to answer user-provided questions.

input block

From your home screen create a new app by clicking the Apps tab and then the New App button.

Assuming you've already covered our Quickstart guide, add an input block and associate it with a new dataset with a few questions that might be relevant to the Data Source:

Dust dev dataset to associate to input block

An example dataset with questions relative to the Data Source we plan to use. Each question is a different input against which we will design our app.

data_source block

Once the input block points to your new dataset you can add a data_source block and select your Data Source. If you want to use the publicly available Data Source ipcc-ar6, edit the user name to spolu and then select ipcc-ar6. You should set top_k to 12, meaning that we'll consume 3072 tokens of context to present chunks to the model.

The query can be set directly to the question from the input using templating with the following value:

{{INPUT.question}}
Dust `data_source` block

The Dust data_source block pointing to the ipcc-ar6 Data Source. The query is set to the input value question field using templating.

At this point you can run your app and introspect the outputs of the data_source block.

chat block

We will use gpt-3.5-turbo to process the chunks and generate an answer to the user.

We use the following instruction in the context of ipcc-ar6:

You are an helpful assistant. For each user question, you'll receive 12 additional system
messages with chunks of text retrieved by semantic search (based on the question) from IPCC
AR6 technical reports SYR, WG1, WG2, and WG3.  You should reply to the user with a concise
but precise answer that is only based on the chunks of text that were presented to you.

And the following code to construct the set of messages presented to the model:

_fun = (env) => {
  let messages = [{ role: "user", content: env.state.INPUT.question }];
  env.state.IPCC_AR6.forEach((d) => {
    d.chunks.forEach((c) => {
      messages.push({
        role: "system",
        content: `From: ${d.document_id}\n\n${c.text}`,
      });
    });
  });
  return messages;
};

final code block

To return only the content generated by the model you can add a final code block with the following code:

_fun = (env) => {
  return env.state.MODEL.message.content;
};

You can now run your app and inspect the outputs of the last code block.

Using the app

You can use the Use tab to submit new questions to your app and get generated answers direclty.

Dust app use interface

The app Use interface. You can submit new questions and get generated answers directly.