Data Sources
An overview of Dust's Data Sources. You'll learn how to create Data Sources, upload documents and leverage them to build more context-aware apps.
Dust's Data Sources provide a fully managed semantic search solution. A Data Source is a managed store of documents on which semantic searches can be performed. Documents can be ingested by API or uploaded manually from the Dust interface.
Once uploaded, documents are automatically chunked, embedded and indexed. Searches can be performed
against the documents of a Data Source using the data_source
block which, given a query, automatically embeds it and performs a vector search operation to
retrieve the most semantically relevant documents and associated chunks.
Semantic search, generally based on embedding models, is important to large language model apps because of the limited context size of models. It enables the retrieval of chunks of valuable information that fit in context to perform a particular task. This is called retrieval-augmented generation.
Data sources enable apps to perform semantic searches over large collections of documents, only retrieving the chunks of information that are the most relevant to the app's task, without having to manage the complexity of setting up vector search database, chunking docouments and embedding chunks and finally performing vector search queries.
Data Source creation
Creating a Data Source consists in providing a name, description, an embedding model to use as well
as max_chunk_size
, the targeted number of tokens for each chunks. The embedding model must be
provided at Data Source creation and cannot be changed in the future (since the model used when
embedding documents chunks and search quries must be consistent).
Parameters
- Name
name
- Type
- string
- Description
The name of the new Data Source. It should be short and can only contain lowercase alphanumerical characters as well as
-
.
- Name
description
- Type
- optional string
- Description
A description of the content of the Data Source.
- Name
visibility
- Type
- string
- Description
One of public or private. Only you can edit your own Data Sources. If public other users can view the Data Source and query from it.
- Name
embedded
- Type
- model
- Description
The model to use for the Data Source. This is set at creation and cannot be changed later (as it would require re-embedding the entire Data Source). This is the model used to embed document chunks when they are ingested as well as queries when searches are peformed (the same model must be used for both).
- Name
max_chunk_size
- Type
- integer
- Description
The number of tokens to use when chunking documents before embedding the resulting chunks. Note that chunks at the end of documents may be shorter then the provided value.
Data sources are created empty. Their description can be edited from the Settings panel. They cannot be renamed but they can be deleted from the same panel.
Choosing a chunk size
Choosing a chunk size depends on the usage you will have of the retrieved chunks.
If the goal is to present these chunks to a model to do retrieval-augmented generation then you'll want to choose a chunk size in tokens that is not too high: imagine a model with ~4k tokens of context and that you want at least 1k of context to generate a completion. Then you have ~3k of context left for retrieval. If you want some amount of diversity in the information retrieved you'll want maybe a half-dozen or a dozen chunks returned, meaning a chunk size between 256 and 512.
Conversely, if you plan to process the retrieved documents with subsequent calls to models (for example using recursive summarization) then you want to pick a higher chunk size such that the semantic retrieval step covers as much information as possible.
These decisions really depend on the quality and speed constraints you have to design your app and we advise that you experiment with various approaches to find the one that suits your use case the best.
Document insertion
Once a Data Source is created, documents can be inserted from the Dust interface or by API. When a document is inserted, the following happens automatically:
- Chunking: The document is pre-processed to remove repeated whitespaces (helps semantic search)
and chunked using
max_chunk_size
tokens per chunks. - Embedding: Each chunk is embedded (in parallel, with retries) using the embedding model parametered on the Data Source.
- Indexing: Each resulting embedding vector is inserted in a vector search database along with metadata about the document and the original chunk text.
The following parameters are accepted when inserting a document:
Parameters
- Name
document_id
- Type
- string
- Description
A unique ID for the document. The semantics of the insertion really is an upsertion. Inserting with a
document_id
that does not exist will create that document, it will otherwise replace the previous document version (removing previous chunks from the vector search db and replacing by the updated document's).
- Name
text
- Type
- string
- Description
The text content of the document.
- Name
timestamp
- Type
- optional integer
- Description
User specified timestamp (epoch in ms) for the document. Can be used to filter documents when querying the Data Source based on their timestamp. If not specified, defaults to the time at insertion.
- Name
tags
- Type
- optional []string
- Description
User specified list of string tags. Can be used to filter the results by tags when querying the Data Source. See the
data_source
block for more details. If not specified, defaults to the empty list.
- Name
source_url
- Type
- optional string
- Description
User specified source URL for the document.
See the Documents API reference to learn how to insert documents by API. Data sources need to be created from the Dust interface.
Uploading directories of files to your Data Source
We also have a script you can use to upload a directory's contents to your data source. Copy the following code into upload.py
and then fill in the values for your Dust API key, the workspace id you are in, and the data source you want to upload to. Then run python <dir>
, where dir
is the directory from which you want to upload the documents.
#!/usr/bin/env python
import requests
import pathlib
import pdftotext
import sys
// The three following variables need to be set with your own Dust API key (can be found in the interface),
// Workspace ID (can be found in the workspace URL) and Data Source ID:
DUST_API_KEY=""
DUST_WORKSPACE_ID=""
DUST_DATA_SOURCE_ID=""
ENDPOINT= f"https://dust.tt/api/v1/w/{DUST_WORKSPACE_ID}/data_sources/{DUST_DATA_SOURCE_ID}/documents/"
def upload(text, file):
url = ENDPOINT + file.stem
r = requests.post(url, headers={'Authorization': "Bearer " + DUST_API_KEY}, json={
"text": text,
})
return r
directory = sys.argv[1]
# iterate through all files and upload text or pdf2text
for file in pathlib.Path(directory).rglob("*"):
if file.is_file():
if file.suffix == ".pdf":
resp = upload("\n\n".join(pdftotext.PDF(file.open("rb"))) , file)
elif file.suffix in [".txt", ".md"]:
resp = upload(file.read_text(), file)
else: continue
if resp.status_code == 200:
print("Uploaded", file)
else:
print("Error uploading", file)
print(resp.text)
Example usage, once values are filled in: python upload.py test_dir/
.
Document deletion
When deleting a document, all associated chunks are automatically removed from the vector search database of the Data Source. Documents can be deleted from the Dust interface or by API.
Data sources can be deleted from the Dust interface. When deleted, all associated data (all documents and associaed chunks) are deleted from our systems.
See the Documents API reference to learn how to delete documents by API.
Querying a Data Source
Querying a Data Source is done using the data_source
block. The
data_source
block returns a list of Document objects. Each
document may include one or more Chunks (the chunks returned by
the semantic search are aggregated per document).
When a query is run the following happens automatically:
- Embedding: The query is embedded using the embedding model set on the Data Source.
- Search: A vector search query is run against the embedding vectors of the Data Source's documents' chunks.
- Union: Most relevant chunks' documents are retrieved and chunks are associated to their original document object.