Data Sources

An overview of Dust's Data Sources. You'll learn how to create Data Sources, upload documents and leverage them to build more context-aware apps.

Dust's Data Sources provide a fully managed semantic search solution. A Data Source is a managed store of documents on which semantic searches can be performed. Documents can be ingested by API or uploaded manually from the Dust interface.

Once uploaded, documents are automatically chunked, embedded and indexed. Searches can be performed against the documents of a Data Source using the data_source block which, given a query, automatically embeds it and performs a vector search operation to retrieve the most semantically relevant documents and associated chunks.

Data sources enable apps to perform semantic searches over large collections of documents, only retrieving the chunks of information that are the most relevant to the app's task, without having to manage the complexity of setting up vector search database, chunking docouments and embedding chunks and finally performing vector search queries.

Data Source creation

Creating a Data Source simply consists in providing a name and a description.

Parameters

  • Name
    name
    Type
    string
    Description

    The name of the new Data Source. It should be short and can only contain lowercase alphanumerical characters as well as -.

  • Name
    description
    Type
    optional string
    Description

    A description of the content of the Data Source.

  • Name
    visibility
    Type
    string
    Description

    One of public or private. Only you can edit your own Data Sources. If public other users can view the Data Source and query from it.

Data sources are created empty. Their description can be edited from the Settings panel. They cannot be renamed but they can be deleted from the same panel.

Document insertion

Once a Data Source is created, documents can be inserted from the Dust interface or by API. When a document is inserted, the following happens automatically:

  • Chunking: The document is pre-processed to remove repeated whitespaces (helps semantic search) and chunked using max_chunk_size tokens per chunks.
  • Embedding: Each chunk is embedded (in parallel, with retries) using the embedding model parametered on the Data Source.
  • Indexing: Each resulting embedding vector is inserted in a vector search database along with metadata about the document and the original chunk text.

The following parameters are accepted when inserting a document:

Parameters

  • Name
    document_id
    Type
    string
    Description

    A unique ID for the document. The semantics of the insertion really is an upsertion. Inserting with a document_id that does not exist will create that document, it will otherwise replace the previous document version (removing previous chunks from the vector search db and replacing by the updated document's).

  • Name
    text
    Type
    string
    Description

    The text content of the document.

  • Name
    timestamp
    Type
    optional integer
    Description

    User specified timestamp (epoch in ms) for the document. Can be used to filter documents when querying the Data Source based on their timestamp. If not specified, defaults to the time at insertion.

  • Name
    tags
    Type
    optional []string
    Description

    User specified list of string tags. Can be used to filter the results by tags when querying the Data Source. See the data_source block for more details. If not specified, defaults to the empty list.

  • Name
    source_url
    Type
    optional string
    Description

    User specified source URL for the document.

See the Documents API reference to learn how to insert documents by API. Data sources need to be created from the Dust interface.

Uploading directories of files to your Data Source

We also have a script you can use to upload a directory's contents to your data source. Copy the following code into upload.py and then fill in the values for your Dust API key, the workspace id you are in, and the data source you want to upload to. Then run python <dir>, where dir is the directory from which you want to upload the documents.

#!/usr/bin/env python
import requests
import pathlib
import pdftotext
import sys

# The three following variables need to be set with your own Dust API key (can be found in the interface),
# Workspace ID (can be found in the workspace URL) and Data Source ID:
DUST_API_KEY=""
DUST_WORKSPACE_ID=""
DUST_DATA_SOURCE_ID=""
ENDPOINT= f"https://dust.tt/api/v1/w/{DUST_WORKSPACE_ID}/data_sources/{DUST_DATA_SOURCE_ID}/documents/"

def upload(text, file):
    url = ENDPOINT + file.stem
    r = requests.post(url, headers={'Authorization': "Bearer " + DUST_API_KEY}, json={
        "text": text,
    })
    return r

directory = sys.argv[1]

# iterate through all files and upload text or pdf2text
for file in pathlib.Path(directory).rglob("*"):
    if file.is_file():
        if file.suffix == ".pdf":
            resp = upload("\n\n".join(pdftotext.PDF(file.open("rb"))) , file)
        elif file.suffix in [".txt", ".md"]:
            resp = upload(file.read_text(), file)
        else: continue
        if resp.status_code == 200:
            print("Uploaded", file)
        else:
            print("Error uploading", file)
            print(resp.text)

Example usage, once values are filled in: python upload.py test_dir/.

Document deletion

When deleting a document, all associated chunks are automatically removed from the vector search database of the Data Source. Documents can be deleted from the Dust interface or by API.

See the Documents API reference to learn how to delete documents by API.

Querying a Data Source

Querying a Data Source is done using the data_source block. The data_source block returns a list of Document objects. Each document may include one or more Chunks (the chunks returned by the semantic search are aggregated per document).

When a query is run the following happens automatically:

  • Embedding: The query is embedded using the embedding model set on the Data Source.
  • Search: A vector search query is run against the embedding vectors of the Data Source's documents' chunks.
  • Union: Most relevant chunks' documents are retrieved and chunks are associated to their original document object.