Documents

On this page we'll dive into the Data Sources endpoint you can use to manage Data Sources programmatically. We'll look at how to insert, retrieve, list and delete documents from a data source.

Authentication

All requests to the Dust API must be authenticated using an Authentication header. The value of this header must be the string Bearer followed by a space and your API key. You can find your API key in your account's API keys panel.

The Chunk model

The Chunk model represents a chunk from a document. See the Data Sources overview to better understand how documents are chunked as part of a data source to enable semantic search.

Properties

  • Name
    hash
    Type
    string
    Description

    A hash of the chunk text as well parent document information.

  • Name
    text
    Type
    string
    Description

    The text of the chunk as it was embedded.

  • Name
    offset
    Type
    integer
    Description

    The offset of the chunk in the parent document.

  • Name
    score
    Type
    float
    Description

    The similarity score of the chunk as returned by the semantic search.

  • Name
    vector
    Type
    []float
    Description

    The embedding vector associated with the chunk.

The Document model

The Document model represents a Data Source document.

Properties

  • Name
    document_id
    Type
    string
    Description

    The document ID as specified at insertion.

  • Name
    created
    Type
    integer
    Description

    Epoch in ms at which the document was inserted.

  • Name
    timestamp
    Type
    integer
    Description

    User specified timestamp (epoch in ms) for the document. Can be used to filter documents when querying the Data Source based on their timestamp. If not specified, defaults to the value of created.

  • Name
    tags
    Type
    []string
    Description

    User specified list of string tags. Can be used to filter the results by tags when querying the Data Source. See the data_source block for more details. If not specified, defaults to the empty list.

  • Name
    source_url
    Type
    string
    Description

    User specified URL for the document.

  • Name
    text_size
    Type
    integer
    Description

    The size in bytes of the document's text.

  • Name
    chunk_count
    Type
    integer
    Description

    The number of chunks that were generated from the document's original text for embedding.

  • Name
    chunks
    Type
    []Chunk
    Description

    The document's chunks. When searching, only includes relevant chunks. When creating a document, includes all the chunks that were generated.

  • Name
    text
    Type
    string
    Description

    The document's full text. When searching, only preset if the full_text is true. Always set when retrieving a document by API.


POST/v1/w/:workspace_id/data_sources/:data_source_name/documents/:document_id

Create a Document

This endpoint enables you to insert a new document to a Data Source. The semantic of this endpoint is an upsert: if the document_id does not exists it gets created, otherwise it gets replaced (meaning you always have to supply a document_id). You can only insert documents to the Data Sources you own.

URL attributes

  • Name
    workspace_id
    Type
    string
    Description

    The ID of the Data Source's workspace (can be found in the Data Source's URL)

  • Name
    data_source_name
    Type
    string
    Description

    The name of the Data Source you want to insert a document to.

  • Name
    document_id
    Type
    string
    Description

    The ID of the document you want to insert or replace (upsert). This can be anything, make sure to use encodeURIComponent or similar.

JSON body attributes

Attributes are passed as a JSON object in the request body.

  • Name
    text
    Type
    string
    Description

    The text content of the document to upsert.

  • Name
    source_url
    Type
    string
    Description

    The source URL for the document to upsert.

  • Name
    light_document_output
    Type
    boolean
    Description

    If true, a lightweight version of the document will be returned in the response (excluding the text, chunks and vectors). Defaults to false.

Optional JSON body attributes

  • Name
    timestamp
    Type
    integer
    Description

    A user-specified timestamp for the document. If not specified, defaults to the current time.

  • Name
    tags
    Type
    []string
    Description

    A list of user-specified tags to associate with the document.

  • Name
    source_url
    Type
    []string
    Description

    A user-specified URL to associate with the document.

Request

POST
/v1/w/:workspace_id/data_sources/:data_source_name/documents/:document_id
curl https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents/top-secret-document \
  -H "Authorization: Bearer sk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Top secret content..."
  }'

Response

{
  "document": {
    "data_source_id": "foo",
    "created": 1679447275024,
    "document_id": "top-secret-document",
    "timestamp": 1679447275024,
    "tags": [],
    "source_url": null,
    "hash": "1eebbe66ac93c...47548fcd",
    "text_size": 21,
    "chunk_count": 1,
    "chunks": [{
      "text": "Top secret content...",
      "hash": "db3c24dfa326c...6bd4e1ce",
      "offset": 0,
      "vector": [ 0.0027032278, ... ],
      "score":null
    }],
    text: null,
  },
  "data_source": {
    "created": 1679447230117,
    "data_source_id": "foo",
    "config": {
      "provider_id": "openai",
      "model_id": "text-embedding-ada-002",
      "extras": null,
      "splitter_id": "base_v0",
      "max_chunk_size": 256,
      "use_cache":false
    }
  }
}

GET/v1/w/:workspace_id/data_sources/:data_source_name/documents/:document_id

Retrieve a Document

This endpoint enables you to retrieve a document by ID.

URL attributes

  • Name
    workspace_id
    Type
    string
    Description

    The ID of the Data Source's workspace (can be found in the Data Source's URL)

  • Name
    data_source_name
    Type
    string
    Description

    The name of the Data Source you want to insert a document to.

  • Name
    document_id
    Type
    string
    Description

    The ID of the document you want to insert or replace (upsert). This can be anything, make sure to use encodeURIComponent or similar.

Request

GET
/v1/w/:workspace_id/data_sources/:data_source_name/documents/:document_id
curl https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents/top-secret-document \
  -H "Authorization: Bearer sk-..."

Response

{
  "document": {
    "data_source_id": "foo",
    "created": 1679447275024,
    "document_id": "top-secret-document",
    "timestamp": 1679447275024,
    "tags": [],
    "source_url": null,
    "hash": "1eebbe66ac93c...47548fcd",
    "text_size": 21,
    "chunk_count": 1,
    "chunks": [],
    "text": "Top secret content..."
  },
}

DELETE/v1/w/:workspace_id/data_sources/:data_source_name/documents/:document_id

Delete a Document

This endpoint enables you to delete a document by ID. All data relative to the document will be deleted (and associated chunks removed from the Data Source vector search database).

URL attributes

  • Name
    workspace_id
    Type
    string
    Description

    The ID of the Data Source's workspace (can be found in the Data Source's URL)

  • Name
    data_source_name
    Type
    string
    Description

    The name of the Data Source you want to insert a document to.

  • Name
    document_id
    Type
    string
    Description

    The ID of the document you want to insert or replace (upsert). This can be anything, make sure to use encodeURIComponent or similar.

Request

DELETE
/v1/w/:workspace_id/data_sources/:data_source_name/documents/:document_id
curl -XDELETE https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents/top-secret-document \
  -H "Authorization: Bearer sk-..."

Response

{
  "document": {
    "document_id": "top-secret-document"
  }
}

GET/v1/w/:workspace_id/data_sources/:data_source_name/documents

List Documents

This endpoint enables you to list the documents of a Data Source.

URL attributes

  • Name
    workspace_id
    Type
    string
    Description

    The ID of the Data Source's workspace (can be found in the Data Source's URL)

  • Name
    data_source_name
    Type
    string
    Description

    The name of the Data Source you want to insert a document to.

Query parameters

Query attributes are passed as GET parameters.

  • Name
    offset
    Type
    integer
    Description

    The offset to use to retrieve the documents from the Data Source, for paging.

  • Name
    limit
    Type
    integer
    Description

    The maximum number of documents to retrieve from the Data Source, for paging.

Request

GET
/v1/w/:workspace_id/data_sources/:data_source_name/documents
curl "https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents?offset=0&limit=10" \
  -H "Authorization: Bearer sk-..."

Response

{
  "documents":[
    {
      "data_source_id": "foo",
      "created": 1679447719555,
      "document_id": "acme-report",
      "timestamp": 1679447719555,
      "tags": [],
      "source_url": null,
      "hash": "1651c5e63b6d6...2ae3acd0",
      "text_size": 13,
      "chunk_count": 1,
      "chunks": []
    }, {
      "data_source_id": "foo",
      "created": 1679447275024,
      "document_id": "top-secret-document",
      "timestamp": 1679447275024,
      "tags": [],
      "source_url": null,
      "hash": "1eebbe66ac93c...47548fcd",
      "text_size": 21,
      "chunk_count": 1,
      "chunks": []
  }],
  "total":2
}

GET/v1/w/:workspace_id/data_sources/:data_source_name/search

Search Documents

This endpoint enables you to perform a semantic search on your Data Source's documents.

URL attributes

  • Name
    workspace_id
    Type
    string
    Description

    The ID of the Data Source's workspace (can be found in the Data Source's URL)

  • Name
    data_source_name
    Type
    string
    Description

    The name of the Data Source you want to perform your search against.

Query parameters

  • Name
    query
    Type
    string
    Description

    The search query text.

  • Name
    top_k
    Type
    integer
    Description

    The maximum number of search elements to return.

  • Name
    full_text
    Type
    boolean
    Description

    Whether or not ot return the full text associated with the matching chunks' documents.

  • Optional Query parameters

  • Name
    timestamp_lt
    Type
    number
    Description

    Filter documents whose creation timestamp is earlier than this value.

  • Name
    timestamp_gt
    Type
    number
    Description

    Filter documents whose creation timestamp is later than this value.

  • Name
    tags_in
    Type
    array[string]
    Description

    Filter documents that have one of these tags attached. You can specify multiple tags_in in the search query.

  • Name
    tags_not
    Type
    array[string]
    Description

    Exclude documents tagged with one of these tags from the search results. You can specify multiple tags_not in the search query.

Request

GET
/v1/w/:workspace_id/data_sources/:data_source_name/search
curl "https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/search?query=secret&top_k=10&full_text=true&tags_in=foo&tags_in=bar&tags_not=baz&tags_not=zab" \
  -H "Authorization: Bearer sk-..."

Response

{
  "error": null,
  "response": {
    "documents": [
      {
        "data_source_id": "foo",
        "created": 1680101106398,
        "document_id": "5",
        "timestamp": 1680101106398,
        "tags": [],
        "source_url": null,
        "hash": "9933f078502b2...e8b849da",
        "text_size": 17,
        "chunk_count": 1,
        "chunks": [
          {
            "text": "Top secret document...",
            "hash": "39e28bdc52f7b...20638967",
            "offset": 0,
            "vector": null,
            "score": 0.7754930853843689
          }
        ],
        "text": "Full text of our Top secret document..."
      }
    ]
  }
}