Documents
On this page we'll dive into the Data Sources endpoint you can use to manage Data Sources programmatically. We'll look at how to insert, retrieve, list and delete documents from a data source.
Authentication
All requests to the Dust API must be authenticated using an Authentication
header. The value of
this header must be the string Bearer followed by a space and your API key. You can find your API
key in your account's API keys panel.
The Chunk model
The Chunk model represents a chunk from a document. See the Data Sources overview to better understand how documents are chunked as part of a data source to enable semantic search.
Properties
- Name
hash
- Type
- string
- Description
A hash of the chunk text as well parent document information.
- Name
text
- Type
- string
- Description
The text of the chunk as it was embedded.
- Name
offset
- Type
- integer
- Description
The offset of the chunk in the parent document.
- Name
score
- Type
- float
- Description
The similarity score of the chunk as returned by the semantic search.
- Name
vector
- Type
- []float
- Description
The embedding vector associated with the chunk.
The Document model
The Document model represents a Data Source document.
Properties
- Name
document_id
- Type
- string
- Description
The document ID as specified at insertion.
- Name
created
- Type
- integer
- Description
Epoch in ms at which the document was inserted.
- Name
timestamp
- Type
- integer
- Description
User specified timestamp (epoch in ms) for the document. Can be used to filter documents when querying the Data Source based on their timestamp. If not specified, defaults to the value of created.
- Name
tags
- Type
- []string
- Description
User specified list of string tags. Can be used to filter the results by tags when querying the Data Source. See the
data_source
block for more details. If not specified, defaults to the empty list.
- Name
source_url
- Type
- string
- Description
User specified URL for the document.
- Name
text_size
- Type
- integer
- Description
The size in bytes of the document's text.
- Name
chunk_count
- Type
- integer
- Description
The number of chunks that were generated from the document's original text for embedding.
- Name
chunks
- Type
- []Chunk
- Description
The document's chunks. When searching, only includes relevant chunks. When creating a document, includes all the chunks that were generated.
- Name
text
- Type
- string
- Description
The document's full text. When searching, only preset if the
full_text
is true. Always set when retrieving a document by API.
Create a Document
This endpoint enables you to insert a new document to a Data Source. The semantic of this
endpoint is an upsert: if the document_id
does not exists it gets created, otherwise it
gets replaced (meaning you always have to supply a document_id
). You can only insert documents
to the Data Sources you own.
URL attributes
- Name
workspace_id
- Type
- string
- Description
The ID of the Data Source's workspace (can be found in the Data Source's URL)
- Name
data_source_name
- Type
- string
- Description
The name of the Data Source you want to insert a document to.
- Name
document_id
- Type
- string
- Description
The ID of the document you want to insert or replace (upsert). This can be anything, make sure to use
encodeURIComponent
or similar.
JSON body attributes
Attributes are passed as a JSON object in the request body.
- Name
text
- Type
- string
- Description
The text content of the document to upsert.
- Name
source_url
- Type
- string
- Description
The source URL for the document to upsert.
Optional JSON body attributes
- Name
timestamp
- Type
- integer
- Description
A user-specified timestamp for the document. If not specified, defaults to the current time.
- Name
tags
- Type
- []string
- Description
A list of user-specified tags to associate with the document.
- Name
source_url
- Type
- []string
- Description
A user-specified URL to associate with the document.
Request
curl https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents/top-secret-document \
-H "Authorization: Bearer sk-..." \
-H "Content-Type: application/json" \
-d '{
"text": "Top secret content..."
}'
Response
{
"document": {
"data_source_id": "foo",
"created": 1679447275024,
"document_id": "top-secret-document",
"timestamp": 1679447275024,
"tags": [],
"source_url": null,
"hash": "1eebbe66ac93c...47548fcd",
"text_size": 21,
"chunk_count": 1,
"chunks": [{
"text": "Top secret content...",
"hash": "db3c24dfa326c...6bd4e1ce",
"offset": 0,
"vector": [ 0.0027032278, ... ],
"score":null
}],
text: null,
},
"data_source": {
"created": 1679447230117,
"data_source_id": "foo",
"config": {
"provider_id": "openai",
"model_id": "text-embedding-ada-002",
"extras": null,
"splitter_id": "base_v0",
"max_chunk_size": 256,
"use_cache":false
}
}
}
Retrieve a Document
This endpoint enables you to retrieve a document by ID.
URL attributes
- Name
workspace_id
- Type
- string
- Description
The ID of the Data Source's workspace (can be found in the Data Source's URL)
- Name
data_source_name
- Type
- string
- Description
The name of the Data Source you want to insert a document to.
- Name
document_id
- Type
- string
- Description
The ID of the document you want to insert or replace (upsert). This can be anything, make sure to use
encodeURIComponent
or similar.
Request
curl https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents/top-secret-document \
-H "Authorization: Bearer sk-..."
Response
{
"document": {
"data_source_id": "foo",
"created": 1679447275024,
"document_id": "top-secret-document",
"timestamp": 1679447275024,
"tags": [],
"source_url": null,
"hash": "1eebbe66ac93c...47548fcd",
"text_size": 21,
"chunk_count": 1,
"chunks": [],
"text": "Top secret content..."
},
}
Delete a Document
This endpoint enables you to delete a document by ID. All data relative to the document will be deleted (and associated chunks removed from the Data Source vector search database).
URL attributes
- Name
workspace_id
- Type
- string
- Description
The ID of the Data Source's workspace (can be found in the Data Source's URL)
- Name
data_source_name
- Type
- string
- Description
The name of the Data Source you want to insert a document to.
- Name
document_id
- Type
- string
- Description
The ID of the document you want to insert or replace (upsert). This can be anything, make sure to use
encodeURIComponent
or similar.
Request
curl -XDELETE https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents/top-secret-document \
-H "Authorization: Bearer sk-..."
Response
{
"document": {
"document_id": "top-secret-document"
}
}
List Documents
This endpoint enables you to list the documents of a Data Source.
URL attributes
- Name
workspace_id
- Type
- string
- Description
The ID of the Data Source's workspace (can be found in the Data Source's URL)
- Name
data_source_name
- Type
- string
- Description
The name of the Data Source you want to insert a document to.
Query parameters
Query attributes are passed as GET parameters.
- Name
offset
- Type
- integer
- Description
The offset to use to retrieve the documents from the Data Source, for paging.
- Name
limit
- Type
- integer
- Description
The maximum number of documents to retrieve from the Data Source, for paging.
Request
curl "https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/documents?offset=0&limit=10" \
-H "Authorization: Bearer sk-..."
Response
{
"documents":[
{
"data_source_id": "foo",
"created": 1679447719555,
"document_id": "acme-report",
"timestamp": 1679447719555,
"tags": [],
"source_url": null,
"hash": "1651c5e63b6d6...2ae3acd0",
"text_size": 13,
"chunk_count": 1,
"chunks": []
}, {
"data_source_id": "foo",
"created": 1679447275024,
"document_id": "top-secret-document",
"timestamp": 1679447275024,
"tags": [],
"source_url": null,
"hash": "1eebbe66ac93c...47548fcd",
"text_size": 21,
"chunk_count": 1,
"chunks": []
}],
"total":2
}
Search Documents
This endpoint enables you to perform a semantic search on your Data Source's documents.
URL attributes
- Name
workspace_id
- Type
- string
- Description
The ID of the Data Source's workspace (can be found in the Data Source's URL)
- Name
data_source_name
- Type
- string
- Description
The name of the Data Source you want to perform your search against.
Query parameters
- Name
query
- Type
- string
- Description
The search query text.
- Name
top_k
- Type
- integer
- Description
The maximum number of search elements to return.
- Name
full_text
- Type
- boolean
- Description
Whether or not ot return the full text associated with the matching chunks' documents.
- Name
timestamp_lt
- Type
- number
- Description
Filter documents whose creation timestamp is earlier than this value.
- Name
timestamp_gt
- Type
- number
- Description
Filter documents whose creation timestamp is later than this value.
- Name
tags_in
- Type
- array[string]
- Description
Filter documents that have one of these tags attached. You can specify multiple
tags_in
in the search query.
- Name
tags_not
- Type
- array[string]
- Description
Exclude documents tagged with one of these tags from the search results. You can specify multiple
tags_not
in the search query.
Optional Query parameters
Request
curl "https://dust.tt/api/v1/w/3e26b0e764/data_sources/foo/search?query=secret&top_k=10&full_text=true&tags_in=foo&tags_in=bar&tags_not=baz&tags_not=zab" \
-H "Authorization: Bearer sk-..."
Response
{
"error": null,
"response": {
"documents": [
{
"data_source_id": "foo",
"created": 1680101106398,
"document_id": "5",
"timestamp": 1680101106398,
"tags": [],
"source_url": null,
"hash": "9933f078502b2...e8b849da",
"text_size": 17,
"chunk_count": 1,
"chunks": [
{
"text": "Top secret document...",
"hash": "39e28bdc52f7b...20638967",
"offset": 0,
"vector": null,
"score": 0.7754930853843689
}
],
"text": "Full text of our Top secret document..."
}
]
}
}