Core Blocks

An overview of Dust's core blocks. You'll learn about their functionality and usage.

Dust apps are composed of Blocks which are executed sequentially. Each block produces outputs and can refer to the outputs of previously executed blocks. We invite you to review the sections on Inputs and Execution to better understand the execution logic of a Dust app.

For each block we describe its specification parameters and configuration parameters. Please refer to the Blocks section for more details on the difference between the two.


Input block

The input block is the entry point to a Dust app and receives the arguments on which the app is executed, in the form of JSON objects. During the design phase of the app it is associated to a Dataset of examples used to iterate on the app. When deployed by API, the input block receives the arguments passed to the app by API.

Refer to the Inputs section for more details on how the input block forks the execution of the app in independent execution streams. When run by API, the dataset to which the input block is associated is ignored and replaced by the arguments passed by API.

Specification

  • Name
    dataset
    Type
    dataset
    Description

    The dataset of examples to be executed by the app during the design phase. Ignored and replaced by the arguments passed when run by API.


Data block

The data block outputs the dataset it is associated with. Unlike the input block it does not fork the execution of the app and simply returns the entire dataset as an array of JSON objects.

The data block is generally used in conjunction with an llm block to output a dataset of few-shot examples to prompt models.

Specification

  • Name
    dataset
    Type
    dataset
    Description

    The dataset of examples to be returned by the block as array.


Code block

The code block executes the Javascript code provided by the user. The code must define a function _fun which takes as input an env variable. _fun is executed as the block is run by the app. code blocks are generally used to glue other blocks together by postprocessing previous blocks outputs.

Specification

  • Name
    code
    Type
    javascript
    Description

    The code to be executed. Exposed as a function named _fun taking an env variable as argument.

Properties of the env variable

  • Name
    state
    Type
    object
    Description

    An object whose fields are the name of previously executed blocks and values are the output of the associated blocks. The output of a previously executed EXAMPLE block can be accessed as env.state.EXAMPLE.

  • Name
    input
    Type
    object
    Description

    An object with fields input and index. input is the input object of the current execution stream, null if there is no input block executed yet. index is the index of the current input object in the dataset associated with the input block.

  • Name
    map
    Type
    optional object
    Description

    An optional object set only in the context of a block being executed as part of a map reduce pair. If set, contains a field name set to the name of the current map block and a field iteration which is the index of the element being executed as part of the map reduce.

  • Name
    config
    Type
    object
    Description

    The configuration with which the app is currently run. An object whose keys are block names and values are the associated configuration values.


LLM block

The llm block provides a standardized interface to large language models from multiple providers (OpenAI, Cohere, AI21, ...). It also provides automatic caching mechanisms and a rich templating language (based on Tera, similar go Jinja2) to construct prompts.

It is used to issue completion requests to large language models as part of an app execution.

Specification

  • Name
    prompt
    Type
    string
    Description

    The prompt to generate completions. The prompt can be templated using the Tera templating language (see below for more details on templating).

  • Name
    temperature
    Type
    float
    Description

    The temperature to use when sampling from the model. A higher temperature (eg 1.0) results in more diverse completions, while a lower temperature (eg 0.0) results in more conservative completions.

  • Name
    max_tokens
    Type
    integer
    Description

    The maximum number of tokens to generate from the model. It can be very broadly seen as a word worth of content. The model may decide to stop generation before reaching max_tokens. For OpenAI models, we support -1 as a special value, specifying that we want the model to generate as many tokens as possible given its context size. Using -1 with a model of context size 2048 with a prompt consuming 1024 tokens, is equivalent to specifying a max_tokens of 1024.

  • Name
    stop
    Type
    []string
    Description

    An array of strings that should interrupt completion when sampled from the model.

  • Name
    frequency_penalty
    Type
    float
    Description

    Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

  • Name
    presence_penalty
    Type
    float
    Description

    Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

  • Name
    top_p
    Type
    float
    Description

    An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

  • Name
    top_logprobs
    Type
    integer
    Description

    Include the log probabilities on the top_logprobs most likely tokens, as well as the chosen token. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens' probabilities. Useful for classification tasks.

Configuration

  • Name
    provider_id
    Type
    string
    Description

    The model provider to use. One of openai, cohere, ai21.

  • Name
    model_id
    Type
    string
    Description

    The model to use from the provider specified by provider_id.

  • Name
    temperature
    Type
    float
    Description

    An override for the temperature to use when sampling from the model. See the specification section for more details.

  • Name
    use_cache
    Type
    bool
    Description

    Whether to rely on the automated caching mechanism of the llm block. If set to true and a previous request was made with the same specification and configuration parameters, then the cached response is returned.

  • Name
    use_stream
    Type
    bool
    Description

    In the context of running an app by API. Whether to stream each token as they are emitted by the model. Currently only supported for provider_id set to openai.

Prompt templating

The prompt field of the llm block can be templated using the Tera. This is particularly useful to construct few-shot prompts for models. Here's an example of templated model prompt:

{% for e in EXAMPLES %}
EN: {{e.english}}
FR: {{e.french}}
{% endfor %}
EN: {{INPUT.english}}
FR:

In this example, for each object in the output of the EXAMPLES block (which is an array of JSON objects), create a line with the English sentence and French translation. Then, add a final line with the English sentence from the INPUT block output english field. A fully functional example app relying on this prompt can be found here.

The Tera templating language supports a variety of constructs including loops and variable replacement. Please refer to the Tera documentation for more details.

Support for chat-based models (chatGPT)

We also support chat-based models such as OpenAI's gpt-3.5-turbo (Chat API) and gpt-4. When these models are used, as part of the llm block, max_tokens and top_logprobs are ignored, and the content of the templated prompt is passed as initial message with user role. The block response is the content of the assistant role message returned by the model. Note that logprobs and tokens are not available with these models.


Chat block

The chat block provides a standardized interface to chat-based large language models such as OpenAI's gpt-3.5-turbo (Chat API) and gpt-4. It is similar in many respects to an LLM block, it has templated (see LLM block above) intstructions that serve as initial prompt and exposes a messages code block to output previous messages (generally passed to the app as input)

It is used to build chat-based experiences where the model is exposed with previous interactions and generates a new message in response.

Specification

  • Name
    instructions
    Type
    string
    Description

    The instructions are passed to the model as initial system role message (see OpenAI's Chat guide). The instructions can be templated using the Tera templating language (see above for more details on templating).

  • Name
    messages_code
    Type
    javascript
    Description

    A Javascript function _fun which takes a single argument env (see the Code block documentation for details) and returns an array (possibly empty) of messages (but you generally want at least one user role message). Messages should be objects with two fields: role and content. Values should be string. Possible values for role are user, assistant and system (see OpenAI's Chat guide).

  • Name
    temperature
    Type
    float
    Description

    The temperature to use when sampling from the model. A higher temperature (eg 1.0) results in more diverse completions, while a lower temperature (eg 0.0) results in more conservative completions.

  • Name
    stop
    Type
    []string
    Description

    An array of strings that should interrupt completion when sampled from the model.

  • Name
    max_tokens
    Type
    integer
    Description

    The maximum number of tokens to generate from the model for the next message. It can be very broadly seen as a word worth of content. The model may decide to stop generation before reaching max_tokens.

  • Name
    frequency_penalty
    Type
    float
    Description

    Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

  • Name
    presence_penalty
    Type
    float
    Description

    Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

  • Name
    top_p
    Type
    float
    Description

    An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

  • Name
    functions_code
    Type
    javascript
    Description

    A Javascript function _fun which takes a single argument env (see the Code block documentation for details) and returns an array (possibly empty) of functions specification objects. Only available for selected OpenAI's models. See OpenAI's function calling guide for more details.

Configuration

  • Name
    provider_id
    Type
    string
    Description

    The model provider to use. One of openai, cohere, ai21.

  • Name
    model_id
    Type
    string
    Description

    The model to use from the provider specified by provider_id.

  • Name
    temperature
    Type
    float
    Description

    An override for the temperature to use when sampling from the model. See the specification section for more details.

  • Name
    use_cache
    Type
    bool
    Description

    Whether to rely on the automated caching mechanism of the llm block. If set to true and a previous request was made with the same specification and configuration parameters, then the cached response is returned.

  • Name
    use_stream
    Type
    bool
    Description

    In the context of running an app by API. Whether to stream each token as they are emitted by the model. Currently only supported for provider_id set to openai.

  • Name
    function_call
    Type
    string
    Description

    If functions_code returns a list of functions specifications, function_call lets you influence how the model will decide whether to use a function or not. Possible values are auto, none or one of your functions' name (forcing the call of that function). See OpenAI's function calling guide for more details. Only available for selected OpenAI's models. Conversely to OpenAI's API we take the function name directly instead of an object.


Map Reduce blocks

The map and reduce blocks are used to execute the set of blocks in between them on each elements of an array, in parallel. The map block takes a block name as from specification argument. If the output of the block referred to is an array, the execution stream will fork on each element of the array. If the output is an object you can use the repeat specification argument to map on the same element repeat times.

The reduce block does not take any argument and has the same name as its associated map block. After executing a reduce block, each output of the blocks in between the map and reduce blocks will be collected in an array accessible from any subsequent block.

Assume we have a MAPREDUCE map block whose from specification argument points to a block whose output is an array of length 4. Assume it is followed by a DUMMY code block and an associated MAPREDUCE reduce block, and finally a FINAL code block. Also assume that the output of the DUMMY block is a simple { "foo": "bar" } object. The DUMMY code block will be executed 4 times in parallel, and as we execute the FINAL block, env.state.DUMMY will contain an array of 4 { "foo": "bar" } objects.

Configuration

  • Name
    from
    Type
    string
    Description

    The name of the block's output to map on. If the output is an array, the execution stream will fork on each element of the array. If the output is an object you the repeat specification argument must be specified.

  • Name
    repeat
    Type
    integer
    Description

    Only valid if the output of the from block is an object. The number of times to repeat the execution of the blocks in between the map and reduce blocks on the same object.


While End blocks

The while and end blocks are used to execute the set of blocks in between sequentially them until a termination condition is met. The while block takes a condition argument expecting code to be executed at each iteration of the loop. The code must return a boolean value. If the value is true the loop continues, otherwise it stops.

The end block does not take any argument and has the same name as its associated while block. After executing the end block, each output of the blocks in between the while and end blocks will be collected in an array accesible from any subsequent block (see the Map Reduce blocks documentation for more details on this behavior).

Configuration

  • Name
    condition
    Type
    string
    Description

    A javascript function _fun which takes a single argument env (see the Code block documentation for details) and returns a boolean value. If the value is true the loop continues, otherwise it stops.

  • Name
    max_iterations
    Type
    integer
    Description

    The maximum number of iterations to execute the blocks in between the while and end blocks. Must be set and has a maximum value of 32.


DataSource block

The data_source block provides an interface to query a Data Source and return chunks that are semantically similar to the query provided. When executed, the query is embedded using the embedding model set on the Data Source and the resulting enmbedding vector is used to perform semantic search on the Data Source's chunks.

The output of the data_source block is a list of Document objects. Each document may include one or more Chunks (only those that were returned from the search). The documents are sorted by decreasing order of the max of their retrieved chunks score. Please refer to the Data Sources overview for more details about how documents are chunked to enable semantic search.

Specification

  • Name
    query
    Type
    string
    Description

    The query to embed and use to perform semantic seach against the data source. The query can be templated using the Tera templating language (see the LLM block documentation for more details on templating).

  • Name
    full_text
    Type
    boolean
    Description

    Whether to return the full text of the retrieved documents (in addition to each chunk text). If true, each returned Document object will have a text property containing the full text of the document.

Configuration

  • Name
    top_k
    Type
    integer
    Description

    The number of chunks to return. The resulting number of document objects may be smaller if multiple chunks from the same document are among the top_k retrieved chunks.

  • Name
    data_sources
    Type
    []{workspace_id, data_source_id}
    Description

    An array of objects representing the Data Sources to query. Note that the Dust interface currently only supports adding one Data Source to a block, but you can pass many by API. The objects must have the following properties: workspace_id, the id of workspace who owns the Data Source, and, data_source_id, The name of the Data Source.

  • Name
    filter
    Type
    optional {tags: {in, not}, timestamp: {gt, lt}}
    Description

    An object representing the filters to apply to the query. The tags field is an object with properties in and not, both arrays of strings (the tags to filter on). The documents' tags must match at least one of the tags specified in in and none of the ones specified in not. The timestamp field is an object with properties lg and gt. The query will only return chunks from documents whose timestamp is greater than gt and lower than lt. The timestamp values are represented as epoch in ms. The filter configuration is optional. And each of the fields in tags or timestamp are also optional. Example filter:

    {
      "tags": {"in": ["tag1", "tag2"], "not": null},
      "timestamp": {"gt": 1675215950729, "lt": 1680012404017}
    }
    

    The tags.in and tags.not values can be templated using the Tera templating language (see the LLM block documentation for more details on templating).