Include data

With Dust, you can ask your assistants to search through your entire selected Data Sources and pick the most relevant documents to tap into to answer. This is the “Search” tool (cf. Understanding Retrieval Augmented Generation (RAG) in Dust)

But sometimes, the context you need for your assistant to answer your query is an exhaustive set of the most recent data you have somewhere. Enters "Most Recent Data": this tool allows your assistant to take into account the most recent documents available in the selected data sources only.

Screenshot 2024-06-20 at 11.00.26.png

Here's how it works:

Untitled

Takes only most recent data into account…what does this mean?

When this method is activated, the assistant retrieves all latest documents from your data source in reverse chronological order, until your assistant context window is filled.

Let’s take an example:

Imagine you are creating an assistant whose data source is a Slack channel. This is how the assistant will gather information to generate answers to your questions:

  1. It will look at the Slack channel, and pick the latest message sent
  2. Then it will pick the message sent just before, then the previous one, etc. without filtering anything out
  3. It will continue until it reaches the limit of what it can process at once (the method’s context window*).
  4. Finally, it will generate a response to the question you asked, taking into account only the messages picked in the 3 previous steps
💡 **What is a context window?**

It refers to the maximum amount of text or information that a language model (like gpt, claude, mistral or gemini) can consider at one time when generating a response.

For example, if a model has a context window of 2048 tokens, it can only "remember" and use the last 2048 tokens (pieces of text, such as words, punctuation marks, or spaces) provided in the conversation. Anything beyond that limit won't influence the model's responses because it simply can't "see" it.

Side note: to smooth product experience, Dust sets its own specific window for “most recent data”. So it does not exactly correspond to the context windows of assistants’ underlying models.* **

When should I use the ‘most recent data’ option?

This method is particularly useful in scenarios where the most current information is crucial: news monitoring, project updates, or any area where exhaustivity and recent developments are more relevant than older information.

Example again:

At Dust, we use Most Recent Data to track recent features shipment.

  • Every time something is shipped in the product, the responsible team member sends a message in #shipped, explaining what is new and why we shipped it.
  • Then we use our @WeeklyShipped assistant to create a synthezised table and update the whole team with a short and crisp snapshot every week.
    Screenshot 2024-04-26 at 12.31.29.png

Limitations

  • While this method ensures that the most recent documents are considered first, it does not prioritize the relevance of the content based on the query beyond recency: it takes ALL recent data, without any filter. Therefore, it might not always return the most contextually relevant documents if they are older than the most recent ones.
  • The context window can be limiting: depending on the model used and the volume of data you have, you might not be able to retrieve data from a large period of time.
    In the example below for instance, the “warning” box indicates you cannot go further than a certain date:
    Untitled

When you create an assistant, try different options to check which one suits you best!