Extract data
Extract data from unstructured sources in a structured manner and feed it to LLMs
Demo
Overview
The Extract Data action enables assistants to extract structured information from a large window of data coming from your data sources over a specified time frame.
A large language model is run on up to 500k tokens (1000 pages book) and emits structured pieces of information following a schema of your design.
This can be used to:
- Generate daily or weekly highlights from Slack channels or Mailing lists
- Extract quantitative data from a stream of unstructured interactions such as customer support tickets.
This comes as a complement to the “Search” and “Most Recent Data” actions:
- The “Search” action perform semantic search and does not provide guarantee that the entire data is processed by the assistant and cannot extract information in a structured manner according to a custom schema definition.
- The “Most Recent Data” is limited by the context size of the model and won’t be able to process large amount of data from data sources and is generally meant for low volumes of recent information.
The output of the Extract Data action is a list of objects following the schema definition you configured which is passed to the assistant to answer your question.
Configuring the Extract Data action consists in the following steps:
- Configure the data sources you want to process
- Select a time frame
- Define or generate a schema to extract
Configure the Extract Data action
Configure the Data Sources to Process
This is very similar to the “Search” action. Select a data source you want to process. Generally a Slack channel, a Folder you feed by API or Intercom tickets.
Select a Time Frame
After defining the action data sources, the time frame to use needs to be provided.
Define or Generate a Schema to Extract
Finally you want to define the schema used by the large language model to extract pieces of information from the data sources and time frame you defined. You canDefine the schema manually by using “Add property”.
- Get your schema generated automatically by an assistant based on your instructions by clicking on “Generate”. The generation is based on your assistant’s instructions. The generated model can be edited.
Using an Assistant based on Extract Data
Using an assistant based on Extract Data is similar to any other assistant. These assistants being generally thoroughly prompted you often only need to call them.
Given the intensive work involved with the Extract Data action, expect their execution to take more time than usual.
The output presented to the assistant to generate the answer can be introspected. As an example, the instructions of the assistant above (with the schema visible in the screenshots of the previous section) are the following:
Context: Dust (the company that owns that list) is a platform to build custom and safe assistants on top of their users' company data (Notion, Slack, Gdrive, ....). Assistants can be easily packaged by users on top of their company data and interacted with through a conversational interface.
Highlights have been extracted from emails on our internal mailing list "sales-archive" which is used to cc internally emails related to our sales efforts.
Based on these highlights generate a structured summary of all the interactions. Avoid broad statement or general facts. Be extremely concise and brief this content is targeted at internal employees of Dust who want to save time and get the information rapidly without reading all the emails.
If there is multiple interactions for the same company, merge them together. If an interaction has company "N/A", use "Dust Internal" as company name.
### `sales-archive` {{ FOREACH company }} #### {{Company Name}} {{pipeline_qualification}} - Contact: {{Dust point of contact}} - **Highlight**: {{1 sentence highlight}} {{IF there are issues}} - **Issues**: (if any) - {{ list of issues}} {{ENDIF}} {{IF there are requirements}} - **Requirements**: (if any) - {{ list of requirements}} {{ENDIF}} {{ ENDFOR company }}
Updated 4 months ago