Websites
Browse websites contents on Dust and add them as a datasource
Dust is able to crawl the content of a public website to make its content accessible from your data sources, meaning you can build agents based on this content.
Setting up the Connection
With an admin or builder role, you can set up new website connections on “Build” > “Websites”.
To fetch the pages of a website, Dust uses the links present inside the page provided in the URL field. It does not guess any pages but instead it “navigates” from reading the content inside the page provided and the pages linked (that are on the same domain).

Dust modal to set a new website connection.
And let’s say we have a website formed as is:
http://myfakewebsite.com
http://myfakewebsite.com/articles
http://myfakewebsite.com/articles/article1
http://myfakewebsite.com/articles/article2
http://myfakewebsite.com/jobs
http://myfakewebsite.com/jobs/engineering
http://myfakewebsite.com/jobs/design
http://myfakewebsite.com/product
http://myfakewebsite.com/about
Crawling strategy: All links vs children pages
If you’d like the whole website crawled, set the URL http://myfakewebsite.com
with “Follow all links within the domain”.
If you’d like only the articles crawled, better set the URL http://myfakewebsite.com/articles
with setting “Only child pages of the provided URL” to ensure the crawler fetches only the pages that contain http://myfakewebsite.com/articles
in their URL.
Indexing a single page
If you want only the Engineering page indexed and no other page, you can set the url http://myfakewebsite.com/jobs/engineering
combined with the setting “Page Limit” to 1.
Advanced setting: “Depth of Search”
This setting that allows you to say “How many links do I allow the crawler to follow to find a given page?”.
If your PDF are stored under the URL you are crawling, they will be included.
Google Docs
If you enter a Google Docs URL, it will be included. Note that if they're a link on a website, as they're good chance of them being on different domain, they're not going to be included.
Limitations
URL must be Public
If a login is required to access the website, Dust will not be able to access its content.
Blocked websites
Some websites are blocking crawling, here is a non exhaustive list:
- reddit.com
- linkedin.com
- instagram.com
- x.com
- tiktok.com
Updated 10 days ago