Websites
Browse websites contents on Dust and add them as a datasource
Dust is able to crawl the content of a public website to make its content accessible from your data sources, meaning you can build assistants based on this content.
Setting up the Connection
With an admin or builder role, you can set up new website connections on “Build” > “Websites”.
To fetch the pages of a website, Dust uses the links present inside the page provided in the URL field. It does not guess any pages but instead it “navigates” from reading the content inside the page provided and the pages linked (that are on the same domain).
And let’s say we have a website formed as is:
http://myfakewebsite.com
http://myfakewebsite.com/articles
http://myfakewebsite.com/articles/article1
http://myfakewebsite.com/articles/article2
http://myfakewebsite.com/jobs
http://myfakewebsite.com/jobs/engineering
http://myfakewebsite.com/jobs/design
http://myfakewebsite.com/product
http://myfakewebsite.com/about
Crawling strategy: All links vs children pages
If you’d like the whole website crawled, set the URL http://myfakewebsite.com
with “Follow all links within the domain”.
If you’d like only the articles crawled, better set the URL http://myfakewebsite.com/articles
with setting “Only child pages of the provided URL” to ensure the crawler fetches only the pages that contain http://myfakewebsite.com/articles
in their URL.
Indexing a single page
If you want only the Engineering page indexed and no other page, you can set the url http://myfakewebsite.com/jobs/engineering
combined with the setting “Page Limit” to 1.
Advanced setting: “Depth of Search”
This setting that allows you to say “How many links do I allow the crawler to follow to find a given page?”.
Limitations
URL must be Public
If a login is required to access the website, Dust will not be able to access its content.
Content must be visible to Dust (Server-side rendering)
Currently, Dust only gathers content that is rendered server-side. If you submit a website and see that Dust has only indexed the first page and that it seems empty, know that this usually happens when the pages are built on-the-fly, right in your browser, using technologies like JavaScript (client-side rendering).
Updated 3 months ago