Websites

Browse websites contents on Dust and add them as a datasource

Dust is able to crawl the content of a public website to make its content accessible from your data sources, meaning you can build assistants based on this content.

Setting up the Connection

With an admin or builder role, you can set up new website connections on “Build” > “Websites”.

To fetch the pages of a website, Dust uses the links present inside the page provided in the URL field. It does not guess any pages but instead it “navigates” from reading the content inside the page provided and the pages linked (that are on the same domain).

Dust modal to set a new website connection.

Dust modal to set a new website connection.

And let’s say we have a website formed as is:

http://myfakewebsite.com

  1. http://myfakewebsite.com/articles
    1. http://myfakewebsite.com/articles/article1
    2. http://myfakewebsite.com/articles/article2
  2. http://myfakewebsite.com/jobs
    1. http://myfakewebsite.com/jobs/engineering
    2. http://myfakewebsite.com/jobs/design
  3. http://myfakewebsite.com/product
  4. http://myfakewebsite.com/about

Crawling strategy: All links vs children pages

If you’d like the whole website crawled, set the URL http://myfakewebsite.com with “Follow all links within the domain”.

If you’d like only the articles crawled, better set the URL http://myfakewebsite.com/articles with setting “Only child pages of the provided URL” to ensure the crawler fetches only the pages that contain http://myfakewebsite.com/articles in their URL.

Indexing a single page

If you want only the Engineering page indexed and no other page, you can set the url http://myfakewebsite.com/jobs/engineering combined with the setting “Page Limit” to 1.

Advanced setting: “Depth of Search”

This setting that allows you to say “How many links do I allow the crawler to follow to find a given page?”.

Limitations

URL must be Public

If a login is required to access the website, Dust will not be able to access its content.

Content must be visible to Dust (Server-side rendering)

Currently, Dust only gathers content that is rendered server-side. If you submit a website and see that Dust has only indexed the first page and that it seems empty, know that this usually happens when the pages are built on-the-fly, right in your browser, using technologies like JavaScript (client-side rendering).


What’s Next