Docstraining

Training Sources

How to add URLs, PDFs, text, and sitemaps to train your AI agent.

Simple Agent uses Retrieval-Augmented Generation (RAG): the agent doesn't memorize answers — it searches the most relevant passages from your content in real time and uses them to respond. This means updating a source immediately updates the agent's behavior.

Source types

URL (site or page)

Paste any public URL. The crawler extracts text, respecting robots.txt:

https://yoursite.com/faq
https://docs.yoursite.com
https://yoursite.com/privacy-policy

How the crawl works:

Downloads the page with JavaScript rendering (headless Chromium)
Extracts main text (ignores navigation, footer, ads)
Splits into chunks of ~800 tokens with 100-token overlap
Generates embeddings and stores them in pgvector

Crawl depth:

Single URL: only that page
Full domain: up to 500 pages (configurable)
XML sitemap: all listed URLs

Automatic update: URLs are re-crawled every 7 days. You can trigger a manual re-crawl from the source list.

PDF

Drag the file or click to upload. Supports:

PDFs with selectable text (direct extraction, faster)
Scanned PDFs (automatic OCR via Tesseract)
Password-protected PDFs (enter the password at upload time)

Limits:

PDF: maximum 4 MB per file (DOCX: 20 MB; TXT: 2 MB)
Up to 10 files per upload
Maximum 1,000 pages per PDF (larger documents are truncated with a warning)

Tables and lists: Extracted as structured text — the agent can answer questions about tabular data with good accuracy.

Direct text

Paste any content into the text box. Useful for:

FAQs that aren't on any website
Support scripts
Confidential internal policies (not indexed on the web)
Structured data like pricing and tables

There is no size limit for direct text.

XML sitemap

Paste the sitemap URL:

https://yoursite.com/sitemap.xml
https://yoursite.com/sitemap-index.xml

Simple Agent reads all listed URLs and crawls each one. Nested sitemaps (sitemap index) are resolved automatically.

Manage sources

In the agent panel → Training tab:

Action	Description
Add source	URL, PDF, text, or sitemap
Re-index	Forces immediate re-crawl/reprocessing
Delete	Removes the source and all associated embeddings
View chunks	Inspect how the content was split

How much content to add?

More content is not always better. RAG retrieves the top-K most relevant passages — irrelevant content increases the risk of noise in responses.

Best practices:

Start with your most visited site pages
Add your full FAQ as a text source
Avoid adding long terms of service and legal documents (rarely relevant for support)
If you have more than 200 pages, use a sitemap to ensure full coverage

Content the agent does not learn from

Content behind a login — the crawler does not authenticate
Content inside images without alt text — no OCR on HTML images
Videos and audio — automatic transcription is not available in this version
Google Docs / Notion — use the public export link or PDF

Test your training

After adding a source, go to the Playground and test with real questions:

"What is your refund policy?"
"Do you support on weekends?"
"How much does the business plan cost?"

If the answer doesn't include the expected information, use View chunks to confirm the content was indexed correctly.

Re-index sources → · Customize responses → · Sources API →