Supported source types
| Type | Format | Max size | Notes |
|---|---|---|---|
| Standard PDF (text-extractable) | 50 MB | Scanned PDFs without OCR will only have whatever text the original PDF carries. | |
| DOCX | Microsoft Word | 50 MB | Tables and footnotes are extracted; embedded images are not analyzed. |
| Plain text | .txt, .md, UTF-8 encoded | 50 MB | Anything that decodes as UTF-8 without binary control bytes. |
| Web URL | HTTPS web page | n/a (text only) | We fetch the page server-side, strip script/style/nav, and index the visible text. |
| YouTube | youtube.com or youtu.be URL | n/a (transcript only) | We pull the auto-generated or human-uploaded transcript. If the video has captions disabled, ingestion fails with a clear error. |
How to add one
From a vault page, click Add source. Pick the type, upload or paste. The source appears in the list with status=processing; once ingestion finishes (~30s for a typical paper) it flips to ready and can be cited in chat.
What happens during ingestion
- The file is uploaded to Supabase Storage under your account's namespace.
- We extract text locally (pdfplumber / python-docx / BeautifulSoup / youtube-transcript-api).
- We upload the original file (or the transcript as .txt) to your vault's Gemini File Search store.
- Gemini indexes the file. Page numbers are preserved for PDFs.
Bulk upload
For up to 50 files at a time, use the bulk PDF endpoint at bulk upload.
ScholarFlow auto-ingestion (Pro)
Don't want to upload manually? Subscribe to a topic and the daily ScholarFlow cron will fetch new open-access papers from Europe PMC and ingest them automatically. See ScholarFlow.
What we don't support yet
- EPUB / MOBI — extract text yourself and upload as .txt for now
- Scanned PDFs without an OCR layer — run them through an OCR tool first
- Audio / video files (other than YouTube transcripts) — same: transcribe first
- Cloud-drive direct sync — paste URLs or download + upload
Source quotas
Free: 100 sources total across all vaults. Pro: 2000. Hitting the cap returns HTTP 402 upgrade_required with a clear message. See quotas.