Adding sources — PDFs, URLs, YouTube

Four source types: PDF, DOCX, plain text, web URL, YouTube video. Up to 50MB per file.

Supported source types

TypeFormatMax sizeNotes
PDFStandard PDF (text-extractable)50 MBScanned PDFs without OCR will only have whatever text the original PDF carries.
DOCXMicrosoft Word50 MBTables and footnotes are extracted; embedded images are not analyzed.
Plain text.txt, .md, UTF-8 encoded50 MBAnything that decodes as UTF-8 without binary control bytes.
Web URLHTTPS web pagen/a (text only)We fetch the page server-side, strip script/style/nav, and index the visible text.
YouTubeyoutube.com or youtu.be URLn/a (transcript only)We pull the auto-generated or human-uploaded transcript. If the video has captions disabled, ingestion fails with a clear error.

How to add one

From a vault page, click Add source. Pick the type, upload or paste. The source appears in the list with status=processing; once ingestion finishes (~30s for a typical paper) it flips to ready and can be cited in chat.

What happens during ingestion

  1. The file is uploaded to Supabase Storage under your account's namespace.
  2. We extract text locally (pdfplumber / python-docx / BeautifulSoup / youtube-transcript-api).
  3. We upload the original file (or the transcript as .txt) to your vault's Gemini File Search store.
  4. Gemini indexes the file. Page numbers are preserved for PDFs.

Bulk upload

For up to 50 files at a time, use the bulk PDF endpoint at bulk upload.

ScholarFlow auto-ingestion (Pro)

Don't want to upload manually? Subscribe to a topic and the daily ScholarFlow cron will fetch new open-access papers from Europe PMC and ingest them automatically. See ScholarFlow.

What we don't support yet

  • EPUB / MOBI — extract text yourself and upload as .txt for now
  • Scanned PDFs without an OCR layer — run them through an OCR tool first
  • Audio / video files (other than YouTube transcripts) — same: transcribe first
  • Cloud-drive direct sync — paste URLs or download + upload

Source quotas

Free: 100 sources total across all vaults. Pro: 2000. Hitting the cap returns HTTP 402 upgrade_required with a clear message. See quotas.

Related articles