Documentation

PDF Text Extractor reads embedded PDF text locally in the browser and can use OCR on scanned or low-text pages. Files are not uploaded to a server, making it useful for contracts, papers, invoices, scanned archives, and office documents.

Features

  • Upload one PDF and view file size, page count, and file name.
  • Extract text layer only, use automatic OCR fallback, or force OCR on every page.
  • Export TXT text or an HTML report.
  • Choose full page markers, simple page markers, or no page separators.
  • Compress whitespace, fix English hyphen line breaks, and convert full-width letters or numbers.

OCR Notes

Automatic OCR runs only when the visible text count on a page is below the threshold. Force OCR renders and recognizes every page, which is useful for scanned PDFs but takes longer. Chinese, English, and mixed Chinese-English recognition are supported.

Use Cases

Use it for contract review, research material cleanup, invoice archiving, scanned document conversion, PDF content search, and office document workflows.

Notes

Encrypted PDFs need to be decrypted before extraction. OCR accuracy depends on scan clarity, page rotation, image resolution, and font quality, so important documents should be reviewed manually.