Converting a scanned document into a compressed, searchable PDF with redactions
Created on 2022-08-27T06:52:40-05:00
- Batch convert PDF to TIFFs
- Perform corrections and redactions on the TIFFs
- Convert the bundle of TIFFs back in to a PDF
- Run optimizers
$ infile=scan.pdf $ tmpfile=$(mktemp) $ outfile=searchable-scan.pdf $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$tmpfile" "$infile" $ ocrmypdf -l eng --deskew "$tmpfile" "$outfile" $ rm $tmpfile
Order of compression matters. Article author found running optimization with gs prior to OCRmyPDF shaved the file from 1.5mb to 1mb. Running only OCRmyPDF took the scanner's raw output from 7.9mb to 2.7mb.