Coder Perfect

What is the best way to search the contents of numerous pdf files?

Problem

What is the best way to search the contents of PDF files in a directory or subdirectory? I’m seeking for command-line utilities. Grep does not appear to be able to search PDF files.

Asked by Jestin Joy

Solution #1

There’s also pdfgrep, which does exactly what it says on the tin.

pdfgrep -R 'a pattern to search recursively from path' /some/path

I’ve only used it for basic searches, and it’s worked perfectly.

(Packages are available for Debian, Ubuntu, and Fedora.)

Recursive search has been supported by pdfgrep since version 1.3.0. Since Ubuntu 12.10, this version has been available (Quantal).

Answered by Graeme

Solution #2

A utility called pdftotext should be included in your distribution:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

The “-” is required for pdftotext to output to stdout rather than files. The parameters —with-filename and —label= will include the file name in the grep output. The optional —color parameter tells grep to print in colors on the terminal, which is useful.

(In Ubuntu, the packages xpdf-utils and poppler-utils support pdftotext.)

If you want to use aspects of GNU grep that pdfgrep doesn’t support, this technique, which uses pdftotext and grep, offers an advantage over pdfgrep. Note that the -C option for printing line of context is supported by pdfgrep-1.3.x.

Answered by sjr

Solution #3

Recoll is a superb Unix/Linux full-text GUI search program that supports a variety of formats, including PDF. It can also provide the query’s exact page number and search term to the document viewer, allowing you to jump immediately to the result from the GUI.

Recoll also has a useful command-line interface as well as a web-browser interface.

Answered by Glutanimate

Solution #4

The following is possible with my current version of pdfgrep (1.3.0):

pdfgrep -HiR 'pattern' /path

When using pdfgrep, use the —help option.

It runs smoothly on Ubuntu.

Answered by arkhi

Solution #5

Ripgrep-all is a utility that is based on ripgrep.

It can handle documents other than PDFs, such as Office documents and videos, and the author claims that it is faster than pdfgrep.

The first is for recursively searching the current directory, while the second is only for PDF files:

rga 'pattern' .
rga --type pdf 'pattern' .

Answered by oschoudhury

Post is based on https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files