Convert PDF files to image files by pdf2image

As of 2018, I was using Python at work and needed to convert a PDF file to an image file for a project and did some research on available libraries.

I first tried ImageMagic but gave up using it because the converted image file sometimes had a bit of noise, which I had to implement extra in the subsequent process.

The next one I tried was pdf2image, which could convert the image file more cleanly, so I decided to use pdf2image. Incidentally, pdf2image is a Python module that wraps pdftoppm and pdftocairo, utilities to convert PDF files to image files.

Installing poppler

To use pdf2image, you must install poppler.

Ubuntu

sudo apt-get install poppler-utils

MacOS

brew install poppler

Windows

Download the latest package from http://blog.alivate.com.au/poppler-windows/
Extract the package
Move the extracted directory to the desired place on your system
Add the bin/ directory to your PATH
Test that all went well by opening cmd and making sure that you can call pdftoppm -h

How to use

There are two methods for converting PDF files to image files: convert_from_path and convert_from_bytes.

number	parameter	default	description
1	pdf_path	None	Path to the PDF file. Can be a string or a pathlib.Path object.
2	pdf_bytes	None	Bytes of the PDF file.
3	dpi	200	Dots per inch, can be seen as the relative resolution of the output PDF, higher is better but anything above 300 is usually not discernable to the naked eye. Keep in mind that this is directly related to the ouput images size when using file formats without compression (like PPM)
4	output_folder	None	Output directory for the generated files, should be seen more as a “working directory” than an output folder. The converted images will be written there to save system memory.
5	first_page	None	First page that will be converted. `first_page=2` will skip page 1.
6	last_page	None	Last page that will be converted. `last_page=2` will skip all pages after page 2.
7	fmt	“ppm”	File format or the output images. Supported values are ppm, jpeg, png and tiff.
8	jpegopt	None	Configuration for the jpeg output format. As such, only used with fmt=’jpeg’. quality: Selects the JPEG quality value. The value must be an integer between 0 and 100. progressive: Select progressive JPEG output. The possible values are `True`, `False`, indicating progressive (yes) or non-progressive (no), respectively. optimize: Sets whether to compute optimal Huffman coding tables for the JPEG output, which will create smaller files but make an extra pass over the data. The value must be `True` or `False`, with `True` performing optimization, otherwise the default Huffman tables are used.
9	thread_count	1	Number of threads to use when converting the PDF. Limited to the actual number of pages.
10	userpw	None	Password for the PDF if it is password-protected.
11	use_cropbox	False	Uses the PDF cropbox instead of the default mediabox. This is a rather dark feature that should be set to true when the module does not seem to work with your data.
12	strict	False	Raises PDFSyntaxError when the PDF is partially malformed. Most PDF are partially malformed and that parameter should be kept to `False`, unless standard compliance is paramount to your use case.
13	transparent	False	Instead of returning a white background, make the PDF background transparent. Only compatible with file formats that support transparency.
14	single_file	False	Only convert the PDF first page and does not append an index to the output file name.
15	output_file	uuid_generator()	Output filename, normally string, but can take a string generator.
16	poppler_path	None	Path to the poppler directory containing librairies and executable files.
17	grayscale	False	Returns grayscale images.
18	size	None	Size of output images, using `None` as any of the dimension will resize and preserve aspect ratio. Examples of valid sizes are: `size=400` will fit the image to a 400×400 box, preserving aspect ratio `size=(400, None)` will make the image 400 pixels wide, preserving aspect ratio `size=(500, 500)` will resize the image to 500×500 pixels, not preserving aspect ratio This behavior is derived directly from the `-scale-to`, `-scale-to-x`, and `-scale-to-y` parameters.
19	paths_only	False	A list of image paths rather than preloaded images are returned.
20	hide_annotations	False	Hide link bounding boxes and other PDF annotations. This is only implemented in pdftoppm at the moment so it cannot be combined with pdftocairo flags.

I want to test some parameters using the convert_from_path method in version 1.16.0.

First, install the necessary modules.

$ from pdf2image import convert_from_path, convert_from_bytes
$ import tempfile

3. dpi

$ images = convert_from_path('/tmp/files/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)

$ images[0].info['dpi']
(300, 300)

It could convert to 300 dpi image files.

4. output_folder

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path)

$ images[0].filename
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpqldzldx2/46f65646-b9fe-409d-80ee-8f593e620d87-01.jpg'

It could convert as temporary image files while loading into memory.

5. first_page

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)

$ len(images)
28

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', first_page=5)

$ len(images)
24

It could convert to convert image files on pages 5-28 (skipping pages 1 – 4).

6. last_page

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)

$ len(images)
28

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', last_page=5)

$ len(images)
5

It could convert to image files on pages 1-5 (skipping pages 6-28).

13. transparent

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', transparent=True)

$ images[0].mode
'RGB'

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='png', transparent=True)

$ images[0].mode
'RGBA'

It could convert to image files with a transparent background.

14. single_file

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', single_file=True)

$ len(images)
1

It could convert to an image file only on the first page.

15. output_file

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path, output_file='AAPL_')
    
$ images[0].filename
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpn1w7m6ev/AAPL_0001-01.jpg'

It could convert to image files with the “AAPL_” prefix.

17. grayscale

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', grayscale=True)

$ images[0].mode
'RGB'

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', grayscale=True)

$ images[0].mode
'L'

$ images = convert_from_path('/Users/shinichi/Downloads/AAPL_10-Q.pdf', fmt='tiff', grayscale=True)

$ images[0].mode
'L'

It could convert to grayscale image files when the format was ppm and tiff.

18. size

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg')

$ images[0].size
(1700, 2200)

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', size=(500, None))

$ images[0].size
(500, 648)

It could convert to image files while maintaining the aspect ratio.

19. paths_only

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path)

$ images[0]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1700x2200>

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path, paths_only=True)
    
$ images[0]
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpt3_1y3z_/639ff5f6-e666-4bde-820a-33c94f237e05-01.jpg'

When it passed output_folder, it could get the paths of temporary files without loading them into memory.

Reference

pdf2image 1.16.0
pdf2image Overview
pdf2image.py