Convert PDF files to image files by pdf2image

Programming

As of 2018, I was using Python at work and needed to convert a PDF file to an image file for a project and did some research on available libraries.

I first tried ImageMagic but gave up using it because the converted image file sometimes had a bit of noise, which I had to implement extra in the subsequent process.

The next one I tried was pdf2image, which could convert the image file more cleanly, so I decided to use pdf2image. Incidentally, pdf2image is a Python module that wraps pdftoppm and pdftocairo, utilities to convert PDF files to image files.

Installing poppler

To use pdf2image, you must install poppler.

Ubuntu
sudo apt-get install poppler-utils
MacOS
brew install poppler
Windows
  1. Download the latest package from http://blog.alivate.com.au/poppler-windows/
  2. Extract the package
  3. Move the extracted directory to the desired place on your system
  4. Add the bin/ directory to your PATH
  5. Test that all went well by opening cmd and making sure that you can call pdftoppm -h

How to use

There are two methods for converting PDF files to image files: convert_from_path and convert_from_bytes.

numberparameterdefaultdescription
1pdf_pathNonePath to the PDF file. Can be a string or a pathlib.Path object.
2pdf_bytesNoneBytes of the PDF file.
3dpi200Dots per inch, can be seen as the relative resolution of the output PDF, higher is better but anything above 300 is usually not discernable to the naked eye. Keep in mind that this is directly related to the ouput images size when using file formats without compression (like PPM)
4output_folderNoneOutput directory for the generated files, should be seen more as a “working directory” than an output folder. The converted images will be written there to save system memory.
5first_pageNoneFirst page that will be converted. first_page=2 will skip page 1.
6last_pageNoneLast page that will be converted. last_page=2 will skip all pages after page 2.
7fmt“ppm”File format or the output images. Supported values are ppmjpegpng and tiff.
8jpegoptNoneConfiguration for the jpeg output format. As such, only used with fmt=’jpeg’.

quality: Selects the JPEG quality value. The value must be an integer between 0 and 100.

progressive: Select progressive JPEG output. The possible values are TrueFalse, indicating progressive (yes) or non-progressive (no), respectively.

optimize: Sets whether to compute optimal Huffman coding tables for the JPEG output, which will create smaller files but make an extra pass over the data. The value must be True or False, with True performing optimization, otherwise the default Huffman tables are used.
9thread_count1Number of threads to use when converting the PDF. Limited to the actual number of pages.
10userpwNonePassword for the PDF if it is password-protected.
11use_cropboxFalseUses the PDF cropbox instead of the default mediabox. This is a rather dark feature that should be set to true when the module does not seem to work with your data.
12strictFalseRaises PDFSyntaxError when the PDF is partially malformed. Most PDF are partially malformed and that parameter should be kept to False, unless standard compliance is paramount to your use case.
13transparentFalseInstead of returning a white background, make the PDF background transparent. Only compatible with file formats that support transparency.
14single_fileFalseOnly convert the PDF first page and does not append an index to the output file name.
15output_fileuuid_generator()Output filename, normally string, but can take a string generator.
16poppler_pathNonePath to the poppler directory containing librairies and executable files.
17grayscaleFalseReturns grayscale images.
18sizeNoneSize of output images, using None as any of the dimension will resize and preserve aspect ratio.

Examples of valid sizes are:

size=400 will fit the image to a 400×400 box, preserving aspect ratio

size=(400, None) will make the image 400 pixels wide, preserving aspect ratio

size=(500, 500) will resize the image to 500×500 pixels, not preserving aspect ratio

This behavior is derived directly from the -scale-to-scale-to-x, and -scale-to-y parameters.
19paths_onlyFalseA list of image paths rather than preloaded images are returned.
20hide_annotationsFalseHide link bounding boxes and other PDF annotations. This is only implemented in pdftoppm at the moment so it cannot be combined with pdftocairo flags.

I want to test some parameters using the convert_from_path method in version 1.16.0.

First, install the necessary modules.

$ from pdf2image import convert_from_path, convert_from_bytes
$ import tempfile

3. dpi

$ images = convert_from_path('/tmp/files/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)

$ images[0].info['dpi']
(300, 300)

It could convert to 300 dpi image files.

4. output_folder

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path)

$ images[0].filename
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpqldzldx2/46f65646-b9fe-409d-80ee-8f593e620d87-01.jpg'

It could convert as temporary image files while loading into memory.

5. first_page

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)

$ len(images)
28

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', first_page=5)

$ len(images)
24

It could convert to convert image files on pages 5-28 (skipping pages 1 – 4).

6. last_page

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)

$ len(images)
28

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', last_page=5)

$ len(images)
5

It could convert to image files on pages 1-5 (skipping pages 6-28).

13. transparent

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', transparent=True)

$ images[0].mode
'RGB'

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='png', transparent=True)

$ images[0].mode
'RGBA'

It could convert to image files with a transparent background.

14. single_file

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', single_file=True)

$ len(images)
1

It could convert to an image file only on the first page.

15. output_file

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path, output_file='AAPL_')
    
$ images[0].filename
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpn1w7m6ev/AAPL_0001-01.jpg'

It could convert to image files with the “AAPL_” prefix.

17. grayscale

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', grayscale=True)

$ images[0].mode
'RGB'

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', grayscale=True)

$ images[0].mode
'L'

$ images = convert_from_path('/Users/shinichi/Downloads/AAPL_10-Q.pdf', fmt='tiff', grayscale=True)

$ images[0].mode
'L'

It could convert to grayscale image files when the format was ppm and tiff.

18. size

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg')

$ images[0].size
(1700, 2200)

$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', size=(500, None))

$ images[0].size
(500, 648)

It could convert to image files while maintaining the aspect ratio.

19. paths_only

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path)

$ images[0]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1700x2200>

$ with tempfile.TemporaryDirectory() as path:
      images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path, paths_only=True)
    
$ images[0]
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpt3_1y3z_/639ff5f6-e666-4bde-820a-33c94f237e05-01.jpg'

When it passed output_folder, it could get the paths of temporary files without loading them into memory.

Reference

pdf2image 1.16.0
pdf2image Overview
pdf2image.py

コメント

タイトルとURLをコピーしました