As of 2018, I was using Python at work and needed to convert a PDF file to an image file for a project and did some research on available libraries.
I first tried ImageMagic but gave up using it because the converted image file sometimes had a bit of noise, which I had to implement extra in the subsequent process.
The next one I tried was pdf2image, which could convert the image file more cleanly, so I decided to use pdf2image. Incidentally, pdf2image is a Python module that wraps pdftoppm and pdftocairo, utilities to convert PDF files to image files.
Installing poppler
To use pdf2image, you must install poppler.
Ubuntu
sudo apt-get install poppler-utils
MacOS
brew install poppler
Windows
- Download the latest package from http://blog.alivate.com.au/poppler-windows/
- Extract the package
- Move the extracted directory to the desired place on your system
- Add the
bin/
directory to your PATH - Test that all went well by opening
cmd
and making sure that you can callpdftoppm -h
How to use
There are two methods for converting PDF files to image files: convert_from_path and convert_from_bytes.
number | parameter | default | description |
1 | pdf_path | None | Path to the PDF file. Can be a string or a pathlib.Path object. |
2 | pdf_bytes | None | Bytes of the PDF file. |
3 | dpi | 200 | Dots per inch, can be seen as the relative resolution of the output PDF, higher is better but anything above 300 is usually not discernable to the naked eye. Keep in mind that this is directly related to the ouput images size when using file formats without compression (like PPM) |
4 | output_folder | None | Output directory for the generated files, should be seen more as a “working directory” than an output folder. The converted images will be written there to save system memory. |
5 | first_page | None | First page that will be converted. first_page=2 will skip page 1. |
6 | last_page | None | Last page that will be converted. last_page=2 will skip all pages after page 2. |
7 | fmt | “ppm” | File format or the output images. Supported values are ppm, jpeg, png and tiff. |
8 | jpegopt | None | Configuration for the jpeg output format. As such, only used with fmt=’jpeg’. quality: Selects the JPEG quality value. The value must be an integer between 0 and 100. progressive: Select progressive JPEG output. The possible values are True , False , indicating progressive (yes) or non-progressive (no), respectively.optimize: Sets whether to compute optimal Huffman coding tables for the JPEG output, which will create smaller files but make an extra pass over the data. The value must be True or False , with True performing optimization, otherwise the default Huffman tables are used. |
9 | thread_count | 1 | Number of threads to use when converting the PDF. Limited to the actual number of pages. |
10 | userpw | None | Password for the PDF if it is password-protected. |
11 | use_cropbox | False | Uses the PDF cropbox instead of the default mediabox. This is a rather dark feature that should be set to true when the module does not seem to work with your data. |
12 | strict | False | Raises PDFSyntaxError when the PDF is partially malformed. Most PDF are partially malformed and that parameter should be kept to False , unless standard compliance is paramount to your use case. |
13 | transparent | False | Instead of returning a white background, make the PDF background transparent. Only compatible with file formats that support transparency. |
14 | single_file | False | Only convert the PDF first page and does not append an index to the output file name. |
15 | output_file | uuid_generator() | Output filename, normally string, but can take a string generator. |
16 | poppler_path | None | Path to the poppler directory containing librairies and executable files. |
17 | grayscale | False | Returns grayscale images. |
18 | size | None | Size of output images, using None as any of the dimension will resize and preserve aspect ratio.Examples of valid sizes are: size=400 will fit the image to a 400×400 box, preserving aspect ratiosize=(400, None) will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500) will resize the image to 500×500 pixels, not preserving aspect ratioThis behavior is derived directly from the -scale-to , -scale-to-x , and -scale-to-y parameters. |
19 | paths_only | False | A list of image paths rather than preloaded images are returned. |
20 | hide_annotations | False | Hide link bounding boxes and other PDF annotations. This is only implemented in pdftoppm at the moment so it cannot be combined with pdftocairo flags. |
I want to test some parameters using the convert_from_path method in version 1.16.0.
First, install the necessary modules.
$ from pdf2image import convert_from_path, convert_from_bytes
$ import tempfile
3. dpi
$ images = convert_from_path('/tmp/files/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)
$ images[0].info['dpi']
(300, 300)
It could convert to 300 dpi image files.
4. output_folder
$ with tempfile.TemporaryDirectory() as path:
images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path)
$ images[0].filename
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpqldzldx2/46f65646-b9fe-409d-80ee-8f593e620d87-01.jpg'
It could convert as temporary image files while loading into memory.
5. first_page
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)
$ len(images)
28
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', first_page=5)
$ len(images)
24
It could convert to convert image files on pages 5-28 (skipping pages 1 – 4).
6. last_page
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', dpi=300)
$ len(images)
28
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', last_page=5)
$ len(images)
5
It could convert to image files on pages 1-5 (skipping pages 6-28).
13. transparent
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', transparent=True)
$ images[0].mode
'RGB'
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='png', transparent=True)
$ images[0].mode
'RGBA'
It could convert to image files with a transparent background.
14. single_file
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', single_file=True)
$ len(images)
1
It could convert to an image file only on the first page.
15. output_file
$ with tempfile.TemporaryDirectory() as path:
images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path, output_file='AAPL_')
$ images[0].filename
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpn1w7m6ev/AAPL_0001-01.jpg'
It could convert to image files with the “AAPL_” prefix.
17. grayscale
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', grayscale=True)
$ images[0].mode
'RGB'
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', grayscale=True)
$ images[0].mode
'L'
$ images = convert_from_path('/Users/shinichi/Downloads/AAPL_10-Q.pdf', fmt='tiff', grayscale=True)
$ images[0].mode
'L'
It could convert to grayscale image files when the format was ppm and tiff.
18. size
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg')
$ images[0].size
(1700, 2200)
$ images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', size=(500, None))
$ images[0].size
(500, 648)
It could convert to image files while maintaining the aspect ratio.
19. paths_only
$ with tempfile.TemporaryDirectory() as path:
images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path)
$ images[0]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1700x2200>
$ with tempfile.TemporaryDirectory() as path:
images = convert_from_path('/tmp/files/Downloads/AAPL_10-Q.pdf', fmt='jpeg', output_folder=path, paths_only=True)
$ images[0]
'/var/folders/zc/df1_19fs0jlgw3_nmlcdys800000gn/T/tmpt3_1y3z_/639ff5f6-e666-4bde-820a-33c94f237e05-01.jpg'
When it passed output_folder, it could get the paths of temporary files without loading them into memory.
コメント