How to Convert PDF to CSV with python
Step by step to convert your scanned PDF bank statement to Excel format using Python
Oct 26, 2023 | Read time 9 minutes📗 Table of contents
Introduction
A common data analysis problem that I get is taking data from PDF documents and converting it to CSV.
Now this can be easy if you are dealing with one or two PDF, but for my case I had about a dozen or so.
(Quite painful!)
Now when something that I would have to do manually more than 5 times, I would try to automate it.
In this blog post, I will show how to extract PDF data and convert it to CSV using Python. We will be using the PdfDodo API to do so.
Pdfdodo provides you the ability to extract data from PDF using custom AI and OCR algorithms.
Step 1: Get Pdfdodo API Key
PdfDodo uses a API keys to allow you to access the API. You can get your API key by going to the “Settings” section.
After retrieving your API key, you can pass the token to the API in the HTTP Authorization Header using X-API-KEY
.
X-API-KEY: <apikey>
Step 2: Upload a PDF file
The next step after we have successfully grabbed our API key, we can upload our document and then call the analyze endpoint to get the data.
Make sure to add X-API-KEY
in the Authorization Header.
To upload our PDF file, we can do this in Python as follows.
pip install requests
After we have installed the requests
library, use the following code to call the Upload file endpoint.
import requests
url = "https://app.pdfdodo.com/api/uploadfile"
headers = {
"Content-Type": "multipart/form-data",
"x-api-key": "YOUR_API_KEY"
}
files = {
'file': ('file.pdf', open('/path/to/your/file.pdf', 'rb'))
}
response = requests.post(url, headers=headers, files=files)
The response will be a JSON result. For example:
{
"documentId": "123e4567-e89b-12d3-a456-426614174000",
"documentName": "your-test-file.pdf"
}
Keep note of the documentId
.
Step 3: Call the analyze endpoint
The last step is to do a GET
request to the analyze endpoint, /api/analyze
, to get the document data.
For example, lets say our document id is 123e4567-e89b-12d3-a456-426614174000
, we can do the following Python code to analyze and get the document data:
import requests
# API endpoint
url = "https://app.pdfdodo.com/api/analyze?documentid=123e4567-e89b-12d3-a456-426614174000"
# Headers
headers = {
"Content-Type": "application/json",
"x-api-key": "YOUR_API_KEY_HERE"
}
# Making the GET request
response = requests.get(url, headers=headers)
# Printing the response
print(response.text)
This will give you a JSON response of the data in your PDF:
{
"tables": [
{
"title": "The title of your table",
"tableId": "a0f229f6-ade7-4637-bbff-e31fbbd64ec5",
"pageNumber": 1,
"rows": [
[
"0",
"Money In ",
"$1,880.00 "
],
[
"1",
"Money Out ",
"$150.25 "
],
[
"2",
"You Received ",
"$1,729.75 "
]
],
"left": 0.6953286,
"top": 0.2655254,
"width": 0.2580709,
"height": 0.11655326,
"mergedTables": []
}
]
}
Convert the response to CSV
After you have the response, the next task is to save it as CSV.
We will need the json
module and csv
module.
An example of the code is below.
import json
import csv
# Your JSON response from the API call
# ...
# ...
# ...
# Parse the JSON string
data = json.loads(response.text)
# Extract rows from table 1
rows = data["tables"][0]["rows"]
# Write to CSV
with open("output.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(rows)
The above code takes the rows from table 1 (data["tables"][0]["rows"]
) and then use the csv
library to write out the rows.
After executing the above code, you will get an output.csv
file in the same directory as the script.
Alternative: Using Tabula library (⭐⭐⭐ / 5)
If you do not want to use a web API and prefer open source, the best library would be Tabula.
Now since we are in Python we need to use tabula-py
. Since Tabula was written in Java, tabula-py
is just wrapper.
tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert them into pandas’ DataFrame. tabula-py also converts a PDF file into CSV/TSV/JSON file.`
To get started, we need to install the Python tabula-py library with the following command:
pip install tabula-py
Now we also need to make sure that Java is installed. Additionally check that you can use java command on your terminal (eg setting the PATH
).
import tabula
# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')
# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("<url to pdf>")
# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
Review
I found the results of Tabula to be ok. The biggest issue is that it does not work with scanned PDFs.
There are some issues I found:
- For PDFs with vertical headers or unconventional structures, Tabula may miss content or provide no output.
- When tables have unclear lines, spanned cells, or are oriented counter-clockwise, Tabula can misalign data or be unreliable.
- For PDFs with two tables on one page the results seem random.
Alternative 2: Using PdfPlumber library (⭐⭐⭐⭐ / 5)
PdfPlumber is another open source library that you can convert PDF to CSV. Unlike Tabula where the csv export function is built-in, we need to import the csv
module.
We can get it from here: https://pypi.org/project/pdfplumber/
To get started, install with pip
pip install pdfplumber
We then can use the following code:
import pdfplumber
import csv
def extract_tables_from_pdf_to_csv(pdf_path, csv_path):
# Open the PDF file
with pdfplumber.open(pdf_path) as pdf:
# This will accumulate all table rows from all pages
all_rows = []
# Iterate over all the pages in the PDF
for page in pdf.pages:
# Extract tables from the current page
tables = page.extract_tables()
# Append rows from each table to all_rows
for table in tables:
all_rows.extend(table)
# Write the accumulated rows to a CSV file
with open(csv_path, 'w', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerows(all_rows)
# Usage
pdf_file_path = 'path_to_input.pdf'
csv_file_path = 'path_to_output.csv'
extract_tables_from_pdf_to_csv(pdf_file_path, csv_file_path)
Results
I found this to be a bit better than Tabula, but not by a big margin. Still it does not work with scanned PDFs.
Additionally, there are more dependencies we have to manage. Pdfplumber internally uses pdf2image, which requires poppler-utils. Ensure you have it installed in your system.
I found PdfPlumber does not do a great job in the following scenarios:
- The table contains vertical text, leading to messed up header text.
- The table is rotated, resulting in unusable output.
- Two similar tables are present on one page, causing unusable output.
- Tables have two or more columns with no values.
- Depending on the PDF document, you will need to adjust the “extraction” settings - theres more than a dozen of them - so I had to spend a bit of time tweaking these to get the result that I am after.
Final thoughts
Overall, if you want to automate PDF to CSV conversion, then use the PDFDodo API with Python using the request
and csv
modules.
It handles scanned and text PDFs and the results are pretty accurate.
As an alternative, we can use open source libraries like tabula-py or pdfplumber. The problem is that they require a bit of setup time and tweeking to get good results. Also they do not work with scanned PDFs!
👋 About the Author
G'day! I am Ken Ito a software engineer based in Australia. I have worked as a software engineer for more than 10 years ranging from roles such as tech lead, data analyst to individual contributor. I have experience in creating code (python, javascript) to extract data from PDFs for ETL and financial statement analysis for accounting purposes.
During those years, I have worked for Government departments, companies like NTT (Nippon Telegraph and Telephone), Bupa and not-for-profit organizations.
Extract Tables from Pdf using Python
Oct 24, 2023Guide to get tables from PDFs using Python
3 Ways to Convert Pdf to Excel Without Losing Formatting
Sep 21, 2023Step by Step to convert your scanned PDFs to Excel without losing formatting.
Top 3 ways to Download Monzo Bank Statement as CSV
Oct 12, 2023Reviews of the top 3 ways to Download Monzo Bank Statement as CSV