Pdf to Excel with C#

Step by step to convert your PDF to Excel in C#

Nov 8, 2023 | Read time 9 minutes

Introduction

A recent project that I was involved in had a task to extract data from a PDF document using C#.

After we have extracted the table data, we can then put it into a Excel sheet of place it in a database table.

In this post, I will go over how to export data from a PDF and then save it to Excel.

We will be using PdfDodo’s API to do this.

Step 1: Get Pdfdodo API Key

PdfDodo uses a API keys to allow you to access the API. You can get your API key by going to the “Settings” section.

After retrieving your API key, you can pass the token to the API in the HTTP Authorization Header using X-API-KEY.

X-API-KEY: <apikey>

Step 2: Upload a PDF file

The next step after we have successfully grabbed our API key, we can upload our document and then call the analyze endpoint to get the data.

Make sure to add X-API-KEY in the Authorization Header.

To upload our PDF file, we can do this in C# as follows and call the api/uploadfile endpoint.

using System;
using System.Net.Http;
using System.Threading.Tasks;
using System.IO;

class Program
{
    static async Task Main()
    {
        var apiKey = "YOUR_API_KEY"; // Replace with your actual API key
        var filePath = "/path/to/your/file.pdf"; // Replace with the path to your PDF file
        var url = "https://app.pdfdodo.com/api/uploadfile"; // The API endpoint

        using (var httpClient = new HttpClient())
        {
            using (var form = new MultipartFormDataContent())
            {
                httpClient.DefaultRequestHeaders.Add("x-api-key", apiKey);

                // Assuming the file is on the disk, and we can open a read stream
                using (var fs = File.OpenRead(filePath))
                {
                    using (var streamContent = new StreamContent(fs))
                    {
                        using (var fileContent = new ByteArrayContent(await streamContent.ReadAsByteArrayAsync()))
                        {
                            fileContent.Headers.ContentType = new System.Net.Http.Headers.MediaTypeHeaderValue("multipart/form-data");

                            // "file" is the name of the parameter for the file upload
                            form.Add(fileContent, "file", Path.GetFileName(filePath));

                            // Post the form to the URL
                            HttpResponseMessage response = await httpClient.PostAsync(url, form);

                            if (response.IsSuccessStatusCode)
                            {
                                Console.WriteLine("File uploaded successfully!");
                            }
                            else
                            {
                                Console.WriteLine("File upload failed!");
                            }

                            // Read the response content and output it to the console
                            string responseContent = await response.Content.ReadAsStringAsync();
                            Console.WriteLine(responseContent);
                        }
                    }
                }
            }
        }
    }
}

Keep note of the documentId.

Step 3: Call the analyze endpoint

The next step is to do a GET request to the analyze endpoint, /api/analyze, to get the document data.

For example, lets say our document id is 123e4567-e89b-12d3-a456-426614174000, we can do the following Python code to analyze and get the document data:

using System;
using System.Net.Http;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        // Replace YOUR_API_KEY_HERE with your actual API key
        string apiKey = "YOUR_API_KEY_HERE";
        string documentId = "123e4567-e89b-12d3-a456-426614174000";
        string baseUri = "https://app.pdfdodo.com/api/analyze";
        string uri = $"{baseUri}?documentid={documentId}/results";

        using (var httpClient = new HttpClient())
        {
            // Add the API key to the request headers
            httpClient.DefaultRequestHeaders.Add("x-api-key", apiKey);
            httpClient.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));

            // Send the GET request
            var response = await httpClient.GetAsync(uri);

            // Ensure we get a success status code
            response.EnsureSuccessStatusCode();

            // Read the response content
            var responseContent = await response.Content.ReadAsStringAsync();
            Console.WriteLine(responseContent);
        }
    }
}

This will give you a JSON response of the data in your PDF:

{
  "tables": [
    {
      "title": "The title of your table",
      "tableId": "a0f229f6-ade7-4637-bbff-e31fbbd64ec5",
      "pageNumber": 1,
      "rows": [
        [
          "0",
          "Money In ",
          "$1,880.00 "
        ],
        [
          "1",
          "Money Out ",
          "$150.25 "
        ],
        [
          "2",
          "You Received ",
          "$1,729.75 "
        ]
      ],
      "left": 0.6953286,
      "top": 0.2655254,
      "width": 0.2580709,
      "height": 0.11655326,
      "mergedTables": []
    }    
  ]
}

We can then use this JSON response to save it to a database.

Step 4: Save JSON as Excel

The last step is to take the above JSON response and save it as a excel spreadsheet.

In my case, I had to save it as a excel file. We can use Newtonsoft.Json to parse the JSON and the “Microsoft Excel Object Library” to create Excel files.

To work with the Microsoft Excel Object Library, also known as Microsoft.Office.Interop.Excel, you’ll need to add a reference to the library in your project

using Microsoft.Office.Interop.Excel;
using Newtonsoft.Json.Linq;
using _Excel = Microsoft.Office.Interop.Excel;

public void JsonToExcel(string jsonString)
{
    // Parse the JSON string
    JObject jsonObject = JObject.Parse(jsonString);

    // Start Excel and get Application object.
    var excelApp = new _Excel.Application();
    excelApp.Visible = true; // You can set this to false if you don't want the Excel window to be visible

    // Create a new, empty workbook and add a worksheet.
    _Excel.Workbook workbook = excelApp.Workbooks.Add(Type.Missing);
    _Excel.Worksheet worksheet = (_Excel.Worksheet)workbook.Worksheets[1];

    // Get the table rows from JSON
    JArray rows = (JArray)jsonObject["tables"][0]["rows"];

    // Iterate over each row and column to write the data to the sheet
    for (int i = 0; i < rows.Count; i++)
    {
        JArray row = (JArray)rows[i];
        for (int j = 0; j < row.Count; j++)
        {
            // Excel is 1-indexed
            worksheet.Cells[i + 1, j + 1] = row[j];
        }
    }

    // Save the workbook and Quit Excel
    workbook.SaveAs(@"C:\path\to\your\spreadsheet.xlsx");
    workbook.Close();
    excelApp.Quit();
}

Now this option is free and does not require external libraries to work with Excel.

The main limitation with the above is that you will need Office (Excel, Word, etc) to be installed on the machine that is running the above code.

If you do not want the interop dependency (installing Office on your machine), another option is the OpenXML SDK library:

https://github.com/dotnet/Open-XML-SDK

This just exists as a DLL reference and you can install it through Nuget. https://www.nuget.org/packages/DocumentFormat.OpenXml

using System;
using System.Collections.Generic;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;
using Newtonsoft.Json;

class Program
{
    static void Main(string[] args)
    {
        // Your JSON string
        string jsonString = @"{
          'tables': [
            {
              'title': 'The title of your table',
              'tableId': 'a0f229f6-ade7-4637-bbff-e31fbbd64ec5',
              'pageNumber': 1,
              'rows': [
                ['0', 'Money In', '$1,880.00'],
                ['1', 'Money Out', '$150.25'],
                ['2', 'You Received', '$1,729.75']
              ]
            }    
          ]
        }";

        // Deserialize the JSON string into a C# object
        var response = JsonConvert.DeserializeObject<JsonResponse>(jsonString);

        // Create a new Excel file
        string filePath = "output.xlsx";
        using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Create(filePath, SpreadsheetDocumentType.Workbook))
        {
            // Add a WorkbookPart to the document.
            WorkbookPart workbookpart = spreadsheetDocument.AddWorkbookPart();
            workbookpart.Workbook = new Workbook();

            // Add a WorksheetPart to the WorkbookPart.
            WorksheetPart worksheetPart = workbookpart.AddNewPart<WorksheetPart>();
            worksheetPart.Worksheet = new Worksheet(new SheetData());

            // Add Sheets to the Workbook.
            Sheets sheets = spreadsheetDocument.WorkbookPart.Workbook.AppendChild(new Sheets());

            // Append a new worksheet and associate it with the workbook.
            Sheet sheet = new Sheet()
            {
                Id = spreadsheetDocument.WorkbookPart.GetIdOfPart(worksheetPart),
                SheetId = 1,
                Name = response.Tables[0].Title
            };
            sheets.Append(sheet);

            SheetData sheetData = worksheetPart.Worksheet.GetFirstChild<SheetData>();

            // Add header row to the Excel file
            Row headerRow = new Row();
            headerRow.Append(
                new Cell() { CellValue = new CellValue("Id"), DataType = CellValues.String },
                new Cell() { CellValue = new CellValue("Description"), DataType = CellValues.String },
                new Cell() { CellValue = new CellValue("Amount"), DataType = CellValues.String }
            );
            sheetData.AppendChild(headerRow);

            // Add the data rows to the Excel file
            foreach (var rowData in response.Tables[0].Rows)
            {
                Row newRow = new Row();
                foreach (string cellValue in rowData)
                {
                    newRow.Append(new Cell() { CellValue = new CellValue(cellValue), DataType = CellValues.String });
                }
                sheetData.AppendChild(newRow);
            }

            workbookpart.Workbook.Save();
        }

        Console.WriteLine("Excel file created successfully!");
    }
}

// Define classes to match the JSON structure
public class JsonResponse
{
    public List<Table> Tables { get; set; }
}

public class Table
{
    public string Title { get; set; }
    public List<List<string>> Rows { get; set; }
}

Final thoughts

Overall to convert PDF to Excel, we can do the following steps:

  • Extract the data from the PDF using PdfDodo API
  • Parse the JSON response using Newtonsoft.Json library
  • Create the Excel file using Microsoft Excel Object Library.

If you do not want to have interop dependencies, then consider using the Open XML SDK.

👋 About the Author

G'day! I am Ken Ito a software engineer based in Australia. I have worked as a software engineer for more than 10 years ranging from roles such as tech lead, data analyst to individual contributor. I have experience in creating code (python, javascript) to extract data from PDFs for ETL and financial statement analysis for accounting purposes.

During those years, I have worked for Government departments, companies like NTT (Nippon Telegraph and Telephone), Bupa and not-for-profit organizations.

Follow along on Twitter and LinkedIn

See also

Step by Step to convert your scanned PDFs to Excel without losing formatting.

3 Ways to Convert Pdf to Excel Without Losing Formatting

Sep 21, 2023

Step by Step to convert your scanned PDFs to Excel without losing formatting.

Step by step to convert your scanned PDF bank statement to Excel format using Python

How to Convert PDF to CSV with python

Oct 26, 2023

Step by step to convert your scanned PDF bank statement to Excel format using Python

Guide to get tables from PDFs using Python

Extract Tables from Pdf using Python

Oct 24, 2023

Guide to get tables from PDFs using Python