Skip to main content Skip to footer

Extract Table Data from PDF Documents in C#

Quick Start Guide
What You Will Need

Visual Studio

.NET Core Console App

NuGet Packages:

Controls Referenced

Document Solutions for PDF - A C# .NET PDF document API library allowing developers to programmatically create and manipulate PDF documents at scale.

Document Solutions for Excel, .NET Edition - A high-speed C# .NET Excel spreadsheet API library, allows you to programmatically create, edit, import, and export Excel with no dependencies on MS Excel.

Tutorial Concept Learn how use a .NET PDF API in order to access Table(s) in PDF files and extract tabular data for export to CSV files or other formats, such as XLSX, as needed.

In today's data-driven world, seamlessly extracting structured table data from PDF documents has become a crucial task for developers. With Document Solutions for PDF (DsPdf, previously GcPdf), you can effortlessly unlock the hidden treasures of information buried within those PDFs programmatically using C#.

Consider the popularity of PDFs, one of the most commonly used document formats, and the vast amount of data it can contain within its tables. Companies and organizations have been using PDF documents to review financial analyses, stock trends, contact details, and beyond. Now, picture situations like the examination of quarterly reports over multiple years, where the accumulation of data takes center stage.

Getting data from these reports may initially seem easy (copy/paste). Still, because of the structure of PDF files, it is rarely the case where a simple copy & paste will get tables worth of data without the need for significant manipulation and modifications. 

Couple this with the possibility of copying and pasting from many other documents, and it is a recipe for a very long day (or even a week or more, depending on the data required!).  Handling this type of requirement efficiently requires a tool that can automate this process, and the C# .NET DsPdf API Library is the perfect tool for the job!

This article is for developers who want to decrease the time it takes to collect data and improve the accuracy of the data-gathering process. The examples will help developers gain an understanding of the DsPdf tool in order to access Table(s) in PDF files and extract tabular data for export to CSV files or other formats, such as XLSX, as needed.

Try it for yourself! Download Document Solutions for PDF Today!

Important Information About Tables in PDF Document

Tables, much like PDF file formats, serve as a nearly prevalent means of data presentation. Nevertheless, it's essential to understand that a PDF document inherently lacks the concept of tables; in other words, the tables you see within a PDF are purely visual elements.

These PDF 'tables' differ from what we commonly encounter in applications like MS Excel or MS Word. Instead, they are constructed through a combination of operators responsible for rendering text and graphics in specific locations, resembling a tabular structure.

This means that the traditional notions of rows, columns, and cells are foreign to a PDF file, with no underlying code components to facilitate the identification of these elements. So, let's delve into how the DsPdf's C# API library can help us achieve this task!

How to Extract Table Data from PDF Documents Programmatically Using C#

  1. Create a .NET Core Console Application with DsPdf Included
  2. Load the Sample PDF that Contains a Data Table
  3. Define Table Recognition Parameters
  4. Get the Table Data
  5. Save Extracted PDF Table Data to Another File Type (CSV)
  6. Bonus: Format the Exported PDF Table Data in an Excel (XLSX) File

Be sure to download the sample application and try the detailed implementation of the use case scenario and code snippets described in this blog piece.

Create a .NET Core Console Application with DsPdf Included

Create a .NET Core Console application, right-click 'Dependencies,' and select 'Manage NuGet Packages'. Under the 'Browse' tab, search for 'GrapeCity.Documents.Pdf' and click 'Install'.

.NET PDF API NuGet Package
'Document Solutions' was previously know as 'GrapeCity Documents', the older product name currently remains on our NuGet packages.

While installing, you will receive a 'License Acceptance' dialog. Click 'I Accept' to continue.

.NET PDF Library NuGet License Acceptance

In the Program file, import the following namespaces:

using System.Text;
using GrapeCity.Documents.Pdf;
using GrapeCity.Documents.Pdf.Recognition;
using System.Linq;

Load the Sample PDF that Contains a Data Table

Create a new PDF document by initializing the GcPdfDocument constructor to load the PDF document that will be parsed. Invoke GcPdfDocument's Load method to load the original PDF document that contains a data table.

using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
    // Initialize GcPdf
    var pdfDoc= new GcPdfDocument();
    // Load a PDF document
    pdfDoc.Load(fs);
}

In this example, we will use this PDF:

Original PDF

Define Table Recognition Parameters

Instantiate a new instance of the RectangleF class and define the table bounds in the PDF document.

const float DPI = 72;
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
    // Initialize GcPdf
    var pdfDoc= new GcPdfDocument();
    // Load a PDF document
    pdfDoc.Load(fs);
    
    // The approx table bounds:
    var tableBounds = new RectangleF(0, 2.5f * DPI, 8.5f * DPI, 3.75f * DPI);
}

To help table recognition within the defined parameters, we use the TableExtractOptions class, allowing us to fine-tune table recognition, accounting for idiosyncrasies of table formatting. TableExtractOptions is a parameter to specify table formatting options like column width, row height, and distance between rows or columns.

// TableExtractOptions allow to fine-tune table recognition accounting for
// specifics of the table formatting:
var tableExtrctOpt = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = tableExtrctOpt.GetMinimumDistanceBetweenRows;

// In this particular case, we slightly increase the minimum distance between rows
// to make sure cells with wrapped text are not mistaken for two cells:
tableExtrctOpt.GetMinimumDistanceBetweenRows = (list) =>
{
    var res = GetMinimumDistanceBetweenRows(list);
    return res * 1.2f;
};

Get the PDF’s Table Data

Create a list to hold table data from the PDF pages.

// CSV: list to keep table data from all pages:
var data = new List<List<string>>();

Invoke the GetTable method with the defined table bounds (defined in the previous step) to make the DsPdf search for a table inside the specified rectangle.

for (int i = 0; i < pdfDoc.Pages.Count; ++i)
{
  // Get the table at the specified bounds:
  var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
}

Access each cell in the table using ITable.GetCell(rowIndex, colIndex) method. Use the Rows.Count and Cols.Count properties to loop through the extracted table cells.

for (int i = 0; i < pdfDoc.Pages.Count; ++i)
{
  // Get the table at the specified bounds:
  var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
  if (itable != null)
  {
    for (int row = 0; row < itable.Rows.Count; ++row)
    {
      // CSV: add next data row ignoring headers:
      if (row > 0)
        data.Add(new List<string>());

      for (int col = 0; col < itable.Cols.Count; ++col)
      {
        var cell = itable.GetCell(row, col);
        if (cell == null && row > 0)
          data.Last().Add("");
        else
        {
          if (cell != null && row > 0)
            data.Last().Add($"\"{cell.Text}\"");
       }
      }
    }
  }
}

Save Extracted PDF Table Data to Another File Type (CSV)

For this step, we must first add a reference to the 'System.Text.Encoding.CodePages' NuGet package reference.

Then, to save the extracted PDF Table data from the variable in the previous step, we will use the File class and invoke its AppendAllLines method.

for (int i = 0; i < pdfDoc.Pages.Count; ++i)
{
   // Get the table at the specified bounds:
    var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
    if (itable != null)
    {
      for (int row = 0; row < itable.Rows.Count; ++row)
      {
       // CSV: add next data row ignoring headers:
        if (row > 0)
           data.Add(new List<string>());
        for (int col = 0; col < itable.Cols.Count; ++col)
        {
          var cell = itable.GetCell(row, col);
          if (cell == null && row > 0)
            data.Last().Add("");
          else
          {
           if (cell != null && row > 0)
            data.Last().Add($"\"{cell.Text}\"");
          }
        } 
     }
  }
}

The data will now be available in a CSV file:

Original PDF

Original PDF

Extracted PDF Table Data in CSV File - .NET C# PDF API Library

Extracted PDF Table Data in CSV File

Bonus: Format the Exported PDF Table Data in an Excel (XLSX) File

Although the data is now available in a format that can be easily read and manipulated, it is saved in a raw format in a CSV file format. To better utilize the data and make analysis more accessible, use the Document Solutions for Excel (DsExcel, previously GcExcel) .NET edition and C# to load the CSV file into an Excel (XLSX) file and apply styling and formatting to the extracted data. 

Check out our .NET Excel API library. Download a Trial Today!

To use DsExcel, add the NuGet package 'GrapeCity.Documents.Excel' to the project and add its namespace.

using GrapeCity.Documents.Excel;

Initialize a DsExcel workbook instance and load the CSV file using the Open method.

 var workbook = new GrapeCity.Documents.Excel.Workbook();
 workbook.Open($@"{fileName}", OpenFileFormat.Csv);

Get the range of the extracted data and wrap the cell range, apply auto-sizing to the columns, and apply styling with conditional back colors.

IWorksheet worksheet = workbook.Worksheets[0];
IRange range = worksheet.Range["A2:E10"];

// wrapping cell content
range.WrapText = true;

// styling column names 
worksheet.Range["A1"].EntireRow.Font.Bold = true;

// auto-sizing range
worksheet.Range["A1:E10"].AutoFit();

// aligning cell content
worksheet.Range["A1:E10"].HorizontalAlignment = HorizontalAlignment.Center;
worksheet.Range["A1:E10"].VerticalAlignment = VerticalAlignment.Center;

// applying conditional format on UnitPrice
IColorScale twoColorScaleRule = worksheet.Range["E2:E10"].FormatConditions.AddColorScale(ColorScaleType.TwoColorScale);

twoColorScaleRule.ColorScaleCriteria[0].Type = ConditionValueTypes.LowestValue;
twoColorScaleRule.ColorScaleCriteria[0].FormatColor.Color = Color.FromArgb(255, 229, 229);

twoColorScaleRule.ColorScaleCriteria[1].Type = ConditionValueTypes.HighestValue;
twoColorScaleRule.ColorScaleCriteria[1].FormatColor.Color = Color.FromArgb(255, 20, 20);

Thread.Sleep(1000);                

Lastly, save the workbook as an Excel file using the Save method:

workbook.Save("ExtractedData_Formatted.xlsx");

As you have seen, using C# and DsPdf, developers can programmatically extract PDF table data to another file (like a CSV), then using DsExcel, the data can be converted to a stylized and formatted Excel XLSX file for easy data analysis:

Original PDF

Original PDF

Extracted PDF Table Data in CSV File

Extracted PDF Table Data in CSV File

PDF Table Data Exported and Formatted in Excel XLSX File - .NET C# Excel Library

Formatted Excel XLSX File

Document Solutions .NET PDF API Library

This article only scratches the surface of the full capabilities of the Document Solutions for PDF. Review our documentation to see the many available features and our demos to see the features in action with downloadable sample projects. To learn more about Document Solutions for PDF and the latest new features, check out our releases page.

comments powered by Disqus