Extract Table Data from PDF Documents in C#

GrapeCity Documents for PDF v5 release continues to add great new features improving smart PDF parsing to recognize tables and extract table data from PDF files using the GcPDF C# .NET API.

By utilizing the GcPDF C# .NET Library, programmatically extracting or parsing data from tables is a snap!  Check out these new and updated features:

  • Automate a scanning/parsing process for PDF documents that requires searching and/or extracting tabular data
  • Create a new document, either text or CSV, containing the extracted content
  • Separate the content into different documents
  • Convert data to different formats for analysis
  • Convert data to SVG for easy reporting and formatting

What to expect from this article:

  • How to automate the search and parse process within a PDF file (programmatically with no interactions)
  • How to extract table data from PDF Documents
  • How to save data into an Excel formatted spreadsheet for easy viewing and analysis.
  • View a couple of use cases to see how these tools are utilized in real-world applications. 

Let's get started!

Because PDF is the most common format for exchanging documents, think about the amount of data that can be stored in one or more PDF document tables.  Financial analysis, stock trends, contact information; just to name a few.  Now, think about something like a quarterly report, and maybe there is some need to analyze years worth of these reports. 

Getting data out of these reports may seem easy at first (copy/paste), but because of the structure of PDF files, it is rarely the case where a simple copy & paste will get tables worth of data without the need for significant manipulation and modifications. 

Couple this, with the possibility of having to copy and paste from many other documents, and it is a recipe for a very long day (or even a week or more, depending on the data required!).  Handling this type of requirement efficiently requires a tool that automates this process, and the C# .NET GcPdf API Library is the perfect tool for the job!

This article is for developers who want to decrease the time it takes to collect data and improve the accuracy of the data gathering process and the examples will help developers gain an understanding of the GcPdf tool in order to access Table(s) in PDF files and extract tabular data for export to CSV files or other formats as needed.

Reading Table Data From a PDF With GcPdf and C#

Tables, not unlike PDF file formats, are a nearly ubiquitous way to present data. However, a PDF itself does not have any notion of tables; that is, the tables residing in a PDF are purely visual components. 

They are not tables in the sense we are all used to in programs like Excel or Word.  Rather, they are represented by a combination of operators that draw text and graphics in specific locations resembling a table. 

Therefore, the idea of rows, columns, and cells is not something a PDF file is familiar with, nor are there any underlying code bits that allow developers to find rows, columns, and cells in the PDF.  So let's find out how GcPDF's C# API LIbrary will help get the job done!

Extracting Table Data from PDF Documents Programmatically With C#

GcPDF for C# .NET API provides developers the tools they need to extract data from tables within a PDF document. By utilizing an API that accepts a table's bounds on a page as input, GcPdf parses the area like a table and returns tabular data (in the form of rows, columns, cells, and textual content).  Take a look at the code below to get started:

API

// Gets or sets the type of algorithm to be used for PDF content recognition when building page text maps (see Page.GetTextMap()).
//
// This property affects the behavior of methods such as GetText(), FindText() and other APIs that rely on text maps.
public RecognitionAlgorithm GcPdfDocument.RecognitionAlgorithm {get; set;}
 
// Defines possible algorithms that can be used to recognize the logical structure of a PDF when building text maps.
public enum RecognitionAlgorithm {
// Advanced algorithm that employs various heuristics and strategies
// to try to correctly recognize the logical document structure when building text maps.
//
// Please note that because the details of this algorithm may change from version to version,
// the text maps for a specific PDF may also change when GcPdf is updated.
    Advanced,
 
// Algorithm that primarily relies on the physical structure of a PDF when building text maps.
//
// Results yielded by this algorithm are consistent with how Acrobat Reader handles text when searching, selecting etc.
    AcrobatLike,
}
 
// Finds and parses a table in the specified area.
ITable Page.GetTable(RectangleF bounds, TableExtractOptions options = null, float dpiX = 72, float dpiY = 72, bool ignoreErrors = true);

Use Cases

A medium-sized grocery store chain needs a detailed analysis of its invoices, retrospectively for the past ten years.  However, the company changed CPA firms at least twice in this time period and only has the invoices available in PDF format on their servers.  Luckily, the invoices from each of the CPA firms represented were all consistent in their formatting and display. 

The store's IT Team tried extracting the data themselves but found the process far too time-consuming for this to be a viable project at this time.  The inconsistencies in copying and pasting data from tables in the PDFs are making the job very difficult. 

However, your team has a solution that can help them automate this process by using C# .NET and GcPDF tools, extracting data inside one or more PDF documents so the client can free up time and resources to analyze the data for the project, and provide the budgeting numbers for the next year.

Export Extracted Data From One PDF into Another PDF

This example demonstrates the use of the Page.GetTable method to extract tabular data from tables within the PDF. This sample page is part of the updated demos: Extract data from a table in a sample invoice PDF.

Use GcPdf to extract the invoice/data to another PDF file, like this:

Programmatically Export Extracted Table Data to Different Format with C#

The new GcPdf C# .NET tool can extract table data from a PDF in a way the extracted data can be exported to another format like CSV, txt, Excel, etc.

Once the data is extracted using the GcPdf methods and properties, the System.Text.Encoding and System.IO.File classes are used to export the extracted data to a different file format with just a few lines of code.

Let's run through an example to extract the table data in the invoice PDF above and save it into a CSV file for further analysis by using GcPdf along with System.Text.Encoding.CodePages assembly following these steps:

Steps:

1. Create a .Net Core Console application, right-click 'Dependencies,' and select 'Manage NuGet Packages'

    • Under the 'Browse' tab search for 'GrapeCity.Documents.Pdf' and click 'Install'
    • While installing, you will a receive 'License Acceptance' dialog, click 'I Accept' to continue

2. In the Program file, import the following namespaces:

using System.Text;
using GrapeCity.Documents.Pdf;
using GrapeCity.Documents.Pdf.Recognition;
using System.Linq;

3. Create a new PDF document by initializing the GcPdfDocument class to load the PDF document to be parsed. Then, invoke GcPdfDocument's Load method to load the original document.

using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
    var pdfDoc= new GcPdfDocument();
    pdfDoc.Load(fs);
}

 4. Instantiate a new instance of RectangleF class and define the table bounds in the PDF document.

const float DPI = 72;
var tableBounds = new RectangleF(0, 2.5f * DPI, 8.5f * DPI, 3.75f * DPI);

5. To help table recognition within the defined parameters, we use the TableExtractOptions class allowing us to fine-tune table recognition, accounting for idiosyncrasies of table formatting.

var tableExtrctOpt = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = tableExtrctOpt.GetMinimumDistanceBetweenRows;
tableExtrctOpt.GetMinimumDistanceBetweenRows = (list) =>
{
    var res = GetMinimumDistanceBetweenRows(list);
    return res * 1.2f;
};

6. Create a list to hold table data from PDF pages.

var data = new List<List<string>>();

5. Invoke the GetTable method with the defined table bounds (defined in #2) to make the GcPdf search for a table inside the specified rectangle.

using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
    for (int i = 0; i < pdfDoc.Pages.Count; ++i)
    {
        var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
    }
}

 7. Access each cell in the table using ITable.GetCell(rowIndex, colIndex) method. Use the Rows.Count and Cols.Count properties to loop through the extracted table cells.

using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
    for (int i = 0; i < pdfDoc.Pages.Count; ++i)
    {
        var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
        if (itable != null)
        {
            for (int row = 0; row < itable.Rows.Count; ++row)
            {
                if (row > 0)
                data.Add(new List<string>());
                for (int col = 0; col < itable.Cols.Count; ++col)
                {
                    var cell = itable.GetCell(row, col);
                    if (cell == null && row > 0)
                        data.Last().Add("");
                    else
                    {
                        if (cell != null && row > 0)
                            data.Last().Add($"\"{cell.Text}\"");
                    }
                }
            }
        }
    }
}

 

8. Add a reference to 'System.Text.Encoding.CodePages' NuGet package reference.

9. To save the extracted data from the variable in the previous step, use the File class and invoke its AppendAllLines method.

var fileName = "ExtractedData.csv"; Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); // needed to encode non-ASCII chars in data
File.Delete(fileName);
File.AppendAllLines(
fileName,
data.Where(l_ => l_.Any(s_ => !string.IsNullOrEmpty(s_))).Select(d_ => string.Join(',', d_)),
Encoding.GetEncoding(1252));

Formatting Extracted Data Using GcExcel

Although the data are now available in a format that can be easily read and manipulated, it is saved in a raw format in a CSV file format.  In order to better utilize the data and make analysis easier, use the GrapeCity Documents for Excel and C# to load the CSV file and save it to an Excel format. 

To use GcExcel, add the Nuget package 'GrapeCity.Documents.Excel' to the project and add its namespace:

using GrapeCity.Documents.Excel;

The following code formats the extracted data by wrapping the content, auto-sizing columns, styling with conditional back colors, etc.

var workbook = new GrapeCity.Documents.Excel.Workbook();
workbook.Open($@"{fileName}", OpenFileFormat.Csv);
 
IWorksheet worksheet = workbook.Worksheets[0];
IRange range = worksheet.Range["A2:E10"];
 
// wrapping cell content
range.WrapText = true;
 
// styling column names
worksheet.Range["A1"].EntireRow.Font.Bold = true;
 
// auto-sizing range
worksheet.Range["A1:E10"].AutoFit();
 
// aligning cell content
worksheet.Range["A1:E10"].HorizontalAlignment = HorizontalAlignment.Center;
worksheet.Range["A1:E10"].VerticalAlignment = VerticalAlignment.Center;
 
// applying conditional format on UnitPrice
IColorScale twoColorScaleRule = worksheet.Range["E2:E10"].FormatConditions.AddColorScale(ColorScaleType.TwoColorScale);
twoColorScaleRule.ColorScaleCriteria[0].Type = ConditionValueTypes.LowestValue;
twoColorScaleRule.ColorScaleCriteria[0].FormatColor.Color = Color.FromArgb(255, 229, 229);
 
twoColorScaleRule.ColorScaleCriteria[1].Type = ConditionValueTypes.HighestValue;
twoColorScaleRule.ColorScaleCriteria[1].FormatColor.Color = Color.FromArgb(255, 20, 20);
 
workbook.Save("ExtractedData_Formatted.xlsx");

Be sure to download the sample application and try the detailed implementation of the use case scenario and code snippets described in the blog above. If you have any suggestions, feel free to leave a comment below.

Happy coding!

comments powered by Disqus