GrapeCity Documents for PDF v4 release adds an important new feature allowing smart PDF parsing to recognize tables and extract data using the GcPDF C# .NET API.

By utilizing the GcPDF C# .NET PDF parsing functionality you can automate the PDF scanning/parsing process and extract information quickly from tables in a PDF without human interaction for each PDF document.

  • Automate a scanning/parsing process for PDF documents that requires searching and/or extracting tabular data
  • Create a new document, either text or CSV, containing the extracted content
  • Separate the content into different documents
  • Convert data to different formats for analysis

Because PDF is the most common format for exchanging documents, let’s consider a document with several sets of data needing analysis. We need to extract this data into a different format like Excel and at first glance, the task would seem easy with just copying and pasting the needed data. However, this does not always work as designed due to the formatting and complexity of the document, or the number of documents considered.

If, for example, we have a few thousand PDFs to examine for the needed data, the same task is a cumbersome and tedious operation.

Handling this type of requirement efficiently requires a tool which automates this process, and C# .NET GcPdf is the perfect tool for the job! This article is for developers who want to decrease the time to collect data and improve accuracy of gathering data.

The examples will help developers understanding of the GcPdf tool to accesses Table(s) in PDF files and extract tabular data for export to CSV files or other formats as needed.

Reading Table With GcPdf

Tables, not unlike PDF file formats, are a nearly ubiquitous way to present data. However, a PDF itself does not have any notion of tables; that is, the tables residing in a PDF are purely visual components.

Internally they are represented by any combination of operators that draw text and graphics but Let's learn more about how to extract table content and data by using C# .NET and the GcPdf API (and we'll throw in some GcExcel for good measure)!

Extract Table from PDF documents

GcPDF for C# .NET API, provides developers the tools they need to extract data from tables within a PDF document. By utilizing an API that accepts a table's bounds on a page as input, GcPdf parses the area as a table and returns tabular data (in the form of rows, columns, cells and textual content).

API

// Gets or sets the type of algorithm to be used for PDF content recognition when building page text maps (see Page.GetTextMap()).
//
// This property affects the behavior of methods such as GetText(), FindText() and other APIs that rely on text maps.
public RecognitionAlgorithm GcPdfDocument.RecognitionAlgorithm {get; set;}

// Defines possible algorithms that can be used to recognize the logical structure of a PDF when building text maps.
public enum RecognitionAlgorithm
{
    // Advanced algorithm that employs various heuristics and strategies
    // to try to correctly recognize the logical document structure when building text maps.
    //
    // Please note that because the details of this algorithm may change from version to version,
    // the text maps for a specific PDF may also change when GcPdf is updated.
    Advanced,

    // Algorithm that primarily relies on the physical structure of a PDF when building text maps.
    //
    // Results yielded by this algorithm are consistent with how Acrobat Reader handles text when searching, selecting etc.
    AcrobatLike,
}

// Finds and parses a table in the specified area.
ITable Page.GetTable(RectangleF bounds, TableExtractOptions options = null, float dpiX = 72, float dpiY = 72, bool ignoreErrors = true);

Use Cases

A medium sized grocery store recently hired your team to solve a problem for them. They would like a detailed analysis of their invoices retrospectively for the last ten years. However, they have changed CPA firms at least twice in this time period and only have the invoices available in PDF format on their servers.

They tried to extract the data themselves, but found it very time consuming and at times, inaccurate because of the inconsistencies in copying/pasting from tables in the PDFs. Your team has a solution that can help them automate this process by using C# .NET and GcPDF tools, extracting data inside one or more PDF documents so the client can free up time and resources to analyze the data for the project, and provide the budgeting numbers for the next year.

Export Extracted Data From One PDF into Another PDF

This example demonstrates the use of the Page.GetTable method to extract tabular data from tables within the PDF. This sample page is part of the updated demos: Extract data from a table in a sample invoice PDF.

Example of PDF with tables to extract data using C#.NET and GcPdf tools

Original document containing tabular data

Use GcPdf to extract the invoice/data to another PDF file, like this:

Modified PDF with new table data using C# .NET and GcPdf Tool

Export Extracted Tabular Data to Different Format

The new GcPdf C# .NET tool can extract table data from a PDF in a way the extracted data can be exported to another format like CSV, txt, Excel etc. Once the data is extracted using the GcPdf methods and properties, the System.Text.Encoding and System.IO.File classes are used to export the extracted data to a different file format with just a few lines of code.

Now, let's run through an example to extract the table data in the invoice PDF above and save it into a CSV file for further analysis.

This can easily be accomplished by using GcPdf along with System.Text.Encoding.CodePages assembly following these steps:

Steps:

  1. Create a .Net Core Console application, right-click 'Dependencies,' and select 'Manage NuGet Packages'
    • Under the 'Browse' tab search for 'GrapeCity.Documents.Pdf' and click 'Install'
    • While installing, you will a receive 'License Acceptance' dialog, click 'I Accept' to continue
  2. In the Program file, import the following namespaces:
using System.Linq;
using System.Text;
using GrapeCity.Documents.Pdf;
using GrapeCity.Documents.Pdf.Recognition;
  1. Create a new PDF document by initializing the GcPdfDocument class to load the PDF document to be parsed. Then, invoke GcPdfDocument's Load method to load the original document.
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
  var pdfDoc= new GcPdfDocument();
  pdfDoc.Load(fs);
}
  1. Instantiate a new instance of RectangleF class and define the table bounds in the PDF document.
const float DPI = 72;
var tableBounds = new RectangleF(0, 2.5f * DPI, 8.5f * DPI, 3.75f * DPI);
  1. To help table recognition within the defined parameters, we use the TableExtractOptions class allowing us to fine-tune table recognition, accounting for idiosyncrasies of table formatting.
var tableExtrctOpt = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = tableExtrctOpt.GetMinimumDistanceBetweenRows;
tableExtrctOpt.GetMinimumDistanceBetweenRows = (list) =>
{
    var res = GetMinimumDistanceBetweenRows(list);
    return res * 1.2f;
};
  1. Create a list to hold table data from PDF pages.

    var data = new List<List<string>>();
    
  2. Invoke the GetTable method with the defined table bounds (defined in #2) to make the GcPdf search for a table inside the specified rectangle.

    using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
    {
    for (int i = 0; i < pdfDoc.Pages.Count; ++i)
    {
       var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
    }
    }
    
  3. Access each cell in the table using ITable.GetCell(rowIndex, colIndex) method. Use the Rows.Count and Cols.Count properties to loop through the extracted table cells.
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
  for (int i = 0; i < pdfDoc.Pages.Count; ++i)
  {
      var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
      if (itable != null)
      {
          for (int row = 0; row < itable.Rows.Count; ++row)
          {
              if (row > 0)
                  data.Add(new List<string>());
                  for (int col = 0; col < itable.Cols.Count; ++col)
                  {
                  var cell = itable.GetCell(row, col);
                  if (cell == null && row > 0)
                          data.Last().Add("");
                      else
                  {
                          if (cell != null && row > 0)
                              data.Last().Add($"\"{cell.Text}\"");
                  }
              }
          }
      }
  }
}
  1. Add reference to 'System.Text.Encoding.CodePages' nuget package reference.

  2. To save the extracted data from the variable in previous step, use File class and invoke it's AppendAllLines method.

    var fileName = "ExtractedData.csv";
    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); // needed to encode non-ASCII chars in data
    File.Delete(fileName);
    File.AppendAllLines(
    fileName,
    data.Where(l_ => l_.Any(s_ => !string.IsNullOrEmpty(s_))).Select(d_ => string.Join(',', d_)),
    Encoding.GetEncoding(1252));
    

Formatting Extracted Data Using GcExcel

CSV Exported Image from GcPdf Tool using C# .NET

The image above shows extracted tabular data in csv file. The extracted content is in a raw format, which is not very user friendly, or conducive to data analysis. Therefore, to make the extracted data more usable, you can use GrapeCity Documents for Excel to load the csv with extracted data and format it using C# .NET.

To use GcExcel, add the Nuget package 'GrapeCity.Documents.Excel' to the project and add its namespace:

using GrapeCity.Documents.Excel;

The following code formats the extracted data by wrapping the content, auto-sizing columns, styling with conditional back colors etc.

var workbook = new GrapeCity.Documents.Excel.Workbook();
workbook.Open($@"{fileName}", OpenFileFormat.Csv);

IWorksheet worksheet = workbook.Worksheets[0];
IRange range = worksheet.Range["A2:E10"];

// wrapping cell content
range.WrapText = true;

// styling column names
worksheet.Range["A1"].EntireRow.Font.Bold = true;

// auto-sizing range
worksheet.Range["A1:E10"].AutoFit();

// aligning cell content
worksheet.Range["A1:E10"].HorizontalAlignment = HorizontalAlignment.Center;
worksheet.Range["A1:E10"].VerticalAlignment = VerticalAlignment.Center;

// applying conditional format on UnitPrice
IColorScale twoColorScaleRule = worksheet.Range["E2:E10"].FormatConditions.AddColorScale(ColorScaleType.TwoColorScale);

twoColorScaleRule.ColorScaleCriteria[0].Type = ConditionValueTypes.LowestValue;
twoColorScaleRule.ColorScaleCriteria[0].FormatColor.Color = Color.FromArgb(255, 229, 229);

twoColorScaleRule.ColorScaleCriteria[1].Type = ConditionValueTypes.HighestValue;
twoColorScaleRule.ColorScaleCriteria[1].FormatColor.Color = Color.FromArgb(255, 20, 20);

workbook.Save("ExtractedData_Formatted.xlsx");

Excel File created with GcPdf and GcExcel using C# .NET

Be sure to download the sample application and try the detailed implementation of the use case scenario and code snippets described in the blog above. If you have any suggestions, feel free to leave a comment below.

Help | Demo

Try a GcPdf free trial for 30 days

Download the latest version of GrapeCity Documents for PDF

Download Now!