Skip to main content Skip to footer

How to Programmatically Extract Data from Tagged PDF Documents in C#

A tagged PDF document is an accessible PDF that can easily be read using screen readers and other assistive technology. A tagged PDF document has a hidden structure added to the document that represents the document content recognizable by a screen reader or other text-to-speech recognition software. This hidden structure is created by a set of PDF tags. Here is a list of standard tags used in PDF documents.

Document Solutions for PDF (DsPdf), previously GrapeCity Documents for PDF (GcPdf), allows users to create such tagged PDF documents as well as extract data from the tagged PDF document based on the varied structural tags. This blog focuses on describing the other aspects of the feature and programmatically extracting data from a tagged PDF document.

How to Create a Tagged PDF Document

Let's start by understanding how a tagged PDF document is created. As described above, GcPdf allows you to create tagged PDF Files. You can refer to the following blog, which provides a detailed explanation of creating tagged PDF documents using GcPdf API. Alternatively, many PDF Files include tags added from Adobe or an authoring application, such as Adobe FrameMaker®, Adobe InDesign, or Microsoft Word.

Use C# to Extract Data from a Tagged PDF Document

The steps below describe how to programmatically extract data from a tagged PDF document, specifically extracting tables from the tagged PDF document using the Table structural element and later adding these extracted tables to create a new PDF document. As a blog use case, we will be extracting the table from the following Invoice document:

Invoice

Initialize Source Document and Destination Document

Create an instance of GcPdfDocument class to load the source-tagged PDF document, an Invoice. Also, create another instance of GcPdfDocument class to add tables extracted from the source document and later save this new document.

The code below initializes both these instances and declares a few text formats that will be used to create the resulting document.

private TextFormat _tf, _tfHdr, _tfPgHdr;
private float _margin = 72;

// Set up some text formats:
_tf = new TextFormat()
{
   Font = Font.FromFile(Path.Combine("Resources", "Fonts", "segoeui.ttf")),
   FontSize = 9,
   ForeColor = Color.Black
};
_tfHdr = new TextFormat(_tf)
{
   Font = Font.FromFile(Path.Combine("Resources", "Fonts", "segoeuib.ttf")),
   FontSize = 11,
   ForeColor = Color.DarkBlue
};
_tfPgHdr = new TextFormat(_tf)
{
   FontSize = 11,
   ForeColor = Color.Gray
};

// The resulting PDF:
GcPdfDocument doc = new GcPdfDocument();
using (var s = File.OpenRead(Path.Combine("Resources", "PDFs", "InvoiceDemo.pdf")))
{
   //Instance to load source tagged PDF document
   var source = new GcPdfDocument();
   source.Load(s);
   PrintAllTables(doc, source);
}

Fetch Tagged PDF's Logical Structure

Fetch the logical structure of the document using GetLogicalStructure method of GcPdfDocument class. This method returns the hidden tag structure of the complete document. We need to extract the root element from the logical structure to access the document's children and specifically find tables in the document using the Table structural element.

The code below fetches the document's logical structure and root element to create a list of tables. It invokes the user-defined PrintTable method to save the extracted tables along with the page index on which each table was found in a list of type List<(TextLayout, Page)>. After creating the list of extracted tables, this code invokes the user-defined GroupRenderTables method to group the tables as per their page index and adds these tables to the destination PDF document.

private void PrintAllTables(GcPdfDocument doc, GcPdfDocument source)
{
   // Get the LogicalStructure and top parent element:
   LogicalStructure ls = source.GetLogicalStructure();
   if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
   {
      // No structure tags found:
      Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());
          return;
   }

   // The root element:
   Element root = ls.Elements[0];

   // Find and print all tables:
   var tables = new List<(TextLayout, Page)>();
   root.Children.FindAll(e_ => e_.StructElement.Type == "Table").ForEach(t_ => tables.Add(PrintTable(t_)));

   //Group and render tables
   GroupRenderTables(tables, doc);
}

Create a List of Extracted Tables

The code below defines the PrintTable method which finds and prints each found table in a text layout along with the page number on which the table was found, to create a collection of extracted tables.

private (TextLayout, Page) PrintTable(Element e)
{
   if (e.Type != "Table")
       throw new Exception($"Unexpected: element type must be 'Table' but it is '{e.Type}'.");

   List<List<IList<ITextParagraph>>> table = new List<List<IList<ITextParagraph>>>();
   int maxCols = 0;
   // Select all child elements with type TR - table rows:
   void SelectRows(IList<Element> elements)
   {
      foreach (Element ec in elements)
      {
         if (ec.HasChildren)
         {
            if (ec.StructElement.Type == "TR")
            {
               var cells = ec.Children.FindAll((e_) => e_.StructElement.Type == "TD" || e_.StructElement.Type == "TH").ToArray();
               maxCols = Math.Max(maxCols, cells.Length);
               List<IList<ITextParagraph>> tableCells = new List<IList<ITextParagraph>>();
               foreach (var cell in cells)
                   tableCells.Add(cell.GetParagraphs());
               table.Add(tableCells);
            }
            else
               SelectRows(ec.Children);
         }
      }
   }
   SelectRows(e.Children);

   // show table
   var sourcePage = FindPage(e.StructElement);
   if (sourcePage == null)
       throw new Exception("Unexpected: could not find the default page for the table.");

   var tl = new TextLayout(72);

   // Add table data to the text layout:
   tl.Append($"\nTable on page {sourcePage.Index + 1} of the source document has {maxCols} column(s) and {table.Count} row(s).\nData by row:\n", _tfHdr);
   tl.AppendParagraphBreak();
   int irow = 0;
   foreach (var row in table)
   {
      int icol = 0;
      foreach (var cell in row)
      {
         foreach (var para in cell)
         {
            tl.Append(para.GetText());
         }
         if (row.IndexOf(cell) <= row.Count)
             tl.Append("\t");
         else
             tl.AppendLine();
             ++icol;
      }
      ++irow;
      tl.AppendLine();
   }
   return (tl, sourcePage);
}

private Page FindPage(StructElement se)
{
   if (se.DefaultPage != null)
       return se.DefaultPage;
   if (se.HasChildren)
       foreach (var child in se.Children)
       {
          var p = FindPage(child);
          if (p != null)
              return p;
       }
   return null;
}

Group and Add Extracted Tables to the Destination Document

The code below defines the GroupRenderTables method to group all the tables based on the page index they are extracted from and adds all tables from a specific page in the source document on one page in the resulting document.

The tables will be added to the destination Pdf by invoking the DrawTextLayout method, which renders the TextLayouts generated in the above step on the new GcPdfDocument instance. For this example, we will print tables on each page in the source document on a new page in the destination document.

  • Note: After adding the tables, the next page we will add to the resulting document will be the original page from the source document on which the table exists. This page has been added for user's reference to help them understand the resulting document for the sake of this example and is not a mandatory step.
private void GroupRenderTables(List<(TextLayout, Page)> tables, GcPdfDocument doc)
{
   // Group tables by the page they were found on:
   var tablesByPage = tables.GroupBy(t_ => t_.Item2.Index);
   // For each page, print all tables found on that page,
   // followed by the original page for reference:
   foreach (var tbp in tablesByPage)
   {
      // The page that will contain the extracted table data:
      var pgTables = doc.NewPage();
      // The page that will contain the source page for reference:
      var pgSrc = doc.NewPage();
      // Print the original page:
      tbp.First().Item2.Draw(pgSrc.Graphics, pgSrc.Bounds);
      // Add a page header:
      pgSrc.Graphics.DrawString($"Page {tbp.First().Item2.Index + 1} of the source PDF",
      _tfPgHdr, new RectangleF(0, 0, pgSrc.Size.Width, _margin), TextAlignment.Center, ParagraphAlignment.Center, false);
      //
      float maxHeight = pgTables.Size.Height - _margin * 2;
      float y = _margin;
      // Print all table data. For simplicity sake we assume that all table data will fit on a single page:
      foreach (var t in tbp)
      {
         t.Item1.MaxHeight = maxHeight;
         t.Item1.MaxWidth = pgTables.Size.Width - _margin * 2;
         pgTables.Graphics.DrawTextLayout(t.Item1, new PointF(_margin, y));
         maxHeight -= t.Item1.ContentHeight + _margin;
         y += t.Item1.ContentHeight + _margin;
      }
   }
}

Save the Resulting PDF Document

Save the newly generated PDF document with the extracted tables.

// Save the PDF:
doc.Save(stream);

Open the PDF to view the resulting document, showcasing the extracted tables added to it:

ResultingDoc

You can download this sample here. A similar example of extracting tables, as described above, can be seen in action here. Other examples include creating an outline using tags and highlighting paragraphs that depict extracting data from the PDF document based on other structural tags. The documentation topic for this feature can help you understand the feature in detail.


Manpreet Kaur - Senior Software Engineer

Manpreet Kaur

Senior Software Engineer
comments powered by Disqus