A tagged PDF document is an accessible PDF easily read using screen readers and other assistive technology. A tagged PDF document has a hidden structure added to the document that represents the document content recognizable by a screen reader or other text to speech recognition software. This hidden structure is created by a set of PDF tags. Here is a list of standard tags used in PDF documents.
GrapeCity Documents for PDF (GcPdf) allows users to create such tagged PDF documents as well as extract data from the tagged PDF document based on the varied structural tags. This blog focuses on describing the other aspect of the feature and programmatically extracting data from a tagged PDF document.
Let's start by understanding how a tagged PDF document is created. As described above, GcPdf allows you to create tagged PDF Files. You can refer to the following blog, which provides a detailed explanation of creating tagged PDF documents using GcPdf API. Alternatively, many PDF Files include tags added from Adobe or an authoring application, such as Adobe FrameMaker®, Adobe InDesign, or Microsoft Word.
The steps below describe how to programmatically extract data from a tagged PDF document, specifically extracting tables from the tagged PDF document using the Table structural element and later adding these extracted tables to create a new PDF document. As a blog use case, we will be extracting the table from the following Invoice document:
Create an instance of GcPdfDocument class to load the source-tagged PDF document, an Invoice. Also, create another instance of GcPdfDocument class to add tables extracted from the source document and later save this new document.
The code below initializes both these instances and declares a few text formats that will be used to create the resulting document.
private TextFormat _tf, _tfHdr, _tfPgHdr;
private float _margin = 72;
// Set up some text formats:
_tf = new TextFormat()
{
Font = Font.FromFile(Path.Combine("Resources", "Fonts", "segoeui.ttf")),
FontSize = 9,
ForeColor = Color.Black
};
_tfHdr = new TextFormat(_tf)
{
Font = Font.FromFile(Path.Combine("Resources", "Fonts", "segoeuib.ttf")),
FontSize = 11,
ForeColor = Color.DarkBlue
};
_tfPgHdr = new TextFormat(_tf)
{
FontSize = 11,
ForeColor = Color.Gray
};
// The resulting PDF:
GcPdfDocument doc = new GcPdfDocument();
using (var s = File.OpenRead(Path.Combine("Resources", "PDFs", "InvoiceDemo.pdf")))
{
//Instance to load source tagged PDF document
var source = new GcPdfDocument();
source.Load(s);
PrintAllTables(doc, source);
}
Fetch the logical structure of the document using GetLogicalStructure method of GcPdfDocument class. This method returns the hidden tag structure of the complete document. We need to extract the root element from the logical structure to access the document's children and specifically find tables in the document using the Table structural element.
The code below fetches the document's logical structure and root element to create a list of tables. It invokes the user-defined PrintTable method to save the extracted tables along with the page index on which each table was found in a list of type List<(TextLayout, Page)>. After creating the list of extracted tables, this code invokes the user-defined GroupRenderTables method to group the tables as per their page index and adds these tables to the destination PDF document.
private void PrintAllTables(GcPdfDocument doc, GcPdfDocument source)
{
// Get the LogicalStructure and top parent element:
LogicalStructure ls = source.GetLogicalStructure();
if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
{
// No structure tags found:
Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());
return;
}
// The root element:
Element root = ls.Elements[0];
// Find and print all tables:
var tables = new List<(TextLayout, Page)>();
root.Children.FindAll(e_ => e_.StructElement.Type == "Table").ForEach(t_ => tables.Add(PrintTable(t_)));
//Group and render tables
GroupRenderTables(tables, doc);
}
The code below defines the PrintTable method which finds and prints each found table in a text layout along with the page number on which the table was found, to create a collection of extracted tables.
private (TextLayout, Page) PrintTable(Element e)
{
if (e.Type != "Table")
throw new Exception($"Unexpected: element type must be 'Table' but it is '{e.Type}'.");
List<List<IList<ITextParagraph>>> table = new List<List<IList<ITextParagraph>>>();
int maxCols = 0;
// Select all child elements with type TR - table rows:
void SelectRows(IList<Element> elements)
{
foreach (Element ec in elements)
{
if (ec.HasChildren)
{
if (ec.StructElement.Type == "TR")
{
var cells = ec.Children.FindAll((e_) => e_.StructElement.Type == "TD" || e_.StructElement.Type == "TH").ToArray();
maxCols = Math.Max(maxCols, cells.Length);
List<IList<ITextParagraph>> tableCells = new List<IList<ITextParagraph>>();
foreach (var cell in cells)
tableCells.Add(cell.GetParagraphs());
table.Add(tableCells);
}
else
SelectRows(ec.Children);
}
}
}
SelectRows(e.Children);
// show table
var sourcePage = FindPage(e.StructElement);
if (sourcePage == null)
throw new Exception("Unexpected: could not find the default page for the table.");
var tl = new TextLayout(72);
// Add table data to the text layout:
tl.Append($"\nTable on page {sourcePage.Index + 1} of the source document has {maxCols} column(s) and {table.Count} row(s).\nData by row:\n", _tfHdr);
tl.AppendParagraphBreak();
int irow = 0;
foreach (var row in table)
{
int icol = 0;
foreach (var cell in row)
{
foreach (var para in cell)
{
tl.Append(para.GetText());
}
if (row.IndexOf(cell) <= row.Count)
tl.Append("\t");
else
tl.AppendLine();
++icol;
}
++irow;
tl.AppendLine();
}
return (tl, sourcePage);
}
private Page FindPage(StructElement se)
{
if (se.DefaultPage != null)
return se.DefaultPage;
if (se.HasChildren)
foreach (var child in se.Children)
{
var p = FindPage(child);
if (p != null)
return p;
}
return null;
}
The code below defines the GroupRenderTables method to group all the tables based on the page index they are extracted from and adds all tables from a specific page in the source document on one page in the resulting document.
The tables will be added to the destination Pdf by invoking the DrawTextLayout method, which renders the TextLayouts generated in the above step on the new GcPdfDocument instance. For this example, we will print tables on each page in the source document on a new page in the destination document.
private void GroupRenderTables(List<(TextLayout, Page)> tables, GcPdfDocument doc)
{
// Group tables by the page they were found on:
var tablesByPage = tables.GroupBy(t_ => t_.Item2.Index);
// For each page, print all tables found on that page,
// followed by the original page for reference:
foreach (var tbp in tablesByPage)
{
// The page that will contain the extracted table data:
var pgTables = doc.NewPage();
// The page that will contain the source page for reference:
var pgSrc = doc.NewPage();
// Print the original page:
tbp.First().Item2.Draw(pgSrc.Graphics, pgSrc.Bounds);
// Add a page header:
pgSrc.Graphics.DrawString($"Page {tbp.First().Item2.Index + 1} of the source PDF",
_tfPgHdr, new RectangleF(0, 0, pgSrc.Size.Width, _margin), TextAlignment.Center, ParagraphAlignment.Center, false);
//
float maxHeight = pgTables.Size.Height - _margin * 2;
float y = _margin;
// Print all table data. For simplicity sake we assume that all table data will fit on a single page:
foreach (var t in tbp)
{
t.Item1.MaxHeight = maxHeight;
t.Item1.MaxWidth = pgTables.Size.Width - _margin * 2;
pgTables.Graphics.DrawTextLayout(t.Item1, new PointF(_margin, y));
maxHeight -= t.Item1.ContentHeight + _margin;
y += t.Item1.ContentHeight + _margin;
}
}
}
Save the newly generated PDF document with the extracted tables.
// Save the PDF:
doc.Save(stream);
Open the PDF to view the resulting document, showcasing the extracted tables added to it:
You can download this sample here. A similar example of extracting tables as described above can be seen in action here. Other examples include creating an outline using tags and highlighting paragraphs that depict extracting data from the PDF document based on other structural tags. The documentation topic for this feature can help you understand the feature in detail.