Document Solutions for PDF
Features / Parse PDF Documents
In This Topic
    Parse PDF Documents
    In This Topic

    DsPdf allows you to parse PDF documents by recognizing their logical text and document structure. The content elements like plain text, tables, paragraphs and elements in tagged PDF documents can be extracted by using DsPdf API as explained below:

    Extract Text

    To extract text from a PDF:

    1. Load a PDF document using Load method of the GcPdfDocument class.
    2. Extract text from the last page of the PDF using GetText method of the Page class.
    3. Add the extracted text to another PDF document using the Graphics.DrawString method.
    4. Save the document using Save method of the GcPdfDocument class.
      C#
      Copy Code
      GcPdfDocument doc = new GcPdfDocument();
      
      FileStream fs = new FileStream("DsPdf.pdf",FileMode.Open,FileAccess.Read);
      doc.Load(fs);
      
      //Extract text present on the last page
      String text=doc.Pages.Last.GetText();
      
      //Add extracted text to a new pdf 
      GcPdfDocument doc1 = new GcPdfDocument();
      PointF textPt = new PointF(72, 72);
      doc1.NewPage().Graphics.DrawString(text, new TextFormat()
              { FontName = "ARIAL", FontItalic = true }, textPt);
      
      doc1.Save("NewDocument.pdf"); 
      
      Console.WriteLine("Press any key to exit");  
      Console.ReadKey();
      

    Similarly, you can also extract all the text from a document by using GetText method of the GcPdfDocument class.

    Extract Text using ITextMap

    DsPdf provides ITextMap interface that represents the text map of a page in a DsPdf document. It helps you to find the geometric positions of the text lines on a page and extract the text from a specific position.

    The text map for a specific page in the document can be retrieved using the GetTextMap method of the Page class, which returns an object of type ITextMap. ITextMap provides four overloads of the GetFragment method, which helps to retrieve the text range and the text within the range. The text range is represented by the TextMapFragment class and each line of text in this range is represented by the TextLineFragment class.

    The example code below uses the GetFragment(out TextMapFragment range, out string text) overload to retrieve the geometric positions of all the text lines on a page and the GetFragment(MapPos startPos, MapPos endPos, out TextMapFragment range, out string text) overload to retrieve the text from a specific position in the page.

    C#
    Copy Code
    // Open an arbitrary PDF, load it into a temp document and use the map to find some texts:
    using (var fs = new FileStream("Test.pdf", FileMode.Open, FileAccess.Read))
    {
        var doc1 = new GcPdfDocument();
        doc1.Load(fs);
        var tmap = doc1.Pages[0].GetTextMap();
    
        // We retrieve the text at a specific (known to us) geometric location on the page:
        float tx0 = 2.1f, ty0 = 3.37f, tx1 = 3.1f, ty1 = 3.5f;
        HitTestInfo htiFrom = tmap.HitTest(tx0 * 72, ty0 * 72);
        HitTestInfo htiTo = tmap.HitTest(ty0 * 72, ty1 * 72);
        tmap.GetFragment(htiFrom.Pos, htiTo.Pos, out TextMapFragment range1, out string text1);
        tl.AppendLine($"Looked for text inside rectangle x={tx0:F2}\", y={ty0:F2}\", " +
            $"width={tx1 - tx0:F2}\", height={ty1 - ty0:F2}\", found:");
        tl.AppendLine(text1);
        tl.AppendLine();
    
        // Get all text fragments and their locations on the page:
        tl.AppendLine("List of all texts found on the page");
        tmap.GetFragment(out TextMapFragment range, out string text);
        foreach (TextLineFragment tlf in range)
        {
            var coords = tmap.GetCoords(tlf);
            tl.Append($"Text at ({coords.B.X / 72:F2}\",{coords.B.Y / 72:F2}\"):\t");
            tl.AppendLine(tmap.GetText(tlf));
        }
        // Print the results:
        tl.PerformLayout(true);
    }
    

    Extract Text Paragraphs

    DsPdf allows extracting text paragraphs from a PDF document by using Paragraphs property of ITextMap interface. It returns a collection of ITextParagraph objects associated with the text map.

    Sometimes, PDF documents might contain some repeating text (for example, overlap of same text to show it as bold) but DsPdf extracts such text without returning the redundant lines. Also the tables with multi-line text in cells are correctly recognized as text paragraphs.

    The example code below shows how to extract all text paragraphs of a PDF document:

    C#
    Copy Code
    GcPdfDocument doc = new GcPdfDocument();
    var page = doc.NewPage();
    var tl = page.Graphics.CreateTextLayout();
    tl.MaxWidth = doc.PageSize.Width;
    tl.MaxHeight = doc.PageSize.Height;
    
    //Text split options for widow/orphan control
    TextSplitOptions to = new TextSplitOptions(tl)
    {
        MinLinesInFirstParagraph = 2,
        MinLinesInLastParagraph = 2,
    };
    
    //Open a PDF, load it into a temp document and get all page texts
    using (var fs=new FileStream("Wetlands.pdf", FileMode.Open, FileAccess.Read))
    {
        var doc1 = new GcPdfDocument();
        doc1.Load(fs);
    
        for (int i = 0; i < doc1.Pages.Count; ++i)
        {
            tl.AppendLine(string.Format("Paragraphs from page {0} of the original PDF:", i + 1));
    
            var pg = doc1.Pages[i];
            var pars = pg.GetTextMap().Paragraphs;
            foreach (var par in pars)
            {
                tl.AppendLine(par.GetText());
            }
        }
    
        tl.PerformLayout(true);
        while (true)
        {
            //'rest' will accept the text that did not fit
            var splitResult = tl.Split(to, out TextLayout rest);
            doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
            if (splitResult != SplitResult.Split)
                break;
            tl = rest;
            doc.NewPage();
        }
        //Append the original document for reference
        doc.MergeWithDocument(doc1, new MergeDocumentOptions());
    }
    //Save document
    doc.Save(stream);
    return doc.Pages.Count;
    

    Limitations

    Extract Data from Tables

    DsPdf allows you to extract data from tables in PDF documents. The GetTable method in Page class extracts data from the area specified as a table. The method takes table area as a parameter, parses that area and returns the data of rows, columns, cells and their textual content. You can also pass TableExtractOptions as a parameter to specify table formatting options like column width, row height, distance between rows or columns.

    The example code below shows how to extract data from a table in a PDF document:

    C#
    Copy Code
    const float DPI = 72;
    const float margin = 36;
    var doc = new GcPdfDocument();
    var tf = new TextFormat()
    {
        Font = Font.FromFile(Path.Combine("segoeui.ttf")),
        FontSize = 9,
        ForeColor = Color.Black
    };
    
    var tfRed = new TextFormat(tf) { ForeColor = Color.Red };
    var fs = File.OpenRead(Path.Combine("zugferd-invoice.pdf"));
    {
        // The approx table bounds:
        var tableBounds = new RectangleF(0, 3 * DPI, 8.5f * DPI, 3.75f * DPI);
    
        var page = doc.NewPage();
        page.Landscape = true;
        var g = page.Graphics;
    
        var tl = g.CreateTextLayout();
        tl.MaxWidth = page.Bounds.Width;
        tl.MaxHeight = page.Bounds.Height;
        tl.MarginAll = margin;
        tl.DefaultTabStops = 150;
        tl.LineSpacingScaleFactor = 1.2f;
    
        var docSrc = new GcPdfDocument();
        docSrc.Load(fs);
    
        var itable = docSrc.Pages[0].GetTable(tableBounds);
    
        if (itable == null)
        {
            tl.AppendLine($"No table was found at the specified coordinates.", tfRed);
        }
        else
        {
            tl.Append($"\nThe table has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tf);
            tl.AppendParagraphBreak();
            for (int row = 0; row < itable.Rows.Count; ++row)
            {
                var tfmt = row == 0 ? tf : tf;
                for (int col = 0; col < itable.Cols.Count; ++col)
                {
                    var cell = itable.GetCell(row, col);
                    if (col > 0)
                        tl.Append("\t", tfmt);
                    if (cell == null)
                        tl.Append("<no cell>", tfRed);
                    else
                        tl.Append(cell.Text, tfmt);
                }
                tl.AppendLine();
            }
        }
        TextSplitOptions to = new TextSplitOptions(tl) { RestMarginTop = margin, MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2 };
        tl.PerformLayout(true);
        while (true)
        {
            var splitResult = tl.Split(to, out TextLayout rest);
            doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
            if (splitResult != SplitResult.Split)
                break;
            tl = rest;
            doc.NewPage().Landscape = true;
        }
        // Append the original document for reference
        doc.MergeWithDocument(docSrc);
        doc.Save(stream);
    
    Note: The font files used in the above sample can be downloaded from Get Table Data demo.

    Limitation

    Extract Content from Tagged PDF

    DsPdf can recognize the logical structure of a source document from which the PDF document is generated. This structure recognition is further used to extract content elements from tagged PDF documents.

    Based on the PDF specification, DsPdf recognizes the logical structure by using LogicalStructure class. It represents a parsed logical structure of a PDF document which is created on the basis of tags in the PDF structure tree. The StructElement property of Element class can be used to get the element type, such as TR for table row, H for headings, P for paragraphs etc.

    The example code below shows how to extract headings, tables and TOC elements from a tagged PDF document:

    C#
    Copy Code
    static void ShowTable(Element e)
    {
        List<List<IList<ITextParagraph>>> table = new List<List<IList<ITextParagraph>>>();
        
        // select all nested rows, elements with type TR
        void SelectRows(IList<Element> elements)
        {
            foreach (Element ec in elements)
            {
                if (ec.HasChildren)
                {
                    if (ec.StructElement.Type == "TR")
                    {
                        var cells = ec.Children.FindAll((e_) => e_.StructElement.Type == "TD").ToArray();
                        List<IList<ITextParagraph>> tableCells = new List<IList<ITextParagraph>>();
                        foreach (var cell in cells)
                            tableCells.Add(cell.GetParagraphs());
                        table.Add(tableCells);
                    }
                    else
                        SelectRows(ec.Children);
                }
            }
        }
        SelectRows(e.Children);
    
        // show table
        int colCount = table.Max((r_) => r_.Count);
        Console.WriteLine();
        Console.WriteLine();
        Console.WriteLine($"Table: {table.Count}x{colCount}");
        Console.WriteLine($"------");
        foreach (var r in table)
        {
            foreach (var c in r)
            {
                var s = c == null || c.Count <= 0 ? string.Empty : c[0].GetText();
                Console.Write(s);
                Console.Write("\t");
            }
            Console.WriteLine();
        }
    }
    
    static void Main(string[] args)
    {
        
        GcPdfDocument doc = new GcPdfDocument();
    
        using (var s = new FileStream("C1Olap QuickStart.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
        {
            doc.Load(s);
    
            // get the LogicalStructure and top parent element
            LogicalStructure ls = doc.GetLogicalStructure();
            Element root = ls.Elements[0];
    
            // select all headings
            Console.WriteLine("TOC:");
            Console.WriteLine("----");
            // iterate over elements and select all heading elements
            foreach (Element e in root.Children)
            {
                string type = e.StructElement.Type;
                if (string.IsNullOrEmpty(type) || !type.StartsWith("H"))
                    continue;
                int headingLevel;
                if (!int.TryParse(type.Substring(1), out headingLevel))
                    continue;
                // get the element text
                string text = e.GetText();
                if (string.IsNullOrEmpty(text))
                    text = "H" + headingLevel.ToString();
                text = new string(' ', (headingLevel - 1) * 2) + text;
                Console.WriteLine(text);
                
            }
    
            // select all tables
            var tables = root.Children.FindAll((e_) => e_.StructElement.Type == "Table").ToArray();
            foreach (var t in tables)
            {
                ShowTable(t);
            }
        }
    }
    

    The example code below shows how to extract all paragraphs from a PDF document and save them to a Word document:
    C#
    Copy Code
    // restore word document from pdf
    using (var s = new FileStream("CharacterFormatting.pdf", FileMode.Open, FileAccess.Read, FileShare.Read))
    {
        doc.Load(s);
    
        // get the LogicalStructure and top parent element
        LogicalStructure ls = doc.GetLogicalStructure();
        Element root = ls.Elements[0];
    
        GcWordDocument wdoc = new GcWordDocument();
    
        // iterate over elements and select all paragraphs
        foreach (Element e in root.Children)
        {
            if (e.StructElement.Type != "P")
                continue;
            var tps = e.GetParagraphs();
            if (tps == null)
                continue;
    
            foreach (var tp in tps)
            {
                // build a Word paragraph from a ITextParagraph
                Paragraph p = wdoc.Body.Paragraphs.Add();
                foreach (var tr in tp.Runs)
                {
                    var range = p.GetRange();
                    var run = range.Runs.Add(tr.GetText());
                    run.Font.Size = tr.Attrs.FontSize;
                    if (tr.Attrs.NonstrokeColor.HasValue)
                        run.Font.Color.RGB = tr.Attrs.NonstrokeColor.Value;
    
                    tr.Attrs.Font.GetFontAttributes(out string fontFamily,
                        out FontWeight? fontWeight,
                        out FontStretch? fontStretch,
                        out bool? fontItalic);
                    if (!string.IsNullOrEmpty(fontFamily))
                        run.Font.Name = fontFamily;
                    if (fontWeight.HasValue)
                        run.Font.Bold = fontWeight.Value >= FontWeight.Bold;
                    if (fontItalic.HasValue)
                        run.Font.Italic = fontItalic.Value;
                }
            }
        }
        wdoc.Save("CharacterFormatting.docx");
    }
    

    Refer to Tagged PDF to know how to create tagged PDF files using DsPdf.