GrapeCity Documents for PDF v3.2 has significant new enhancements for extracting text from PDF documents. The logic is improved, easily handling individual cases such as text rendered multiple times to create bold or shadowed text effects so that text is not repeated in the output but only appears once in the document. The FindText method now returns a FoundPosition object, returning an array of Quadrilateral structures from its Bounds property – the FindText method finds text which spans more than one line. A new property ITextMap.Paragraphs now returns a collection of ITextParagraph objects associated with the ITextMap.

Extract Paragraphs Using ITextMap.Paragraphs

This example reads an existing multi-page PDF document and shows how to use ITextMap.Paragraphs to extract paragraphs from each page of a PDF document. The complete example and code is included in the updated sample explorer for GrapeCity Documents for PDF.

How to Parse and Extract Content from PDF Documents in C# VB.NET Figure 1 Original document Wetlands.pdf

The code extracts the text paragraphs on each page rendering each paragraph in alternating colors (for clarity) in a new PDF document:

How to Parse and Extract Content from PDF Documents in C# VB.NET
Figure 2 Extract Paragraphs from a PDF Sample

First, the code creates a new PDF document where the text paragraphs will be rendered and adds a note explaining the sample at the top of the first page:

const int margin = 36;  
Color c1 = Color.PaleGreen;  
Color c2 = Color.PaleGoldenrod;  

GcPdfDocument doc = new GcPdfDocument();  
var page = doc.NewPage();  

var rc = Common.Util.AddNote(  
   "Here we load an existing PDF (Wetlands) into a temporary GcPdfDocument, " +  
   "and iterate over the pages of that document, printing all paragraphs found on the page. " +  
   "We alternate the background color for the paragraphs so that the bounds between paragraphs are more clear. " +  
   "The original PDF is appended to the generated document for reference.",  
   new RectangleF(margin, margin, page.Size.Width - margin * 2, 0));  

// Text format for captions:  
var tf = new TextFormat()  
   Font = Font.FromFile(Path.Combine("Resources", "Fonts", "yumin.ttf")),  
   FontSize = 14,  
   ForeColor = Color.Blue  
// Text format for the paragraphs:  
var tfpar = new TextFormat()  
   Font = StandardFonts.Times,  
   FontSize = 12,  
   BackColor = c1,  
// Text layout to render the text:  
var tl = page.Graphics.CreateTextLayout();  
tl.MaxWidth = doc.PageSize.Width;  
tl.MaxHeight = doc.PageSize.Height;  
tl.MarginAll = rc.Left;  
tl.MarginTop = rc.Bottom + 36;  
// Text split options for widow/orphan control:  
TextSplitOptions to = new TextSplitOptions(tl)  
   MinLinesInFirstParagraph = 2,  
   MinLinesInLastParagraph = 2,  
   RestMarginTop = rc.Left,  

Code Walkthrough: A new GcPdfDocument doc object is created and generates a new page using the NewPage method. Then it adds a sample explanation note on the first page using the helper function AddNote. Next, new separate TextFormat objects are created to format the captions and paragraphs, and a new TextLayout object is created to specify the page margins. Finally, a new TextSplitOptions object is made to handle pagination.

Using the new ITextMap.Paragraphs property, the code required to perform this task is very easy:

// Open an arbitrary PDF, load it into a temp document and get all page texts:  
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "Wetlands.pdf")))  
   var doc1 = new GcPdfDocument();  

   for (int i = 0; i < doc1.Pages.Count; ++i)  
       tl.AppendLine(string.Format("Paragraphs from page {0} of the original PDF:", i + 1), tf);  

       var pg = doc1.Pages[i];  
       var pars = pg.GetTextMap().Paragraphs;  
       foreach (var par in pars)  
           tl.AppendLine(par.GetText(), tfpar);  
           tfpar.BackColor = tfpar.BackColor == c1 ? c2 : c1;  

   while (true)  
      // 'rest' will accept the text that did not fit:  
      var splitResult = tl.Split(to, out TextLayout rest);  
      doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);  
      if (splitResult != SplitResult.Split)  
      tl = rest;  
   // Append the original document for reference:  
   doc.MergeWithDocument(doc1, new MergeDocumentOptions());  
// Done:  

Code Walkthrough: First, open the Wetlands.pdf document and use the new ITextMap.Paragraphs API to get the text paragraphs and then append them into a different document. After each paragraph is appended, use TextFormat for the paragraphs and update tfpar, to alternate the background color to highlighting the separate paragraphs in the new document. Then the final document is completed using TextLayout.PerformLayout and TextLayout.Split to paginate the results merging those into the output document using GdPdfDocument.MergeWithDocument. The final result is saved using GcPdfDocument.Save.

Enhanced FindText Across Multiple Lines

How to Parse and Extract Content from PDF Documents in C# VB.NET

Figure 3 Using FindText across lines and paragraphs

The FindText method now supports finding text which may appear in multiple lines in a paragraph or across paragraphs. To illustrate this, code is added similar to the code in the FindText demo sample, which searches for longer text strings that span across multiple lines and paragraphs. Here is the code added immediately above the code calling doc.Save(stream):

// Example using FindText to find text spanning multiple lines:  
var findIt = doc.FindText(new FindTextParams("Hundreds, if not thousands, of invertebrates that form the food of birds also rely on water for most, if not all, phases of their existence.", true, false), OutputRange.All);  
foreach (var find in findIt)  
   foreach (var ql in find.Bounds)  
      doc.Pages[find.PageIndex].Graphics.FillPolygon(ql, Color.FromArgb(100, Color.OrangeRed));  
var findIt2 = doc.FindText(new FindTextParams("To lose any more of these vital areas is almost unthinkable. Wetlands enhance and protect water quality in lakes and streams where additional species spend their time and from which we draw our water.", true, false), OutputRange.All);  
foreach (var find in findIt2)  
   foreach (var ql in find.Bounds)  
      doc.Pages[find.PageIndex].Graphics.FillPolygon(ql, Color.FromArgb(100, Color.OrangeRed));  
// Done:  

Code Walkthrough: Use the FindText method to find two longer text strings, where the first string spans across multiple lines, and the second string spans across various paragraphs. The FoundPosition.Bounds property returns an array of Quadrilateral structures forming the bounds in each successive line or paragraph. The code uses GcGraphics.FillPolygon.html) to highlight the found text and fill the area of the found text with a semi-transparent orange-red color.

Try a GcPdf free trial for 30 days

Download the latest version of GrapeCity Documents for PDF

Download Now!