PDFs have become an exceedingly popular way to share and view data since the content within them is more difficult to modify. This allows companies to share data with ease and peace of mind. When sharing PDFs there could be several documents that need to be searched to find some text.

When working with several PDF's it becomes vital to use a PDF API that can programmatically find the text based on different scenarios, such as searching on a particular page or range of pages or finding text with specified search parameters. You may even want to programmatically search and hide certain confidential data. Using GrapeCity's Documents for PDF you can accomplish all these text search and redact text scenarios from a PDF.

GrapeCity’s Documents for PDF (GcPDF) our PDF API library, enables developers to conduct different types of text searches to find, highlight, or redact text within a PDF document. This blog will cover the steps required to conduct these text searches. Continue reading to discover them all!

  1. Find and Highlight Text in a PDF Document
  2. Find Text on a Specific Page of a PDF
  3. Find Text in a Specific Range of PDF Pages
  4. Find Text with Search Options in PDF
  5. Find and Highlight Transformed Text
  6. Find Data Based on Structure Tags
  7. Find and Redact Text

Try GrapeCity Documents for PDF

Download the latest version of GrapeCity Documents for PDF

Download Now!

1. Find and Highlight Text in a PDF Document

Use GcPDF’s FindText method to perform a text search in a PDF. Highlight each item found by using the System.Drawing Graphics class and the bounds of the found text.

For example, use the following code to find the word "drive" in a PDF and then highlight the found word.

C#:

// Create an object of GcPdfDocument class.  
var doc = new GcPdfDocument();  
  using (var fs = new FileStream(Path.Combine("PDF Test.pdf"),  
      FileMode.Open, FileAccess.Read))  
    {  
// Load an existing PDF  
        doc.Load(fs);  
// Use the FindText method to search text for drive, using case-insensitive, whole word match  
        var findsDrive = doc.FindText(new FindTextParams("drive", true, false), OutputRange.All);  

  // Highlight all found text 'drive' using semi-transparent orange red  
               foreach (var find in findsDrive)  
                    doc.Pages[find.PageIndex].Graphics.FillPolygon  
                   (find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));  

       doc.Save(stream);  
    }

Result:

result

See GcPDF’s demo Find and highlight all occurrences of a string as another resource.

2. Find Text on a Specific Page of a PDF

In some cases, users may only want to search for text on one page, instead of the entire PDF document. To do this get the text map of the page by its index and perform a text search only on that page’s text map.

For example, the below code will do the following:

  1. Create a new instance of FindTextParams class
  2. Get the text map of a page by its index
  3. Perform text search within the text map using FindText Method

C#:

            // Create new instance of PDF document  
            GcPdfDocument doc = new GcPdfDocument();  
           using (var fs = new FileStream(Path.Combine("PDF Test.pdf"),  
              FileMode.Open, FileAccess.Read))  
            {  
            // Load existing PDF  
                doc.Load(fs);  
           // 1\. Create a new instance of FindTextParams, search for ‘the’  
                var ftp = new FindTextParams("the", false, false);  
           // 2\. Get the text map of a page by its index, not index starts at 0 so this will search page 2  
                var tm = doc.Pages[1].GetTextMap();  
                if (tm != null)  
    // 3\. Perform text search within the text map using FindText Method and highlight text  orange                    
                  tm.FindText(ftp, (p_) => {  
                       doc.Pages[1].Graphics.FillPolygon(p_.Bounds[0], Color.FromArgb(100, Color.OrangeRed));  
                    });  
                doc.Save("PDF Find Test.pdf");  
            }

This code will only highlight the text found on the second page of the PDF.

textmap

See our GcPDF demo Get text from a specific position in a PDF for more information on using the text map of a PDF.

3. Find Text in a Specific Range of PDF Pages

To conduct a text search within a range of pages in a PDF document, use GcPDF’s OutputRange Class to define the range of pages with the FromPage and ToPage properties.

For example, the following code will search pages 3 and 4 of the PDF for the word "the", then highlight it:

C#:

            GcPdfDocument doc = new GcPdfDocument();

            using (var fs = new FileStream(Path.Combine("PDF Test.pdf"),  
              FileMode.Open, FileAccess.Read))  
            {  
             // Load an existing document from file stream  
                doc.Load(fs);  
                // Create an new FindTextParams instance  
                var ftp = new FindTextParams("the", true, false);  
                // Define to and from page range properties  
                OutputRange pageRange = new OutputRange(3, 4);  
                // Find all text using case-insensitive word search within the page range  
                var findsTextThe = doc.FindText(ftp, pageRange);

                foreach (var find in findsTextThe)  
                    doc.Pages[find.PageIndex].Graphics.FillPolygon  
                   (find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));

                doc.Save("PDF Find Test.pdf");

               }

Results: Search and highlighting of found terms only happens on pages 3 and 4 of the PDF document.

output

4. Find Text With Search Options in PDF

When conducting a text search users can specify the find text parameters using GcPDF’s FindTextParams constructor. The FindTextParams constructor has parameters wholeWord and matchCase thatallow the user to indicate whether the search should match whole words, be case sensitive, or both.

// Find all 'motorcycle' matching whole words and using case-sensitive word search:

var findsMotorcycle = doc.FindText(new FindTextParams("Motorcycle", true, true), OutputRange.All);

5. Find and Highlight Transformed Text

PDF's are known to contain graphically transformed text; drawing text on top of an existing PDF using page graphics. This is typical when adding a logo or watermark to a PDF. GcPDF supports the ability to search for text specifically within graphically transformed text and highlight the found items.

To accomplish this using GcPDF's FindText method to search for the wanted text.

Then, loop through each page that contains the searched text and create a content stream of the page using GcPDF's PageContentStream. With this stream, get the graphics that appear on the page using GetGraphics, and apply the highlighting to the bounds of the found text from the returned graphics.

For example, the following code will search the a PDF document for the graphical transformed text that is acting as a watermark, and highlight the found text to be blue.

C#:

           GcPdfDocument doc = new GcPdfDocument();  
            using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "logo pdf.pdf"), FileMode.Open, FileAccess.Read))  
            {  
             // Load an existing document from file stream  
                doc.Load(fs);  
                // Find all text items 'LOGO', using case-sensitive search:  
                var finds = doc.FindText(  
                    new FindTextParams("LOGO", false, true),  
                    OutputRange.All);

                // Highlight all finds: first, find all pages where the text was found  
                var pgIndices = finds.Select(f_ => f_.PageIndex).Distinct();  
                // Loop through pages with found text  
                foreach (int pgIdx in pgIndices)  
                {  
                    var page = doc.Pages[pgIdx];  
                    // Create a content stream of the page  
                    PageContentStream pcs = page.ContentStreams.Insert(0);  
                    // Get the graphics included on the a pages content stream  
                    var g = pcs.GetGraphics(page);  
                    foreach (var find in finds.Where(f_ => f_.PageIndex == pgIdx))  
                    {  
                        foreach (var ql in find.Bounds)  
                        {  
                            // Set the color used to fill the polygon/highlight the found text  
                            g.FillPolygon(ql, Color.CadetBlue);  
                            g.DrawPolygon(ql, Color.Blue);  
                        }  
                    }  
                }  
                // Done:  
                doc.Save(stream);  
            }

Result:

transform text

See our GcPDF demo Find and highlight all occurrences of a string in a graphically transformed text for another example of how to search graphically transformed text.

6. Find Data Based on Structure Tags

Searching for text using the basis of structure tags is another way to specify the parameter of the text search. For example, if the text being searched is a header (i.e. H1, H2, H3), using GcPDF's API library you can get the structure of the documents, then specify what tag item you need. In this case, let's say "H1", the search for the text within the returned tagged items. Here are steps to accomplish this:

  1. Get the structure of the PDF
  2. Search the page root for a specific structure tag
  3. Loop through all found H1 tags for specific text

Note: You must include the GrapeCity.Documents.Pdf.Recognition.Structure namespace.

Step 1: Get the Structure of the PDF

Using GcPDF's GetLogicalStructure method create a logical structure of the PDF document. Then, hold a reference of this structure in an element.

            GcPdfDocument doc = new GcPdfDocument();

            using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "tags.pdf"), FileMode.Open, FileAccess.Read))

{

                doc.Load(fs);

                // 1\. Get the LogicalStructure of the doc

                LogicalStructure ls = doc.GetLogicalStructure();

                if (ls == null || ls.Elements == null || ls.Elements.Count == 0)

                {

                    // No structure tags found:

                    Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());

                    return;

                }

                // Element holds a reference of the logical structure

                Element root = ls.Elements[0];

Step 2: Search the Page Root for a Specific Structure Tag

Use the FindAll method to search the root element for a specific structure tag. In this example, I am searching all of the “H1” header tags and storing them into an array.

            GcPdfDocument doc = new GcPdfDocument();

            using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "tags.pdf"), FileMode.Open, FileAccess.Read))

{

                doc.Load(fs);

                // 1\. Get the LogicalStructure of the doc

                LogicalStructure ls = doc.GetLogicalStructure();

                if (ls == null || ls.Elements == null || ls.Elements.Count == 0)

                {

                    // No structure tags found:

                    Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());

                    return;

                }

                // Element holds a reference of the logical structure

                Element root = ls.Elements[0];

       // 2\. Find all the H1 tags

           var find = root.Children.FindAll(e_ => e_.StructElement.Type == "H1");

Step 3: Loop Through All Found H1 Tags for Specific Text

Loop through the found H1 tags and get the text of the items using the getText method.

We will then conduct the search to see if the text of the H1 headers include the text we are trying to locate. If it dos, get the location of the header, using getCoords method, and highlight the area.

            GcPdfDocument doc = new GcPdfDocument();

            using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "tags.pdf"), FileMode.Open, FileAccess.Read))

{

                doc.Load(fs);

                // 1\. Get the LogicalStructure of the doc

                LogicalStructure ls = doc.GetLogicalStructure();

                if (ls == null || ls.Elements == null || ls.Elements.Count == 0)

                {

                    // No structure tags found:

                    Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());

                    return;

                }

                // Element holds a reference of the logical structure

                Element root = ls.Elements[0];

       // 2\. Find all the H1 tags

           var find = root.Children.FindAll(e_ => e_.StructElement.Type == "H1");

     // 3\. Loop through all found H1 tags for specific text  
                foreach (Element e in find)  
                {  
                    var color = Color.FromArgb(64, Color.Magenta);  
                    if (e.HasContentItems)  
                    {  
                        // Get headers text  
                        var text = e.GetText();  
                        foreach(var i in e.ContentItems)  
                        {

                           // Search for title with text "Quickstart"   
                            if (text == "Quickstart")  
                            {  
                                if (i is ContentItem ci)  
                                {  
                                    var p = ci.GetParagraph();  
                                    if (p != null)  
                                    {  
                                        // Get the coordinates of the found H1 tag  
                                        var rc = p.GetCoords().ToRect();  
                                        rc.Offset(rc.Width, 0);  
                                        // Draws highlighting around found H1  
                                        ci.Page.Graphics.DrawPolygon(p.GetCoords(), color, 1, null);  
                                    }  
                                }  
                            }  
                        }  
                    }  

                    else  
                        Console.WriteLine();  
                }

Result:

text

For more information and example on reading a PDF's structure tags check out our Read Structure Tags demos.

7. Find and Redact Text

There are some PDF documents that contain information that needs to be found and redacted. Using GcPDF users can implement a search within a PDF’s text map to find text and mark it for redaction using an instance of the RedactAnnotation class.

Redact the marked areas using the Redact method.

                // Loop through pages, removing anything that looks like a short date:

                foreach (var page in doc.Pages)

                {

                    var tmap = page.GetTextMap();

                    foreach (ITextLine tline in tmap)

                    {

                        if (Regex.Match(tline.Text.Trim(), @"\d+[/-]\w+[/-]\d").Success)

                        {

                            var redact = new RedactAnnotation()

                            {

                                Rect = tline.GetCoords().ToRect(),

                                MarkBorderColor = Color.Red,

                                MarkFillColor = Color.Yellow,

                                Page = page

                            };

                        }

                    }

                  // Apply the redacts:

                 doc.Redact();

                }

Results:

Find and redact areas Apply redact and erase data
redact
erase

See GcPDF’s Find and Redact text demo here and our Apply Redact demo here.

This blog showcased seven ways to use GcPDF to find text within a PDF. Try GcPDF our .NET PDF library yourself today and let the GrapeCity Documents team know if there are any other find text scenarios that you would like to explore!

Try GrapeCity Documents for PDF

Download the latest version of GrapeCity Documents for PDF

Download Now!