PDFs have become an exceedingly popular way to share and view data since the content within them is more difficult to modify. This allows companies to share data with ease and peace of mind. When sharing PDFs there could be several documents that need to be searched to find some text.
When working with several PDF's it becomes vital to use a PDF API that can programmatically find the text based on different scenarios, such as searching on a particular page or range of pages or finding text with specified search parameters. You may even want to programmatically search and hide certain confidential data. Using GrapeCity's Documents for PDF you can accomplish all these text search and redact text scenarios from a PDF.
GrapeCity’s Documents for PDF (GcPDF) our PDF API library, enables developers to conduct different types of text searches to find, highlight, or redact text within a PDF document. This blog will cover the steps required to conduct these text searches. Continue reading to discover them all!
Use GcPDF’s FindText method to perform a text search in a PDF. Highlight each item found by using the System.Drawing Graphics class and the bounds of the found text.
For example, use the following code to find the word "drive" in a PDF and then highlight the found word.
C#:
// Create an object of GcPdfDocument class.
var doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("PDF Test.pdf"),
FileMode.Open, FileAccess.Read))
{
// Load an existing PDF
doc.Load(fs);
// Use the FindText method to search text for drive, using case-insensitive, whole word match
var findsDrive = doc.FindText(new FindTextParams("drive", true, false), OutputRange.All);
// Highlight all found text 'drive' using semi-transparent orange red
foreach (var find in findsDrive)
doc.Pages[find.PageIndex].Graphics.FillPolygon
(find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
doc.Save(stream);
}
Result:
See GcPDF’s demo Find and highlight all occurrences of a string as another resource.
In some cases, users may only want to search for text on one page, instead of the entire PDF document. To do this get the text map of the page by its index and perform a text search only on that page’s text map.
For example, the below code will do the following:
C#:
// Create new instance of PDF document
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("PDF Test.pdf"),
FileMode.Open, FileAccess.Read))
{
// Load existing PDF
doc.Load(fs);
// 1\. Create a new instance of FindTextParams, search for ‘the’
var ftp = new FindTextParams("the", false, false);
// 2\. Get the text map of a page by its index, not index starts at 0 so this will search page 2
var tm = doc.Pages[1].GetTextMap();
if (tm != null)
// 3\. Perform text search within the text map using FindText Method and highlight text orange
tm.FindText(ftp, (p_) => {
doc.Pages[1].Graphics.FillPolygon(p_.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
});
doc.Save("PDF Find Test.pdf");
}
This code will only highlight the text found on the second page of the PDF.
See our GcPDF demo Get text from a specific position in a PDF for more information on using the text map of a PDF.
To conduct a text search within a range of pages in a PDF document, use GcPDF’s OutputRange Class to define the range of pages with the FromPage and ToPage properties.
For example, the following code will search pages 3 and 4 of the PDF for the word "the", then highlight it:
C#:
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("PDF Test.pdf"),
FileMode.Open, FileAccess.Read))
{
// Load an existing document from file stream
doc.Load(fs);
// Create an new FindTextParams instance
var ftp = new FindTextParams("the", true, false);
// Define to and from page range properties
OutputRange pageRange = new OutputRange(3, 4);
// Find all text using case-insensitive word search within the page range
var findsTextThe = doc.FindText(ftp, pageRange);
foreach (var find in findsTextThe)
doc.Pages[find.PageIndex].Graphics.FillPolygon
(find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
doc.Save("PDF Find Test.pdf");
}
Results: Search and highlighting of found terms only happens on pages 3 and 4 of the PDF document.
When conducting a text search users can specify the find text parameters using GcPDF’s FindTextParams constructor. The FindTextParams constructor has parameters wholeWord and matchCase thatallow the user to indicate whether the search should match whole words, be case sensitive, or both.
// Find all 'motorcycle' matching whole words and using case-sensitive word search:
var findsMotorcycle = doc.FindText(new FindTextParams("Motorcycle", true, true), OutputRange.All);
PDF's are known to contain graphically transformed text; drawing text on top of an existing PDF using page graphics. This is typical when adding a logo or watermark to a PDF. GcPDF supports the ability to search for text specifically within graphically transformed text and highlight the found items.
To accomplish this using GcPDF's FindText method to search for the wanted text.
Then, loop through each page that contains the searched text and create a content stream of the page using GcPDF's PageContentStream. With this stream, get the graphics that appear on the page using GetGraphics, and apply the highlighting to the bounds of the found text from the returned graphics.
For example, the following code will search the a PDF document for the graphical transformed text that is acting as a watermark, and highlight the found text to be blue.
C#:
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "logo pdf.pdf"), FileMode.Open, FileAccess.Read))
{
// Load an existing document from file stream
doc.Load(fs);
// Find all text items 'LOGO', using case-sensitive search:
var finds = doc.FindText(
new FindTextParams("LOGO", false, true),
OutputRange.All);
// Highlight all finds: first, find all pages where the text was found
var pgIndices = finds.Select(f_ => f_.PageIndex).Distinct();
// Loop through pages with found text
foreach (int pgIdx in pgIndices)
{
var page = doc.Pages[pgIdx];
// Create a content stream of the page
PageContentStream pcs = page.ContentStreams.Insert(0);
// Get the graphics included on the a pages content stream
var g = pcs.GetGraphics(page);
foreach (var find in finds.Where(f_ => f_.PageIndex == pgIdx))
{
foreach (var ql in find.Bounds)
{
// Set the color used to fill the polygon/highlight the found text
g.FillPolygon(ql, Color.CadetBlue);
g.DrawPolygon(ql, Color.Blue);
}
}
}
// Done:
doc.Save(stream);
}
Result:
See our GcPDF demo Find and highlight all occurrences of a string in a graphically transformed text for another example of how to search graphically transformed text.
Searching for text using the basis of structure tags is another way to specify the parameter of the text search. For example, if the text being searched is a header (i.e. H1, H2, H3), using GcPDF's API library you can get the structure of the documents, then specify what tag item you need. In this case, let's say "H1", the search for the text within the returned tagged items. Here are steps to accomplish this:
Note: You must include the GrapeCity.Documents.Pdf.Recognition.Structure namespace.
Using GcPDF's GetLogicalStructure method create a logical structure of the PDF document. Then, hold a reference of this structure in an element.
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "tags.pdf"), FileMode.Open, FileAccess.Read))
{
doc.Load(fs);
// 1\. Get the LogicalStructure of the doc
LogicalStructure ls = doc.GetLogicalStructure();
if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
{
// No structure tags found:
Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());
return;
}
// Element holds a reference of the logical structure
Element root = ls.Elements[0];
Use the FindAll method to search the root element for a specific structure tag. In this example, I am searching all of the “H1” header tags and storing them into an array.
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "tags.pdf"), FileMode.Open, FileAccess.Read))
{
doc.Load(fs);
// 1\. Get the LogicalStructure of the doc
LogicalStructure ls = doc.GetLogicalStructure();
if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
{
// No structure tags found:
Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());
return;
}
// Element holds a reference of the logical structure
Element root = ls.Elements[0];
// 2\. Find all the H1 tags
var find = root.Children.FindAll(e_ => e_.StructElement.Type == "H1");
Loop through the found H1 tags and get the text of the items using the getText method.
We will then conduct the search to see if the text of the H1 headers include the text we are trying to locate. If it dos, get the location of the header, using getCoords method, and highlight the area.
GcPdfDocument doc = new GcPdfDocument();
using (var fs = new FileStream(Path.Combine("Resources", "PDFs", "tags.pdf"), FileMode.Open, FileAccess.Read))
{
doc.Load(fs);
// 1\. Get the LogicalStructure of the doc
LogicalStructure ls = doc.GetLogicalStructure();
if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
{
// No structure tags found:
Common.Util.AddNote("No structure tags were found in the source document.", doc.Pages.Add());
return;
}
// Element holds a reference of the logical structure
Element root = ls.Elements[0];
// 2\. Find all the H1 tags
var find = root.Children.FindAll(e_ => e_.StructElement.Type == "H1");
// 3\. Loop through all found H1 tags for specific text
foreach (Element e in find)
{
var color = Color.FromArgb(64, Color.Magenta);
if (e.HasContentItems)
{
// Get headers text
var text = e.GetText();
foreach(var i in e.ContentItems)
{
// Search for title with text "Quickstart"
if (text == "Quickstart")
{
if (i is ContentItem ci)
{
var p = ci.GetParagraph();
if (p != null)
{
// Get the coordinates of the found H1 tag
var rc = p.GetCoords().ToRect();
rc.Offset(rc.Width, 0);
// Draws highlighting around found H1
ci.Page.Graphics.DrawPolygon(p.GetCoords(), color, 1, null);
}
}
}
}
}
else
Console.WriteLine();
}
Result:
For more information and example on reading a PDF's structure tags check out our Read Structure Tags demos.
There are some PDF documents that contain information that needs to be found and redacted. Using GcPDF users can implement a search within a PDF’s text map to find text and mark it for redaction using an instance of the RedactAnnotation class.
Redact the marked areas using the Redact method.
// Loop through pages, removing anything that looks like a short date:
foreach (var page in doc.Pages)
{
var tmap = page.GetTextMap();
foreach (ITextLine tline in tmap)
{
if (Regex.Match(tline.Text.Trim(), @"\d+[/-]\w+[/-]\d").Success)
{
var redact = new RedactAnnotation()
{
Rect = tline.GetCoords().ToRect(),
MarkBorderColor = Color.Red,
MarkFillColor = Color.Yellow,
Page = page
};
}
}
// Apply the redacts:
doc.Redact();
}
Results:
Find and redact areas | Apply redact and erase data |
---|---|
![]() |
![]() |
See GcPDF’s Find and Redact text demo here and our Apply Redact demo here.
This blog showcased seven ways to use GcPDF to find text within a PDF. Try GcPDF our .NET PDF library yourself today and let the GrapeCity Documents team know if there are any other find text scenarios that you would like to explore!