PDF files contain ready-to-use data. This data is presented independent of application software, hardware, and operating systems. Many times you may find a need to access data stored inside the PDF documents. We all use the copy and paste method to fetch data manually from an existing PDF. But, extraction of data from a particular geometry in a PDF using the GcPdf API helps us define the specified region through which data can be fetched, to let the code do the talking.

This helps us reduce dependency on manual data entry.

Reasons to Extract Text from a PDF

We often need to save a PDF file as a Word document in order to edit it as we want. To make this task easier, GrapeCity Documents for PDF (GcPdf) allows you to extract data from a PDF and save it in a Word document. This will transform the extracted text in a readable and editable form.

When extracting text with GcPdf API, you can specify what data will be extracted using the GetText method of TextMap class on a particular location or a page, to fetch characters of a fragment.

Note: Two overloads of GetFragment method exist: one helps to extract text from a whole page of PDF and another helps to fetch text from a particular location on a page.

Extract text from a specific location within the PDF

  1. Load an existing PDF in GcPdfDocument instance:

    GcPdfDocument doc = new GcPdfDocument();
    FileStream fs = new FileStream(Path.Combine("Resources","Document.pdf"),FileMode.Open,FileAccess.Read);
    doc.Load(fs);
    
  2. Fetch ITextMap instance for a page:

    var tmap = doc.Pages[0].GetTextMap();

  3. Specify points for a location to extract text from and fetch fragments of text:

 float tx0 = 2.1f, ty0 = 3.37f, ty1 = 3.5f;
 HitTestInfo htiFrom = tmap.HitTest(tx0 * 72, ty0 * 72);
 HitTestInfo htiTo = tmap.HitTest(ty0 * 72, ty1 * 72);            tmap.GetFragment(htiFrom.Pos, htiTo.Pos, out TextMapFragment fragment1, out string text1);

Extract text from a specific location in the PDF

Extract entire page's text from PDF

  1. Load an existing PDF in GcPdfDocument instance:
GcPdfDocument doc = new GcPdfDocument();
FileStream fs = new FileStream(Path.Combine("Resources","Document.pdf"),FileMode.Open,FileAccess.Read);
doc.Load(fs);
  1. Fetch ITextMap for the second page: var tmap_page2 = doc.Pages[1].GetTextMap();

  2. Get all text fragments and their locations on the page:

   tmap_page2.GetFragment(out TextMapFragment fragment2, out string text);

Extract entire page's text from PDF

Add Extracted Text to a Word Document

The extracted text can be represented in a Word document, using GcWord API. This API provides a huge collection of objects to perform several operations.

The extracted text can be added to a paragraph in the Word document. After proper formatting, the Word document can be finally saved.

This is performed as follows:

  1. Create a new GcWordDocument:
           GcWordDocument wordDocument = new GcWordDocument();
    
  2. Add a paragraph to the first page of word document to add the text fetched from the GcPdf Document:
  ParagraphCollection paragraphs = wordDocument.Body.Sections.First.GetRange().Paragraphs;
paragraphs.Add("Text from the GcPdf Document",s);
   foreach (TextLineFragment tlf in fragment1)
      {
         paragraphs.Add(tmap.GetText(tlf));
      }

where “s” is a Style class instance.

  1. Save the final word document:
  wordDocument.Save("ExtractedData.docx");

That’s it. You have now added extracted text from GcPdf to a GcWord document.

Here's your generated Word document:

How to Extract Text from a PDF in .NET Core Apps

Try this tutorial yourself!

Download the sample

Try a GcPdf free trial for 30 days

Download the latest version of GrapeCity Documents for PDF

Download Now!