Parse data from semi-structured text files
The C1TextParser library can parse text files such as HTML and plain text
Output formatted as JSON or objects
The extraction result can be formatted as JSON or an object instance from a custom class.
Three different types of extractors
The C1TextParser library supports three different extractors: Starts-After-Continues-Until, HTML, and Template-based.
Regular expression-based matching
Extraction can occur along matched regular expressions, after a matched word or phrase, or using a defined script.
Extract and Parse Text from PDF and Word Documents
Use C1TextParser along with other ComponentOne components to extract text from more file types. Use C1Word or C1PdfDocumentSource to extract text from Microsoft Word and PDF files that can then be parsed by C1TextParser.
The Starts-After-Continues-Until extractor is the simplest and the easiest to use.
- This extractor was designed with the purpose of extracting relevant text from a plain text source.
- To use it you must define two parameters: where the text starts and where it ends (or continues until).
- Essentially, it extracts all the text contained between the occurrences of two regular expressions.
The HTML Extractor is designed to help automate the process of extracting specific data from emails and other HTML-structured files. Automated emails, such as travel itineraries, tickets, and e-commerce receipts, typically follow a repeated structure that can be parsed using C1TextParser even if every email does not follow the exact same HTML structure. The HTML Extractor is similar to the template-based extractor; however it’s specialized for complex HTML documents by allowing unexpected characters within the markup.
The template-Based extractor is the most generic as it allows users to parse data structures following a declarative XML template. Since the template can be provided as a separate file, it allows users to provide both template and source from which to parse. The plain text source can contain many instances of the defined structure. All the text that does not match the template specification is simply ignored.
Extract Important Information from Emails
Emails are a very common source of data for certain segments of a company (such as sales and marketing), and often data extraction is done manually. Anytime you receive an email that has a similar repeated structure a parser can be useful. C1TextParser enables you to easily extract, store, and track this repeated type of data from emails. Once this data has been extracted it can be stored (to build a table of relevant records) or passed on to another destination. Examples of emails that can be easily parsed include:
- Invoices and order forms
- Leads from an email submission form
- Customer support and requests
- Ticket and travel reservation confirmations
Process Large Number of Resumes for Digital Analysis
Resumes are often formatted in a predictable manner that allows them to be easily read by a machine for parsing out important information. If a company has to deal with hundreds of resumes that would take too much time for humans to process, a text parsing service that first analyzes the resumes can help narrow the field or provide quick stats on the candidate pool by parsing out key requirements.