0

I am pulling data from a printable PDF using iTextSharp. This is the text that I have extracted:

Borrower: Guarantor:
{{0_SH}} By: {{1_SH}} (seal)
By: (seal)
Print Name:
Print Name:
Phillip Moore Phillip Moore
Date: {{1_DH}}
2/23/2022
Title: Owner
Date: {{0_DH}}
2/23/2022
12 of 12 (LOC 2020) Borrower Initials {{0_IH}}

And I have written this regex routine:

string pattern = @"Print\sName:\s(?'guarantor1'[a-zA-Z|\s|-|-|'|,|.|&|\d]+)\n";
Regex rgx = new Regex(pattern, RegexOptions.Singleline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
    string guarantor1 = matches[0].Groups["guarantor1"].Value;
    return guarantor1.Trim();
}

But the extracted data from the regex for guarantor1 is Phillip Moore Phillip Moore. I need just the first part Phillip Moore. Any ideas how to parse this correctly? There could also be a middle name or initial.

Craig
  • 1,205
  • 2
  • 21
  • 54
  • It seems that the reason that the name is printed twice, is because the Borrower Name = Guarantor Name. Have you tried reading the data from the fields? Do you have a sample PDF? The following may (or may not) be helpful: https://stackoverflow.com/questions/69353784/itext7-pdf-to-blob-c-sharp/69364767#69364767 and https://stackoverflow.com/questions/68941615/is-itext7-available-in-vb-net-or-only-c-sharp/68946796#68946796 . They're for iText7, but I believe that iTextSharp has something similar. – Tu deschizi eu inchid Mar 22 '22 at 20:11

1 Answers1

1

You could match the last occurrence of Print Name: and then match as least as possible of the allowed chars until you encounter the same using a backreference until the end of the string.

Note that \s can also match a newline.

\bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1$)

See a regex demo and a C# demo.

If there should also be a match without the double naming, the space and the backreference to group 1 can be optional.

\bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?=(?:\s\1)?$)

See another Regex demo.

Example code

string pattern = @"\bPrint\sName:\r?\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1\r?$)";
Regex rgx = new Regex(pattern, RegexOptions.Multiline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
    string guarantor1 = matches[0].Groups["guarantor1"].Value;
    Console.WriteLine(guarantor1.Trim());
}

Output

Phillip Moore
The fourth bird
  • 154,723
  • 16
  • 55
  • 70