Updated 2022-03-22 1
Viewed 13 times
0

I am pulling data from a printable PDF using iTextSharp. This is the text that I have extracted:

Borrower: Guarantor:
{{0_SH}} By: {{1_SH}} (seal)
By: (seal)
Print Name:
Print Name:
Phillip Moore Phillip Moore
Date: {{1_DH}}
2/23/2022
Title: Owner
Date: {{0_DH}}
2/23/2022
12 of 12 (LOC 2020) Borrower Initials {{0_IH}}

And I have written this regex routine:

string pattern = @"Print\sName:\s(?'guarantor1'[a-zA-Z|\s|-|-|'|,|.|&|\d]+)\n";
Regex rgx = new Regex(pattern, RegexOptions.Singleline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
    string guarantor1 = matches[0].Groups["guarantor1"].Value;
    return guarantor1.Trim();
}

But the extracted data from the regex for guarantor1 is Phillip Moore Phillip Moore. I need just the first part Phillip Moore. Any ideas how to parse this correctly? There could also be a middle name or initial.

🔴 No definitive solution yet
📌 Solution 1
0

You could match the last occurrence of Print Name: and then match as least as possible of the allowed chars until you encounter the same using a backreference until the end of the string.

Note that \s can also match a newline.

\bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1$)

See a regex demo and a C# demo.

If there should also be a match without the double naming, the space and the backreference to group 1 can be optional.

\bPrint\sName:\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?=(?:\s\1)?$)

See another Regex demo.

Example code

string pattern = @"\bPrint\sName:\r?\n(?!Print\sName)(?'guarantor1'[a-zA-Z\s',.&\d\--]+?)(?= \1\r?$)";
Regex rgx = new Regex(pattern, RegexOptions.Multiline);
MatchCollection matches = rgx.Matches(fullText);
if (matches.Count > 0)
{
    string guarantor1 = matches[0].Groups["guarantor1"].Value;
    Console.WriteLine(guarantor1.Trim());
}

Output

Phillip Moore