Updated 2022-05-21 1
Viewed 11 times
0

I'd like to match all patterns between <PDF> and </PDF> inside a string:

import re

lines = """
hello
<PDF>
bla1
</PDF>
test
<PDF>
bla2
</PDF>
"""

matches = re.findall(r"<PDF>.*</PDF>", lines, re.DOTALL)
print(matches)

Output:

['<PDF>\nbla1\n</PDF>\ntest\n<PDF>\nbla2\n</PDF>']

Expected Output:

['<PDF>\nbla1\n</PDF>', '<PDF>\nbla2\n</PDF>']

What's going wrong here? How can I ensure that no text between </PDF> and <PDF> gets matched?

🟢 Solution
2

* is greedy, so it tries to match as much as possible.

Use *? in this case. See Python's documentation of module re:

Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

matches = re.findall(r"<PDF>.*?</PDF>", lines, re.DOTALL)