I'm not so sure about how regex works, but I'm trying to make a project where (haven't still set it up, but working on the pdf indexing side of code first with a test pdf) to analyze the mark scheme pdf, and based on that do anything with the useful data.

Issue is, is that when I enter the search parameters in regex, it returns nothing from the pdf. I'm trying iterate or go through each row with the beginning 1 - 2 digits (Question column), then A-D (Answer column) using re.compile(r'\d{1} [A-D]') in the following code:

import re
import requests
import pdfplumber
import pandas as pd


def download_file(url):
    local_filename = url.split('/')[-1]
    
    with requests.get(url) as r:
        with open(local_filename, 'wb') as f:
            f.write(r.content)
        
    return local_filename



ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)

with pdfplumber.open(ap) as pdf:
    page = pdf.pages[1]
    text = page.extract_text()


#print(text)

new_vend_re = re.compile(r'\d{1} [A-D]')

for line in text.split('\n'):
    if new_vend_re.match(line):
        print(line)

When I run the code, I do not get anything in return. Printing the text though will print the whole page.

Here is the PDF I'm trying to work with: https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf

πŸ”΄ No definitive solution yet
πŸ“Œ Solution 1
0

re.match only returns true when the pattern matches starting at the first character of the string.

Try using re.search instead. You should also make your pattern more flexible as mentioned in the answer by js-on.

new_vend_re = re.compile(r'\d{1}\s+[A-D]')

for line in text.split('\n'):
    if new_vend_re.search(line):
        print(line)

from the python docs.

Pattern.match(string[, pos[, endpos]])ΒΆ
If zero or more characters at the beginning of string match this regular expression, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

πŸ“Œ Solution 2
0

You're matching only one single space between the digits and the marks, but if you look at the output of text, there is more than one space between the digits and marks.

'9700/12  Cambridge International AS/A Level – Mark Scheme  March 2019\nPUBLISHED \n \nQuestion  Answer  Marks \n1  A  1\n2  C  1\n3  C  1\n4  A  1\n5  A  1\n6  C  1\n7  A  1\n8  D  1\n9  A  1\n10  C  1\n11  B  1\n12  D  1\n13  B  1\n...

Change your regex to the following to match not only one, but one or more spaces:

new_vend_re = re.compile(r'\d[1}\s+[A-D]')

See the answer by alexpdev to get to know the difference of new_vend_re.match() and new_vend_re.search(). If you run this within your code, you will get the following output:

1  A  1
2  C  1
3  C  1
4  A  1
5  A  1
6  C  1
7  A  1
8  D  1
9  A  1

(You can also see here, that there are always two spaces instead of one).