Beautiful soup and regular expressions

I use Beautiful Soup to define a specific tag and its contents. The content is html links and I want to extract the text of these tags.

The problem is that the text consists of different numbers according to a specific pattern. I am only interested in a number, for example “61993J0417” and “61991CJ0316”, and I need the regular expression to match when the number has “J” and “CJ” in the middle.

I used this code for this:

soup.find_all(text=re.compile('[6][1-2][0-9]{3}[J]|[CJ][0-9]{4}'))

A soup variable is the contents of a particular tag. This code works in 9 out of 10 cases. However, when I run this script in one of my source files, it also matches numbers such as "51987PC0716".

I do not understand why I am turning to you for help.

+3
source share
2 answers

You have not indicated what it refers to |; the default is all regex, which means you asked for either

[6][1-2][0-9]{3}[J]

(this is the same as 6[12][0-9]{3}J) or

CJ[0-9]{4}

(not [CJ]that means "either C or J"). Use parentheses to indicate which alternatives are:

^6[12][0-9]{3}(J|CJ)[0-9]{4}$

which is better written

^6[12][0-9]{3}C?J[0-9]{4}$
+3
source

IIUC, you always have a "J" inside your line. So make it mandatory and make "C" optional using the question mark. Sort of:

re.compile('6[1-2][0-9]{3}C?J[0-9]{4}')

I have not tested this, but you can probably continue here.

+3
source

All Articles