How to create a regular expression for this text

Here's the input:

7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 100012. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 100012. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.

And here is the expected result:

[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 100012. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 100012. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]

I tried this:

re.findall(r'(?=\s(\d+)\s(1\..*?)\s\d+\s1\.)', txt, re.DOTALL)

But, of course, this is not the right decision - the regular expression should match (\d+) 1., but not PRub. 1 1..
What to do to make it work?

+5
source share
1 answer

Like this:

In [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'

In [2]: import re

In [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
Out[3]: 
['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',
 '2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
 '3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
 '4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']

For the exact conclusion, I would do something like:

In [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)

In [5]: [tuple(f.split(' ',1)) for f in ns]
Out[5]: 
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
 ('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
 ('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
 ('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]

This may be the best way to do this, but my python foo is not as good as my regexp foo.

Regexplanation:

(?<=\s) # Use positive look-behind to match a leading space but don't include it
\d      # match digit    
.*?     # Match everything up till the next record (lazy)
        # The following positive look-behinds is the key. It matches the start of
        # each new record i.e
        # 2 1. S
        # 3 1. S
        # 4 1. Q
        # 0 1.$ 
        # look-arounds match but don't seek past.  
(?=\s\d\s\d[.](?=$|\s[A-Z]))
(?=     # positive look-ahead 1
\s      # space
\d      # digit
\s      # space
\d      # digit
[.]     # period
(?=     # postive look-ahead 2 
$       # end of string
|       # OR
\s[A-Z] # space followed by uppercase letter
)       # close look-ahead 1
)       # close look-ahead 2
+4
source

All Articles