Like this:
In [1]: s='7. Data 1 1. STR1 STR2 3. 12345 4. 0876 9. NO 2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 ZŁ 12. NO PRub. 1 1. 1000 XX 2. NO 4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO 0 1.'
In [2]: import re
In [3]: re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
Out[3]:
['1 1. STR1 STR2 3. 12345 4. 0876 9. NO',
'2 1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'3 1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO',
'4 1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO']
For the exact conclusion, I would do something like:
In [4]: ns = re.findall('(?<=\s)\d.*?(?=\s\d\s\d[.](?=$|\s[A-Z]))',s)
In [5]: [tuple(f.split(' ',1)) for f in ns]
Out[5]:
[('1', '1. STR1 STR2 3. 12345 4. 0876 9. NO'),
('2', '1. STR 2. STRT STR 3. 9909090 5. YES 6. NO 7. YES 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('3', '1. STRT 2. STRT 3. 63110300291 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 Z\xc5\x81 12. NO PRub. 1 1. 1000 XX 2. NO'),
('4', '1. QWERET 2. IOSTR9 3. 76012509879 5. YES 6. NO 7. NO 8. NO 9. YES 10. 5000 XX 11. 1000 XX 12. NO PRub. 1 1. 1000 XX 2. NO')]
This may be the best way to do this, but my python foo is not as good as my regexp foo.
Regexplanation:
(?<=\s)
\d
.*?
(?=\s\d\s\d[.](?=$|\s[A-Z]))
(?=
\s
\d
\s
\d
[.]
(?=
$
|
\s[A-Z]
)
)
source
share