You say that you have already extracted the tag <img>, and you are working on it as a separate line. This makes work easier, but still difficult to handle. For example, how would you handle this tag?
<img foosrc="whatever" barclass=noclass src =
folder/img.jpg class ='ho hum' ></img>
Here you have:
- more than one space following the tag name
src class==- no final
/, HTML , XML.
... , , . , , - , . , , ? , ?
:
String[] tags = { "<img src = \"the source\" class=class01 />",
"<img class=class02 src=folder/img02.jpg />",
"<img class= \"class03\" / >",
"<img foosrc=\"whatever\" barclass=noclass" +
" class='class04' src =\nfolder/img04.jpg></img>" };
String regex =
"(?i)\\s+(src|class)\\s*=\\s*(?:\"([^\"]+)\"|'([^']+)'|(\\S+?)(?=\\s|/?\\s*>))";
Pattern p = Pattern.compile(regex);
int n = 1;
for (String tag : tags)
{
System.out.printf("%ntag %d: %s%n", n++, tag);
Matcher m = p.matcher(tag);
while (m.find())
{
System.out.printf("%8s: %s%n", m.group(1),
m.start(2) != -1 ? m.group(2) :
m.start(3) != -1 ? m.group(3) :
m.group(4));
}
}
:
tag 1: <img src = "the source" class=class01 />
src: the source
class: class01
tag 2: <img class=class02 src=folder/img02.jpg />
class: class02
src: folder/img02.jpg
tag 3: <img class= "class03" / >
class: class03
tag 4: <img foosrc="whatever" barclass=noclass class='class04' src =
folder/img04.jpg></img>
class: class04
src: folder/img04.jpg
:
(?ix)
\s+
(src|class)
\s*=\s*
(?:
"([^"]+)" # double-quoted (group 2)
| '([^']+)' # single-quoted (group 3)
| (\S+?)(?=\s|/?\s*>) # or not quoted (group 4)
)