How to get attributes and values ​​from a badly formatted string in Java

I need to get attributes and values ​​from several lines, such as:

<img src = "the source" class=class01 />
<img class=class02 src=folder/img.jpg />
<img class= "class01" / >

Spaces and slashes are accepted in values, and some values ​​are enclosed in quotation marks, not all. Some equal signs are spaced.

I'm new to this, so the code is dirty and probably not flawless.

My attempt:

//remove unnecessary spacing and "<img" and "/>"
str = str.replaceAll("/ >", "/>");
str = str.substring(4, str.length()-1);
str = str.replaceAll(" =", "=");
str = str.replaceAll("= ", "=");

//remove quotes
str = str.replaceAll("\"", "");

//creating a matcher and compiling the regex pattern is omitted, because I know how to do that using matcher.group();
regexSrc = "src=(.*?)($| class=)";
String srcString = matcherSrc.group(1);

regexClass = "class=(.*?)($| src=)";
String classString = matcherClass.group(1);

System.out.println("the source is: " + srcString);
System.out.println("the class is: " + classString);

Any suggestions on how to do this are the best way to evaluate.

+3
source share
4 answers

You say that you have already extracted the tag <img>, and you are working on it as a separate line. This makes work easier, but still difficult to handle. For example, how would you handle this tag?

<img  foosrc="whatever" barclass=noclass src =
folder/img.jpg class   ='ho hum' ></img>

Here you have:

  • more than one space following the tag name
  • src class
  • =
  • =
  • no final /, HTML , XML.

... , , . , , - , . , , ? , ?

:

String[] tags = { "<img src = \"the source\" class=class01 />",
                  "<img class=class02 src=folder/img02.jpg />",
                  "<img class= \"class03\" / >", 
                  "<img  foosrc=\"whatever\" barclass=noclass" +
                  "    class='class04' src =\nfolder/img04.jpg></img>" };

String regex = 
  "(?i)\\s+(src|class)\\s*=\\s*(?:\"([^\"]+)\"|'([^']+)'|(\\S+?)(?=\\s|/?\\s*>))";
Pattern p = Pattern.compile(regex);
int n = 1;
for (String tag : tags)
{
  System.out.printf("%ntag %d: %s%n", n++, tag);
  Matcher m = p.matcher(tag);
  while (m.find())
  {
    System.out.printf("%8s: %s%n", m.group(1),
        m.start(2) != -1 ? m.group(2) :
        m.start(3) != -1 ? m.group(3) :
        m.group(4));
  }
}

:

tag 1: <img src = "the source" class=class01 />
     src: the source
   class: class01

tag 2: <img class=class02 src=folder/img02.jpg />
   class: class02
     src: folder/img02.jpg

tag 3: <img class= "class03" / >
   class: class03

tag 4: <img  foosrc="whatever" barclass=noclass    class='class04' src =
folder/img04.jpg></img>
   class: class04
     src: folder/img04.jpg

:

(?ix)   # ignore-case and free-spacing modes
\s+           # leading \s+ ensures we match the whole name
(src|class)   # the attribute name is stored in group1
\s*=\s*       # \s* = any number of any whitespace
(?:           # the attribute value, which may be...
   "([^"]+)"              # double-quoted (group 2)
 | '([^']+)'              # single-quoted (group 3)
 | (\S+?)(?=\s|/?\s*>)    # or not quoted (group 4)
)
+1

HTML-, JTidy, , HTML.

+2

Since Stephen C replied that it might not be so safe to use regex for this. This may cause you problems.

But here is something that can do what you need, at least for this example:

 ([a-z]+) *= *"?((?:(?! [a-z]+ *=|/? *>|").)+)

See rubular .

You may need to test it for more possible input and maybe there should be settings.

Here in the java code:

Pattern p = Pattern.compile("([a-z]+) *= *\"?((?:(?! [a-z]+ *=|/? *>|\").)+)", Pattern.DOTALL);
Matcher m = p.matcher(input);
while (m.find()){
    String key = m.group(1);
    String value = m.group(2);
    System.out.printf("%1s:%2s\n", key, value);
}
0
source

All Articles