Extract HTML tags using Java

I would like to extract the various HTML tags available from the source code of a webpage, is there any method in Java to do this or make it an HTML parser?

I want to highlight all HTML tags.

+3
source share
5 answers

Check out the CyberNeko HTML Parser .

0
source

You can use regular expressions. If your html is valid XML - you can use an XML parser

0
source

Java XML- DOM JavaScript:

DocumentBuilder builder = DocumentBuilderFactory.newDocumentBuilder();
Document doc = builder.parse(html);
doc.getElementById("someId");
doc.getElementsByTagName("div");
doc.getChildNodes();

( , html ..).

http://download.oracle.com/javase/1.5.0/docs/api/org/w3c/dom/Document.html

The cyber-neko analyzer is also good if you need more.

0
source

You can write your own method utilfor extracting tags.

Check the tags <and />or >for the full tag and write these tags to another file.

0
source

I used HTMLParser in one project, I was very pleased with this.

Edit: if you check the sample page, the parser sample does pretty much what you ask for.

0
source

All Articles