Am I only checking URL links with this java code?

I have a method that uses a url and finds all the links on this page. However, I am worried that these are only links, when I check to see if links work or not, some of the links seem weird. For example, if I check links to www.google.com, I get 6 broken links that don’t return an http status code, but instead say that there is no protocol for this broken link. I simply would not think that google would have any broken links on the main page. An example of one of the broken links is: / preferences? Hl = en I don’t see where this link is on the google main page. I am wondering if I am checking only links or is it possible I am retrieving code that should not be a link?

Here is a method that checks the URL for links:

public static List getLinks(String uriStr) {

    List result = new ArrayList<String>();
    //create a reader on the html content
    try{
        System.out.println("in the getlinks try");
    URL url = new URI(uriStr).toURL();
    URLConnection conn = url.openConnection();
    Reader rd = new InputStreamReader(conn.getInputStream());

    // Parse the HTML
    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    kit.read(rd, doc, 0);

    // Find all the A elements in the HTML document
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
    while (it.isValid()) {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

        String link = (String)s.getAttribute(HTML.Attribute.HREF);
        if (link != null) {
                // Add the link to the result list
                System.out.println(link);
            //System.out.println("link print finished");
            result.add(link);
        }
        //System.out.println(link);
        it.next();
    }
    }
+5
source share
1 answer

There is nothing wrong with the link you are returning.

Looking at your code, you retrieve the attribute href, which in the case of your example refers to the element:

<a  class=gbmt href="/preferences?hl=en">Search settings</a>

(You can see this link, if you click on "Settings" at the bottom right, the list should appear with several links)

As you can see, the attribute hrefcontains only /preferences?hl=enthat just makes it a relative reference. The full URL will be the address of the page you are currently on + href. In this case:

http://www.google.com/preferences?hl=en

You just need to tweak your code to add an argument to your method if the URL is relative.

+1
source

All Articles