Why does a website crawl forever?

Question

Why does a website crawl forever?

public class Parser {

    public static void main(String[] args) {
        Parser p = new Parser();
        p.matchString();
    }

    parserObject courseObject = new parserObject();
    ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
    ArrayList<String> courseNames = new ArrayList<String>();
    String theWebPage = " ";

    {
        try {
            URL theUrl = new URL("http://ocw.mit.edu/courses/");
            BufferedReader reader =
                new BufferedReader(new InputStreamReader(theUrl.openStream()));
            String str = null;

            while((str = reader.readLine()) != null) {
                theWebPage = theWebPage + " " + str;
            }
            reader.close();

        } catch (MalformedURLException e) {
            // do nothing
        } catch (IOException e) {
            // do nothing
        }
    }

    public void matchString() {
        // this is my regex that I am using to compare strings on input page
        String matchRegex = "#\\w+(-\\w+)+";

        Pattern p = Pattern.compile(matchRegex);
        Matcher m = p.matcher(theWebPage);

        int i = 0;
        while (!m.hitEnd()) {
            try {
                System.out.println(m.group());
                courseNames.add(i, m.group());
                i++;
            } catch (IllegalStateException e) {
                // do nothing
            }
        }
    }
}

, , - - MIT OpencourseWare. , , . Pattern Matcher () , . , , - bufferedReader . , - , -, . , , . .

+5

java regex web-crawler

anonuser0428 11 . '12 1:54

3

? parserObject?

, main() matchString()?

parserObject courseObject = new parserObject();
ArrayList<parserObject>  courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage=" ";
{

    try {
            URL theUrl = new URL("http://ocw.mit.edu/courses/");
            BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
            String str = null;

            while((str = reader.readLine())!=null)
            {
                theWebPage = theWebPage+" "+str;
            }
            reader.close();

    } catch (MalformedURLException e) {

    } catch (IOException e) {

    }
}

. -, - . , , .

(, ). , , static {. , main, , MalformedURLException IOException.

+2

HeatfanJohn 11 . '12 2:05

You can, of course, solve this problem with the limited JDK 1.0 API and run into a problem that Stuart Marx helped you solve in your excellent answer .

Or you just use the popular de facto standard library, such as Apache Commons IO , and read your website in String using no problem:

// using this...
import org.apache.commons.io.IOUtils;

// run this...
try (InputStream is = new URL("http://ocw.mit.edu/courses/").openStream()) {
    theWebPage = IOUtils.toString(is);
}

+1

Lukas Eder Feb 18 '15 at 10:04

source share

Stuart Marks · Accepted Answer · 2012-08-11T03:54:26+0000

while ((str = reader.readLine()) != null)
    theWebPage = theWebPage + " " +str;

theWebPage , . , , String , , . , .

-. 55 000 3,25 . . - 1,5 (1/2 55 000 ). . (2,6 Core2Duo, 1 ), 15 ( ).

, theWebPage StringBuilder

    theWebPage.append(" ").append(str);

theWebPage , toString() , . , .

, { } . ( ). . , . , . .

Why does a website crawl forever?

More articles: