Status: adding static to the method, variable queue and synchronization (crawler.class) solved the problem. thanks everyone!
http://pastie.org/8724549#41-42,46,49,100-101,188-189,191
selected method / block synchronized.
This block / method should be accessed by one method at a particular time..
It should be like this: the first thread goes to the method, updates the size, the rest see this size. updated. the update should have been done only by the first thread. not others
- why is it even running. it is controlled by all 11 threads.
- it starts without waiting for the end of the previous thread. "
queue loaded, new size ------------" create / add items
package crawler;
import crawler.Main;
import static crawler.Main.basicDAO;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.ConcurrentMap;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Crawler implements Runnable, InterfaceForConstants {
public static final String patternString = "[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})";
public ConcurrentLinkedQueue<Link> queue = new ConcurrentLinkedQueue<Link>();
private volatile String url;
private void crawl(String url) {
synchronized (Crawler.class){
System.out.println("queue size "+queue.size());
if(queue.size() < (totalSizeOfEmailBucket / 3)){
updateQueue();
}
System.out.println("This is inside of sync block. ----------- queue size "+queue.size());
}
System.out.println("This is at the end of sync block. ----------- queue size "+queue.size());
BufferedReader bf = null;
try {
url = queue.poll().getLink();
URL target = new URL(url);
bf = new BufferedReader(
new InputStreamReader(target.openStream())
);
StringBuilder html = new StringBuilder();
String inputLine;
while ((inputLine = bf.readLine()) != null) {
html.append(inputLine);
}
List emailList = new ArrayList( getEmailList(html.toString()) );
System.out.println("Just worked on --------- "+ url);
Main.processedLinksCount++;
if(emailList.size()>0){
putEmailsInDB(emailList);
}
} catch (IOException ex) {
Logging.logError(ex.toString());
basicDAO.deleteLink(url);
} catch (Exception ex) {
Logging.logError(ex.toString());
basicDAO.deleteLink(url);
}finally{
if(bf !=null){
try {
bf.close();
} catch (IOException ex) {
Logging.logError(ex.toString());
}
}
crawl(null);
}
}
public synchronized void updateQueue() {
Queue<Link> tempQueue = new PriorityQueue<Link>();
tempQueue = getNonProcessedLinkFromDB() ;
queue.addAll(tempQueue);
BasicDAO.markLinkAsProcesed(tempQueue);
System.out.println("queue loaded, new size ------------------------------------ "+queue.size());
}
private List getLinkList(String html, String url) {
Document doc = Jsoup.parse(html);
Elements bodies = doc.select("body");
List linkList = new ArrayList();
for(Element body : bodies ){
Elements aTags = body.getElementsByTag("a");
for (Element a: aTags){
String link = a.attr("href");
if ( !(link.startsWith("#"))
&&
!(link.contains("()"))
&&
!(link.endsWith(".jpg"))
&&
!(link.endsWith(".jpeg"))
&&
!(link.endsWith(".png"))
&&
!(link.endsWith(".gif")) ){
if( link.startsWith("/") ){
link = url+link;
}
linkList.add(link);
}
}
}
return linkList;
}
private List getEmailList(String html) {
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(html);
List emailList = new ArrayList();
while(m.find()){
emailList.add(m.group());
Main.nonUniqueEmailsCount++;
}
return emailList;
}
private Queue<Link> getNonProcessedLinkFromDB() {
return ( basicDAO.getNonProcessedLink() );
}
private void putEmailsInDB(List emailList) {
basicDAO.insertEmail(emailList);
}
private void putLinksInDB(List linkList) {
basicDAO.insertLinks(linkList);
}
@Override
public void run() {
if(url != null){
crawl(url);
}else{
}
}
public Crawler(String url){
this.url = url;
}
public Crawler(){
this.url = null;
}
}
way to start threads: not optimistic. I know. there is no executor service or pool, but the following code is valid:
for (int i = 0; i < 11; i++) {
try {
Thread thread = new Thread(new Crawler("https://www.google.com.pk/?gws_rd=cr&ei=-q8vUqqNDIny4QTLlYCwAQ#q=pakistan"));
System.out.println("resume with saved link true");
thread.start();
System.out.println("thread stared");
threadList.add(thread);
System.out.println("thread added to arraylist");
} catch (Exception ex) {
new Logging().logError(ex.toString());
}
}
debugs:
for 11 threads , its says in logs:
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
queue size 0
This is at the end of sync block.
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 0
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
Just worked on
queue size 999
Just worked on
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
queue size 999
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
Just worked on
queue size 999
queue loaded, new size
This is inside of sync block.
This is at the end of sync block.
8692 characters / 254 lines
Advertising from Carbon:
Advertisement Braintree: 2.9% and 30¢ per transaction. No minimums, no monthly fees.