Crawling with Java

January 2009: I had the opportunity (2008) to develop a crawler in Java. Even it is very basic, but I think you may find it easier to understand than many of the codes you can see around. I hope you find it useful.

The crawler system was developed entirely by myself. This crawler will fetch webpages in two different schemes: Depth or Breadth first. Depth-first scheme will try to follow all links from bottom to top, this means, it will follows links to its conclution before another path is tried. In the other way, Breadth-first scheme will try to finish with all the links in one page before processing the next page in queue.

I personally find better the "Breadth-first" scheme because it will require the utilization of less resources. You only need to open once each web page and you don't need to store web pages in memory. I found easily to implement this scheme than the "Depth-first" scheme. In the first version of this crawler only that scheme was developed.

During the crawling process, links are separated according if they are internal (from the same host) or if they are externals (outside the initial host). The crawler precess first all the internal links and then the external links.

Only *.html and *.htm are followed for simplicity. You can change the regular expresion to include more extensions or avoid only exceptions.

This crawler have the option to limit the number of pages to be fetched. Only those pages that were successfully fetched and that contains at least 1 link are taken into account in that limit.

In order to test my crawler, can be started in any of the following ways (examples):
1) java Search
(The system will ask you interactively)
2) java Search http://www.alepe.com/index.php 10 b
(10 means the limit of pages to fetch and the "b" or "d" means the scheme)

Download:

Version 1 (only breadth-first scheme version)

* BasicCrawler.java

"BasicCrawler" Class is a implementation of a very basic web crawler. In general terms, it read an URL from an imput string, fetch the webpage, parse the content for links based in Breathd first scheme, retreiving all the links in subsequent pages.

Version 2 (final basic version, 3 files):

1) Search.java

"Search" Class is a implementation of a basic web crawler. In general terms, it read an URL from an imput string, fetch the webpage, parse the content for links and depending of two schemes (Breathd first and Depth first) it retreive all the links in subsequent pages.

2) Link.java

"Link" is a custom object to store links. It stores the URL object, the ID of the Parent Link and the status (read, unread) of the link.

3) LinksMgr.java

"LinksMgr" manage the list of links according to different schemes. This is where the fetching strategy takes place.

If you want to learn more about crawlers, I recomend to read Crawling the web with Java by McGraw-Hill/Osborne