Practicle Dristributed Web Crawler: June 2008

Sunday, June 8, 2008

Practical Distributed Web Crawler ?

The main problem that this project faces is to solve the need of very high resources that is required to provide a successful web crawling. Most of the web crawlers used at the present date uses server farms to cater their needs. This makes the area untouchable for normal developers. My goal is to reduce the resources for web crawling by using a distributed system.

The distributed system will be used to do the web crawling and also the data processing. And a single database server to store the data. And also the project will provide the searching facility according to page details and images tags to provide a better image search.

Home Page : http://www.221bot.com

Friday, June 6, 2008

Project Objectives

The final goal of this project is to create a web crawler that can work under the practical environments. In order to achieve that level the project will look in depth to usability and the flexibility of the product.

The final product will be able attract users to the system, and to distribute a client among them. And the web crawler will be capable of collecting information under given key words so that the clients may customize the search patterns according to their needs.

This will provide the common user a web site that he/she might search for web address under a given key word or images under given key words. And for image search the users may select if they want a search according to image tags or page content.

And other developers will be able to download the editable version of the client to edit and distribute a client that is capable of searching a specific area that the developer needs to focus on.

The final deliverables are as followings –

· An online web server application to distribute the workload to the clients.

· An online web site to do the publicity and to distribute the web crawler.

· An Oracle 10G database server to collect the processed information.

· And a Client that will do the crawling and the processing.

Wednesday, June 4, 2008

Brief description of the development plan

The development methodology I have selected to complete this project is the Extreme Programming methodology. Extreme programming is a methodology encourages the developer to start from the simplest form of the product.

This allows the development to be flexible to future developments and extra functionalities. And extreme Programming favors simple designs, common metaphors, collaboration of users and programmers, frequent verbal communication, and feedback. Since user involvement is a must in this project this will also make it flexible to user requirements and to change accordingly.

The project will undergo 3 meager areas in the development stage. Each stage is a expanded and added functionalities of the previous stage.

Areas developed under the 1^st stage:

The web server application to distribute information and collect user support and comments
The Database to handle client information over the web
Basic web crawler functions

Areas developed under the 2^nd stage:

Web crawler and the database
Client software
Web server application to do searching

Areas developed under the 3^rd stage

Web server application to distribute the workload
Web server application to distribute the client and collect client data
Customizable web crawler