Sunday, June 8, 2008

Practical Distributed Web Crawler ?

The main problem that this project faces is to solve the need of very high resources that is required to provide a successful web crawling. Most of the web crawlers used at the present date uses server farms to cater their needs. This makes the area untouchable for normal developers. My goal is to reduce the resources for web crawling by using a distributed system.

The distributed system will be used to do the web crawling and also the data processing. And a single database server to store the data. And also the project will provide the searching facility according to page details and images tags to provide a better image search.

Home Page : http://www.221bot.com

Friday, June 6, 2008

Project Objectives

The final goal of this project is to create a web crawler that can work under the practical environments. In order to achieve that level the project will look in depth to usability and the flexibility of the product.

The final product will be able attract users to the system, and to distribute a client among them. And the web crawler will be capable of collecting information under given key words so that the clients may customize the search patterns according to their needs.

This will provide the common user a web site that he/she might search for web address under a given key word or images under given key words. And for image search the users may select if they want a search according to image tags or page content.

And other developers will be able to download the editable version of the client to edit and distribute a client that is capable of searching a specific area that the developer needs to focus on.

The final deliverables are as followings –

· An online web server application to distribute the workload to the clients.

· An online web site to do the publicity and to distribute the web crawler.

· An Oracle 10G database server to collect the processed information.

· And a Client that will do the crawling and the processing.

Wednesday, June 4, 2008

Brief description of the development plan

The development methodology I have selected to complete this project is the Extreme Programming methodology. Extreme programming is a methodology encourages the developer to start from the simplest form of the product.

This allows the development to be flexible to future developments and extra functionalities. And extreme Programming favors simple designs, common metaphors, collaboration of users and programmers, frequent verbal communication, and feedback. Since user involvement is a must in this project this will also make it flexible to user requirements and to change accordingly.

The project will undergo 3 meager areas in the development stage. Each stage is a expanded and added functionalities of the previous stage.

Areas developed under the 1st stage:

  • The web server application to distribute information and collect user support and comments
  • The Database to handle client information over the web
  • Basic web crawler functions

Areas developed under the 2nd stage:

  • Web crawler and the database
  • Client software
  • Web server application to do searching

Areas developed under the 3rd stage

  • Web server application to distribute the workload
  • Web server application to distribute the client and collect client data
  • Customizable web crawler

Thursday, May 8, 2008

Brief description of the resources used

Database Server:

Hardware:

· Processor – 2 GHz

· RAM – 2Gb

· Broad Band internet connection of 1Mbs with Static IP

Software:

· Oracle 10G

Web Server:

Hardware:

· Shared web hosting of 1Gb Storage

· Static IP

· 1Tb Bandwidth

Software:

· PHP 5 with MySQL 4.1

User (Client):

Hardware:

· Processor - 1Ghz with 256MB RAM

· Internet Connection of any kind

Software:

· Windows XP SP2 operating system with JRE 1.4 or higher

Development stage software’s:

· Net beans 6.0

· Zend Studio

· Oracle 10G

· Visual Studio 6

Human resources will be used in the testing phase to identify errors and also to make sure that the system is compatible under any working environment. voluntarily participants are very important for this project since this is a distributed system