Sunday, June 8, 2008

Practical Distributed Web Crawler ?

The main problem that this project faces is to solve the need of very high resources that is required to provide a successful web crawling. Most of the web crawlers used at the present date uses server farms to cater their needs. This makes the area untouchable for normal developers. My goal is to reduce the resources for web crawling by using a distributed system.

The distributed system will be used to do the web crawling and also the data processing. And a single database server to store the data. And also the project will provide the searching facility according to page details and images tags to provide a better image search.

Home Page : http://www.221bot.com

9 comments:

Ranhiru said...

Hey machan damn good idea!! Well this might be the next generation of easy and cost effective web crawling! Gud luck 2 u!!!

SAN said...

Thx a lot machchang;
Well let's try to do it in this Generation :D

Deane said...

ela ela, go samitha. Ela idea. Let's see how you actually do this. How about using some portion of the searcher's resources?

SAN said...

Thx deane !!!
And thx for ur new idea also !!!

Well thats a really gud idea bro; but im not sure how much people would like to donate for a simple search. and different browsers might be a problem also becoz different browsers will have different ways to interpret with them.

But there will be a system to count the number of data donated by a participant using the crawler application.

And a number of points will be added to his account accordingly. he may use it to track a sites changes. this will be very useful for web admins and forum lovers. this is still in test state. if i does not manage to do this for the FYP on time i will do this someday. Think it will be a new idea for a web crawler.

After all it will be a true LIVE SEARCH !!!

Yohan Liyanage said...

This seems challenging. Good Luck on your project machan. Well, once done, you could be on the way to be the next Larry Page. XD

SAN said...

Thx Yohan!

Hey I'm no "Larry Page" :P bro, just doing ma FYP.

but since you are interested there will be a function to keep track of web pages.

This can be used on web sites that does not have RRS feeds. And also can be used to customize the information u need to read. with this option u will not have to visit the web site again and again.

For a example take Defence.lk.
You can use this system (ones done :P) to scan the site in a frequency of given time for key words that u provide.
How cool if u can get a mail or SMS (only selected providers) when ever defence.lk announces the word BOMB in there site :)

So I think now ur wondering how to find the bandwidth need for such a massive operation. its simple since this is a distributed system you will have to GIVE before u can TAKE.

Points will be allocated to ur donation of bandwidth for this systems crawling and web monitoring. And u can scan web sites for a given time and a give frequency filtering keywords. and to alert u via e-mail or SMS.

In simple if u give 100 u get 100 :D

Genesis said...

A challenging project in many aspects.
Good work Bro!

Genesis said...

BTW when is this due?
Hows the help from Dr.M?

SAN said...

Thx Genesis !!! :)

well the help is fine, he agreed to provide the necessary resources like FTP and SSH to the corporate lab.

Still i didn't got them but hope in the near future they will provide it :P.