Something in My Mind: Distributed Web Crawler

2007年8月1日星期三

Distributed Web Crawler - Grub

http://www.grub.org/

這東西很妙...類似SETI@home那樣，電腦閒閒沒事做的人可以下載Grub的client回家裝，這樣就可以幫忙去"爬"網路上的網頁，然後做一些index的處理，再送回Grub的資料庫存起來。理論上這是集眾人之力來索引整個Internet的終極方法。

今天有一個新聞，Wikia的創辦人Jimmy Wales買下了Grub，打算用來挑戰Google或Yahoo這些強大的search engine，靠的就是Grub背後這個資料庫。

不過，這樣不就變成大家幫他們找資料，錢都是他們賺嗎？我自己也寫過web crawler，最棘手的問題就是頻寬不夠或是機器不夠......，Grub這一招讓這些難題都迎刃而解。當然也有很多人質疑說為什麼他們可以拿大家爬來的資料來賺錢，Grub是說open source不等於免費啦，還有可以優先爬自己的網站增加自己網站的曝光率等等，當然他們也有提供interface可以讓任何人使用這個資料庫啦......不過資料庫還是在他們那裡就是了。

Grub這樣的賺錢手法我個人認為還好，有名大站不也是靠大家貼照片和寫blog來賺錢嗎？Web 2.0時代嘛......內容提供者由以前的網站變成使用者自己，想賺錢的就看誰能夠吸引最多人來提供內容囉。對我來說，我對那個可以索引網頁的interface比較感興趣......

1 則留言:

adAma 提到...: Sorry bothering, but chance blogger(Something in My Mind) knew-

Where I can get evaluation tools to separate a local data base(16,939 files~300MB pure texts located 3 directories that too flat) into an appropriate structure? thanks!

I just tried teleport-pro to do such tasks, but must first separate the first index page via guess by myself, then manually check each link(total>1000) pointed to how many pages/size...etc.

Appreciate your help(email reply preferred) in advance.; 2008年1月30日上午8:42

張貼留言

Something in My Mind

2007年8月1日星期三

Distributed Web Crawler - Grub

1 則留言:

自我介紹

標籤

Good Blogs

Useful Links

網誌存檔

Counter

Something in My Mind

2007年8月1日 星期三

Distributed Web Crawler - Grub

1 則留言:

自我介紹

標籤

Good Blogs

Useful Links

網誌存檔

Counter

2007年8月1日星期三