crawler 軟體清單 & scrapy 的替代物

2012年9月29日 星期六
photo credit: Ian Sane

Scrapy 是什麼?來看看官方的定義:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

哇嗚,它可以用來扒取網站,擷取網頁上結構化的資料。100% python,可以在 Linux, Windows, Mac 及 BSD 上運行,而且,有很詳盡的說明文件 ...嗯聽起來挺不賴的嘛。

然而我還是想知道有哪些可用的取代軟體,這時候有個聲音傳來了:

If you're looking for a python based crawler, Scrapy is probably your best bet.
─ Eric Wu

所以意思是 scrapy 已經非常好了是嗎?無論如何,Eric Wu 還真是個好心人,他在 Quora 留下了非常有用的爬蟲 (crawler) 清單,記錄用各式各樣語言寫成的爬蟲軟體。

Java
    Nutch => http://nutch.apache.org/
    Heritrix => https://webarchive.jira.com/wiki/display/Heritrix/Heritrix...
    WebSPHINX => http://www.cs.cmu.edu/~rcm/websphinx/

Python
    Scrapy => http://scrapy.org/
    Scrape.py => http://zesty.ca/scrape/
    HarvestMan => http://harvestmanontheweb.com/
    Mechanized (ported from the perl version) => http://wwwsearch.sourceforge.net/mechanize/

Ruby
    scRUBYt => https://github.com/scrubber/scrubyt
    Anemone => http://anemone.rubyforge.org/

Ruby: Not Really Crawlers but can be used like one
    hpricot => http://hpricot.com/
    Nokogiri => http://nokogiri.org/

PHP
    Snoopy => http://sourceforge.net/projects/snoopy/
    PHPCrawl => http://phpcrawl.cuab.de/

Erlang
    eBot => https://github.com/matteoredaelli/ebot


這個清單可以無窮地長下去,然而這是我很不樂見的 XD。因為 python 對我來說是個蠻美的語言,所以我會比較偏好先試用 python based 的軟體。你用過哪些爬蟲軟體呢?如果有推薦的爬蟲軟體,歡迎告訴筆者囉 :)

0 意見:

張貼留言

嗨,我是 Seyna。歡迎您的留言 :)

 

Categories

 

© 2010 取火之路, Design by DzigNine
In collaboration with Breaking News, Trucks, SUV