取火之路: 10月 2012

photo credit: molechaser

Scrapy shell 提供了很多互動方式，便於檢查網頁結構和設計的擷取規則究竟適當與否，是個非常方便的功能。然而要怎麼樣才能在 shell 裡看見中文字呢？答案很簡單 ─ 使用 print (或許加上 encode)

// 進入 scrapy shell
scrapy shell

// 以 yahoo 電影的排行榜網頁為例

>>> fetch("http://tw.movie.yahoo.com/chart.html")

2012-10-04 17:36:09+0800 [default] INFO: Spider opened
2012-10-04 17:36:09+0800 [default] DEBUG: Crawled (200) <GET http://tw.movie.yahoo.com/chart.html> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html lang="zh-tw"><head><title>\u53f0\u5317\u7968\u623f\u699c - '>
[s]   item       {}
[s]   request    <GET http://tw.movie.yahoo.com/chart.html>
[s]   response   <200 http://tw.movie.yahoo.com/chart.html>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x1a72f90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser


// 選取第一個 option 元素，抽出文字
>>> desc = hxs.select('//option/text()').extract()[0]

// 原始 unicode 格式
>>> desc
u'\u96fb\u5f71\u6642\u523b'

// 正確地顯示中文
>>> print desc
電影時刻

>>> print desc.encode('utf-8')
電影時刻

取火之路

Navigation

如何在 scrapy shell 裡正確顯示中文

Popular Posts

Blog Archive

Categories