[Python] install scrapy

前陣子寫爬蟲, 用的工具是python的scrapy!

Scrapy非常輕量而且好上手, 但對我來說障礙在安裝他的時候!

幸運地最近機器全部都format, 間接練就重複安裝的手動過程

紀錄一下安裝scrapy的動作, 以防下次老人癡呆忘記

在scrapy官網上, 文件說明安裝scrapy需要以下幾種素材:

(1) python 2.7以上

(2) 需要pip以及setuptools軟包

(3) 需要lxml

(4) 需要OpenSSL

而我所在的環境是Red Hat 4.4.7-11,

ps.如何查詢? 請下> cat /proc/version

曉得安裝scrapy需要那些素材以後, 手冊寫著執行:

pip install Scrapy

即可完成安裝, 事情常常不是那樣簡單

在我的Linux環境下, python是2.6版default, 首先必須要安裝2.7以上並且完成”指向”

Step1: 安裝python 2.7.8

wget http://python.org/ftp/python/2.7.8/Python-2.7.8.tar.xz
tar xf Python-2.7.8.tar.xz
cd Python-2.7.8
./configure –prefix=/usr/local –enable-unicode=ucs4 –enable-shared LDFLAGS=”-Wl,-rpath /usr/local/lib”

make

make install

Step2: 接著完成python指向, 不讓他default導到2.6

mv /usr/bin/python /usr/bin/python.bak
ln -s /usr/local/bin/python2.7 /usr/bin/python
rm /usr/bin/python.bak
sed -i ‘s/python/python2.6’ /usr/bin/yum

Step3: 安裝pip

wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
python2.7 ez_setup.py
easy_install-2.7 pip

Step4: 安裝scrapy

pip install Scrapy

接著執行scrapy shell 欲爬的url, failed!!!

會出現以下這些鬼:

Traceback (most recent call last):
File “/usr/local/bin/pip”, line 9, in <module>
load_entry_point(‘pip==1.5.6’, ‘console_scripts’, ‘pip’)()
File “/usr/local/lib/python2.7/site-packages/pip-1.5.6-py2.7.egg/pip/__init__.py”, line 185, in main
return command.main(cmd_args)
File “/usr/local/lib/python2.7/site-packages/pip-1.5.6-py2.7.egg/pip/basecommand.py”, line 161, in main
text = ‘\n’.join(complete_log)
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 81: ordinal not in range(128)

所以還是照規矩, 一步一步安裝吧Q__Q

yum install libevent-devel python-devel
pip install uwsgi
pip install twisted
pip install w3lib
yum install python-devel libxml2-devel libxslt-devel
yum install pyOpenSSL
yum install libffi-devel

pip install lxml
pip install cssselect
pip install pyOpenSSL

當執行scrapy shell,看到以下log就安心了

[root@cdh4-dn2 data]# scrapy shell “https://hotel.reservation.jp/superhotel/eng/reservation/3.asp?ne_hotel=049&ne_reserv_d=3&ne_reserv_ym=2007/11/01”
/usr/local/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected.
verifyHostname, VerificationError = _selectVerifyImplementation()
2014-11-03 09:27:19+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-11-03 09:27:19+0800 [scrapy] INFO: Optional features available: ssl, http11
2014-11-03 09:27:19+0800 [scrapy] INFO: Overridden settings: {‘LOGSTATS_INTERVAL’: 0}
2014-11-03 09:27:19+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-03 09:27:19+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-03 09:27:19+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-11-03 09:27:19+0800 [scrapy] INFO: Enabled item pipelines:
2014-11-03 09:27:19+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-11-03 09:27:19+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-11-03 09:27:19+0800 [default] INFO: Spider opened
2014-11-03 09:27:19+0800 [default] DEBUG: Crawled (200) <GET https://hotel.reservation.jp/superhotel/eng/reservation/3.asp?ne_hotel=049&ne_reserv_d=3&ne_reserv_ym=2007/11/01> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f74ea7c2b90>
[s] item {}
[s] request <GET https://hotel.reservation.jp/superhotel/eng/reservation/3.asp?ne_hotel=049&ne_reserv_d=3&ne_reserv_ym=2007/11/01>
[s] response <200 https://hotel.reservation.jp/superhotel/eng/reservation/3.asp?ne_hotel=049&ne_reserv_d=3&ne_reserv_ym=2007/11/01>
[s] settings <scrapy.settings.Settings object at 0x7f74fa0ec390>
[s] spider <Spider ‘default’ at 0x7f74e9dce990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

In [1]:

參考資料

http://scrapy.org/

http://doc.scrapy.org/en/master/



One Response to “[Python] install scrapy”

  1.   蝸牛 Says:

    很有用的內容
    謝謝分享


total of 2289475 visits