pyspider 安装文档
一、快速入门
如果你使用 ubuntu请先安装以下依赖包:apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml如果需要调用js解析框架,请先照着phantomjs官网说明安装phantomjs应用,并把可执行程序添加到系统环境变量PATH中以方便调用。注意:本程序在WINDOWNS系统中运行问题比较多,作者也没有对win系统做兼容性测试.如果没有特别强的解决问题的能力不建议用WINDOWS类系统.
以上安装完成后,执行以下命令安装pyspider
pip install pyspider
二、安装
censtos是服务器常用的操作系统,本例以centos7.2最小化版安装为例安装
yum 更新
#yum update
安装wget命令
#yum install wget
首页安装扩展源
#yum -y install epel-release
国内yum源:
http://mirrors.163.com/.help/CentOS7-Base-163.repo
安装开发编译工具
#yum install gcc gcc-c++
安装依赖库 系统自带版本为python2.7.5
#yum install python-pip python-devel python-distribute libxml2 libxml2-devel python-lxml libxslt libxslt-devel openssl openssl-devel
升级pip
#pip install --upgrade pip
安装pyspider
1 在线方式
#pip install pyspider
2 文件包下载到 本地安装
建议先建立 /www /data 目录 便于以后项目及数据库文件都丢到对应目录
从git上下载文件包pyspider-master到/www目录下
#cd pyspider-master
#python setup.py install
安装后提示版本为pyspider-0.3.7
设置防火墙
开放5000(pyspider默认打开5000端口,如果启动指定了其它端口请对应修改)端口并重启防火墙。如果防火墙是关闭状态请跳过下面两步
firewall-cmd --zone=public --add-port=5000/tcp --permanent
firewall-cmd --reload
访问控制台
安装完成直接执行pyspider以默认配置运行pyspider,访问http://localhost:5000访问控制台
MYSQL 数据库安装
系统默认是 mariadb-server 替换mysql 所以命令如下:
#yum install mariadb mariadb-server
安装完依赖包后 默认自动安装
安装完成后 启动mariadb
#systemctl start mariadb
设为开机启动
#systemctl enable mariadb
mysql的默认安全检查 设置mysql 的root密码等相关
#mysql_secure_installation
完成后 输入如下命令,测试数据库服务:
#mysql -uroot -p
三、API参考
self.crawl(url, **kwargs)
self.crawl在pyspider 系统中是非常重要的接口,它的功能是告诉pyspider哪些URL需要抓取.
参数:
url
需要被抓取的url或url列表.
callback
这个参数用来指定爬取内容后需要哪个方法来处理内容.一般解析为 response. _default: __call__ _ 如下面调用方法:
def on_start(self): self.crawl('http://scrapy.org/', callback=self.index_page)
self.crawl还有以下可选参数
age
本参数用来指定任务的有效期,在有效期内不会重复抓取.默认值是-1(永远不过期,意思是只抓一次)
</pre> @config(age=10 * 24 * 60 * 60) def index_page(self, response): ... <pre>
解析:每一个回调index_page的任务有效期是10天,在10天之内再遇到这个任务都会被忽略掉(除非有强制抓取参数才不会忽略).
priority
这个参数用来指定任务的优先级,数值越大越先被执行.默认值为0.
</pre> def index_page(self): self.crawl('http://www.example.org/page2.html', callback=self.index_page) self.crawl('http://www.example.org/233.html', callback=self.detail_page,priority=1) <pre>
这两个任务如果被同时放入到任务队列里,页面233.html先被执行.
Use this parameter can do a BFS and reduce the number of tasks in queue(which may cost more memory resources).
exetime
the executed time of task in unix timestamp. default: 0(immediately)
import time <pre> </pre> def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,exetime=time.time()+30*60) <pre>
The page would be crawled 30 minutes later.
retries
任务执行失败后重试次数. default: 3
itag
任务标记值,此标记会在抓取时对比,如果这个值发生改变,不管有效期有没有到都会重新抓取新内容.多数用来动态判断内容是否修改或强制重爬.默认值是:None.
def index_page(self, response): for item in response.doc('.item').items(): self.crawl(item.find('a').attr.url, callback=self.detail_page, itag=item.find('.update-time').text()) 本实例中使用页面中update-time元素的值当成itag来判断内容是否有更新. class Handler(BaseHandler): crawl_config = { 'itag': 'v223' }
修改全局参数itag,使所有任务都重新执行(需要点run按钮来启动任务).
auto_recrawl
when enabled, task would be recrawled every age time. default: False
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,age=5*60*60, auto_recrawl=True)
The page would be restarted every age 5 hours.
method
HTTP请求方法设置,默认值: GET
params
把一个字典参数附加到url参数里,如 :
def on_start(self): self.crawl('http://httpbin.org/get', callback=self.callback,params={'a': 123, 'b': 'c'}) self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
解析:这两个是相同的任务.
data
这个参数会附加到URL请求的body里,如果是字典会经过form-encoding编码再附加.
def on_start(self): self.crawl('http://httpbin.org/post', callback=self.callback,method='POST', data={'a': 123, 'b': 'c'})
files
dictionary of {field: {filename: ‘content’}} files to multipart upload.`
headers
自定义请求头(字典类型).
自定义请求的cookies(字典类型).
connect_timeout
指定请求时链接超时时间,单位秒,默认值:20.
timeout
请求内容里最大等待秒数.默认值:120.
allow_redirects
遇到30x状态码时是否重新请求跟随.默认是:True.
validate_cert
遇到HTTPS类型的URL时是否验证证书,默认值:True.
proxy
设置代理服务器,格式如 username:password@hostname:port .暂时只支持http代理
class Handler(BaseHandler): crawl_config = { 'proxy': 'localhost:8080' }
Handler.crawl_config里配置proxy参数会对整个项目生效,本项目的所有任务都会使用代理爬取.
etag
use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
last_modified
use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True
fetch_type
设置是否启用JavaScript解析引擎. default: None
js_script
JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write(“binux”); }.
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,fetch_type='js', js_script=''' function() { window.scrollTo(0,document.body.scrollHeight); return 123; } ''')
The script would scroll the page to bottom. The value returned in function could be captured via Response.js_script_result.
js_run_at
run JavaScript specified via js_script at document-start or document-end. default: document-end
js_viewport_width/js_viewport_height
set the size of the viewport for the JavaScript fetcher of the layout process.
load_images
load images when JavaScript fetcher enabled. default: False
save
传递一个对象给任务,在任务解析时可以通过response.save来获取传递的值.
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback,save={'a': 123}) def callback(self, response): return response.save['a']
在回调里123将被返回.
taskid
唯一性的taskid用来区别不同的任务.默认taskid是由URL经过md5计算得出.你也可以使用def get_taskid(self, task)方法覆盖系统自带的来自定义任务id.如:
import json from pyspider.libs.utils import md5string def get_taskid(self, task): return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
本实例任务ID不只是url,不同的data参数也会生成不同的任务id
force_update
force update task params even if the task is in ACTIVE status.
cancel
cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawltask, you should set auto_recrawl=False as well.
cURL command
self.crawl(curl_command)
cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel, right click the request and “Copy as cURL”.
You can use cURL command as the first argument of self.crawl. It will parse the command and make the HTTP request just like curl do.
@config(**kwargs)
default parameters of self.crawl when use the decorated method as callback. For example:
@config(age=15*60) def index_page(self, response): self.crawl(‘http://www.example.org/list-1.html’, callback=self.index_page) self.crawl(‘http://www.example.org/product-233’, allback=self.detail_page) @config(age=10*24*60*60) def detail_page(self, response): return {…}
age of list-1.html is 15min while the age of product-233.html is 10days. Because the callback of product-233.html is detail_page, means it’s a detail_page so it shares the config of detail_page.
Handler.crawl_config = {}
default parameters of self.crawl for the whole project. The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
class Handler(BaseHandler): crawl_config = { 'headers': { 'User-Agent': 'GoogleBot', } } ...