Scrapy在MAC环境的安装、Scrapy引擎基本介绍、简单的创建命令功能实践

作者: admin 分类: Scrapy 发布时间: 2019-07-20 00:15  阅读: 119 views

Scrapy介绍

Scrapy是python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化数据。scrapy用途广泛,可以用于数据挖掘、检测和自动化测试。

Scrapy是一个为遍历爬行网站、分解获取数据而设计的应用程序框架。

程序猿工作多年以后,爬虫应该是一个较容易接触的一个应用功能,它本身的技术框架、原理、上手难度应该不是很高。但是想要把它用的好、用的到位是一个考验人的工作。有很多爬虫的黑产、也有很多受益于爬虫的服务。细看技术,主要是抓取链接,解析html内容;粗看应用,这爬爬、那爬爬,收集有用的信息为自己所用。玩得好的话,可以打开你新世界的大门...

image

依次分为引擎、调度器、下载器、爬虫、管道、下载中间件、Spider中间件

  • Scrapy Engine:负责spider、itemPipeline、downloader、scheduler中间的通讯、信号、数据传递等。
  • Scheduler:它负责接收引擎发送过来的request请求,并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎。
  • Downloader:负责下载Scrapy Engine发送的所有request请求,将获取到的response交还给Scrapy Engine,由引擎交给Spider处理。
  • Spider:它负责处理所有的response响应。从中分析提取数据,获取item字段需要的数据,将需要跟进的url提交给引擎,再次进入Scheduler
  • Item Pipeline:它负责处理Spider中获取到的item,并进行分析、过滤、存储等处理
  • Downloader Middlewares:可自定义扩展下载功能的组件。
  • Spider Middlewares:可扩展和操作引擎和Spider中间通信的功能组件,如:进入Spider的response和从Spider出去的request

scrapy 命令信息

安装参考Mac系统安装python3.7.3并安装Scrapy爬虫框架

deathearth:eclipse-workspace chenhailong$ scrapy

Scrapy 1.6.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

创建项目

deathearth:eclipse-workspace chenhailong$ scrapy startproject Zhizhu

New Scrapy project 'Zhizhu', using template directory '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /Users/chenhailong/eclipse-workspace/Zhizhu

You can start your first spider with:
    cd Zhizhu
    scrapy genspider example example.com

Scrapy创建项目后的目录结构

mySpider/
    scrapy.cfg              //项目配置文件
    mySpider/               //项目的python模块,从这里应用代码
        __init__.py         //表示当前文件夹是一个包
        items.py            //项目的目标文件,定义结构化字段
        pipelines.py        //项目的管道文件
        settings.py         //项目的设置文件
        spiders/            //项目的爬虫代码目录
            __init__.py     //包
            ...             //各种爬虫代码

项目实例

使用scrapy创建爬虫文件

在spider文件夹下生成deathearth爬虫文件,并指定爬取范围

deathearth:spiders chenhailong$ scrapy genspider deathearth “deathearth.com”

Created spider 'deathearth' using template 'basic' in module:
  Zhizhu.spiders.deathearth

这种生成方式会增加默认代码、当然也可以自己新建文件去写
文件内容如下

# -*- coding: utf-8 -*-
import scrapy

//尝试访问一个链接
class DeathearthSpider(scrapy.Spider):
    name = 'deathearth'       #爬虫的唯一识别名称,不同爬虫需要不同名字
    allowed_domains = ['deathearth.com']  #爬虫搜索的域名范围,只会在该域名下
    start_urls = ['http://deathearth.com/']     #爬虫的url元组/列表。从这里开始抓取,从这里生成子urls

#解析方法,每个url下载后将被调用,主要解析网页提取数据,生存下一级url
    def parse(self, response):  
print("一个可以访问的链接",response.url)
        pass

做个单个页面的抓取下载

class DeathearthSpider(scrapy.Spider):
    name = 'deathearth'                       #爬虫的唯一识别名称,不同爬虫需要不同名字
    allowed_domains = ['deathearth.com']      #爬虫搜索的域名范围,只会在该域名下
    start_urls = ['http://www.deathearth.com/676.html'] #爬虫的url元组/列表。从这里开始抓取,从这里生成子urls

    #解析方法,每个url下载后将被调用,主要解析网页提取数据,生存下一级url
    def parse(self, response):
        filename = "676.html"
        open(filename,'w').write(response.body)
        pass

运行一个编写在Python文件中的spider

scrapy runspider xxx.py

试运行爬虫【命令需要在spider目录下执行】

scrapy crawl deathearth

2019-04-10 09:48:58 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Zhizhu)
2019-04-10 09:48:58 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Darwin-14.5.0-x86_64-i386-64bit
2019-04-10 09:48:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Zhizhu', 'NEWSPIDER_MODULE': 'Zhizhu.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Zhizhu.spiders']}
2019-04-10 09:48:58 [scrapy.extensions.telnet] INFO: Telnet Password: dfe561b22e1ba5d4
2019-04-10 09:48:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-04-10 09:48:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-10 09:48:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-10 09:48:58 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-10 09:48:58 [scrapy.core.engine] INFO: Spider opened
2019-04-10 09:48:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-10 09:48:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-10 09:48:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.deathearth.com/robots.txt> (referer: None)
2019-04-10 09:48:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.deathearth.com/676.html> (referer: None)
2019-04-10 09:48:59 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.deathearth.com/676.html> (referer: None)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/chenhailong/eclipse-workspace/Zhizhu/Zhizhu/spiders/deathearth.py", line 14, in parse
    open(filename,'w').write(response.body)
TypeError: write() argument must be str, not bytes
2019-04-10 09:48:59 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-10 09:48:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 452,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 9690,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 10, 1, 48, 59, 428261),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'memusage/max': 45395968,
 'memusage/startup': 45395968,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2019, 4, 10, 1, 48, 58, 672194)}
2019-04-10 09:48:59 [scrapy.core.engine] INFO: Spider closed (finished)

发生错误

TypeError: write() argument must be str, not bytes

这是由于我mac上已经升级为python3.7版本,

Python3给open函数添加了名为encoding的新参数,而这个新参数的默认值却是‘utf-8’。这样在文件句柄上进行read和write操作时,系统就要求开发者必须传入包含Unicode字符的实例,而不接受包含二进制数据的bytes实例。
将 open(filename,’w’) 改为 open(filename,’wb’)在此运行,发现正常

scrapy crawl deathearth

2019-04-10 23:11:36 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Zhizhu)
2019-04-10 23:11:36 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Darwin-14.5.0-x86_64-i386-64bit
2019-04-10 23:11:36 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Zhizhu', 'NEWSPIDER_MODULE': 'Zhizhu.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Zhizhu.spiders']}
2019-04-10 23:11:37 [scrapy.extensions.telnet] INFO: Telnet Password: 61aa2c023915e499
2019-04-10 23:11:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-04-10 23:11:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-10 23:11:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-10 23:11:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-10 23:11:37 [scrapy.core.engine] INFO: Spider opened
2019-04-10 23:11:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-10 23:11:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-04-10 23:11:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.deathearth.com/robots.txt> (referer: None)
2019-04-10 23:11:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.deathearth.com/676.html> (referer: None)
2019-04-10 23:11:37 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-10 23:11:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 452,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 9767,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 10, 15, 11, 37, 837350),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'memusage/max': 45273088,
 'memusage/startup': 45273088,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 10, 15, 11, 37, 123152)}
2019-04-10 23:11:37 [scrapy.core.engine] INFO: Spider closed (finished)

且发现当前spider项目目录下存在下载好的 676.html文件,用来解析html内容

其中的(referer:None) 表示没有指向其他连接

更多查看 Scrapy根据XPATH解析页面内容、下载为json格式文件、抓取列表页等的简单示例


   原创文章,转载请标明本文链接: Scrapy在MAC环境的安装、Scrapy引擎基本介绍、简单的创建命令功能实践

如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!

发表评论

电子邮件地址不会被公开。 必填项已用*标注

更多阅读