scrapy与selenium结合爬取数据(爬取动态网站)的示例代码

scrapy框架只能爬取静态网站。如需爬取动态网站,需要结合着selenium进行js的渲染,才能获取到动态加载的数据。

如何通过selenium请求url,而不再通过下载器Downloader去请求这个url?

方法:在request对象通过中间件的时候,在中间件内部开始使用selenium去请求url,并且会得到url对应的源码,然后再将   源 代码通过response对象返回,直接交给process_response()进行处理,再交给引擎。过程中相当于后续中间件的process_request()以及Downloader都跳过了。

相关的配置:

1、scrapy环境中安装selenium:pip install selenium

scrapy与selenium结合爬取数据(爬取动态网站)的示例代码

2、确保python环境中有phantomJS(无头浏览器)

scrapy与selenium结合爬取数据(爬取动态网站)的示例代码

对于selenium的主要操作是下载中间件部分如下图:

scrapy与selenium结合爬取数据(爬取动态网站)的示例代码

scrapy与selenium结合爬取数据(爬取动态网站)的示例代码

代码如下

middlewares.py代码:

注意:自定义下载中间件,采用selenium的方式!!

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
from scrapy.http import HtmlResponse, Response
import time

class TaobaospiderSpiderMiddleware(object):
 # Not all methods need to be defined. If a method is not defined,
 # scrapy acts as if the spider middleware does not modify the
 # passed objects.

 @classmethod
 def from_crawler(cls, crawler):
  # This method is used by Scrapy to create your spiders.
  s = cls()
  crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
  return s

 def process_spider_input(self, response, spider):
  # Called for each response that goes through the spider
  # middleware and into the spider.

  # Should return None or raise an exception.
  return None

 def process_spider_output(self, response, result, spider):
  # Called with the results returned from the Spider, after
  # it has processed the response.

  # Must return an iterable of Request, dict or Item objects.
  for i in result:
   yield i

 def process_spider_exception(self, response, exception, spider):
  # Called when a spider or process_spider_input() method
  # (from other spider middleware) raises an exception.

  # Should return either None or an iterable of Response, dict
  # or Item objects.
  pass

 def process_start_requests(self, start_requests, spider):
  # Called with the start requests of the spider, and works
  # similarly to the process_spider_output() method, except
  # that it doesn't have a response associated.

  # Must return>
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from selenium.webdriver import FirefoxOptions


class TaobaoSpider(scrapy.Spider):
 """
 scrapy框架只能爬取静态网站。如需爬取动态网站,需要结合着selenium进行js的渲染,才能获取到动态加载的数据。

 如何通过selenium请求url,而不再通过下载器Downloader去请求这个url?
 方法:在request对象通过中间件的时候,在中间件内部开始使用selenium去请求url,并且会得到url对应的源码,然后再将源代码通过response对象返回,直接交给process_response()进行处理,再交给引擎。过程中相当于后续中间件的process_request()以及Downloader都跳过了。

 """
 name = 'taobao'
 allowed_domains = ['taobao.com']
 start_urls = ['https://s.taobao.com/search?q=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170306']
 
 def __init__(self):
  # 在初始化淘宝对象时,创建driver
  super(TaobaoSpider, self).__init__(name='taobao')
  option = FirefoxOptions()
  option.headless = True
  self.driver = webdriver.Firefox(options=option)

 def parse(self, response):
  """
  提取列表页的商品标题和价格
  :param response:
  :return:
  """
  info_divs = response.xpath('//div[@class="info-cont"]')
  print(len(info_divs))
  for div in info_divs:
   title = div.xpath('.//a[@class="product-title"]/@title').extract_first('')
   price = div.xpath('.//span[contains(@class, "g_price")]/strong/text()').extract_first('')
   print(title, price)

scrapy与selenium结合爬取数据(爬取动态网站)的示例代码

扫一扫手机访问