使用Scrapy框架爬取知网数据

发表于 2021-03-25 更新于 2024-03-22 分类于后端 Waline：

步骤

创建一个Scrapy项目
定义提取的Item
编写爬取网站的 spider 并提取 Item
编写 Item Pipeline 来存储提取到的Item(即数据)

安装scrapy

1	pip install scrapy

新建项目

创建项目：scrapy startproject xxx
进入项目：cd xxx #进入某个文件夹下
创建爬虫：scrapy genspider xxx（爬虫名） xxx.com （爬取域）
生成文件：scrapy crawl xxx -o xxx.json (生成某种类型的文件)
运行爬虫：scrapy crawl XXX
列出所有爬虫：scrapy list
获得配置信息：scrapy settings [options]

目录结构

全局配置 settings

# 默认是注释的，这个东西非常重要，如果不写很容易被判断为电脑，简单点写一个Mozilla/5.0即可
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
# 是否遵循机器人协议，默认是true，需要改为false，否则很多东西爬不了
ROBOTSTXT_OBEY = False
# 最大并发数，很好理解，就是同时允许开启多少个爬虫线程
CONCURRENT_REQUESTS = 32
# 下载延迟时间，单位是秒，控制爬虫爬取的频率，根据你的项目调整，不要太快也不要太慢，默认是3秒，即爬一个停3秒，设置为1秒性价比较高，如果要爬取的文件较多，写零点几秒也行
DOWNLOAD_DELAY = 3
# 是否保存COOKIES，默认关闭，开机可以记录爬取过程中的COOKIE，非常好用的一个参数
COOKIES_ENABLED = False
# 默认请求头，上面写了一个USER_AGENT，其实这个东西就是放在请求头里面的，这个东西可以根据你爬取的内容做相应设置
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
}
# 项目管道，300为优先级，越低越爬取的优先度越高
ITEM_PIPELINES = {
    'spider.pipelines.SpiderPipeline': 300
}

定义数据结构 items

class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 需要取哪些内容，就创建哪些容器
    # 标题
    title = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 来源
    source = scrapy.Field()

项目主程序

使用 MySQLdb 操作数据库

1	pip install MySQLdb

安装不成功请使用离线安装方式，pypi：[https://pypi.org/project/mysqlclient/#files]

选择对应的Python版本，下载.whl文件，并离线安装

1	pip install mysqlclient-2.0.3-cp37-cp37m-win_amd64.whl

以爬取知网文章标题、作者、隶属单位，并将结果保存至MySQL数据库为例

class CnkiSpider(scrapy.Spider):
    name = 'cnki'  # 定义爬虫名称
    allowed_domains = ['cnki.com.cn']  # 定义爬虫域
    start_urls = ['http://search.cnki.com.cn/Search/ListResult']

    def start_requests(self):

        for num in range(2, 10):
            form_data = {"searchType": "MulityTermsSearch", "ParamIsNullOrEmpty": "false", "Islegal": "false",
                         "Content": "计算机", "Page": str(num)}
            for url in self.start_urls:
                yield scrapy.FormRequest(url=url, formdata=form_data, method='POST', callback=self.parse)

    def parse(self, response):
        # 打开数据库连接
        db = MySQLdb.connect("localhost", "root", "123456", "cnki", charset='utf8', port=3306)
        # 使用cursor()方法获取操作游标
        cursor = db.cursor()
        title = response.xpath('//div[@class="list-item"]/p[@class="tit clearfix"]//a[1]/@title').extract()
        for item in title:
            sql = "INSERT INTO article(title) VALUE ('%s')" % (str(item))
            try:
                # 执行sql语句
                cursor.execute(sql)
                # 提交到数据库执行
                db.commit()
            except:
                # Rollback in case there is any error
                db.rollback()
        author = response.xpath('//div[@class="list-item"]/p[@class="source"]/span[1]/@title').extract()
        for item in author:
            sql1 = "INSERT INTO author(`name`) VALUE ('%s')" % (str(item).split(";")[0])
            try:
                # 执行sql语句
                cursor.execute(sql1)
                # 提交到数据库执行
                db.commit()
            except:
                # Rollback in case there is any error
                db.rollback()
        affiliated = response.xpath('//div[@class="list-item"]/p[@class="source"]/span[3]/@title').extract()
        for item in affiliated:
            sql3 = "INSERT INTO affiliated(`name`) VALUE ('%s')" % (str(item))
            try:
                # 执行sql语句
                cursor.execute(sql3)
                # 提交到数据库执行
                db.commit()
            except:
                # Rollback in case there is any error
                db.rollback()
        source = response.xpath('//div[@class="list-item"]/p[@class="source"]/a[1]/span/@title').extract()
        for item in source:
            sql2 = "INSERT INTO source(`name`) VALUE ('%s')" % (str(item))
            try:
                # 执行sql语句
                cursor.execute(sql2)
                # 提交到数据库执行
                db.commit()
            except:
                # Rollback in case there is any error
                db.rollback()
        # 关闭数据库连接
        db.close()

数据处理 pipelines

以将爬取到的数据导出为excel为例

class SpiderPipeline:
    # 用来将item保存到输出结果中
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    # 执行文件创建，然后初始化exporter，并启动start_exporting()，开始接收Item
    def open_spider(self, spider):
        self.file = open("/cnki_data.csv", "wb")
        # self.exporter = CsvItemExporter(self.file,
        #                                 fields_to_export=["title"])
        self.exporter = CsvItemExporter(self.file,
                                        fields_to_export=["title", "author", "source"])
        self.exporter.start_exporting()

        # 结束exporter的exporting，关闭文件流

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()