avatar

crawler爬虫

00 前言

之前一直想学爬虫,但是一直没时间学(其实就是懒😂),现在学校刚好开了这门课,就现在学一学,玩一玩。

每周周三python课堂上再继续编写

更新ing

01 环境配置

1.1 创建crawler环境

1
conda create -n crawler python=3.8

1.2 下载requests库

1
conda install resquests

1.3 pycharm选择crawler环境

02数据抓取

2.1 爬虫简介

网络爬虫(web crawler),也被称为网络蜘蛛🕷(web spider),是在万维网浏览网页并按照一定的规则提取信息的脚本或程序。

2.2 数据抓取实践

2.2.1 发起请求

1
2
3
4
5
import requests  # 调用requests库

url = 'https://blog.bxhong.cn' # 将变量url赋值为bxhong.cn的网址
data = requests.get(url) # 利用requests库的get方法,向url发起请求,并将服务器的内容返回data
print(data.text) # 打印内容

运行(部分):

1
2
3
4
5
6
7
<!DOCTYPE html><html lang="zh-CN" data-theme="light"><head><meta charset="UTF-8"><meta name="baidu-site-verification" content="niDCmPzZfD"><meta name="360-site-verification" content="08942686cb345cddddcd464b351cb704"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1"><title>Ash's blog</title><meta name="description" content="一切都会好的,城南的花都开了"><meta name="author" content="安河桥北以北"><meta name="copyright" content="安河桥北以北"><meta name="format-detection" content="telephone=no"><link rel="shortcut icon" href="/img/icon_32x32.png"><meta http-equiv="Cache-Control" content="no-transform"><meta http-equiv="Cache-Control" content="no-siteapp"><link rel="preconnect" href="//cdn.jsdelivr.net"/><link rel="preconnect" href="https://hm.baidu.com"/><link rel="preconnect" href="http://ta.qq.com"/><link rel="preconnect" href="https://fonts.googleapis.com" crossorigin="crossorigin"/><link rel="preconnect" href="//busuanzi.ibruce.info"/><meta name="twitter:card" content="summary"><meta name="twitter:title" content="Ash's blog"><meta name="twitter:description" content="一切都会好的,城南的花都开了"><meta name="twitter:image" content="https://gitee.com/bao_xian_hong/Album/raw/master/image/6.jpg"><meta property="og:type" content="website"><meta property="og:title" content="Ash's blog"><meta property="og:url" content="http://chenai007.github.io/"><meta property="og:site_name" content="Ash's blog"><meta property="og:description" content="一切都会好的,城南的花都开了"><meta property="og:image" content="https://gitee.com/bao_xian_hong/Album/raw/master/image/6.jpg"><script src="https://cdn.jsdelivr.net/npm/js-cookie/dist/js.cookie.min.js"></script><script>var autoChangeMode = 'false'
var t = Cookies.get("theme")
if (autoChangeMode == '1'){
var isDarkMode = window.matchMedia("(prefers-color-scheme: dark)").matches
var isLightMode = window.matchMedia("(prefers-color-scheme: light)").matches
var isNotSpecified = window.matchMedia("(prefers-color-scheme: no-preference)").matches
var hasNoSupport = !isDarkMode && !isLightMode && !isNotSpecified

2.2.2 设置UA进行伪装

在没有进行UA伪装时,我们再运行之前的代码

1
2
3
4
5
import requests

url = 'http://httpbin.org/get'
data = requests.get(url)
print(data.text)

运行后
1
2
3
4
5
6
7
8
9
10
11
12
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.24.0",
"X-Amzn-Trace-Id": "Root=1-5f6aaa30-c4ead8d3ca66c04f00ab7825"
},
"origin": "120.82.246.73",
"url": "http://httpbin.org/get"
}

我们可以看出,我们得到的UA"User-Agent": "python-requests/2.24.0"
1
2
3
4
5
6
7
import requests

url = 'http://httpbin.org/get'
# headers 里面大小写均可
headers = {'User-Agent': "Mozilla/5.0 (X11, Ubuntu; Linux x86_64; rv: 52.0) Gecko/20100101 Firefox/52.0"}
data = requests.get(url, headers=headers)
print(data.text)

运行如下:
1
2
3
4
5
6
7
8
9
10
11
12
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (X11, Ubuntu; Linux x86_64; rv: 52.0) Gecko/20100101 Firefox/52.0",
"X-Amzn-Trace-Id": "Root=1-5f6aabcc-b6410e5738c939b735f2e584"
},
"origin": "120.82.246.73",
"url": "http://httpbin.org/get"
}

如何获取UA呢
右击网页点击*检查然后随便选择一个请求(没有就刷新一下)之后点击标头之后下滑即可

2.2.3 网页解析

文章作者: 安河桥北以北
文章链接: http://chenai007.github.io/2020/09/16/python/crawler%E7%88%AC%E8%99%AB/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Ash's blog

评论