一、建立IP代理池的思路:
做爬虫时,遇到访问太频繁IP被封是难以避免的,而本地单个IP是不足以进行大规模爬取,并且自己并不想购买付费代理,那么,构建一个IP代理池是非常有必要的。思路如下:
图1
二、建立IP 代理池的步骤:
- 爬取代理IP:搜索选择代理IP网站,选取免费代理;代码如下:
# _*_ coding:UTF-8 _*_# 开发作者:Jason Zhang# 创建时间:2020/12/29 17:58# 文件名称:爬取代理IP.PY# 开发工具:PyCharmimport requestsimport lxml.htmlimport osheaders = {\\\'User-Agent\\\':\\\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36\\\'}url_list = [\\\'http://www.xicidaili.com/nn/%r\\\' % i for i in range(1,10)]ip_list = []for url in url_list: r = requests.get(url,headers=headers) etree = lxml.html.fromstring(r.text) ips = etree.xpath(\\\'//tr[@class="odd"]\\\') for ip in ips: IP = ip.xpath(\\\'//td/text()\\\') ip = IP[0] +\\\':\\\'+ IP[1] ip_list.append(ip)f = open(\\\'ip.txt\\\',\\\'wb\\\')f.write(\\\',\\\'.join(ip_list).encode(\\\'utf-8\\\'))f.close(
- 验证代理IP:
通过网络访问来验证代理IP的可用性和访问速度,将之前爬取到的代理IP地址从ip.txt文件中提取出来,分别试用代理IP去访问某个网站首页,仅保留响应时间在2秒内的IP,并保存在QIP.txt中,代码如下:
# _*_ coding:UTF-8 _*_# 开发作者:关中老玉米# 创建时间:2020/12/29 18:27# 文件名称:验证代理IP.PY# 开发工具:PyCharmimport requestsip_list = open(\\\'ip.txt\\\').read().split(\\\',\\\')headers = {\\\'User-Agent\\\':\\\'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36\\\'}qurl = \\\'https://www.baidu.com\\\' #用百度来测试IP是否能正常连网for i in ip_list: #设置超时时间timeout为2 s,超时则为不可用IP r = requests.get(url, proxies={\\\'http\\\': \\\'http://\\\' + ip[i]}, headers=headers,timeout=2) if r.text: qip.append(qip[i]) else: continuef = open(\\\'quality_ip.txt\\\',\\\'wb\\\')f.write(\\\',\\\'.join(quality_ip).encode(\\\'utf-8\\\'))f.close()
- 使用代理IP:
建立IP代理池之后,有以下两种使用代理IP的方式。
# _*_ coding:UTF-8 _*_# 开发作者:Jason Zhang# 创建时间:2020/12/31 18:03# 文件名称:使用代理IP.PY# 开发工具:PyCharm# (1)使用随机 IP,代码如下:import randomimport requestsip_list = open(\\\'quality_ip.txt\\\').read().split(\\\',\\\')headers = {\\\'User-Agent\\\':\\\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36\\\'}url = \\\'http://*********\\\'r = requests.get(url, proxies={\\\'http\\\': \\\'http://\\\'+random.choice(ip_list)},headers=headers)# (2)因为免费的代理时效很短,在后续的爬取任务中很容易失效,所以当出现访问错误(响应码不等于 200)时,更换 IP,代码如下:ip_list = open(\\\'qip.txt\\\').read().split(\\\',\\\')headers = { \\\'User-Agent\\\':\\\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36\\\'}for ip in ip_list: for i in range(len(url_list)): r = requests.get(url_list[i], proxies={\\\'http\\\': \\\'http://\\\'+ip},headers=headers) if r.status_code != 200: break