准备工作
首先导入一些需要用到的库
1 2 3
| from bs4 import BeautifulSoup import re import urllib.request,urllib.error
|
若没有所需的库命令行输入pip install 需要下载的库命,下载即可
代码分析
- 爬取网页
将需要爬取的网页网址保存起来,以银川为例
1
| baseUrl="http://www.nmc.cn/publish/forecast/ANX/yinchuan.html"
|
将网址传入到下面的函数中,获取网页源码,返回值为字符串
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| def askURL(url): head={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"} request=urllib.request.Request(url,headers=head) html="" try: response = urllib.request.urlopen(request) html = response.read().decode("utf-8") except urllib.error.URLError as e: if hasattr(e, "code"): print(e.code) if hasattr(e, "reason"): print(e.reason) return html
|
此处的User-Agent获取方法为,打开浏览器进入想要爬取的页面,按下F12点击NetWork,然后F5刷新界面,会出现以下内容
按顺序点击,找到User-Agent,复制到代码中
如果不写这一段代码模拟浏览器访问网页,很多时候会被认出是爬虫然后被拒绝访问
2.解析数据
我们想要爬取的天气数据集中在这一块
对应HTML的标签为div class=”weatherWrap”,于是在解析时先定位到这个标签下的字符串,利用之前导入的BeautifulSoup库
1 2
| soup = BeautifulSoup(html,"html.parser") for item in soup.find_all('div',class_="weatherWrap"):
|
接下来是对需要取得的详细信息的解析,这里一般用正则表达式来抓取
1 2 3 4 5 6 7
| findDate=re.compile(r'<div class="date"> (.*?) <br/>') findWeek=re.compile(r'<br/>(.*?) </div><div class="weathericon">') findImage=re.compile(r'<div class="weathericon"><img src="(.*?)"/></div>') findweather=re.compile(r'<div class="desc"> (.*?) </div><div class="windd">') findWindDir=re.compile(r'<div class="windd"> (.*?) </div><div class="winds">') findWind=re.compile(r'<div class="windd"> .*? </div><div class="winds"> (.*?) </div>') findTemperature=re.compile(r'<div class="tmp tmp_lte_.*?"> (.*?) </div>')
|
爬取后我们将数据保存在列表里返回
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| def getData(baseurl): datalist=[] html=askURL(baseurl) soup = BeautifulSoup(html,"html.parser") for item in soup.find_all('div',class_="weatherWrap"): data=[] item=str(item) date=re.findall(findDate,item)[0] data.append(date)
week=re.findall(findWeek,item)[0] data.append(week)
image=re.findall(findImage,item) data.append(image)
weather=re.findall(findweather,item) data.append(weather)
windDir=re.findall(findWindDir,item) data.append(windDir)
wind=re.findall(findWind,item) data.append(wind)
temperature=re.findall(findTemperature,item) data.append(temperature)
datalist.append(data) return datalist
|
至此我们想要的数据就爬取完成了,输出到文件里验证是否正确