Python爬取EF每日英语资源

Python爬取EF每日英语资源作为程序猿，英文的重要程度不用多少，大家都是知道的。那么今天为大家分享一个如何用Python获取英语学习资源的一个案例。目标网站是英孚教育。（学习目的，请勿商用。如果侵犯到了您的权益，请联系我，我会删除此文章。）

网站：http://center.ef.com.cn/blog/
需要的库：

python3.x
urllib
requestes
bs4
json

[h1]目标网站分析[/h1]

根目录：http://center.ef.com.cn/blog/lesson
1:http://center.ef.com.cn/blog/lesson?lesson_id=457&view=video
2:http://center.ef.com.cn/blog/lesson?lesson_id=458&view=video
3:http://center.ef.com.cn/blog/lesson?lesson_id=459&view=video
。。。
32:http://center.ef.com.cn/blog/lesson?lesson_id=488&view=video
MP4 根目录：https://www.englishtown.cn
MP3根目录：https://cns.ef-cdn.com/_snds/wiktionary/sentences/
真正的有用地址：https://www.englishtown.cn/community/dailylesson/lessonhandler.ashx?operate=preloaddata&teachculturecode=en&ss=EE&v=4&lesson_id=457&transculturecode=zh-cn
课程名称：https://www.englishtown.cn/community/dailylesson/lessonhandler.ashx?operate=getlessonbyid&v=4&lesson_id=459&transculturecode=zh-cn

[h1]具体代码实现[/h1]

from bs4 import BeautifulSoupimport urllib.requestimport requestsimport jsonimport osimport timeclass EF(): ''' 整个EF项目的主体 ''' def __init__(self): self.baseUrl = "https://www.englishtown.cn/community/dailylesson/lessonhandler.ashx?operate=preloaddata&teachculturecode=en&ss=EE&v=4&" self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'} self.name_url = "https://www.englishtown.cn/community/dailylesson/lessonhandler.ashx?operate=getlessonbyid&v=4&" def getRequsetContent(self, url): ''' 获取页面请求 ''' try: req = urllib.request.Request(url, headers=self.header) response = urllib.request.urlopen(req, timeout=10) except: print("页面加载失败") else: return response.read().decode('UTF-8') def spyder(self, url, name_url): html = self.getRequsetContent(url) html4name = self.getRequsetContent(name_url) data_dict = json.loads(html) data_dict_name = json.loads(html4name) lesson = data_dict_name['Lesson']['LessonNameWithPerfix'] not_allow = ['/', '\\', ':', '*', "'" , '"', "", "|", "？", '\r', '\n'] # 非法字符 lesson_name = lesson.split("-")[1] for char in not_allow: if char in lesson_name: lesson_name = lesson_name.replace(char, '_') # 创建文件夹 if not os.path.exists(lesson_name): os.mkdir(lesson_name) slides = data_dict['Slides'][0] localizedSlides = slides['LocalizedSlides'] en = localizedSlides['en'] mediaSource = en['MediaSource'] # MP4地址 mp4_url = "https://www.englishtown.cn" + mediaSource dialogue = en['Dialogue'] sentences = dialogue['Sentences'] en_list = [] # 保存的是英文 cn_list = [] # 保存的是中文 mp3_list = [] # 保存的是MP3 for sentence in sentences: text = sentence['Sentence']['Text'] mp3 = sentence['Sentence']['SentenceAudioSrc'] trans = sentence['Trans']['zh-cn']['Text'] en_list.append(text) mp3_list.append(mp3) cn_list.append(trans) for en, cn, mp3 in zip(en_list, cn_list, mp3_list): print("英文:{}, 中文:{}, MP3:{}".format(en, cn, mp3)) with open(lesson_name + "\\sentences.txt", 'a') as f: f.write("英文:{}, 中文:{}, MP3:{}".format(en, cn, cn + "mp3")) f.write("\n") mp3_url = "https://cns.ef-cdn.com/_snds/wiktionary/sentences/" + mp3 self.DL(mp3_url, lesson_name, cn) time.sleep(0.5) self.DL(mp4_url, lesson_name, "") def DL(self, url, fd_name, mp3_name): res = requests.get(url, headers = self.header) if mp3_name == "": fn = fd_name + ".mp4" else: fn = mp3_name + ".mp3" with open(fd_name + "\\" + fn, 'wb') as f: f.write(res.content)if __name__ == "__main__": ef = EF() for i in range(457, 489): url = ef.baseUrl + "lesson_id={}&transculturecode=zh-cn".format(i) name_url = ef.name_url + "lesson_id={}&transculturecode=zh-cn".format(i) ef.spyder(url, name_url) time.sleep(1)

复制代码

注意点

获取课程名称的地址和获取句子的url不一样，需要爬取两次

保存文件时，注意文件名中不能有特殊符号

本项目的难点在于对JSON的解析，层级比较深，但整体还是比较好理解的。

待优化的地方：这个项目没有加线程，所以下载速度较慢，大家可以自行修改一下。

另外，URL有待优化，我这里只是爬取了等级4下面的内容。

点击【原文】获取直播回放