python爬虫：爬取天天基金网基金数据

诺斯说，“人类社会只有发明了发明的方法之后才能发展。”这个方法就是科学的思维。投资也只有选择正确的方法，走到正确的道路上，才能够成功。最近构建FOF投资组合，基金数据分析是第一步。文末可留言交流建议哦。Zorro

项目开始

Step 1：创建存储数据表

此处使用本地postgresql数据库：
create table if not exists funds.tt_web_fund_list (
code  text,
name             text,
unitnetworth       text,
unitnet_day       text,
dayofgrowth       text,
recent1week       text,
recent1month       text,
recent3month       text,
recent6month       text,
recent1year       text,
recent2year       text,
recent3year       text,
fromthisyear       text,
frombuild          text,
servicecharge    text,
upenoughamount    text
)

Step 2：创建爬虫项目

可选择Windows PowerShell进入系统命令行，跳转到Python爬虫项目路径，新建一个爬虫项目，代码如下：

scrapy startproject funds

通过pycharm打开爬虫项目文件夹，发现如图：

Step3：分析网站接口

天天基金网偏股型基金的页面：

谷歌浏览器按F12，进入网站debug分析；点击Network，翻找左边的url列表，根据右边的Reponse内容，找到正确的数据url，如下所示。（因为这些数据都已经是结构化的，可基于ajax直接通过API接口调用获取。如果没有，就只能自己写xpath正则匹配）

点击header，Request URL就是我们爬虫需要请求的地址，我发现的接口地址是：https://fundapi.eastmoney.com/fundtradenew.aspx?ft=pg&sc=1n&st=desc&pi=1&pn=100&cp=&ct=&cd=&ms=&fr=&plevel=&fst=&ftype=&fr1=&fl=0&isab=1

Step 4：编写爬虫脚本
4.1 设置item

import scrapyclass FundsItem(scrapy.Item): # define the fields for your item here like: code = scrapy.Field() # 基金代码 name = scrapy.Field() # 基金名称 unitNetWorth = scrapy.Field() # 单位净值 unitnet_day = scrapy.Field() # 日期 dayOfGrowth = scrapy.Field()  # 日增长率 recent1Week = scrapy.Field() # 最近一周 recent1Month = scrapy.Field() # 最近一月 recent3Month = scrapy.Field() # 最近三月 recent6Month = scrapy.Field() # 最近六月 recent1Year = scrapy.Field() # 最近一年 recent2Year = scrapy.Field() # 最近二年 recent3Year = scrapy.Field() # 最近三年 fromThisYear = scrapy.Field() # 今年以来 fromBuild = scrapy.Field()  # 成立以来 serviceCharge = scrapy.Field()  # 手续费 upEnoughAmount = scrapy.Field()    # 起够金额 pass
4.2 编写Spider

import scrapyimport jsonfrom scrapy.http import Requestfrom funds.items import FundsItemclass FundsSpider(scrapy.Spider): name = 'fundsList'                   # 唯一，用于区别Spider。运行爬虫时，就要使用该名字 allowed_domains = ['fund.eastmoney.com'] # 允许访问的域 # 初始url。在爬取从start_urls自动开始后，服务器返回的响应会自动传递给parse(self, response)方法。 # 说明：该url可直接获取到所有基金的相关数据 # start_url = ['http://fundact.eastmoney.com/banner/pg.html#ln'] # custome_setting可用于自定义每个spider的设置，而setting.py中的都是全局属性的，当你的scrapy工程里有多个spider的时候这个custom_setting就显得很有用了 # custome_setting = { # # } # spider中初始的request是通过调用 start_requests() 来获取的。 start_requests() 读取 start_urls 中的URL，并以 parse 为回调函数生成 Request 。 # 重写start_requests也就不会从start_urls generate Requests了 def start_requests(self):       url = 'https://fundapi.eastmoney.com/fundtradenew.aspx?ft=pg&sc=1n&st=desc&pi=1&pn=100&cp=&ct=&cd=&ms=&fr=&plevel=&fst=&ftype=&fr1=&fl=0&isab=1'       requests = []       request = scrapy.Request(url, callback=self.parse_funds_list)       requests.append(request)       return requests def parse_funds_list(self, response):       datas = response.body.decode('UTF-8')       # 取出json部门       datas = datas[datas.find('{'):datas.find('}')+1] # 从出现第一个{开始，取到}       # 给json各字段名添加双引号       datas = datas.replace('datas', '\"datas\"')       datas = datas.replace('allRecords', '\"allRecords\"')       datas = datas.replace('pageIndex', '\"pageIndex\"')       datas = datas.replace('pageNum', '\"pageNum\"')       datas = datas.replace('allPages', '\"allPages\"')       jsonBody = json.loads(datas)       jsonDatas = jsonBody['datas']       fundsItems = []       for data in jsonDatas:          fundsItem = FundsItem()          fundsArray = data.split('|')          fundsItem['code'] = fundsArray[0]          fundsItem['name'] = fundsArray[1]          fundsItem['unitnet_day'] = fundsArray[3]          fundsItem['unitNetWorth'] = fundsArray[4]          fundsItem['dayOfGrowth'] = fundsArray[5]          fundsItem['recent1Week'] = fundsArray[6]          fundsItem['recent1Month'] = fundsArray[7]          fundsItem['recent3Month'] = fundsArray[8]          fundsItem['recent6Month'] = fundsArray[9]          fundsItem['recent1Year'] = fundsArray[10]          fundsItem['recent2Year'] = fundsArray[11]          fundsItem['recent3Year'] = fundsArray[12]          fundsItem['fromThisYear'] = fundsArray[13]          fundsItem['fromBuild'] = fundsArray[14]          fundsItem['serviceCharge'] = fundsArray[18]          fundsItem['upEnoughAmount'] = fundsArray[24]          fundsItems.append(fundsItem)       return fundsItems
4.3 配置settings.py
4.3.1 设置ITEM_PIPELINES参数，用来存储爬取的数据

ITEM_PIPELINES = { 'funds.pipelines.FundsPipeline': 300,}

Step 5：编写pipelines.py

将爬取的数据存储进postgresql本地数据库：
from sqlalchemy import create_engineclass FundsPipeline(object): def process_item(self, item, spider):       engine = create_engine(r'postgresql://postgres:zorro@localhost/postgres', echo=False)       connection = engine.raw_connection()       print(item['code'])       sql = '''       INSERT INTO fund.tt_web_fund_list (          code,          name,          unitNetWorth,          unitnet_day,          dayOfGrowth,          recent1Week,          recent1Month,          recent3Month,          recent6Month,          recent1Year,          recent2Year,          recent3Year,          fromThisYear,          fromBuild,          serviceCharge,          upEnoughAmount       ) VALUES('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')       ''' % (             item['code'],             item['name'],             item['unitNetWorth'],             item['unitnet_day'],             item['dayOfGrowth'],             item['recent1Week'],             item['recent1Month'],             item['recent3Month'],             item['recent6Month'],             item['recent1Year'],             item['recent2Year'],             item['recent3Year'],             item['fromThisYear'],             item['fromBuild'],             item['serviceCharge'],             item['upEnoughAmount']          )       with connection.cursor() as cursor:          cursor.execute(sql)  # 执行sql       connection.commit()  # 提交到数据库执行       connection.close()       return item

Step 6：在pycharm中启动爬虫

Step 7：在DB中查询表数据

检查是否爬取成功，结果显示数据已成功存储进pg数据库，项目成功。

总结

年初仔细学习了一下python的scrapy库，因为是全英文技术文档，读来没有那么顺利，学学停停。最近需要构建FOF组合，便重新开始。

以上是个简单的爬虫，爬取的仅仅是天天基金网偏股型基金页面的第一页。如果需要爬取所有基金的数据，需要使用递归爬取，后面继续研究，有了完整的基金数据，配合正确的选基原则与市场分析，构建一个有效的FOF投资组合是一件比较激动人心的事情。

做一个：
热爱生活，热爱歌唱
热爱音乐，热爱艺术
热爱运动，热爱投资
热爱阅读，热爱写作
热爱国家，热爱进步
的火热人。

2
19

[url=]点击留言，期待建议[/url]