[爬虫]使用python抓取京东全站数据(商品,店铺,分类,评论)
网上抓取京东数据的文章,现在要么无法抓取数据,要么只能抓取部分数据,本文将介绍如何抓取京东全站数据,包括商品信息、店铺信息,评论信息,分类信息等。
一、环境
OS:win10
python:3.5
scrapy:1.3.2
pymongo:3.2
pycharm
环境搭建,自行百度
二、数据库说明
- 产品分类
京东大概有1183个分类,这是除去了一些虚拟产品(话费、彩票、车票等)的分类,可以到如下网页查看:
https://www.jd.com/allSort.aspx
我们也是从这个网址开始抓取。由于这些分类里面也有属于频道的页面,也就是说,这个分类里面也有很多子分类,需要做一些特殊处理才可以拿到所有分类,具体方法,下文再说。
name #分类名称
url #分类url
_id #分类id
- 产品

url #产品url
_id #产品id
category #产品分类
reallyPrice #产品价格
originalPrice #原价
description #产品描述
shopId #shop id
venderId #vender id
commentCount #评价总数
goodComment #好评数
generalComment #中评数
poolComment #差评数
favourableDesc1 #优惠描述1
favourableDesc2 #优惠描述2
- 评论

_id #评论id
productId #产品id
guid
content #评论内容
creationTime #评论时间
isTop
referenceId
referenceName
referenceType
referenceTypeId
firstCategory
secondCategory
thirdCategory
replyCount #回复次数
score #分数
status
title
usefulVoteCount #被标记的有用评论数
uselessVoteCount #被标记的无用评论数
userImage
userImageUrl
userLevelId
userProvince
viewCount
orderId #订单id
isReplyGrade
nickname #评论人的名称
userClient
mergeOrderStatus
discussionId
productColor
productSize
imageCount #评论中图片的数量
integral
userImgFlag
anonymousFlag
userLevelName
plusAvailable
recommend
userLevelColor
userClientShow
isMobile #是否移动端评论
days
afterDays #追加评论数
- 店铺

店铺有别名的,一般有两个url,例如宝梦旗舰店:
url1:http://mall.jd.com/index-596056.html
url2: https://baomeng.jd.com/
_id #店铺名称
name #店铺名称
url1 #店铺url1
url2 #店铺url2
shopId #shop id
venderId #vender id
- 评论总结

_id
goodRateShow #好评率
poorRateShow #差评率
poorCountStr #差评数字符串
averageScore #平均分
generalCountStr #中评数字符串
showCount
showCountStr
goodCount #好评数
generalRate #中评率
generalCount #中评数
skuId
goodCountStr #好评数字符串
poorRate #差评率
afterCount #追评数
goodRateStyle
poorCount
skuIds
poorRateStyle
generalRateStyle
commentCountStr
commentCount
productId #产品id
afterCountStr
goodRate
generalRateShow
jwotestProduct
maxPage
score
soType
imageListCount
三、抓取说明
- 抓取分类
代码如下:
def parse_category(self, response):
“”“获取分类页”""
selector = Selector(response)
try:
texts = selector.xpath(’//div[@class=“category-item m”]/div[@class=“mc”]/div[@class=“items”]/dl/dd/a’).extract()
for text in texts:
items = re.findall(r’(.*?)’, text)
for item in items:
if item[0].split(’.’)[0][2:] in key_word:
if item[0].split(’.’)[0][2:] != ‘list’:
yield Request(url=‘https:’ + item[0], callback=self.parse_category)
else:
categoriesItem = CategoriesItem()
categoriesItem[‘name’] = item[1]
categoriesItem[‘url’] = ‘https:’ + item[0]
categoriesItem[’_id’] = item[0].split(’=’)[1].split(’&’)[0]
yield categoriesItem
yield Request(url=‘https:’ + item[0], callback=self.parse_list)
except Exception as e:
print(‘error:’, e)
如前文所说,有些类别里面包含有很多子类别,所以对于这样的url,需要再次进行类别抓取:
if item[0].split(’.’)[0][2:] != ‘list’:
yield Request(url=‘https:’ + item[0], callback=self.parse_category)
- 抓取产品
访问每个类别的url就可以获取得到产品列表,找到产品的URL,进入详情页面抓取产品的详情:
def parse_list(self, response):
“”“分别获得商品的地址和下一页地址”""
meta = dict()
meta[‘category’] = response.url.split(’=’)[1].split(’&’)[0]selector = Selector(response) texts = selector.xpath('//*[@id="plist"]/ul/li/div/div[@class="p-img"]/a').extract() for text in texts: items = re.findall(r'<a target="_blank" href="(.*?)">', text) yield Request(url='https:' + items[0], callback=self.parse_product, meta=meta)
产品的基本信息在详情页面基本可以获取,但是有些信息,比如:价格、优惠政策等信息,是需要动态获取的。
先来看价格信息,访问的URL格式为:
https://p.3.cn/prices/mgets?skuIds=J_(product_id)
这个url最后括号里面的信息就是产品的id,需要动态获取,代码如下:
response = requests.get(url=price_url + product_id) price_json =
response.json() productsItem[‘reallyPrice’] = price_json[0][‘p’]
productsItem[‘originalPrice’] = price_json[0][‘m’]
获取得到的都是json格式,比较好解析。
再来看优惠信息,优惠信息分为两种:优惠券和满减描述:

所以需要抓取这两种信息,都是动态加载,代码如下:
优惠 res_url = favourable_url % (product_id, shop_id, vender_id, category.replace(’,’, ‘%2c’))
print(res_url) response = requests.get(res_url) fav_data = response.json() if fav_data[‘skuCoupon’]:
desc1 = [] for item in fav_data['skuCoupon']: start_time = item['beginTime'] end_time = item['endTime'] time_dec = item['timeDesc'] fav_price = item['quota'] fav_count = item['discount'] fav_time = item['addDays'] desc1.append(u'有效期%s至%s,满%s减%s' % (start_time, end_time, fav_price, fav_count)) productsItem['favourableDesc1'] = ';'.join(desc1)if fav_data[‘prom’] and fav_data[‘prom’][‘pickOneTag’]:
desc2 = []
for item in fav_data[‘prom’][‘pickOneTag’]:
desc2.append(item[‘content’])
productsItem[‘favourableDesc1’] = ‘;’.join(desc2)
- 抓取店铺信息
在每个产品的详情页面都可以直接找到店铺id和vender id:
ids = re.findall(r"venderId:(.?),\s.?shopId:’(.?)’", response.text)
if not ids:
ids = re.findall(r"venderId:(.?),\s.?shopId:(.?),", response.text)
vender_id = ids[0][0]
shop_id = ids[0][1]
店铺的名称比较难取,有多种不同页面,店铺标题也在不同地方,而且自营产品,在详情页面也可以店铺名称,代码如下:
try:
name = response.xpath(’//ul[@class=“parameter2 p-parameter-list”]/li/a//text()’).extract()[0] except:
try:
name = response.xpath(’//div[@class=“name”]/a//text()’).extract()[0].strip()
except:
try:
name = response.xpath(’//div[@class=“shopName”]/strong/span/a//text()’).extract()[0].strip()
except:
try:
name = response.xpath(’//div[@class=“seller-infor”]/a//text()’).extract()[0].strip()
except:
name = u’京东自营’
- 抓取评论
评论的信息也是动态加载,返回的格式也是json,访问url格式为:
https://club.jd.com/comment/productPageComments.action?productId=(product_id)&score=0&sortType=5&page=%s&pageSize=10
只需要产品的ID即可。
获取评论信息代码如下:
“”“获取商品comment”""
try:
data = json.loads(response.text)
except Exception as e:
print(‘get comment failed:’, e)
return Noneproduct_id = response.meta['product_id'] commentSummaryItem = CommentSummaryItem() commentSummary = data.get('productCommentSummary') commentSummaryItem['goodRateShow'] = commentSummary.get('goodRateShow') commentSummaryItem['poorRateShow'] = commentSummary.get('poorRateShow') commentSummaryItem['poorCountStr'] = commentSummary.get('poorCountStr') commentSummaryItem['averageScore'] = commentSummary.get('averageScore') commentSummaryItem['generalCountStr'] = commentSummary.get('generalCountStr') commentSummaryItem['showCount'] = commentSummary.get('showCount') commentSummaryItem['showCountStr'] = commentSummary.get('showCountStr') commentSummaryItem['goodCount'] = commentSummary.get('goodCount') commentSummaryItem['generalRate'] = commentSummary.get('generalRate') commentSummaryItem['generalCount'] = commentSummary.get('generalCount') commentSummaryItem['skuId'] = commentSummary.get('skuId') commentSummaryItem['goodCountStr'] = commentSummary.get('goodCountStr') commentSummaryItem['poorRate'] = commentSummary.get('poorRate') commentSummaryItem['afterCount'] = commentSummary.get('afterCount') commentSummaryItem['goodRateStyle'] = commentSummary.get('goodRateStyle') commentSummaryItem['poorCount'] = commentSummary.get('poorCount') commentSummaryItem['skuIds'] = commentSummary.get('skuIds') commentSummaryItem['poorRateStyle'] = commentSummary.get('poorRateStyle') commentSummaryItem['generalRateStyle'] = commentSummary.get('generalRateStyle') commentSummaryItem['commentCountStr'] = commentSummary.get('commentCountStr') commentSummaryItem['commentCount'] = commentSummary.get('commentCount') commentSummaryItem['productId'] = commentSummary.get('productId') # 同ProductsItem的id相同 commentSummaryItem['_id'] = commentSummary.get('productId') commentSummaryItem['afterCountStr'] = commentSummary.get('afterCountStr') commentSummaryItem['goodRate'] = commentSummary.get('goodRate') commentSummaryItem['generalRateShow'] = commentSummary.get('generalRateShow') commentSummaryItem['jwotestProduct'] = data.get('jwotestProduct') commentSummaryItem['maxPage'] = data.get('maxPage') commentSummaryItem['score'] = data.get('score') commentSummaryItem['soType'] = data.get('soType') commentSummaryItem['imageListCount'] = data.get('imageListCount') yield commentSummaryItem for hotComment in data['hotCommentTagStatistics']: hotCommentTagItem = HotCommentTagItem() hotCommentTagItem['_id'] = hotComment.get('id') hotCommentTagItem['name'] = hotComment.get('name') hotCommentTagItem['status'] = hotComment.get('status') hotCommentTagItem['rid'] = hotComment.get('rid') hotCommentTagItem['productId'] = hotComment.get('productId') hotCommentTagItem['count'] = hotComment.get('count') hotCommentTagItem['created'] = hotComment.get('created') hotCommentTagItem['modified'] = hotComment.get('modified') hotCommentTagItem['type'] = hotComment.get('type') hotCommentTagItem['canBeFiltered'] = hotComment.get('canBeFiltered') yield hotCommentTagItem for comment_item in data['comments']: comment = CommentItem() comment['_id'] = comment_item.get('id') comment['productId'] = product_id comment['guid'] = comment_item.get('guid') comment['content'] = comment_item.get('content') comment['creationTime'] = comment_item.get('creationTime') comment['isTop'] = comment_item.get('isTop') comment['referenceId'] = comment_item.get('referenceId') comment['referenceName'] = comment_item.get('referenceName') comment['referenceType'] = comment_item.get('referenceType') comment['referenceTypeId'] = comment_item.get('referenceTypeId') comment['firstCategory'] = comment_item.get('firstCategory') comment['secondCategory'] = comment_item.get('secondCategory') comment['thirdCategory'] = comment_item.get('thirdCategory') comment['replyCount'] = comment_item.get('replyCount') comment['score'] = comment_item.get('score') comment['status'] = comment_item.get('status') comment['title'] = comment_item.get('title') comment['usefulVoteCount'] = comment_item.get('usefulVoteCount') comment['uselessVoteCount'] = comment_item.get('uselessVoteCount') comment['userImage'] = 'http://' + comment_item.get('userImage') comment['userImageUrl'] = 'http://' + comment_item.get('userImageUrl') comment['userLevelId'] = comment_item.get('userLevelId') comment['userProvince'] = comment_item.get('userProvince') comment['viewCount'] = comment_item.get('viewCount') comment['orderId'] = comment_item.get('orderId') comment['isReplyGrade'] = comment_item.get('isReplyGrade') comment['nickname'] = comment_item.get('nickname') comment['userClient'] = comment_item.get('userClient') comment['mergeOrderStatus'] = comment_item.get('mergeOrderStatus') comment['discussionId'] = comment_item.get('discussionId') comment['productColor'] = comment_item.get('productColor') comment['productSize'] = comment_item.get('productSize') comment['imageCount'] = comment_item.get('imageCount') comment['integral'] = comment_item.get('integral') comment['userImgFlag'] = comment_item.get('userImgFlag') comment['anonymousFlag'] = comment_item.get('anonymousFlag') comment['userLevelName'] = comment_item.get('userLevelName') comment['plusAvailable'] = comment_item.get('plusAvailable') comment['recommend'] = comment_item.get('recommend') comment['userLevelColor'] = comment_item.get('userLevelColor') comment['userClientShow'] = comment_item.get('userClientShow') comment['isMobile'] = comment_item.get('isMobile') comment['days'] = comment_item.get('days') comment['afterDays'] = comment_item.get('afterDays') yield comment if 'images' in comment_item: for image in comment_item['images']: commentImageItem = CommentImageItem() commentImageItem['_id'] = image.get('id') commentImageItem['associateId'] = image.get('associateId') # 和CommentItem的discussionId相同 commentImageItem['productId'] = image.get('productId') # 不是ProductsItem的id,这个值为0 commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl') commentImageItem['available'] = image.get('available') commentImageItem['pin'] = image.get('pin') commentImageItem['dealt'] = image.get('dealt') commentImageItem['imgTitle'] = image.get('imgTitle') commentImageItem['isMain'] = image.get('isMain') yield commentImageItem # next page for i in range(1, int(data['maxPage'])): url = comment_url % (product_id, str(i)) meta = dict() meta['product_id'] = product_id yield Request(url=url, callback=self.parse_comments2, meta=meta)
- 抓取过程

基本代码已经在文中贴出,写的比较乱,欢迎大家一起讨论。
了解更多分析及淘宝数据抓取可查看:
http://cloud.yisurvey.com:9081//html/37be8794-b79e-4511-9d0a-81f082bac606.html
本文转载自互联网、仅供学习交流,内容版权归原作者所有,如涉作品、版权和其他问题请联系我们删除处理。
特别说明:本文旨在技术交流,请勿将涉及的技术用于非法用途,否则一切后果自负。如果您觉得我们侵犯了您的合法权益,请联系我们予以处理。
DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。
更多推荐



所有评论(0)