2016-08-07

python实战计划爬取商品数据

对58同城下手

Task1

SubTask1 58二手市场类别页面

Aim

(因为改版..只能爬取转转商品)

Results

广州58健身页面非推广信息(298条)

Code

本来想爬取每个位置,删除推广的,然后发现爬了之后就没有推广,于是直接爬了
接地气一点..换个URL爬
注意爬取图片时候用lazy_url
清理一下信息,用strip()
爬了300条信息,增加了url

#-*- coding: utf8 -*-
from bs4 import BeautifulSoup
import requests
import time
import csv

urlBase = 'http://cs.58.com/pbdn/0/'
urlJS = ['http://gz.58.com/jianshenqixie/0/pn{}/'.format(str(i)) for i in range(0,10)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    'Cookie': '...' #太占位置了..po上来时候删掉
}

titleSel = '#infolist > div.infocon > table > tbody > tr > td.t > a'
priceSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.pricebiao > span'
oripriceSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.priceyuan'
describtionSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.desc'
placeSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.fl'
imgSel = '#infolist > div.infocon > table > tbody > tr > td.img > a > img'

tuiGuang = '#jingzhun > tbody > tr.jztr.last > td.jzxztd'

goodSel = '#infolist > div.infocon > table > tbody > tr'


info = ['title','price','place','describtion','Url','img','OriPrice']


def GetInfo(url, writer):
    time.sleep(3)
    wb_data = requests.get(url,headers=headers)
    soup = BeautifulSoup(wb_data.text,"lxml")

    titles = soup.select(titleSel)
    imgs = soup.select(imgSel)
    prices = soup.select(priceSel)
    describtions = soup.select(describtionSel)
    oriprices = soup.select(oripriceSel)
    places = soup.select(placeSel)

    for title,price,oriprice,describtion,place,img in zip(titles,prices,oriprices,describtions,places,imgs):
        writer.writerow(
            {   'title' : title.get_text().strip(),
                'price' : price.get_text().strip(),
                'place' : place.get_text().strip().replace('\n',''),#clear blank and '\n'
                'describtion' : describtion.get_text().strip(),
                'img' : img.get('lazy_src'),#carefully for iamges!!
                'OriPrice':oriprice.get_text().strip(),
                'Url' : title.get('href')
            })
    return ('success get url: ' + url)

def main():
    testFile = open('test.csv', 'w')
    writer = csv.writer(testFile)
    writer.writerow(info)
    writer = csv.DictWriter(testFile, info)
    for ps in urlJS:
        print(GetInfo(ps, writer))
    testFile.close()

if __name__ == '__main__':
    main()

SubTask2 58二手市场类别页面

想必这个要跑很久…吃饭前写完丢在哪里跑吧

结果是….

吃了饭回来发现…..

有个函数参数写多了………….

Results

同样因为页面改版,同样是上一个里面得到的链接,用上一步得到的url进一步爬取呗:)

爬取浏览量似乎很简单,爬不到新旧程度,和发帖时间

1	PostTimes = [PostTimes[1]] #补上这句后就能爬取到描述了..没想到那里有两个相同标签QAQ

Code

#-*- coding: utf8 -*-
from bs4 import BeautifulSoup
import requests
import time
import csv

urlJS = ['http://gz.58.com/jianshenqixie/0/pn{}/'.format(str(i)) for i in range(0,10)]

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    'Cookie': '...'
}

info = ['kind','title','PostTime','price','New','Place']

titleSel = '#infolist > div.infocon > table > tbody > tr > td.t > a'

kindSel = '#nav > div > span > a'
IntitleSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.box_left_top > h1'
PostTimeSel = 'body > div.content > div > div.box_left > div > div > div > p'
priceSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.price_li > span > i'
NewSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.box_left_top > p > span.look_time'
PlaceSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.palce_li > span > i'


urls = []
def GetInfo(url, writer):
    time.sleep(3)
    wb_data = requests.get(url, headers=headers)
    soup = BeautifulSoup(wb_data.text, "lxml")
    #这里写了一个生成的小脚本...(因为都是重复信息)
    titles = soup.select(IntitleSel)
    prices = soup.select(priceSel)
    News  = soup.select(NewSel)
    Places = soup.select(PlaceSel)
    kinds = soup.select(kindSel)
    PostTimes = soup.select(PostTimeSel)
	PostTimes = [PostTimes[1]]
    
    for kind,title,PostTime,price,New,Place in zip(kinds,titles,PostTimes,prices,News,Places):
        writer.writerow(dict(kind = kind.get_text(), title = title.get_text(), PostTime = PostTime.get_text(),
                          price = price.get_text(), New = New.get_text(), Place = Place.get_text()))
    return ('successful get url: ' + url)

def GetUrl(url):
    time.sleep(3)
    wb_data = requests.get(url,headers=headers)
    soup = BeautifulSoup(wb_data.text,"lxml")

    titles = soup.select(titleSel)
    for href in titles:
        urls.append(href.get('href'))
    return ('successful get base url: ' + url)

def main():
    testFile = open('test1.csv', 'w')
    writer = csv.writer(testFile)
    writer.writerow(info)
    writer = csv.DictWriter(testFile, info)
    for ps in urlJS:
        print(GetUrl(ps))
    for pages in urls:
        GetInfo(pages,writer)#这里应该print的..
    #testURL = 'http://zhuanzhuan.58.com/detail/761852078245855236z.shtml?fullCate=5%2C46%2C542&fullLocal=3&from=pc'
    #GetInfo(testURL,writer)
    testFile.close()


if __name__ == '__main__':
    main()

Review

第一个任务算是一个小复习吧:

requests处理网络请求
BeautifulSoup解析网页
soup.select(titleSel)得到元素
kind.get_text() && kind.get( ‘src’ )
csv.writer(testFile) && writer.writerow(info) && writer = csv.DictWriter(testFile, info)
自己生成重复太多的info代码
time.sleep
headers
-*- coding: utf8 —

Task2 爬取JS浏览量

58同城改版了…

似乎浏览量可以直接爬取……

(说实话改版之后漂亮很多)

于是用自己博客做试验咯…..

打开了

当初换了域名后浏览量清零

暴怒关闭的浏览量统计…….(厚颜)

inspect > sources > busuanzi > …

然后发现不算子爬取不了???????

#-*- coding: utf8 -*-
from bs4 import BeautifulSoup
import requests
import time
import csv

api = 'http://busaunzi.ibruce.info/busuanzi?jsonpCallback=BusuanziCallback_835222807248'
js = requests.get(api,'lxml')
print(js)

视频笔记

由于58同城特殊性,可以从title爬取信息soup.title.text
url.split('.')[-1].strip('x.shtml')得到id
… if (…) else none

Requests

BeautifulSoup

output

prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:
- 1
  soup.get_text("|", strip=True) 第一个参数是分割,第二个是去除空白

偏门方法

soup.title
# <title>The Dormouse's story</title>

通过点取属性的方式只能获得当前名字的第一个tag:
soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
soup.title.name
# u'title'

soup.a 
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

append()
Tag.append() 方法想tag中添加内容,就好像Python的列表的 .append() 方法:
soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# [u'Foo', u'Bar']

主要方法


.select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:
soup.select("title")
# [<title>The Dormouse's story</title>]

soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]


搜索指定名字的属性时可以使用的参数值包括 字符串 , 正则表达式 , 列表, True .

soup.find_all('a')  #name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3") #可以传入正则,函数
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

soup.find_all("a", limit=2)#返回前两个

正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示\和\标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b
下面代码找出所有名字中包含”t”的标签:

for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

函数

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

Html5 X-path Http

http1.3(download)

总结

一定要观察是否有两个相同标签再大量爬取……..(先爬几个生成格式文档试试)
如果有唯一描述的样式,可以soup.select('time')
#代表id, . 代表class

本文标题:python实战计划爬取商品数据

文章作者:Renld

发布时间:2016年08月07日 - 11时29分

最后更新:2016年08月07日 - 16时56分

原始链接:http://renld.github.io/2016/08/07/PL1-final/

许可协议: "署名-非商用-相同方式共享 3.0" 转载请保留原文链接及作者。

Task1

SubTask1 58二手市场类别页面

Aim

Results

Code

SubTask2 58二手市场类别页面

Results

Code

Review

Task2 爬取JS浏览量

视频笔记

ReadMores

BeautifulSoup

output

正则表达式

函数

总结