python实战计划 爬取商品数据

对58同城下手

Task1

SubTask1 58二手市场类别页面

Aim

(因为改版..只能爬取转转商品)

Results

广州58健身页面非推广信息(298条)

Code

  • 本来想爬取每个位置,删除推广的,然后发现爬了之后就没有推广,于是直接爬了
  • 接地气一点..换个URL爬
  • 注意爬取图片时候用lazy_url
  • 清理一下信息,用strip()
  • 爬了300条信息,增加了url
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#-*- coding: utf8 -*-
from bs4 import BeautifulSoup
import requests
import time
import csv

urlBase = 'http://cs.58.com/pbdn/0/'
urlJS = ['http://gz.58.com/jianshenqixie/0/pn{}/'.format(str(i)) for i in range(0,10)]

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Cookie': '...' #太占位置了..po上来时候删掉
}

titleSel = '#infolist > div.infocon > table > tbody > tr > td.t > a'
priceSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.pricebiao > span'
oripriceSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.priceyuan'
describtionSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.desc'
placeSel = '#infolist > div.infocon > table > tbody > tr > td.t > span.fl'
imgSel = '#infolist > div.infocon > table > tbody > tr > td.img > a > img'

tuiGuang = '#jingzhun > tbody > tr.jztr.last > td.jzxztd'

goodSel = '#infolist > div.infocon > table > tbody > tr'


info = ['title','price','place','describtion','Url','img','OriPrice']


def GetInfo(url, writer):
time.sleep(3)
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,"lxml")

titles = soup.select(titleSel)
imgs = soup.select(imgSel)
prices = soup.select(priceSel)
describtions = soup.select(describtionSel)
oriprices = soup.select(oripriceSel)
places = soup.select(placeSel)

for title,price,oriprice,describtion,place,img in zip(titles,prices,oriprices,describtions,places,imgs):
writer.writerow(
{ 'title' : title.get_text().strip(),
'price' : price.get_text().strip(),
'place' : place.get_text().strip().replace('\n',''),#clear blank and '\n'
'describtion' : describtion.get_text().strip(),
'img' : img.get('lazy_src'),#carefully for iamges!!
'OriPrice':oriprice.get_text().strip(),
'Url' : title.get('href')
})
return ('success get url: ' + url)

def main():
testFile = open('test.csv', 'w')
writer = csv.writer(testFile)
writer.writerow(info)
writer = csv.DictWriter(testFile, info)
for ps in urlJS:
print(GetInfo(ps, writer))
testFile.close()

if __name__ == '__main__':
main()

SubTask2 58二手市场类别页面

想必这个要跑很久…吃饭前写完丢在哪里跑吧

结果是….

吃了饭回来发现…..

有个函数参数写多了………….

Results

同样因为页面改版,同样是上一个里面得到的链接,用上一步得到的url进一步爬取呗:)

爬取浏览量似乎很简单,爬不到新旧程度,和发帖时间

1
PostTimes = [PostTimes[1]]  #补上这句后就能爬取到描述了..没想到那里有两个相同标签QAQ

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#-*- coding: utf8 -*-
from bs4 import BeautifulSoup
import requests
import time
import csv

urlJS = ['http://gz.58.com/jianshenqixie/0/pn{}/'.format(str(i)) for i in range(0,10)]

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'Cookie': '...'
}

info = ['kind','title','PostTime','price','New','Place']

titleSel = '#infolist > div.infocon > table > tbody > tr > td.t > a'

kindSel = '#nav > div > span > a'
IntitleSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.box_left_top > h1'
PostTimeSel = 'body > div.content > div > div.box_left > div > div > div > p'
priceSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.price_li > span > i'
NewSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.box_left_top > p > span.look_time'
PlaceSel = 'body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.palce_li > span > i'


urls = []
def GetInfo(url, writer):
time.sleep(3)
wb_data = requests.get(url, headers=headers)
soup = BeautifulSoup(wb_data.text, "lxml")
#这里写了一个生成的小脚本...(因为都是重复信息)
titles = soup.select(IntitleSel)
prices = soup.select(priceSel)
News = soup.select(NewSel)
Places = soup.select(PlaceSel)
kinds = soup.select(kindSel)
PostTimes = soup.select(PostTimeSel)
PostTimes = [PostTimes[1]]

for kind,title,PostTime,price,New,Place in zip(kinds,titles,PostTimes,prices,News,Places):
writer.writerow(dict(kind = kind.get_text(), title = title.get_text(), PostTime = PostTime.get_text(),
price = price.get_text(), New = New.get_text(), Place = Place.get_text()))
return ('successful get url: ' + url)

def GetUrl(url):
time.sleep(3)
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,"lxml")

titles = soup.select(titleSel)
for href in titles:
urls.append(href.get('href'))
return ('successful get base url: ' + url)

def main():
testFile = open('test1.csv', 'w')
writer = csv.writer(testFile)
writer.writerow(info)
writer = csv.DictWriter(testFile, info)
for ps in urlJS:
print(GetUrl(ps))
for pages in urls:
GetInfo(pages,writer)#这里应该print的..
#testURL = 'http://zhuanzhuan.58.com/detail/761852078245855236z.shtml?fullCate=5%2C46%2C542&fullLocal=3&from=pc'
#GetInfo(testURL,writer)
testFile.close()


if __name__ == '__main__':
main()

Review

第一个任务算是一个小复习吧:

  • requests处理网络请求

  • BeautifulSoup解析网页

  • soup.select(titleSel)得到元素

  • kind.get_text() && kind.get( ‘src’ )

  • csv.writer(testFile) && writer.writerow(info) && writer = csv.DictWriter(testFile, info)

  • 自己生成重复太多的info代码

  • time.sleep

  • headers

  • -*- coding: utf8 —

Task2 爬取JS浏览量

58同城改版了…

似乎浏览量可以直接爬取……

(说实话 改版之后漂亮很多)

于是用自己博客做试验咯…..

打开了

当初换了域名后浏览量清零

暴怒关闭的浏览量统计…….(厚颜)

inspect > sources > busuanzi > …

然后发现不算子爬取不了???????

1
2
3
4
5
6
7
8
9
#-*- coding: utf8 -*-
from bs4 import BeautifulSoup
import requests
import time
import csv

api = 'http://busaunzi.ibruce.info/busuanzi?jsonpCallback=BusuanziCallback_835222807248'
js = requests.get(api,'lxml')
print(js)

视频笔记

  • 由于58同城特殊性,可以从title爬取信息soup.title.text
  • url.split('.')[-1].strip('x.shtml')得到id
  • … if (…) else none

ReadMores

Requests

BeautifulSoup

output

  • prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行

  • get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:

    • 1
      soup.get_text("|", strip=True) 第一个参数是分割,第二个是去除空白
  • 偏门方法

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    soup.title
    # <title>The Dormouse's story</title>

    通过点取属性的方式只能获得当前名字的第一个tag:
    soup.p
    # <p class="title"><b>The Dormouse's story</b></p>

    soup.p['class']
    soup.title.name
    # u'title'

    soup.a
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

    append()
    Tag.append() 方法想tag中添加内容,就好像Python的列表的 .append() 方法:
    soup = BeautifulSoup("<a>Foo</a>")
    soup.a.append("Bar")
    soup
    # <html><head></head><body><a>FooBar</a></body></html>
    soup.a.contents
    # [u'Foo', u'Bar']

  • 主要方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

.select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:
soup.select("title")
# [<title>The Dormouse's story</title>]

soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]


搜索指定名字的属性时可以使用的参数值包括 字符串 , 正则表达式 , 列表, True .

soup.find_all('a') #name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3") #可以传入正则,函数
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

soup.find_all("a", limit=2)#返回前两个

正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示\和\标签都应该被找到:

1
2
3
4
5
6
7
8
9
10
11
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
下面代码找出所有名字中包含”t”的标签:

for tag in soup.find_all(re.compile("t")):
print(tag.name)
# html
# title

函数

1
2
3
4
5
6
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]

Html5 X-path Http

http1.3(download)

总结

  • 一定要观察是否有两个相同标签再大量爬取……..(先爬几个生成格式文档试试)
  • 如果有唯一描述的样式,可以soup.select('time')
  • #代表id, . 代表class
文章目录
  1. 1. Task1
    1. 1.1. SubTask1 58二手市场类别页面
      1. 1.1.1. Aim
      2. 1.1.2. Results
      3. 1.1.3. Code
    2. 1.2. SubTask2 58二手市场类别页面
      1. 1.2.1. Results
    3. 1.3. Code
    4. 1.4. Review
  2. 2. Task2 爬取JS浏览量
  3. 3. 视频笔记
    1. 3.1. ReadMores
      1. 3.1.1. BeautifulSoup
        1. 3.1.1.1. output
        2. 3.1.1.2. 正则表达式
        3. 3.1.1.3. 函数
    2. 3.2. 总结