2-2 파이썬 프로젝트 : BS4 웹크롤링, 형태소 분석,

C.L.O.W.N 2021. 7. 17. 18:40

2021. 7. 17. 18:40

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

두번째 프로젝트 순서

1. 프로젝트 주제 정하기

2. 기획 및 데이터 수집, 전처리

3. 데이터 저장(판다스 열/행 관련 정리)

4. 시각화 및 자동화

각종 커뮤니티를 모두 크롤링하기에는 시간도 없고 벅차서, 이슈링크라는 싸이트에서 이미 친절하게 각종 커뮤니티를 크롤링해주고 있어서 이슈링크 싸이트를 이용하였다. 봇을 이용하여 글들을 긁어와주는 걸로 보인다.

하지만 확인 결과, 오늘의 이슈태그 Top5나 커뮤니티 베스트 키워드들은 실제로 다수의 사람들이 관심이 있는 것이 아니었다. 분석하기로는 얼마나 커뮤니티에서 자주 언급되는지에 따라서 순위가 올라가는 것으로 보인다.

그래서 다수의 사람들이 관심이 있는 키워드와 이슈거리를 어떻게 하면 찾을 수 있을까 생각하면서 파이썬 프로젝트를 진행했다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

#현재 시간 설정
from datetime import datetime, date, time, timedelta
 
now = datetime.now()
nowDatetime = now.strftime('%Y-%m-%d %H:%M')
print(nowDatetime)
 
 
#### 이슈 빼오기 #####
import requests
from bs4 import BeautifulSoup
 
url = 'https://www.issuelink.co.kr/community/listview/all/3/adj/_self/blank/blank/blank'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
html_rank = requests.get(url, headers=headers).text
soup_rank = BeautifulSoup(html_rank, 'lxml')
keywords = soup_rank.select('div.ibox.float-e-margins > div > table > tbody > tr > td > a')
 
 
key_list = []
for k in keywords:
    keyword = k.text
    key_list.append(keyword)
 
###### 7시를 기준으로 기준 리스트를 하나 만들어야 함 / datetime에서 시간 분만 빼와서 if로 비교
 
key_list.pop(0)
print(key_list)
Colored by Color Scripter

cs

예측을 한 것인데 실제로 확인을 해보니 총조회수가 높지 않지만 많이 언급될 수록 상위권에 있는 것으로 확인이 됐다. 이제 이 과정을 코드로 시행.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

### 이슈 실제 조회수 ###
 
from bs4 import BeautifulSoup
import requests
import urllib
import operator
 
sum_list = []
search_list = key_list
 
def sum(search) :
    
    url = f'https://www.issuelink.co.kr/community/listview/read/3/adj/_self/blank/{search}'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
 
    hits = soup.select('span.hit')
    
    sum=0
    
    for hit in hits :
        sum += int(hit.text.replace(',',''))
    
    sum_list.append(sum)  
    print('*',end='')
    
for search in search_list :
    sum(search)
 
#조회수, 키워드 합치기
sum_search = dict(zip(key_list,sum_list))
 
#조회수 순으로 정렬
#a = sorted(sum_search.items(), key=lambda x:x[1], reverse = True)
 
print()
print(sum_search)
Colored by Color Scripter

cs

이슈가 되는 키워드를 검색했을 때, 각 커뮤니티별로 얼만큼의 조회수를 보이고 있는지 확인을 했다. 문제는 각 사람들마다 성향이 있어서 특정 싸이트를 보여주면, 극도로 싫어하는 사람들이 있다. 그래서 커뮤니티 글 중에 조회수가 높은 글을 보여주기에는 애매해서, 이를 네이버 뉴스로 보여주려고 했다.

다만 네이버 뉴스를 예로 들었을 때, 당시 '브레이브걸스' 로션은 알테니 스킵을 아재 팬들이 몰라서 이슈가 됐었다. 하지만 당시 브레이브걸스 자체가 가지고 있는 인기가 있어서, 브레이브걸스를 뉴스에서 검색해도 이 알테니 스킵이라는 이슈와는 다른 이슈가 검색이 될 수도 있다. 그래서 서브키워드를 같이 추출하기 위해서 형태소 분석을 했다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

#### 키워드 제목 추출 탐30 개별로 ####
 
from bs4 import BeautifulSoup
import requests
import urllib
import os
 
#디렉토리 폴더 생성
path = "./subject"
if not os.path.isdir(path):                                                           
    os.mkdir(path)
 
keyword_list=[]
 
def subject(search) :
    url = f'https://www.issuelink.co.kr/community/listview/all/3/adj/_self/blank/{search}'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
 
    sub = soup.select('span.title')
    
    keyword_list.clear()
    
    for i in sub :
        split_string = i.get_text().split(' [',1)
        substring = split_string[0]    
        keyword_list.append(substring)
        
    with open(f'./subject/{search}.txt','w', encoding = 'utf-8') as file :
        file.writelines(keyword_list)
    
    print('**', end="")
            
    
for search in search_list :
    subject(search)
 
print()
print('완료')
Colored by Color Scripter

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

#### 키워드 형태소 카운팅 ####
 
""" 형태소 분석기
    명사 추출 및 빈도수 체크
    python [모듈 이름] [텍스트 파일명.txt] [결과파일명.txt]
"""
 
import sys
from konlpy.tag import Twitter
from collections import Counter
 
 
def get_tags(text, ntags=50):
    spliter = Twitter()
    nouns = spliter.nouns(text)
    count = Counter(nouns)
    return_list = []
    for n, c in count.most_common(ntags):
        temp = {'tag': n, 'count': c}
        return_list.append(temp)
    return return_list
 
 
def main(search):
    # 분석할 파일
    noun_count = 50
    # count.txt 에 저장
    open_text_file = open(f'./subject/{search}.txt', 'r',-1,"utf-8")
    # 분석할 파일을 open 
    text = open_text_file.read() #파일을 읽습니다.
    tags = get_tags(text, noun_count) # get_tags 함수 실행
    open_text_file.close()   #파일 close
    open_output_file = open(f"./subject/{search}-count.txt", 'w',-1,"utf-8")
    # 결과로 쓰일 count.txt 열기
    for tag in tags:
        noun = tag['tag']
        count = tag['count']
        open_output_file.write('{} {}\n'.format(noun, count))
    # 결과 저장
    open_output_file.close() 
 
for search in search_list :
    main(search)
    
print('완료')
 
Colored by Color Scripter

cs

당시에 형태소 분석을 제대로 다루지 못해서 구글을 통해서 검색을 했다. 그리고 제목보다는 서브키워드가 중요하다고 생각하여서, 제목은 잠시 파일에 저장하고 시간이 지나면 삭제하는 식의 과정을 진행했다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import requests
from bs4 import BeautifulSoup
 
sub_key = {}
 
for i in sum_search :
      
    with open(f'C:/Workspace/project2_final/output/temp/{i}-count.txt','r', encoding = 'utf-8') as file :
        data = str(file.readlines()[1])
 
    split_string = data.split(' ',1) 
    substring = split_string[0]           #빈도수 제거 
    #print(substring)
    
    sub_key[i] = substring
    
    
print(sub_key)
Colored by Color Scripter

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

#네이버 검색, 키워드 서브키워드 
import requests
from bs4 import BeautifulSoup
 
art_lists = []
 
def search(key, b) :
    
    art_list = [b]
    
    #url = f'https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query={key}'
    url = f'https://search.naver.com/search.naver?where=news&sm=tab_jum&query={key}'
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    
    for i in range(2) :
        news = soup.select('div.info_group > a:nth-of-type(2)')[i].attrs["href"]
        art_list.append(news)
    
    art_lists.append(art_list)
    
a = sorted(sum_search.items(), key=lambda x:x[1], reverse = True) #value 값 기준으로 정렬, 상위 5개 키워드
 
 
for i in range(5) :
    b= a[i][0]     #정렬 후 dic -> list 함수로 변환돼서 [i][0]으로 빼옴 
    #print(b)
    
    with open(f'C:/Workspace/project2_final/output/temp/{b}-count.txt','r', encoding = 'utf-8') as file :
        data = str(file.readlines()[1])
 
    split_string = data.split(' ',1) 
    substring = split_string[0]           #빈도수 제거 
    #print(substring)
    
    key = b + " " + substring
    search(key, b)
    
print(art_lists)
print(sum_search.items())
 
Colored by Color Scripter

cs

'도전하자. 프로젝트' 카테고리의 다른 글

2-4 파이썬 팀프로젝트 : matplotlib, smtplib 메일 자동화 (0)	2021.07.19
2-3 파이썬 팀프로젝트 : 데이터 저장, 판다스 열/행 관련 총정리 + 폴더/파일 생성 (0)	2021.07.18
2-1 파이썬 팀프로젝트 : 네이버 실시간검색 대체 이슈 예측 기획 (0)	2021.07.16
1-5. 파이썬 EDA 데이터 분석 팀 프로젝트 간단 정리 (2)	2021.06.26
1-4. 파이썬 EDA 데이터 분석 팀 프로젝트 판다스, 시각화 (Matplotlib, Json) (0)	2021.04.26

✔굿모닝 IT ✔

2-2 파이썬 프로젝트 : BS4 웹크롤링, 형태소 분석,

'도전하자. 프로젝트' 카테고리의 다른 글

+ Recent posts

티스토리툴바