'파이썬프로젝트' 태그의 글 목록

파이썬프로젝트

3-4 파이썬 팀프로젝트 - CNN, AlexNet, VGG-16 모델 평가

C.L.O.W.N 2021. 8. 9. 16:26

2021. 8. 9. 16:26

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

세 번째 프로젝트 순서

1. 식물 병충해 자료 파일분류

2. CNN 모델링

3. 이미지화

4. CNN, AlexNet, VGG-16 모델 평가

프로젝트 발표 준비 전 모델 평가 및 성능 테스트? 단계

참고로 그린라이트로 정하게 된 배경에는.. 예전에 마녀사냥에서 자주 사용했던 그린라이트, 불빛이 들어온다는 느낌에서 약간의 언어유희를 사용했다. 식물이 아픈지 안 아픈지 제대로 정의해준다(right)와 밝혀준다(light)의 조합이랄까..

이미지 인식 모델에서 자주 사용 되는 모델 AlexNet, VGG-16, GoogleNet(2012~2014), ResNet, InceptionV3 (2014 이후) 매년 ImageNet Large Scale Visual Recognition Challenge( ILSVRV ) 대회에서 성능 비교를 한다. 요즘은 점차 정확도에 차이가 줄어서 0.01% 차이로도 순위가 갈린다고 수업시간에 들었다.

AlexNet

2012년 AlexNet 은 이전의 모든 경쟁자를 압도할 정도로, 상위 5개 오류를 26%에서 15.3%로 줄였다고 한다. 총 8개의 layer로 구성 됐고, 5개는 convolutional layer로 구성이 됐다. 11x11, 5x5, 3x3, 컨볼루션, 최대 풀링, 드롭아웃, 데이터 증대, ReLU 활성화 등으로 구성이 됐다.

입력 크기 256x256x3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

def Alexnet_model():
    inputs = Input(shape=(64, 64, 3))
 
    conv1 = Conv2D(filters=48, kernel_size=(10, 10), strides=1, padding="valid", activation='relu')(inputs)
    pool1 = MaxPooling2D(pool_size=(3, 3), strides=2, padding="valid")(conv1)
    nor1 = tf.nn.local_response_normalization(pool1, depth_radius=4, bias=1.0, alpha=0.001/9.0, beta=0.75)
 
    conv2 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", groups=2, activation='relu')(nor1)
    pool2 = MaxPooling2D(pool_size=(3, 3), strides=2, padding="valid")(conv2)
    nor2 = tf.nn.local_response_normalization(pool2, depth_radius=4, bias=1.0, alpha=0.001 / 9.0, beta=0.75)
 
    conv3 = Conv2D(filters=192, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(nor2)
    conv4 = Conv2D(filters=192, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv3)
    conv5 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv4)
    pool3 = MaxPooling2D(pool_size=(3, 3), strides=2, padding="valid")(conv5)
    drop1 = Dropout(0.5)(pool3)
    nor3 = tf.nn.local_response_normalization(drop1, depth_radius=4, bias=1.0, alpha=0.001 / 9.0, beta=0.75)
 
    flat = Flatten()(nor3)
    dense1 = Dense(units=2048, activation='relu')(flat)
    dense2 = Dense(units=1024, activation='relu')(dense1)
    logits = Dense(units=20, activation='softmax')(dense2)
 
    return Model(inputs=inputs, outputs=logits)
 
 
model=Alexnet_model()
model.summary()
model.compile(loss='binary_crossentropy', optimizer=otm, metrics=["accuracy"])
Colored by Color Scripter

cs

GoogleNet

ILSVRC 2014 대회의 우승자는 Google의 GoogLeNet(Inception V1이라고도 함)이다. 6.67%의 상위 5위 오류율을 달성. 인간 수준의 성과에 매우 가까웠다고 한다. 27개의 pooling layer가 포함된 22개의 layer로 구성된다.

EarlyStopping을 했을 때 이미지 사이즈에 맞게 변형해서 만든 저희 모델들의 결과

VGG-16

ILSVRC 2014 대회의 준우승은 커뮤니티에서 VGGNet이다. VGGNet은 16개의 컨볼루션 레이어로 구성되며 매우 균일한 아키텍처로 구성됐다. AlexNet과 유사한 3x3 회선만 있지만 필터가 더 많다. 이미지에서 특징을 추출하기 위해 커뮤니티에서 선호되고 있다. 단점은 훈련에 많은 시간이 걸리고, 네트워크 아키텍처 가중치가 크다.

이미지 크기 244x244

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

def vgg_16():
    inputs = Input(shape=(64, 64, 3))
    conv1_1 = Conv2D(filters=32, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(inputs)
    conv1_2 = Conv2D(filters=32, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv1_1)
    pool1 = MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid")(conv1_2)
    nor1 = BatchNormalization()(pool1)
 
    conv2_1 = Conv2D(filters=64, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(nor1)
    conv2_2 = Conv2D(filters=64, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv2_1)
    pool2 = MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid")(conv2_2) # 16
    nor2 = BatchNormalization()(pool2)
 
    conv3_1 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(nor2)
    conv3_2 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv3_1)
    conv3_3 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv3_2)
    pool3 = MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid")(conv3_3)
    nor3 = BatchNormalization()(pool3)
 
    # 4 layers
    conv4_1 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(nor3)
    conv4_2 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv4_1)
    conv4_3 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv4_2)
    pool4 = MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid")(conv4_3) # 4
    nor4 = BatchNormalization()(pool4)
 
    # 5 layers
    conv5_1 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(nor4)
    conv5_2 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv5_1)
    conv5_3 = Conv2D(filters=128, kernel_size=(3, 3), strides=1, padding="same", activation='relu')(conv5_2)
    pool5 = MaxPooling2D(pool_size=(2, 2), strides=2, padding="valid")(conv5_3)
    nor5 = BatchNormalization()(pool5)
    drop5 = Dropout(0.5)(nor5)
 
    flatten1 = Flatten()(drop5)
    dense1 = Dense(units=2048, activation=tf.nn.relu)(flatten1)
    dense2 = Dense(units=1024, activation=tf.nn.relu)(dense1)
 
    logits = Dense(units=20, activation='softmax')(dense2)
    return Model(inputs=inputs, outputs=logits)
 
model=vgg_16()
model.compile(loss='categorical_crossentropy', optimizer=otm, metrics=["accuracy"])
model.summary()
Colored by Color Scripter

cs

모델 성능 평가

외부 데이터를 수집하려고 직접 찾아보면서 고생을 하다가, 관련 자료를 깃허브에 친절하게 올려주신 분이 있어서 감사히 썼습니다.

https://github.com/spMohanty/PlantVillage-Dataset/tree/master/raw/color

GitHub - spMohanty/PlantVillage-Dataset: Dataset of diseased plant leaf images and corresponding labels

Dataset of diseased plant leaf images and corresponding labels - GitHub - spMohanty/PlantVillage-Dataset: Dataset of diseased plant leaf images and corresponding labels

github.com

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

##########################
# 테스트 데이터 50개 랜덤으로 평가
##########################
 
 
########################################
#class name으로 변경
from os import rename, listdir
 
path2 = "./tests"  #TEST_DATA => tests로 수정함
 
list3 = list(str(i) for i in range(20))
dict3 = dict(zip(combined_labels,list3))
 
for fname in os.listdir(path2):
    newname = dict3.get(fname)
    try:
        if newname not in os.listdir(path2) :    
            os.rename(os.path.join(path2, fname), os.path.join(path2,newname))
    except:
        counter = True
 
if counter :        
    print("이미 변경 완료")
else :
    print("class name으로 변경완료")
 
 
########################################
### 필수 아님 #####
 
#labeled_name으로 되돌리고 싶을 때
list3 = list(str(i) for i in range(20))
dict3 = dict(zip(list3, combined_labels))
 
for fname in os.listdir(path2):
    newname = dict3.get(fname)
    try :
        if newname not in os.listdir(path2) :    
            os.rename(os.path.join(path2, fname), os.path.join(path2,newname))
    except:
        counter = True
 
if counter :        
    print("이미 변경 완료")
else :
    print("labeled name으로 변경완료")
 
 
########################################
import os, re, glob
import cv2
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import img_to_array
import random
 
# class에 따른 본래 label
t_list = list(str(i) for i in range(20))
t_dict = dict(zip(t_list, combined_labels))
 
 
 
########################################
#테스트 데이터 list
tests ='./tests'
tests_list = os.listdir(tests)
 
model = load_model('model1.h5')     # 자신의 model load
 
def convert_image_to_array(image_dir):
    try:
        image = cv2.imread(image_dir)
        if image is not None :
            image = cv2.resize(image, dsize=(64,64))
            return img_to_array(image)
        else :
            return np.array([])
    except Exception as e:
        print(f"Error : {e}")
        return None
 
def predict_disease(image_path, num):
    global count
    image_array = convert_image_to_array(image_path)
    np_image = np.array(image_array, dtype=np.float32) / 225.0
    np_image = np.expand_dims(np_image,0)
    result = model.predict_classes(np_image)
    c = result.astype(str)[0]
    if c == num : 
        count += 1
    return count
 
for i in range(20):
    num = str(i)
    tests_file = os.listdir(f'tests/{num}')
    count = 0
    max = len(tests_file)
    
    for j in range(50):
        ran_num = random.randint(0,max) # 임의의 숫자 추출
        tests_path =  f'tests/{num}/' + os.listdir(f'./tests/{num}')[ran_num]
        predict_disease(tests_path, num)
 
    print(f'###### 테스트 데이터 {t_dict.get(num)} 의 정확도 입니다 #######' )
    print('accuracy: {:0.5f}'.format(count/50))
 
print('테스트 완료')
Colored by Color Scripter

cs

###### 테스트 데이터 Corn_(maize)_Common_rust_ 의 정확도 입니다 #######
accuracy: 1.00000
###### 테스트 데이터 Corn_(maize)_healthy 의 정확도 입니다 #######
accuracy: 1.00000
###### 테스트 데이터 Grape_Black_rot 의 정확도 입니다 #######
accuracy: 0.96000
###### 테스트 데이터 Grape_Esca_(Black_Measles) 의 정확도 입니다 #######
accuracy: 0.94000
###### 테스트 데이터 Grape_Leaf_blight_(lsariopsis_Leaf_Spot) 의 정확도 입니다 #######
accuracy: 0.98000
###### 테스트 데이터 Orange_Haunglongbing_(Citrus_greening) 의 정확도 입니다 #######
accuracy: 0.98000
###### 테스트 데이터 Pepper,_bell_Bacterial_spot 의 정확도 입니다 #######
accuracy: 0.94000
###### 테스트 데이터 Pepper,_bell_healthy 의 정확도 입니다 #######
accuracy: 0.98000
###### 테스트 데이터 Potato_Early_blight 의 정확도 입니다 #######
accuracy: 1.00000
###### 테스트 데이터 Potato_Late_blight 의 정확도 입니다 #######
accuracy: 0.98000
###### 테스트 데이터 Soybean_healthy 의 정확도 입니다 #######
accuracy: 0.96000
###### 테스트 데이터 Squash_Powdery_mildew 의 정확도 입니다 #######
accuracy: 0.94000
###### 테스트 데이터 Tomato_Bacterial_spot 의 정확도 입니다 #######
accuracy: 0.90000
###### 테스트 데이터 Tomato_Early_blight 의 정확도 입니다 #######
accuracy: 0.90000
###### 테스트 데이터 Tomato_Late_blight 의 정확도 입니다 #######
accuracy: 0.84000
###### 테스트 데이터 Tomato_Septoria_leaf_spot 의 정확도 입니다 #######
accuracy: 0.96000
###### 테스트 데이터 Tomato_Spider_mites_Two-spotted_spider_mite 의 정확도 입니다 #######
accuracy: 0.96000
###### 테스트 데이터 Tomato_Target_Spot 의 정확도 입니다 #######
accuracy: 0.96000
###### 테스트 데이터 Tomato_Tomato_Yellow_Leaf_Curl_Virus 의 정확도 입니다 #######
accuracy: 0.98000
###### 테스트 데이터 Tomato_healthy 의 정확도 입니다 #######
accuracy: 0.98000
테스트 완료
----------------------------------------------------------------------------------------------------
테스트 정확도 : 0.957

이건 팀원이 정확도와 loss 값 그래프를 보고 그린 그래프

수고하셨습니다

'도전하자. 프로젝트' 카테고리의 다른 글

4-2 파이썬 팀프로젝트 CNN 카테고리 분류 - 데이터 크롤링 및 전처리 (0)	2021.08.10
4-1 파이썬 팀프로젝트 CNN 상품 카테고리 분류 intro (셀리니움 vs BS4) (0)	2021.08.10
3-3 파이썬 프로젝트 CNN 식물 병충해 시각화 및 모델 개선 (0)	2021.08.04
3-2 파이썬 팀프로젝트 CNN 모델링 - 인공지능, 머신러닝, 딥러닝 뭔데? (0)	2021.08.01
3-1 파이썬 팀프로젝트 CNN 식물 병충해 분류 (0)	2021.07.31

2-2 파이썬 프로젝트 : BS4 웹크롤링, 형태소 분석,

C.L.O.W.N 2021. 7. 17. 18:40

2021. 7. 17. 18:40

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

두번째 프로젝트 순서

1. 프로젝트 주제 정하기

2. 기획 및 데이터 수집, 전처리

3. 데이터 저장(판다스 열/행 관련 정리)

4. 시각화 및 자동화

각종 커뮤니티를 모두 크롤링하기에는 시간도 없고 벅차서, 이슈링크라는 싸이트에서 이미 친절하게 각종 커뮤니티를 크롤링해주고 있어서 이슈링크 싸이트를 이용하였다. 봇을 이용하여 글들을 긁어와주는 걸로 보인다.

하지만 확인 결과, 오늘의 이슈태그 Top5나 커뮤니티 베스트 키워드들은 실제로 다수의 사람들이 관심이 있는 것이 아니었다. 분석하기로는 얼마나 커뮤니티에서 자주 언급되는지에 따라서 순위가 올라가는 것으로 보인다.

그래서 다수의 사람들이 관심이 있는 키워드와 이슈거리를 어떻게 하면 찾을 수 있을까 생각하면서 파이썬 프로젝트를 진행했다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

#현재 시간 설정
from datetime import datetime, date, time, timedelta
 
now = datetime.now()
nowDatetime = now.strftime('%Y-%m-%d %H:%M')
print(nowDatetime)
 
 
#### 이슈 빼오기 #####
import requests
from bs4 import BeautifulSoup
 
url = 'https://www.issuelink.co.kr/community/listview/all/3/adj/_self/blank/blank/blank'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
html_rank = requests.get(url, headers=headers).text
soup_rank = BeautifulSoup(html_rank, 'lxml')
keywords = soup_rank.select('div.ibox.float-e-margins > div > table > tbody > tr > td > a')
 
 
key_list = []
for k in keywords:
    keyword = k.text
    key_list.append(keyword)
 
###### 7시를 기준으로 기준 리스트를 하나 만들어야 함 / datetime에서 시간 분만 빼와서 if로 비교
 
key_list.pop(0)
print(key_list)
Colored by Color Scripter

cs

예측을 한 것인데 실제로 확인을 해보니 총조회수가 높지 않지만 많이 언급될 수록 상위권에 있는 것으로 확인이 됐다. 이제 이 과정을 코드로 시행.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

### 이슈 실제 조회수 ###
 
from bs4 import BeautifulSoup
import requests
import urllib
import operator
 
sum_list = []
search_list = key_list
 
def sum(search) :
    
    url = f'https://www.issuelink.co.kr/community/listview/read/3/adj/_self/blank/{search}'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
 
    hits = soup.select('span.hit')
    
    sum=0
    
    for hit in hits :
        sum += int(hit.text.replace(',',''))
    
    sum_list.append(sum)  
    print('*',end='')
    
for search in search_list :
    sum(search)
 
#조회수, 키워드 합치기
sum_search = dict(zip(key_list,sum_list))
 
#조회수 순으로 정렬
#a = sorted(sum_search.items(), key=lambda x:x[1], reverse = True)
 
print()
print(sum_search)
Colored by Color Scripter

cs

이슈가 되는 키워드를 검색했을 때, 각 커뮤니티별로 얼만큼의 조회수를 보이고 있는지 확인을 했다. 문제는 각 사람들마다 성향이 있어서 특정 싸이트를 보여주면, 극도로 싫어하는 사람들이 있다. 그래서 커뮤니티 글 중에 조회수가 높은 글을 보여주기에는 애매해서, 이를 네이버 뉴스로 보여주려고 했다.

다만 네이버 뉴스를 예로 들었을 때, 당시 '브레이브걸스' 로션은 알테니 스킵을 아재 팬들이 몰라서 이슈가 됐었다. 하지만 당시 브레이브걸스 자체가 가지고 있는 인기가 있어서, 브레이브걸스를 뉴스에서 검색해도 이 알테니 스킵이라는 이슈와는 다른 이슈가 검색이 될 수도 있다. 그래서 서브키워드를 같이 추출하기 위해서 형태소 분석을 했다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

#### 키워드 제목 추출 탐30 개별로 ####
 
from bs4 import BeautifulSoup
import requests
import urllib
import os
 
#디렉토리 폴더 생성
path = "./subject"
if not os.path.isdir(path):                                                           
    os.mkdir(path)
 
keyword_list=[]
 
def subject(search) :
    url = f'https://www.issuelink.co.kr/community/listview/all/3/adj/_self/blank/{search}'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
 
    sub = soup.select('span.title')
    
    keyword_list.clear()
    
    for i in sub :
        split_string = i.get_text().split(' [',1)
        substring = split_string[0]    
        keyword_list.append(substring)
        
    with open(f'./subject/{search}.txt','w', encoding = 'utf-8') as file :
        file.writelines(keyword_list)
    
    print('**', end="")
            
    
for search in search_list :
    subject(search)
 
print()
print('완료')
Colored by Color Scripter

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

#### 키워드 형태소 카운팅 ####
 
""" 형태소 분석기
    명사 추출 및 빈도수 체크
    python [모듈 이름] [텍스트 파일명.txt] [결과파일명.txt]
"""
 
import sys
from konlpy.tag import Twitter
from collections import Counter
 
 
def get_tags(text, ntags=50):
    spliter = Twitter()
    nouns = spliter.nouns(text)
    count = Counter(nouns)
    return_list = []
    for n, c in count.most_common(ntags):
        temp = {'tag': n, 'count': c}
        return_list.append(temp)
    return return_list
 
 
def main(search):
    # 분석할 파일
    noun_count = 50
    # count.txt 에 저장
    open_text_file = open(f'./subject/{search}.txt', 'r',-1,"utf-8")
    # 분석할 파일을 open 
    text = open_text_file.read() #파일을 읽습니다.
    tags = get_tags(text, noun_count) # get_tags 함수 실행
    open_text_file.close()   #파일 close
    open_output_file = open(f"./subject/{search}-count.txt", 'w',-1,"utf-8")
    # 결과로 쓰일 count.txt 열기
    for tag in tags:
        noun = tag['tag']
        count = tag['count']
        open_output_file.write('{} {}\n'.format(noun, count))
    # 결과 저장
    open_output_file.close() 
 
for search in search_list :
    main(search)
    
print('완료')
 
Colored by Color Scripter

cs

당시에 형태소 분석을 제대로 다루지 못해서 구글을 통해서 검색을 했다. 그리고 제목보다는 서브키워드가 중요하다고 생각하여서, 제목은 잠시 파일에 저장하고 시간이 지나면 삭제하는 식의 과정을 진행했다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import requests
from bs4 import BeautifulSoup
 
sub_key = {}
 
for i in sum_search :
      
    with open(f'C:/Workspace/project2_final/output/temp/{i}-count.txt','r', encoding = 'utf-8') as file :
        data = str(file.readlines()[1])
 
    split_string = data.split(' ',1) 
    substring = split_string[0]           #빈도수 제거 
    #print(substring)
    
    sub_key[i] = substring
    
    
print(sub_key)
Colored by Color Scripter

cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

#네이버 검색, 키워드 서브키워드 
import requests
from bs4 import BeautifulSoup
 
art_lists = []
 
def search(key, b) :
    
    art_list = [b]
    
    #url = f'https://search.naver.com/search.naver?where=nexearch&sm=top_hty&fbm=1&ie=utf8&query={key}'
    url = f'https://search.naver.com/search.naver?where=news&sm=tab_jum&query={key}'
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    
    for i in range(2) :
        news = soup.select('div.info_group > a:nth-of-type(2)')[i].attrs["href"]
        art_list.append(news)
    
    art_lists.append(art_list)
    
a = sorted(sum_search.items(), key=lambda x:x[1], reverse = True) #value 값 기준으로 정렬, 상위 5개 키워드
 
 
for i in range(5) :
    b= a[i][0]     #정렬 후 dic -> list 함수로 변환돼서 [i][0]으로 빼옴 
    #print(b)
    
    with open(f'C:/Workspace/project2_final/output/temp/{b}-count.txt','r', encoding = 'utf-8') as file :
        data = str(file.readlines()[1])
 
    split_string = data.split(' ',1) 
    substring = split_string[0]           #빈도수 제거 
    #print(substring)
    
    key = b + " " + substring
    search(key, b)
    
print(art_lists)
print(sum_search.items())
 
Colored by Color Scripter

cs

'도전하자. 프로젝트' 카테고리의 다른 글

2-4 파이썬 팀프로젝트 : matplotlib, smtplib 메일 자동화 (0)	2021.07.19
2-3 파이썬 팀프로젝트 : 데이터 저장, 판다스 열/행 관련 총정리 + 폴더/파일 생성 (0)	2021.07.18
2-1 파이썬 팀프로젝트 : 네이버 실시간검색 대체 이슈 예측 기획 (0)	2021.07.16
1-5. 파이썬 EDA 데이터 분석 팀 프로젝트 간단 정리 (2)	2021.06.26
1-4. 파이썬 EDA 데이터 분석 팀 프로젝트 판다스, 시각화 (Matplotlib, Json) (0)	2021.04.26

2-1 파이썬 팀프로젝트 : 네이버 실시간검색 대체 이슈 예측 기획

C.L.O.W.N 2021. 7. 16. 19:37

2021. 7. 16. 19:37

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

두번째 프로젝트 순서

1. 프로젝트 주제 정하기

2. 기획 및 데이터 수집, 전처리

3. 데이터 저장(판다스 열/행 관련 정리)

4. 시각화 및 자동화

주제 : 하루 동안 있었던 이슈와 볼거리 정리

일정 : 4/29~5/7 (거의 5월 프로젝트인데 이제서야 정리를 하네요...)

기획 : 가십거리와 함께 전문적인 의견들에 대해서 하루를 정리하는 시간에 받아 보는 서비스

[네이버 실시간 검색어 폐지]

네이버 실시간 검색어도 갑자기 폐지됐다. 실검에 자꾸 광고성 키워드들이 올라와서 실검을 폐지했다는데 황당하다. 상품 키워드들만 올라가는 창을 만들든 트래픽을 분석해서 억제를 시키면 되는데, 소비자들의 편의성을 없앴다. 솔직히 실검보려고 네이버를 사용하는 사람들이 많았을텐데 얼마나 영향을 끼쳤을지 궁금하다.

(네이버는 이전에도 홈 화면을 뉴스나 실검이 나오는 창이 아닌 구글을 따라하듯이 바꿨는데, 이 막대한 손해는 어떻게 극복했는지도 궁금하다. 네이버 블로그도 바이럴 마케팅에 먹힌지 너무 오래 됐고, 인플루언서도 팔로워를 돈으로 사는 마당에 신뢰도를 어떻게 회복할지도 궁금한데.. 이건 뭐 뻘소리라서 패스..)

[이슈 생성 과정]

있다가 없어지면 불편하기 마련, 그래서 사람들이 어떤 것에 관심을 가지고 있고,

어떤 사건이 일어나는지 이슈의 생성 과정을 분석해보았다.

1. SNS 또는 커뮤니티에서 먼저 사건이 커진다.

2. 그것이 커뮤니티로 돌고 돌아 공유가 돼서 커뮤니티에 상주하는 기자들이 기사를 쓰기 시작한다.

(기사가 화제가 되는 경우도 있다.)

3. 기사를 본 사람들이 관련 내용이 어떤지 궁금해서 찾아보게 된다.

4. 실시간 검색어 순위에 오르게 되고, 기사나 관련 내용을 모르는 사람들은 실검을 클릭하게 된다.

=> 이러한 점을 미루어 봤을 때 커뮤니티에서 어떤 것이 이슈가 되고 있는지 파악해서 이슈를 예측해본다.

파이썬 업무 자동화를 배워서 이번에는 자동으로 시스템이 돌아가고, 자동으로 메일을 보내는 시스템을 구현하려고 한다. 원래는 Django를 통해서 웹 서비스를 구현해보려고 했는데, 생각보다 시간이 많이 걸려서 다음에 해보기로 했다.

'도전하자. 프로젝트' 카테고리의 다른 글

2-3 파이썬 팀프로젝트 : 데이터 저장, 판다스 열/행 관련 총정리 + 폴더/파일 생성 (0)	2021.07.18
2-2 파이썬 프로젝트 : BS4 웹크롤링, 형태소 분석, (0)	2021.07.17
1-5. 파이썬 EDA 데이터 분석 팀 프로젝트 간단 정리 (2)	2021.06.26
1-4. 파이썬 EDA 데이터 분석 팀 프로젝트 판다스, 시각화 (Matplotlib, Json) (0)	2021.04.26
1-3. 파이썬 EDA 데이터분석 팀 프로젝트 데이터 수집, 정제 등 (0)	2021.04.26

PREV 이전 1 NEXT 다음

✔굿모닝 IT ✔