'크롤링' 태그의 글 목록

크롤링

(6) 사무 자동화 - 웹 검색 후 특정 부분 자동 캡쳐/ 스크린샷, 확대 축소

C.L.O.W.N 2022. 1. 24. 19:30

2022. 1. 24. 19:30

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

목차 (작성 예정)

(1) 100% 만족할 파이썬 엑셀 사무 자동화, 회사에서 안 된다면?

(2) 엑셀 보안 한 방에 뚫기

(3) 시간 50배 단축, 실무 엑셀 함수 구현 (vlookup, index match 등)

(4) 실무 엑셀 함수 응용

(5) 엑셀 실무용 유용한 함수 및 기능들 파이썬으로 해결

(6) 언제까지 수작업할래? 크롤링과 사진 자동으로 캡쳐, 스샷

오랜만에 글을 썼는데.. 사무자동화 마지막 글이네요. 우선 생각했던건 여기까지고 추가로 필요한 부분 있으면 제가 또 만들지 않을까.. bs4나 셀레니움(selenium)으로 크롤링까지는 금방 완성이 되는데, 자동 캡쳐나 스크린샷을 찍는 방법은 처음으로 시도했다. 검색을 해보면 전체화면 캡쳐를 하는게 대부분이고, 내가 원하는 element나 특정 부분 캡쳐는 설명이 별로 없었다.

복잡하게 좌표값이나 height, witdh를 다 구해서 전체화면 스크린샷을 찍고 원하는 부분만 가져오는 것도 많았다. 물론 크롬에서 쉽게 특정 element를 캡쳐하는 방법도 있었다. 약간 신세계를 경험하는 것 같았는데 자동화가 아니다 보니 패스.. 그래도 짧게 소개해보자면

GSMARENA라는 싸이트에서 빨간색 박스를 친 부분의 사진만 가져오고 싶을 때, 우선 F12 개발자 도구를 켠다. 그 다음에 Elements에서 저 부분을 포함하는 class나 id를 찾고 클릭을 한다. 이후에 shift + crtl + p 를 누른다.

그러면 개발자용 명령창이 뜨게 되는데 약자로 cnod를 검색. capture node screenshot이 나오게 되는데, 클릭하면 바로 캡쳐가 된다. 심지어 결과도 깔끔하고 화질도 나쁘지 않다. 그러면 뭐하나.. 자동이 아니면 매번 이 작업을 반복해야 한다.

1
2
3
4
5
6
7
8
9
10
11
12

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
chrome_options = Options()
driver = webdriver.Chrome(options=chrome_options)
 
element = driver.find_element_by_id('specs-list')
element.screenshot('screenshot.png')
 
element_png = element.screenshot_as_png
with open('screenshot.png', "wb") as file:
    file.write(element_png)

cs

다행히 셀레니움 기능중에 특정 element 값만 캡쳐하는 기능이 있었다. screenshot으로 저장하는 기능과 screenshot_as_png로 저장하는 기능인데 차이점은 거의 없는 것 같다. 이렇게 수월하게 되면 개발이 재미가 없지...

해결해야할 문제들

- 특정부분이 화면에 나오지 않으면 짤려서 찍혔다.

- 크롭옵션에 따라 캡쳐가 안 되기도 했다.

- 듀얼모니터에서 실행이 안 됐다.

- 화면 축소시 찍히지 않았다.

최대한 덜 잘려나오게 하려면 셀레니움이 노트북 화면이 아닌, 듀얼 모니터 화면에서 실행 되어야 하는데 안 됐다. 좌표값을 옮겨서 확대를 하라고 해서 여러 방법 끝에 됐다. 그리고 headless의 경우 어떤 화면에서는 잘 캡쳐가 됐지만, 안 되는 경우도 있어서 뺐다.

1
2
3
4
5
6
7
8
9
10

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
chrome_options = Options()
 
# chrome_options.headless = True #싸이트마다 다르게 캡쳐가 됨
# chrome_options.add_argument("--kiosk") #F11 눌러진 효과
chrome_options.add_argument("--window-position=2000,0") #듀얼모니터에서 보기
driver = webdriver.Chrome(options=chrome_options)
driver.maximize_window() 
Colored by Color Scripter

cs

듀얼 모니터 화면에서 캡쳐가 됐다!! 하지만 화면이 잘리지 않으려면 화면을 축소해서 찍어야 했다. 사람들마다 셀레니움 화면을 축소하는 다양한 방법을 소개시켜줬는데 잘 안 됐다. 기본 설정에 들어가서 바꾸라는 사람부터 zoom out을 하라는 것도 있었지만, 값을 바꾸게 되면 css에도 영향을 미치는 것 같았다.

1
2
3
4
5

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
# driver.execute_script("document.body.style.zoom='50%'")
driver.execute_script("document.body.style.transform = 'scale(0.50)'")

cs

document.body.style.zoom을 활용하게 되면 안 된다. document.body.style.transform을 사용해야 한다. 원하는 요소가 한 페이지에 나오면 좋겠지만, 아닌 경우도 있으니 쓰시면 될 것 같습니다. 문제는 어느정도 해결한 듯 보였으나 원하는 이미지가 서로 다른 element에 있어서 이미지 병합까지 하게 됐다.

밑에 과정은 gsmarena에서 기본적인 정보를 크롤링하고 이미지 캡쳐, 병합하는 과정입니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
 
#환율
def exchange_rate(today_rate, price,r) : #r은 반올림 자리 숫자
    p_price = round(today_rate * int(price),r)
    return str(p_price)
 
#월
def month_string_to_number(string):
    m = { 'jan': 1, 'feb': 2, 'mar': 3, 'apr':4,  'may':5,  'jun':6,  'jul':7,  'aug':8,  'sep':9,  'oct':10,  'nov':11,  'dec':12 }
    s = string.strip()[:3].lower()    
    try:
        month = m[s]
        return month
    except:
        return '확인 바람'
 
#이름
def crawl_gsmarena(j,today_rate1,today_rate2) :
    
    global phone_name
    url = j
    
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text, "html.parser") 
 
    tds= soup.find_all('td', attrs={'class': 'nfo'})
    
    try :    
        phone_name = soup.select('h1.specs-phone-name-title')[0].text
        
    except :
        phone_name = 0
        return phone_name
 
    for td in tds:
        
        global s_year, s_mon, phone_price, exchange
        
        if 'status' in str(td) :
 
        # <td class="nfo" data-spec="status">Available. Released 2021, August 27</td>
            try :
                td = td.text
                s_year = re.findall('\d+', td)[0]
                s_mon = td.split(',')[1]
                s_mon = re.sub('[^a-zA-Z]', '', s_mon)
                s_mon = month_string_to_number(s_mon)
                
            except :
                s_year = 'comming_soon'
                s_mon = 'comming_soon'
                
        elif 'price' in str(td) :
            td = td.text 
            # p = td.split()[1]
            
            if '$' in td :
                p = td.split('$')[1].split()[0]
                phone_price = p
                exchange = ''
                    
            elif '€' in td :
                p = td.split('€')[1]
                p = p.split()[0].replace(',','').split('.')[0]
                phone_price = exchange_rate(today_rate1, p,2)  # today_rate 환율 입력
                exchange = today_rate1
                
            elif '₹' in td :
                p = td.split('₹')[1]
                p = p.split()[0].replace(',','').split('.')[0]
                phone_price = exchange_rate(today_rate2, p,3)  # today_rate 환율 입력
                exchange = today_rate2
                            
            elif 'EUR' in td :
                p = td.replace(',','')
                p = re.findall('\d+', p)[0]
                phone_price = exchange_rate(today_rate1, p,2)
                exchange = today_rate1    
 
            else :
                phone_price = td
                exchange = ''
                
    return phone_name, s_year, s_mon, phone_price, exchange
Colored by Color Scripter

cs

핸드폰 이름, 생산 년도, 월, 핸드폰 가격 등을 가져옵니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import os 
import glob
from PIL import Image
import glob
import shutil
 
#이미지 병합
def merge_img(name):
    files = glob.glob('crawling_temp/*.png')
 
    full_width, full_height = 364, 0 #364로 고정 / 너비가 2배로 찍혀서 반은 없애야 함
 
    for f in files :
        image = Image.open(f)
        _, height = image.size
 
        # full_width = max(full_width,width)
        full_height += height
 
 
    canvas = Image.new('RGB', (full_width, full_height), 'white')
    output_height = 0
    
    for i in files:
        with Image.open(i) as image :
            _, height = image.size
            canvas.paste(image, (0, output_height))
            output_height += height
 
    canvas.save(f'crawling/{name}.png')
    
    shutil.rmtree('crawling_temp')
    
    return 
 
#이미지 리사이징
def resize_img():
    im = Image.open(f'crawling_temp/a.png')
    half = 0.5
    out = im.resize( [int(half * s) for s in im.size] )
    out.save(f'crawling_temp/a.png')
    im.close()
    return 
 
chrome_options = Options()
chrome_options.add_argument('--profile-directory=Default')
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--disable-plugins-discovery")
chrome_options.add_argument("--window-position=2000,0") #듀얼모니터에서 보기
 
driver = webdriver.Chrome(options=chrome_options)
driver.maximize_window() 
 
 
    
urls = 'https://www.gsmarena.com/samsung_galaxy_s21_fe_5g-10954.php' # 찾고자 하는 url 리스트
names = 'samsung galaxy s21' # 검색키워드 리스트
 
for i in range(len(urls)) :
    url = urls[i]
    driver.get(url)
    
    element = driver.find_element_by_class_name("article-info")
 
    if os.path.exists('crawling_temp') :
        pass
    else :
        os.mkdir('crawling_temp')
    
    element.screenshot(f'crawling_temp/a.png')
    resize_img()
    
    driver.execute_script("document.body.style.transform = 'scale(0.50)'") #축소
 
    element2 = driver.find_element_by_id('specs-list')
    element2.screenshot(f'crawling_temp/b.png')
 
    name = names[i]
    if '/' in name :
        name = name.replace('/','')
    merge_img(name)
    
    time.sleep(2) # 빠르면 봇으로 인식
    
# 이미 사용중이라고 에러가 뜨면 vs코드 껐다가 다시 실행
print('완료')
    
Colored by Color Scripter

cs

이미지 병합, 리사이징까지의 과정이 있습니다. 화면을 축소하니까 사진을 잘못 찍는 경우가 있어서, 사진 1장은 원래 크기에서 찍고, 다른 1장은 축소해서 찍었네요. 그리고 리사이징을 해서 병합하는 복잡한 과정을 거쳤습니다. 여러분들이 도전하는 크롤링과 자동 캡쳐 스크린샷 코드는 문제가 없길 바랍니다 ㅠㅠ

'할 수 있다. 파이썬' 카테고리의 다른 글

(5) 효율 100% 엑셀 사무 자동화, 함수 및 기능들 총정리 (파이썬, 판다스) (0)	2022.01.23
(4) 실무 엑셀 함수 TOP3 자동화 - VLOOKUP 다중조건, INDEX/MATCH 파이썬으로 (0)	2022.01.22
(3) 실무 엑셀 함수 VLOOKUP, INDEX MATCH 시간 50배 단축, 파이썬으로 한 방에 잡자 (0)	2022.01.21
(2) 파이썬 엑셀 사무 자동화 : 보안 걸린 엑셀 한 번에 뚫기 openpyxl? xlwings? (2)	2022.01.20
(1) 파이썬 엑셀 사무 자동화 : 회사 사내망 때문에 좌절한 당신... (0)	2022.01.19

PTKOREA 인턴 면접 합격 후기 및 인턴 생활 정리해 드림. (마케팅, 데이터)

C.L.O.W.N 2022. 1. 18. 22:30

2022. 1. 18. 22:30

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

정말 속전속결이었다. 이렇게 빨리 합격 소식과 함께 출근하게 될 줄은 몰랐다. 이틀만에 서류전형 합격소식이 왔고, 3일 뒤에 면접을 볼 수 있냐는 말에 면접도 기회라서 보게 됐다. 면접을 봤을 때 어느 정도 느낌이 있었지만 면까몰이라고 했으니 기다리고 있었다. 면접 본 다음 날 합격했다고 전화가 와서 바로 준비를 했다.

오랜만에 면접을 봐서 그런지 어떻게 해야할까 질문도 찾아보고, 유튜브도 다시 보게 됐다. 그런데 나 같은 경우는 그럴 필요가 없었다... 이건 뒤에 가서 얘기를 하고 그래도 면접 질문이 궁금해서 찾아오신 분들이 있으니 정리한 걸 공유하려고 한다.

마케팅, 데이터, 기획 쪽 직무?

1. 자기소개
2. 지원동기
3. 펑타이 코리아에 대해서
4. 광고대행사 무슨 일 하는지
5. 지원한 포지션에 뽑혀야 하는 이유 or 강점
6. 성격 장·단점
7. 엑셀, 기본적인 문서 작업 수준
8. 어떤 식으로 스트레스 푸는지
9. 광고대행사의 야근에 대해 어떻게 생각
10. 어떤 동료, 상사와 일하고 싶은지

요즘 내용이 많이 바뀌어서 자소서 관련 부분 팀프로젝트 관련 필히 준비하세요~

*자소서 기반 질문 위주라서 준비해야 한다

열심히 면접왕 이형 1분 자기소개서 보고 연습 했는데, 나는 자기소개도 건너 뛰었다. 그냥 면접관님(팀장님)이 단도직입적으로 자소서에 쓴 내용 때문에 면접 꼭 면접 보고싶었다고... 말하기는 민망하지만 블로그에 쓴 글과도 연관이 되어있다.

그래서 어떤 식으로 그런 결과를 낼 수 있었는지 여쭤보셨다. 그쪽 관련해서는 엄청나게 다양한 시도를 하고, 경험을 직접 해봐서 술술 대답을 할 수 있었다. 반복해서 하는 일인데 괜찮냐는 말과 함께 6개월 동안 하는 직무니까 중간에 그만두지는 않을 거죠?라고 하셔서 '아, 이건 된건가..?' 싶었다. 추가로 경험했던 부분이 퍼포먼스 마케팅을 직접하고 있었다는 식으로 얘기를 해주셨다. 데이터 관련직무인데 인사이트를 가지고 있어서 뽑힌게 아닌가하는 생각이 들었고, 들어가서 어떤걸할지 많이 궁금했다.

엑셀이나 기본적인 문서 다루는 걸 물어보셨으면... 파이썬으로 자동화 해보고싶다고 말씀드리고 싶었는데, 물어보시지는 않았다.

그렇게 일주일만에 모든 절차가 끝나고 합격을 했다. 그리고 지금 출근한지 몇 주가 됐는데.. 현재는 인수인계를 하고 있기도 하고, 일이 5~6월이 몰린다고 해서 할게 많지는 않다. 데이터를 다루고, GA로 데이터를 뽑는 걸 반복해서 한다고 했는데 아직은 제대로 보지는 못 했다.

그런데 예상했던 것과 달리, 펑타이는 수작업이 많아 보인다. 그래서 인턴을 많이 뽑는다는 말이 있을 정도.. 제일기획 자회사다 보니 그만큼 일이 많아서기도 하다. PTKOREA로 회사 상호명이 바뀌었습니다.

회사생활

회사는 자유롭다. 수평적인 가운데 서로 영어 호칭을 쓴다. 복장도 진짜 자유롭다. 처음에 세미 정장느낌으로 입고 갔는데 그럴 필요가 없었다. 블라인드에서는 여름에 반바지도 입을 정도라고.. 간식도 많고, 자율 출퇴근제라서 편하게 다닐 수 있다.

지금 그래서 무슨 일을 하냐고 물으면, 성격상 수작업이나 효율 떨어지는 일을 싫어서 업무 자동화를 시도하고 있다. 진짜 제약이 너무 많기는 하지만 ㅋㅋㅋㅋㅋ 꿋꿋이 하고 있다. 엑셀 MATCH, VLOOKUP 이런 함수를 다 파이썬으로 만들고 있다. 노트북도 좋은게 아닌데 엑셀 함수를 돌리면 거의 하루종일 걸리는 듯 해서 파이썬으로 코드를 짜고 있다(95% 이상 완성). 추가로 크롤링과 자동 캡쳐까지하고 시도하고 있다... 데이터를 어떻게 저장할지, 어떤 규칙으로 분류할지 많은 생각을 하게 된다.

'코테 & 취준' 카테고리의 다른 글

11번가 채용 검색 모델링 직무 코딩테스트(코테) 코딜리티 후기 (0)	2021.12.01
CJ올리브네트웍스 AI Engineer 인성/ 코딩테스트(코테) 후기 (9)	2021.11.21

유튜브 크롤링(3) 올인원 - 채널 제목, 댓글, 조회수, 자막까지

C.L.O.W.N 2021. 9. 29. 16:34

2021. 9. 29. 16:34

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

지금 크롤링을 하고 있어서 시간이 나는 김에 글을 작성합니다. 크롤링도 크롤링이지만 이 데이터를 어떻게 정제할지가 더 고민이네요. 지난 번 글들을 활용해서 작성하오니 본인의 목적에 맞게끔 수정해서 사용하면 됩니다!

유튜브 크롤링(1) - 셀레니움 페이지 자동 번역, api 번역기 없이 가능! (키 입력, 마우스 입력)

유튜브로 새로운 수익모델을 찾기위한 채널 분석을 시도하고 있다. (기존 채널에 영상을 새로 올려야 하는데 요즘 못 올리고 있다 ㅠㅠ) 솔직히 노가다를 해도 되는데 파이썬을 배웠으면 자동화

0goodmorning.tistory.com

기능

- 특정 유튜브 채널에서 동영상 목록의 링크를 가져오기 (채널명, 구독자수)

- 제목, 조회수, 날짜, 좋아요 수, 싫어요 수, 댓글 개수

- 댓글 크롤링 (번역 기능 추가)

- 자동번역 자막 추출

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
 
options = webdriver.ChromeOptions() # 크롬 옵션 객체 생성
user_agent = "Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36 "
options.add_argument('user-agent=' + user_agent)
options.add_argument('headless') # headless 모드 설정
options.add_argument("window-size=1920x1080") # 화면크기(전체화면)
options.add_argument("disable-gpu") 
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--mute-audio") #mute
options.add_argument('--blink-settings=imagesEnabled=false') #브라우저에서 이미지 로딩을 하지 않습니다.
options.add_argument('incognito') #시크릿 모드의 브라우저가 실행됩니다.
options.add_argument("--start-maximized")
 
#1
prefs = {
  "translate_whitelists": {"en":"ko"},
  "translate":{"enabled":"true"}
}
options.add_experimental_option("prefs", prefs)
 
#2
prefs = {
  "translate_whitelists": {"your native language":"ko"},
  "translate":{"enabled":"True"}
}
options.add_experimental_option("prefs", prefs)
 
#3
options.add_experimental_option('prefs', {'intl.accept_languages': 'ko,ko_kr'})
Colored by Color Scripter

cs

기본 셀레니움 webdriver 세팅입니다. prefs 기능은 영어를 번역할 때 필요한 기능이라서 끄셔도 상관 없습니다. 그리고 처음에 어떻게 돌아가는지 궁금하시면 # options.add_argument('headless') headless 기능을 꺼주세요.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

import os
import pandas as pd
import winsound
 
ytb = pd.read_csv('youtube_link.csv')
ytb_link = ytb.link.to_list()
 
for i in ytb_link :
    
    driver = webdriver.Chrome('chromedriver.exe', options= options)
    driver.get(i)
 
    # 스크롤 다운
    time.sleep(1.5)
   endkey = 4 # 90~120개 / 늘릴때 마다 30개
    while endkey:
        driver.find_element_by_tag_name('body').send_keys(Keys.END)
        time.sleep(0.3)
        endk -= 1
 
    channel_name = driver.find_element_by_xpath('//*[@id="text-container"]').text
    subscribe = driver.find_element_by_css_selector('#subscriber-count').text
    channel_name = re.sub('[=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…《\》]', '', channel_name)
    # print(channel_name,subscribe)
 
    # bs4 실행    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
 
    video_list0 = soup.find('div', {'id': 'contents'})
    video_list2 = video_list0.find_all('ytd-grid-video-renderer',{'class':'style-scope ytd-grid-renderer'})
 
    base_url = 'http://www.youtube.com'
    video_url = []
 
    # 반복문을 실행시켜 비디오의 주소를 video_url에 넣는다.
    for i in range(len(video_list2)):
        url = base_url+video_list2[i].find('a',{'id':'thumbnail'})['href']
        video_url.append(url)
 
    driver.quit()    
 
    if subscribe :
        channel = channel_name + ' - ' + subscribe
    else :
        channel = channel_name
        
    
    directory = f'data/{channel}/subtitle'
    if not os.path.exists(directory):
        os.makedirs(directory)
        
    print(channel, len(video_url))
    
    ytb_info(video_url, channel)
    print()
    winsound.PlaySound('sound.wav', winsound.SND_FILENAME)
Colored by Color Scripter

cs

ytb_link : 본인이 수집하고자하는 채널을 리스트 형식으로 만들어주세요. 저는 csv 파일로 만들어서 컬럼 이름을 'link'로 하여 생성을 했습니다.

channel : 채널 이름으로 폴더를 만들기 때문에, 폴더 이름에 들어가면 오류가 생기는 부호들을 미리 전처리 합니다. subtitle까지 만든 건 미리 자막 파일을 저장할 수 있는 폴더도 같이 만들어놨습니다.

# 한 채널이 끝날 때마다 윈도우 플레이사운드로 알려줍니다. 시끄럽다고 생각하시면 끄면 됩니다.

1
2
3
4
5
6
7
8
9
10
11
12

import time
 
last_page_height = driver.execute_script("return document.documentElement.scrollHeight")
 
while True:
    driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
    time.sleep(0.5)
    
    if new_page_height == last_page_height:
        break
    last_page_height = new_page_height
    time.sleep(0.75)
Colored by Color Scripter

cs

endkey : 본인이 수집하고자 하는 채널의 링크 개수를 결정합니다. 현재 설정으로는 90~120개를 수집합니다. time.sleep(2)으로 설정하시면 180개까지 크롤링을 합니다. endkey 개수를 늘리면 30개씩 추가가 됩니다. 에라 모르겠다하고 모든 링크를 크롤링하시려면 위에 코드를 입력해주세요.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

# 정보만 크롤링하고 싶을 때
from bs4 import BeautifulSoup
import pyautogui
import pandas as pd
import re
 
def ytb_info2(video_url,channel) :
    print(f'{channel}',' 크롤링 시작')
    driver = webdriver.Chrome('C:/work/python/Asia_GAN/myproject/youtube/chromedriver.exe', options= options)
 
    #데이터 넣을 리스트
    date_list = []
    title_list = []
    view_list = []
    like_list = []
    dislike_list = []
    comment_list = []
    
    #각 채널별 영상으로 크롤링
    for i in range(len(video_url)):
        start_url = video_url[i]
        print(start_url, end= ' / ')
        driver.get(start_url)
        driver.implicitly_wait(1.5)
                
        body = driver.find_element_by_tag_name('body')
        
        #댓글 null 값 방지 
        num_of_pagedowns = 2
        while num_of_pagedowns:
            body.send_keys(Keys.PAGE_DOWN)
            time.sleep(0.5)
            num_of_pagedowns -= 1
            time.sleep(0.5)
        
        #크롤링 요소    
        try : 
            info = driver.find_element_by_css_selector('.style-scope ytd-video-primary-info-renderer').text.split('\n')
 
            if '인기 급상승 동영상' in info[0] :
                info.pop(0)
            elif '#' in info[0].split(' ')[0] :
                info.pop(0)
        
            title = info[0]
            divide = info[1].replace('조회수 ','').replace(',','').split('회')
            view = divide[0]
            date = divide[1].replace(' ','')
            like = info[2]
            dislike = info[3]    
            
            driver.implicitly_wait(1)  
                  
            try:
                comment = driver.find_element_by_css_selector('#count > yt-formatted-string > span:nth-child(2)').text.replace(',','')
            except:
                comment = '댓글x'
                
            #리스트에 추가
            title_list.append(title)
            view_list.append(view)
            date_list.append(date)
            like_list.append(like)
            dislike_list.append(dislike)
            comment_list.append(comment) 
            
            # 크롤링 정보 저장    
            new_data = {'date':date_list, 'title':title_list, 'view':view_list, 'comment': comment_list, 'like':like_list, 'dislike':dislike_list}
            df = pd.DataFrame(new_data)
            df.to_csv(f'data/{channel}/{channel}.csv', encoding='utf-8-sig')
        except :
            continue
        
        # 확인용
        print(title, view, date, like, dislike, comment)
   
    driver.quit()
Colored by Color Scripter

cs

자막과 댓글이 필요 없을 경우

제목, 날짜, 조회수, 좋아요 수, 싫어요 수, 댓글 수만 크롤링을 합니다. 정보 양이 많지 않기 때문에 셀레니움만으로도 가능합니다. html_source를 bs4로 넘겼을 때와 비교해도 얼마 차이가 나지 않습니다.

# print(title, view, date, like, dislike, comment) 만약 어떤 정보가 나오는지 확인할 필요가 없으시면 비활성화해주세요.

나는 댓글과 자막도 필요하신 분들은

밑으로

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

from youtube_transcript_api import YouTubeTranscriptApi
from konlpy.tag import Kkma
from pykospacing import Spacing
 
def ytb_subtitle(start_url, title) :
    try:
        code = start_url.split('=')[1]
        srt = YouTubeTranscriptApi.get_transcript(f"{code}", languages=['ko']) #한글로, 딕셔너리 구조
 
        text = ''
 
        for i in range(len(srt)):
            text += srt[i]['text'] + ' '
            
        text_ = text.replace(' ','')
 
        #문장 분리 / kss 사용해도 무방
        kkma = Kkma()
 
        text_sentences = kkma.sentences(text_)
 
        #종결 단어
        lst = ['죠','다','요','시오', '습니까','십니까','됩니까','옵니까','뭡니까',]
 
        df = pd.read_csv('not_verb.csv',encoding='utf-8')
        not_verb = df.stop.to_list()
 
        #단어 단위로 끊기
        text_all = ' '.join(text_sentences).split(' ')
 
        for n in range(len(text_all)) :
            i = text_all[n]
            if len(i) == 1 : #한글자일 경우 추가로 작업x
                continue
            
            else :
                for j in lst : #종결 단어
                    #질문형
                    if j in lst[4:]:
                        i += '?'
                    
                    #명령형                
                    elif j == '시오' :
                        i += '!'
                    
                    #마침표    
                    else :
                        if i in not_verb : #특정 단어 제외
                            continue
                        else :        
                            if j == i[len(i)-1] : #종결
                                    text_all[n] += '.'
                                    
 
        spacing = Spacing()
        text_all_in_one = ' '.join(text_all)
 
        text_split = spacing(text_all_in_one.replace(' ','')).split('.')
        text2one= []
        for t in text_split:
            text2one.append(t.lstrip())  
            
        w = '. '.join(text2one)
                        
        f = open(f'data/{channel}/subtitle/{title}.txt','w',encoding='utf-8')
        f.write(w)
        f.close()
        print('O')
    except:
        print('X')
Colored by Color Scripter

cs

유튜브 크롤링(2) - ㄹㅇ 초간단 유튜브 자막 다운 & 추출 (문장분리까지)

유튜브 크롤링 글에 제목, 조회수, 댓글, 좋아요를 크롤링하는 방법에 대해서 글을 써야 하는데, 요즘 자소서를 쓰고 알고리즘 공부도 하고 이것저것 하다보니 글을 쓸 시간이 많지 않았다. 유튜

0goodmorning.tistory.com

유튜브 자막 추출 다운과 관련해서는 이전 글을 참고해주시면 좋을 것 같습니다. not_verb.csv 파일의 경우 '다', '요'로 끝나는 단어 중 동사가 아닌 명사, 형용사 단어를 stop 컬럼으로 추가하시면 됩니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160

# 영어 번역 없음
import winsound as sd
from bs4 import BeautifulSoup
import pyautogui
import pandas as pd
import re
 
def beepsound():
    fr = 2000    # range : 37 ~ 32767
    du = 1000     # 1000 ms ==1second
    sd.Beep(fr, du) # winsound.Beep(frequency, duration)
 
def ytb_info(video_url,channel) :
    print(f'{channel}',' 크롤링 시작')
    driver = webdriver.Chrome('chromedriver.exe', options= options)
    # new_data = {'date': '', 'title': '', 'view': '', 'comment': '', 'like':'', 'dislike':''}
    
    count = 1
    
    #데이터 넣을 리스트
    date_list = []
    title_list = []
    view_list = []
    like_list = []
    dislike_list = []
    comment_list = []
    
    try:
        #각 채널별 영상으로 크롤링
        for i in range(len(video_url)):
            start_url = video_url[i]
            print(start_url, end= ' / ')
            driver.get(start_url)
            driver.implicitly_wait(1.5)
                    
            body = driver.find_element_by_tag_name('body')
            
            #댓글 null 값 방지 
            num_of_pagedowns = 1
            while num_of_pagedowns:
                body.send_keys(Keys.PAGE_DOWN)
                time.sleep(0.5)
                num_of_pagedowns -= 1
                driver.implicitly_wait(1)
            
            #크롤링 요소    
            try : 
                info = driver.find_element_by_css_selector('.style-scope ytd-video-primary-info-renderer').text.split('\n')
 
                if '인기 급상승 동영상' in info[0] :
                    info.pop(0)
                elif '#' in info[0].split(' ')[0] :
                    info.pop(0)
            
                title = info[0]
                divide = info[1].replace('조회수 ','').replace(',','').split('회')
                view = divide[0]
                date = divide[1].replace(' ','')
                like = info[2]
                dislike = info[3]    
                        
                try:
                    comment = driver.find_element_by_css_selector('#count > yt-formatted-string > span:nth-child(2)').text.replace(',','')
                except:
                    comment = '댓글x'
                    
                #리스트에 추가
                title_list.append(title)
                view_list.append(view)
                date_list.append(date)
                like_list.append(like)
                dislike_list.append(dislike)
                comment_list.append(comment) 
                
                # 크롤링 정보 저장    
                new_data = {'date':date_list, 'title':title_list, 'view':view_list, 'comment': comment_list, 'like':like_list, 'dislike':dislike_list}
                df = pd.DataFrame(new_data)
                df.to_csv(f'data/{channel}/-{channel}.csv', encoding='utf-8-sig')
 
            except :
                continue
 
            # print(title, view, date, like, dislike, comment)
            
            num_of_pagedowns = 1
            while num_of_pagedowns:
                body.send_keys(Keys.PAGE_DOWN)
                time.sleep(0.5)
                num_of_pagedowns -= 1
                
            #페이지 다운
            last_page_height = driver.execute_script("return document.documentElement.scrollHeight")
 
            while True:
                driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
                # driver.implicitly_wait(2) #오류남
                time.sleep(0.5)
                new_page_height = driver.execute_script("return document.documentElement.scrollHeight")
 
                if new_page_height == last_page_height:
                    break
                last_page_height = new_page_height
                # driver.implicitly_wait(1)
                time.sleep(0.75)
            
            time.sleep(0.5)
 
            
            # 댓글 크롤링
            html = driver.page_source
            soup = BeautifulSoup(html, 'lxml')
            
            users = soup.select("div#header-author > h3 > #author-text > span")
            comments = soup.select("yt-formatted-string#content-text")
            
            user_list=[]
            review_list=[]
 
            for i in range(len(users)):
                str_tmp = str(users[i].text)
                str_tmp = str_tmp.replace('\n', '')
                str_tmp = str_tmp.replace('\t', '')
                str_tmp = str_tmp.replace('              ','')
                str_tmp = str_tmp.replace('            ','')
                user_list.append(str_tmp)
 
                str_tmp = str(comments[i].text) 
                str_tmp = str_tmp.replace('\n', '')
                str_tmp = str_tmp.replace('\t', '')
                str_tmp = str_tmp.replace('            ', '')
 
                review_list.append(str_tmp)        
 
            
            # 댓글 추가    
            pd_data = {"ID":user_list, "Comment":review_list}
            youtube_pd = pd.DataFrame(pd_data)
            
            title = re.sub('[-=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…《\》]', '', title)
            youtube_pd.to_csv(f"data/{channel}/{title}.csv", encoding = 'utf-8-sig')#,index_col = False)
            print('ㅁ',end='')
 
            # 자막 추출
            ytb_subtitle(start_url, title)
            
            # 광고 끄기
            if count :
                # time.sleep(1)
                try:
                    driver.implicitly_wait(0.5)
                    driver.find_element_by_css_selector("#main > div > ytd-button-renderer").click()
                    count -=1
                except:
                    continue
 
    except :
        driver.quit()
        beepsound()
    driver.quit()
    beepsound()
Colored by Color Scripter

cs

기본 정보 / 댓글 / 자막까지

기본 정보 크롤링 밑으로 추가된 기능은 스크롤 다운 후, html page_source를 bs4로 넘겨서 댓글을 크롤링 합니다. 양이 많기 때문에 셀레니움보다 가볍고 빠른 bs4를 사용하시는 것을 추천드립니다.

댓글을 다 크롤링하고, 자막까지 받았을 때 영상 1개당 33초 정도 걸렸습니다. 컴퓨터, 인터넷 사양에 따라서 다를 거라 생각합니다. 한 채널이 끝날 때마다 소리가 나게 했습니다. 필요 없으면 꺼주세요!

*주의사항 *

유튜브 댓글은 기본적으로 인기 댓글순으로 정렬이 되어있기 때문에, 뒤에 있는 댓글일수록 공감을 적게 받거나 관심이 적은 댓글일 확률이 높습니다. 저는 모든 댓글이 필요하지 않기 때문에, 가장 크롤링이 빠르면서 댓글들 정보를 모을 수 있게 시간 설정을 했습니다. 댓글이 적으면 모든 댓글을 크롤링하지만, 많아지면 60~90% 정도만 크롤링을 하게 됩니다.

모든 댓글들이 필요하신 분들은, time.sleep을 1초 이상으로 해주세요. driver.implicitly_wait의 경우 스크롤은 내려가는데 댓글들이 로딩이 되지 않는 경우가 있어서 time.sleep을 사용했습니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

#영어 번역
import pyautogui
import pandas as pd
import re
 
def ytb_info(video_url,channel) :
    print(f'{channel}',' 크롤링 시작')
    driver = webdriver.Chrome('chromedriver.exe', options= options)
    df = pd.DataFrame()
 
    count = 1
    
    #각 채널별 영상으로 크롤링
    for i in range(len(video_url)):
        start_url = video_url[i]
        print(start_url, end= '/ ')
        driver.implicitly_wait(1)
        driver.get(start_url)
        
        #영어 번역
        pyautogui.hotkey('shift','F10')
        for i in range(7):
            pyautogui.hotkey('down')
        pyautogui.hotkey('enter')
        
        body = driver.find_element_by_tag_name('body')
        
        #댓글 null 값 방지 
        num_of_pagedowns = 1
        while num_of_pagedowns:
            body.send_keys(Keys.PAGE_DOWN)
            time.sleep(.75)
            num_of_pagedowns -= 1
            driver.implicitly_wait(1)
        
        #크롤링 요소    
        info = driver.find_element_by_css_selector('.style-scope ytd-video-primary-info-renderer').text.split('\n')
 
        if '인기 급상승 동영상' in info[0] :
            info.pop(0)
        elif '#' in info[0].split(' ')[0] :
            info.pop(0)
        
        title = info[0]
        divide = info[1].replace('조회수 ','').replace(',','').split('회')
        view = divide[0]
        date = divide[1].replace(' ','')
        like = info[2]
        dislike = info[3]    
                
        try:
            comment = driver.find_element_by_css_selector('#count > yt-formatted-string > span:nth-child(2)').text.replace(',','')
        except:
            comment = '댓글x'
   
        
        # 크롤링 정보 저장    
        new_data = {'date':date, 'title':title, 'view':view, 'comment': comment, 'like':like, 'dislike':dislike}
        df = df.append(new_data, ignore_index=True)
        df.to_csv(f'data/{channel}/{channel}.csv', encoding='utf-8-sig')
        # print(title, view, date, like, dislike, comment)
        
        #페이지 다운
        last_page_height = driver.execute_script("return document.documentElement.scrollHeight")
 
        while True:
            driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
            time.sleep(1)
            new_page_height = driver.execute_script("return document.documentElement.scrollHeight")
 
            if new_page_height == last_page_height:
                break
            last_page_height = new_page_height
            time.sleep(1)
        
        time.sleep(0.5)
        
        #댓글 크롤링
        review_list = []
        user_list =[]
        reviews = driver.find_elements_by_css_selector('#content-text')
        users = driver.find_elements_by_css_selector('h3.ytd-comment-renderer a span')
        num = 0
        for i in range(len(users)):
            review = reviews[i].text.replace('\n', ' ')
            review_list.append(review)
            
            user = users[i].text
            user_list.append(user)
            
        # 댓글    
        pd_data = {"ID":user_list, "Comment":review_list}
        youtube_pd = pd.DataFrame(pd_data)
        
        title = re.sub('[-=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…《\》]', '', title)
        youtube_pd.to_csv(f"data/{channel}/{title}.csv", encoding = 'utf-8-sig')
        print('ㅁ',end='')
 
        # 자막 추출
        ytb_subtitle(start_url, title)
        
        # 광고 끄기
        if count :
            # time.sleep(1)
            try:
                driver.implicitly_wait(0.5)
                driver.find_element_by_css_selector("#main > div > ytd-button-renderer").click()
                count -=1
            except:
                continue
        
    driver.quit()
Colored by Color Scripter

cs

해외 번역

단점 : headless으로 하면 안 된다. 마우스를 사용하지 못 한다. 시간이 진짜아아아아 엄처어어어엉 오래 걸린다. 굳이 이렇게 안 해도 될 거라고 생각이 드는데 혹시나 필요하신 분들을 위해서 남긴다.

가장 문제가 되는 부분이 번역을 한 정보는 bs4로 넘어가지 않는다. 셀레니움으로 모든 댓글과 닉네임들을 모아야 하기 때문에 시간이 오래 걸리는 것이다.

이 데이터들을 어떻게 사용할 것인지는 아직까지는 비밀.

'할 수 있다. 파이썬' 카테고리의 다른 글

(3) 실무 엑셀 함수 VLOOKUP, INDEX MATCH 시간 50배 단축, 파이썬으로 한 방에 잡자 (0)	2022.01.21
(2) 파이썬 엑셀 사무 자동화 : 보안 걸린 엑셀 한 번에 뚫기 openpyxl? xlwings? (2)	2022.01.20
(1) 파이썬 엑셀 사무 자동화 : 회사 사내망 때문에 좌절한 당신... (0)	2022.01.19
유튜브 크롤링(2) - ㄹㅇ 초간단 유튜브 자막 다운 & 추출 (문장분리까지) (1)	2021.09.28
유튜브 크롤링(1) - 셀레니움 페이지 자동 번역, api 번역기 없이 가능! (키 입력, 마우스 입력) (2)	2021.09.14

PREV 이전 1 NEXT 다음

✔굿모닝 IT ✔

크롤링

(6) 사무 자동화 - 웹 검색 후 특정 부분 자동 캡쳐/ 스크린샷, 확대 축소

'할 수 있다. 파이썬' 카테고리의 다른 글

PTKOREA 인턴 면접 합격 후기 및 인턴 생활 정리해 드림. (마케팅, 데이터)

'코테 & 취준' 카테고리의 다른 글

유튜브 크롤링(3) 올인원 - 채널 제목, 댓글, 조회수, 자막까지

'할 수 있다. 파이썬' 카테고리의 다른 글

+ Recent posts

티스토리툴바