multi threading으로 빠르게 크롤링하기

728x90

데이터를 모으는 작업을 하면서 속도를 높일 수 있는 방법을 찾아보다가, '멀티쓰레드'를 활용해서 크롤링을 해보려고 한다. 실제로 작업 과정이 대폭 줄었으며, 간단한 예제를 통해 보여주고자 한다.

1. 멀티스레드란?

멀티쓰레드(Multithreading)는 하나의 프로세스 안에서 둘 이상의 쓰레드(작업 단위)가 동시에 실행되는 개념이다. 이를 통해 CPU 자원을 보다 효율적으로 사용하고, 프로그램의 응답성이나 처리 속도를 향상시킬 수 있다.

2. 필요한 라이브러리

time, request, bs4, threading를 사용할 것이다. 각 라이브러리에 대해 간단하게 설명해보겠다.

1) time - 시간 관련 라이브러리

시간 측정, sleep 등 다양한 시간 관련 라이브러리

2) request - HTTP 요청 라이브러리

웹 페이지에 GET/POST 요청을 쉽게 보낼 수 있음
응답을 text, json, content 등으로 바로 사용할 수 있음

3) beautifulsoup – HTML 파싱 라이브러리

HTML이나 XML 문서를 구조적으로 파싱해서 원하는 요소를 쉽게 추출
requests와 함께 자주 사용됨

4) threading – 멀티쓰레딩 라이브러리

여러 작업을 동시에 처리하고 싶을 때 사용 (특히 I/O 작업)

3. GeekNews 기사 파싱을 통해 속도 비교하기

해커 뉴스 한국 버전인 GeekNews의 기사 내용들을 크롤링 해보자. time 라이브러리로 시간 측정을 하여 프로그램의 속도를 비교해보자.

1) single thread crawling

import requests
from bs4 import BeautifulSoup
import time
import threading

threads = []

def get_topic_contents(url):
    response = requests.get(url)

    soup = BeautifulSoup(response.text, 'html.parser')

    contents = soup.select('#topic_contents')

    print(contents)

def get_topics(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    titles =  soup.select('.topicdesc')

    for title in titles:
        sp_url = title.find('a').get('href')
        new_url = url+sp_url
        start_threads(new_url)
        
    finish_threads(threads)

def start_threads(url):
    t = threading.Thread(target=get_topic_contents, args=(url,))
    t.start()
    threads.append(t)

def finish_threads(threads):
    for t in threads:
        t.join()
    

start = time.perf_counter()

url = 'https://news.hada.io/'

get_topics(url)

end = time.perf_counter()
print(f"모든 작업 완료! {end-start:.4f}")

2) muti thread crawling

import requests
from bs4 import BeautifulSoup
import time
import threading

threads = []

def get_topic_contents(url):
    response = requests.get(url)

    soup = BeautifulSoup(response.text, 'html.parser')

    contents = soup.select('#topic_contents')

    print(contents)

def get_topics(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    titles =  soup.select('.topicdesc')

    for title in titles:
        sp_url = title.find('a').get('href')
        new_url = url+sp_url
        start_threads(new_url)
        
    finish_threads(threads)

def start_threads(url):
    t = threading.Thread(target=get_topic_contents, args=(url,))
    t.start()
    threads.append(t)

def finish_threads(threads):
    for t in threads:
        t.join()
    

start = time.perf_counter()

url = 'https://news.hada.io/'

get_topics(url)

end = time.perf_counter()
print(f"모든 멀티쓰레드 작업 완료! {end-start:.4f}")

4. 결론

거의 6~7배의 속도 차이가 나는 것을 확인할 수 있다. 그만큼 속도를 높이기 위한 좋은 수단임을 알 수 있다. 이 프로그램은 간단하게 한페이지이지만, 만약에 30개의 웹사이트를 주기적으로 크롤링한다고 생각한다면 시간 절약을 위해서 써보는 것이 어떨까...!

'Python' 카테고리의 다른 글

Pandas에 대해 (0)	2022.03.01
Numpy에 대해 (0)	2022.02.25

코드를 모아모아

multi threading으로 빠르게 크롤링하기

1. 멀티스레드란?

2. 필요한 라이브러리

1) time - 시간 관련 라이브러리

2) request - HTTP 요청 라이브러리

3) beautifulsoup – HTML 파싱 라이브러리

4) threading – 멀티쓰레딩 라이브러리

3. GeekNews 기사 파싱을 통해 속도 비교하기

4. 결론

'Python' 카테고리의 다른 글

티스토리툴바

multi threading으로 빠르게 크롤링하기

1. 멀티스레드란?

2. 필요한 라이브러리

1) time - 시간 관련 라이브러리

2) request - HTTP 요청 라이브러리

3) beautifulsoup – HTML 파싱 라이브러리

4) threading – 멀티쓰레딩 라이브러리

3. GeekNews 기사 파싱을 통해 속도 비교하기

4. 결론

'Python' 카테고리의 다른 글

관련글

티스토리툴바