[Python] Tesseract를 사용하여 이미지 캡챠(Captcha) 뚫기

JaehyoJJAng2023년 06월 05일

▶︎ Tesseract 설치

tesseract를 사용하기 위해서는 설치를 먼저 해주어야 한다.

필자는 맥(mac)으로 실습해볼 것이기에 맥을 기준으로 설치해보겠다.

윈도우나 리눅스의 경우 https://tesseract-ocr.github.io/tessdoc/Installation.html에서 설치 방법을 확인해보며 된다.

‣ homebrew

To install Tesseract run this command

brew install tesseract

‣ Ubuntu

sudo apt-get install -y tesseract-ocr
sudo apt-get install -y libtesseract-dev
sudo apt-get install -y libgl1-mesa-glx

Note for Ubuntu users: In case apt is unable to find the package try adding universe entry to the sources.list file as shown below.

$ sudo vi /etc/apt/sources.list

Copy the first line "deb http://archive.ubuntu.com/ubuntu bionic main" and paste it as shown below on the next line.
If you are using a different release of ubuntu, then replace bionic with the respective release name.

deb http://archive.ubuntu.com/ubuntu bionic universe

▶︎ 캡챠 이미지 저장

본인의 경우 playwright로 특정 작업을 자동화하는 코드를 만들고 있기 때문에

캡챠 이미지가 있는 URL에 접속해, 캡챠 이미지를 로컬에 저장하도록 하였다. {% include codeHeader.html name="main.py" %}

from playwright.sync_api import sync_playwright

def save_image(url, img_tag_selector, file_path):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        
        # 이미지 태그를 선택하고 해당 이미지의 스크린샷을 찍어 저장합니다.
        img_element = page.query_selector(img_tag_selector)
        if img_element:
            img_element.screenshot(path=file_path)
        else:
            print(f"Could not find an image element with selector '{img_tag_selector}' on the page.")
        
        browser.close()

# 웹 페이지 URL, 이미지 태그의 CSS 선택자, 이미지를 저장할 파일 경로 설정
url = "https://example.com"
img_tag_selector = "img[src$='.jpg']"  # 이미지 태그의 CSS 선택자 예시
file_path = "code.png"

# 이미지 스크린샷 찍고 저장
save_image(url, img_tag_selector, file_path)

<br.

▶︎ 이미지에서 문자 추출

tesseract로 이미지에서 문자열을 추출해보자.

pytesseract와 cv2가 설치되어 있지 않다면 아래 명령을 실행하여 설치해주도록 하자.

pip install pytesseract opencv-python

{% include codeHeader.html name="tesseract.py" %}

import pytesseract

# 저장한 이미지 불러오기
image = cv2.imread('code.png')

#해당 이미지에 있는 글씨(문자)를 pytersseract를 이용하여 추출함
pytesseract.pytesseract.tesseract_cmd =r"/home/dev/testProjcet/code.png"
text = pytesseract.image_to_string(image)

python