IPIPGO ip proxy BeautifulSoup example: Python parsing HTML code

BeautifulSoup example: Python parsing HTML code

Crawler always be blocked IP? Try this combo Brothers should have encountered this situation, right? When you write a crawler script in Python, you get a 403 error from the target site just after two minutes of running. This time do not rush to smash the keyboard, today teach you to use BeautifulSoup + proxy IP this pair of golden partners to break the game ...

BeautifulSoup example: Python parsing HTML code

Crawlers are always blocked IP, try this combo!

Brothers should have encountered this situation, right? When you write a crawler script in Python, you just run for two minutes and receive a 403 error from the target site. At this time do not rush to smash the keyboard, today teach you to use theBeautifulSoup+Proxy IPThis golden pair to break the ice.

To cite a real case: last month there is an e-commerce price comparison brother, with ordinary script to catch the data of a shopping platform, the results just run half an hour IP will be pulled black. Later changed to use ipipgo's rotating proxy program, with the parsing skills we are going to talk about, and now every day to capture tens of thousands of stable commodity information.

Hands-on building of anti-blocking environment

Install these two essential libraries first (remember to operate in a virtual environment):

pip install beautifulsoup4 requests

重点来了!传统就像裸奔上网,用代理IP相当于给爬虫穿防弹衣。这里以ipipgo的服务为例,演示如何配置:

proxies = {
    'http': 'http://username:password@gateway.ipipgo.com:9020',
    'https': 'http://username:password@gateway.ipipgo.com:9020'
}

Be careful to replace the authentication information with your own account. ipipgo's Dedicated Proxy has separate ports for each channel, so don't mix them up.

Four Steps to Web Parsing

Real-world parsing of a news site (desensitized):

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0'} fake browser
response = requests.get('https://example.com/news',
                       proxies=proxies, headers=headers)
                       headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

 Grab titles with a specific class
titles = soup.find_all('h3', class_='news-title')
for title in titles.
    print(title.get_text().strip())

A guide to avoiding the pit:Here the most easy to plant in three places: 1) did not add the request header is recognized as a crawler 2) poor quality proxy IP leads to request failure 3) page structure changes lead to selector failure. The first two problems can be solved with ipipgo's quality proxy + standard request header template.

How do you break dynamic content?

When it comes to JavaScript rendered pages, BeautifulSoup may not be able to do the job. Don't panic, it's the ultimate solution:

take prescription ipipgo configuration recommendations
Simple Dynamic Loading Requests-html library Use long-lasting static IPs
Complex Interaction Pages Selenium Automation With browser fingerprint protection

Focusing on the Selenium solution, remember to add the proxy in the driver configuration:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=http://gateway.ipipgo.com:9020')
driver = webdriver.Chrome(options=options)

Frequently Asked Questions First Aid Kit

Q: Why is it still blocked even though I'm obviously using a proxy?
A: Check three things: 1) whether the proxy is in effect 2) whether the request frequency is too high 3) whether it triggers the website anti-climbing rules. It is recommended to use ipipgo's per-volume billing package to automatically switch between high stash IPs.

Q: What should I do if I return a garbled code?
A: Specify the encoding when initializing BeautifulSoup:
soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')

Q: How do I choose an agent package for ipipgo?
A: For beginnerstrial version($5/day), to be transferred when business stabilizesEnterprise Customized EditionThe following is a special reminder: to do large-scale collection must choose exclusive IP pool. Special Reminder: To do large-scale collection, be sure to choose the exclusive IP pool, shared IP is easy to affect each other.

Final scratch: the heart of web parsing lies in theStable page acquisition + accurate data extraction. Use ipipgo's proxy service is like a turbocharger for the crawler, both to avoid IP being blocked and to enhance the collection efficiency. There are specific questions welcome to ipipgo official website to find technical support, their technical customer service response speed is really fast, personally test the kind of seconds back.

我们的产品仅支持在境外网络环境下使用(除TikTok专线外),用户使用IPIPGO从事的任何行为均不代表IPIPGO的意志和观点,IPIPGO不承担任何法律责任。

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

美国长效动态住宅ip资源上新!

Professional foreign proxy ip service provider-IPIPGO

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish