
First, why use BeautifulSoup to find the proxy IP?
The old iron engaged in data collection know that many sites will hide the proxy IP in the HTML structure. At this timefind_allIt's like a metal detector that can help you dig out proxy IPs from the nooks and crannies of a web page. For example, some websites will put IP addresses in a div with a class called "proxy-list", so using find_all('div', class_=' proxy-list') will be able to find all of them.
from bs4 import BeautifulSoup
html_doc = """
<div class="proxy-list">
<span>192.168.1.1:8080</span>
<span>10.0.0.1:8888</span>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ip_list = [tag.text for tag in soup.find_all('span')]
Second, the actual combat: pull out the different formats of the proxy IP
Some sites are so sneaky that they will hide the IP and port separately. This is the time to playcombinationup. For example, this structure was encountered:
The crawl code has to be written this way:
proxies = []
for li in soup.find_all('li', attrs={"data-ip": True})::
ip = li['data-ip']
port = li['data-port']
proxies.append(f"{ip}:{port}")
Third, avoiding the three major tricks of anti-climbing
1. The Great Disguise: with ipipgo'sDynamic Residential IPRotation request header
2. time magic: random sleep 1-3 seconds after find_all
3. distributed operations: API node harvesting with multiple ipipgo's at the same time
IV. Handbook for handling special scenarios
run intopaged dataDon't panic, grab the page number links and then go through them one by one:
page_links = [a['href'] for a in soup.find_all('a', class_='page-link')]
for link in page_links.
Remember to switch to ipipgo's proxy here
response = requests.get(link, proxies={"http": "ipipgo.com:8000"})
V. QA time: demining of frequently asked questions
Q: Why does find_all always return an empty list?
A: eighty percent of the site used dynamic loading, on ipipgoS5 AgentWith Selenium is the way to go!
Q: What should I do if I extract a duplicate IP?
A: Use Python's set de-duplication, or just use the ipipgo provided by theReal-time de-duplication API
Q: What if I need to verify that the agent is valid?
A: ipipgo's package comes with theSurvival detection functionIt saves you from having to write your own validation scripts.
VI. Why do older drivers choose ipipgo?
1. ExclusiveIP survival rate 99.2%It's a big step up from the competition.
2. Supporthourly rateNo money is wasted on temporary assignments.
3. Provision of off-the-shelfBeautifulSoup parsing templateThe first time I saw it, I was able to get it up and running in seconds.
Lastly, to tell the truth, looking for proxy IP is like panning for gold, the tool again cattle must also have a reliable source of mining. I've used five or six service providers, or ipipgo's IP pool is the most up-to-date. In particular, theIntelligent Routing FunctionThe fastest node is automatically matched to the fastest node, which is much more troublesome than manually switching. Recently, I have been doing e-commerce data collection, and after hanging up ipipgo, the collection speed is directly doubled, the key has not been sealed, which is really as stable as the old dog.

