
XML data crawling meet IP blocked? Try this trick
The web crawler brother understand, catch XML data is the most headache is the target site blocked IP. last week my colleague Lao Zhang planted in this matter - he wrote the weather data collection script ran less than 3 hours, the server IP directly be pulled black. This is the time to offer ourThe Proxy IP MethodUp!
import requests
from xml.etree import ElementTree
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9020', 'https': 'http://username:password@gateway.ipipgo.com:9020'
'https': 'http://username:password@gateway.ipipgo.com:9020'
}
response = requests.get('http://data.example.com/weather.xml', proxies=proxies)
xml_data = ElementTree.fromstring(response.content)
Look at the proxy settings section of the code, here we use the one provided by ipipgoDynamic Residential Agents. Their IP pool is updated with 200,000+ fresh addresses every day, which is more than ten times more stable than public proxies. Remember to replace username and password with your own credentials registered on the ipipgo website.
XML parsing meets CAPTCHA? Proxy Rotation
Many sites will bury the XML interfacean anti-reptile trap, such as this situation:
| symptomatic | traditional solution | Agency Program |
|---|---|---|
| Pop-up CAPTCHA in the middle of parsing | Manual processing of card progress | Automatic IP switching continues |
| Failed to load a specific tag | retrying over and over again is time-consuming | Multi-territory IP Parallel Catch |
With ipipgo.Intelligent Rotation ModelThe API also allows you to specify city-level localization, such as grabbing region-specific XML data and directly selecting the exit node for the corresponding region.
Practical case: using proxy IP to capture logistics information
Recently helped an e-commerce company to do the logistics tracking system, the core code looks like this:
from itertools import cycle
import xmltodict
ip_pool = [
'gateway.ip ipgo.com:9020',
'gateway.ipipgo.com:9021', 'gateway.ipipgo.com:9022', 'gateway.ipipgo.com:9022'
'gateway.ipipgo.com:9022'
]
proxy_cycler = cycle(ip_pool)
def fetch_logistics(tracking_num).
current_proxy = next(proxy_cycler)
proxies = {'https': f'http://user:pass@{current_proxy}'}
try.
response = requests.get(f'https://logistics.com/api?num={tracking_num}',
proxies=proxies, timeout=8)
return xmltodict.parse(response.text)
except Exception as e.
print(f "IP {current_proxy} request exception, auto switch")
return fetch_logistics(tracking_num)
This program uses ipipgo'sLong-lasting static proxiesIt can be used for more than 24 hours for a single IP. Especially suitable for XML interfaces that need to maintain sessions, such as government data platforms with cookie authentication.
Common pitfalls for newbies QA
Q: Proxy IP timeout when I use it?
A: 80% are using free proxies, ipipgo's commercial-grade proxies come with a defaultautomatic reconnection mechanismThe network will intelligently switch lines in case of network fluctuations.
Q:When parsing XML, I always get a message that the data is incomplete?
A: It may be that the IP speed is not enough to cause transmission interruption, change the proxy type in the ipipgo console tohigh speed channelThe download speed can be increased by up to 3 times in real life.
Q: What if I need to process multiple XML files at the same time?
A: Use theirMulti-Threading PackageIf you want to use the lxml library instead of the standard library, you can use the lxml library to parse more efficiently.
Lastly, a word of caution: don't just look at the price of the proxy service, ipipgo'sTwo-way encrypted transmissionrespond in singingrequest header masquerading asThe function can avoid 90%'s anti-climbing detection. Once I forgot to open these features, 10 minutes was blocked 20 IP, blood tears lesson ah!

