
Hands on to teach you to choose the format: XML and JSON in the end where the difference?
engage in data collection of the old iron must have seen XML and JSON this pair of enemies, especially with the proxy ip crawl data, the performance of these two goods is completely different. Let's take the proxy ip collection to say something.XML is like a chatterbox.Each piece of data has to be wrapped in a layer of "clothing", for example:
1.2.3.4
8080
https</type
</proxy
(indicates contrast)JSON is a straight shooter.I don't want to be a nag:
{
"ip": "1.2.3.4",
"port": 8080,
"type": "https"
}
Do you see the way? When collecting data with proxy ip, JSON format can save at least 30% of traffic, which needs to frequently switch ip collection task, it is simply a small fuel saver.
Proxy Capture in Action: Format Selection Matters
Our ipipgo customers have tested it and collected the same 1000 proxy ip data:
- XML time: 8.2 seconds on average
- JSON time consumption: 5.1 seconds on average
Why is it so different?It's all in the packet size.The proxy ip service itself has a response time! Proxy ip service itself has a response time, if the data format and then drag behind, the collection efficiency directly fracture. Here to insert a hardcast, ipipgo's interface default support dual-format output, want to change the format as long as you change a parameter on the line:
Here's an example.
requests.get("https://api.ipipgo.com/get", params={"format": "json"})
A guide to avoiding the pit: these details are going to kill you
Ever seen someone use XML to parse a proxy ip and end up in a hole? The most outrageous situation I've ever encountered:
1. Wrong case of tags ( and are silly)
2. Attribute values are not in quotes (ip with special characters will collapse).
3. Forgetting to deal with CDATA blocks (collecting comments as real data)
JSON, on the other hand, doesn't have this kind of shit, especially when dealing with proxy ip data like ipipgo with geo-location information, the nested structure is handled with ease:
{
"node": {
"ip": "1.2.3.4",
"location": {
"city": "shanghai",
"carrier": "Telecom"
}
}
}
question-and-answer session
Q: Why is JSON always recommended?
A: To give an inappropriate example, XML like courier wrapped in ten layers of bubble wrap, JSON is like a direct send naked pieces. For the need to frequently switch proxy ip collection task, save the traffic can be more than a few websites.
Q: What should I pay attention to with proxy ip collection?
A: three things to remember: 1) choose to support the automatic switching of providers (such as ipipgo polling interface) 2) set the timeout not more than 3 seconds 3) meet the verification code immediately cut ip
Q: What are the exclusive advantages of ipipgo?
A: to say three points really: ① support for street-level positioning selection of proxy ip ② response time control within 200ms ③ daily automatic update 20%IP pool, anti-blocking effect of the barrage!
Ultimate advice on selection
Finally dump a dry comparison table:
Processing speed: JSON beats √
Fault tolerance: XML is slightly stronger x
Expansion space: tie ≈
Traffic consumption: JSON save 30%+√
If you mainly do proxy ip collection, close your eyes and choose JSON is right. Of course, if you use ipipgo, it is recommended to turn on their intelligent format conversion, automatically adapted to the target site's parsing needs, this feature has been tested to improve the 20% collection success rate.
Say a real case: an e-commerce customer with xml format picking proxy ip, the result is that every hour triggered 300 + times the CAPTCHA. After changing to json format + ipipgo dynamic residential agent, directly down to single digits. The gap, is it convincing enough?

