What to do when a crawler encounters a Content-Type error?
Recently, a friend who does e-commerce complained to me that the crawler program he wrote was always intercepted by the target website. I asked him to send the code to see, good guy, the request header is not even set the Content-Type! It's like going to someone's house and knocking on the door without saying whether it's a delivery or a water meter check, of course the doorman won't let you in.
Many newbies tend to ignore this parameter, in factContent-Type is your network ID.. Especially when doing data collection with proxy IPs, the server will determine the type of request through this field. We do crawlers, the most common scenario is to match these two types:
application/x-www-form-urlencoded Form Submission Specialty
application/json API interface must
Practical: put a proxy vest on curl
Assuming you now want to access a certain API through ipipgo's proxy server, the correct curl command should look like this:
curl -x http://username:password@proxy.ipipgo.cc:8080
-H "Content-Type: application/json"
-d '{"keyword": "cellphone"}'
https://api.example.com/search
There are a few key points to note here:
- Replace the username in the proxy address with the account you registered with ipipgo.
- The password suggests a temporary dynamic key (which can be generated in the ipipgo backend)
- The -x parameter specifies the proxy server, don't write it as -proxy
Seventy-two variations of Content-Type
You have to use different types for different scenarios, and I've put together a quick checklist:
take | Content-Type value |
---|---|
Common Form Submission | application/x-www-form-urlencoded |
Uploading files | multipart/form-data |
Calling the REST API | application/json |
Getting XML data | application/xml |
When using ipipgo's rotating proxies, remember to carry the correct Content-Type for each request. their smart routing will automatically match the best node, but the gods can't save you if the request headers are not configured correctly.
Common pitfalls QA
Q:Setting Content-Type still returns 415 error?
A: Ninety percent is because the actual data format transferred does not match the declared type. For example, the header writes application/json, but the data part uses the format of form
Q: What should I do if my proxy IP often times out?
A: In this case, it is recommended to change to ipipgo's enterprise package, their long connection proxy supports auto retry mechanism. Remember to add the timeout parameter in curl:
--connect-timeout 30 Connection timeout time
--m 60 Maximum request time
Q: What if I need to use more than one Content-Type at the same time?
A: This situation basically does not exist, a request can only have one content type. If it is mixed data, it can be processed with multipart format segmentation.
Say something from the heart.
Engaging in technology is sometimes like stir fry, fire seasoning a little worse flavor changes. Last week to help customers debug a collection system, dead can not get the data, and finally found that the Content-Type written as applicetion/json (less a). So ah, with ipipgo's proxy can solve the IP problem, but these basic configurations must also be carefully checked.
Finally, if you have enterprise-level needs, we recommend that you go directly to ipipgo's custom protocol broker. Their technical support can help debug the request header parameters, than their own toss to save a lot of heartache. Newbie registration remember to receive 3G experience traffic, enough to test.