personal blog of a mid-mo developer
If you use php and have to interact with another server, you’ve more than likely done it with curl.
cURL is a command line tool capable of a wide array of server to server interactions, their website list the wide range of abilities, but the ones I use most often have to do with scraping content or form posting.
If you are doing form posting and using the POST method, the code is pretty straight forward. Here is some sample code I whipped up to initiate a post to cullenbreedlove.com/submit.php with the fields name1 set to value1, name2 set to value2, and name3 set to value3. If you notice I’m passing an array for the postfields, you can actually pass a string like name1=value1&name2=value2&name3=value3, but it isn’t nearly as clean.
$url = "http://www.cullenbreedlove.com/submit.php"; $post_fields = array(); $post_fields['name1'] = "value1"; $post_fields['name2'] = "value2"; $post_fields['name3'] = "value3"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);//This just verifies a security certificate is legit, if they are too cheap they may have generated their own. curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');//give a legit looking useragent curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields); $data = curl_exec($ch); curl_close($ch);
The data that is returned from the website after posting to it is stored in the $data variable, you can do whatever you want with that information. This is good for verifying your post went through without errors.
If you need to make a GET request you’ll simply take out the post lines and put your variables in the url. Should you run into problems posting more than likely it is because your data isn’t sanitized, they are issuing a redirect and you aren’t following it, or they are returning the data compressed. Setting a couple more curl options will clue you in:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);//set this to follow header redirects curl_setopt($ch, CURLOPT_HEADER, true);//set this to see returned header information curl_setopt($ch, CURLOPT_ENCODING, ”gzip, deflate”);//set this to expect more than just utf or unicode transfers
If the headers returned contain “Location: ” somewhere it is the redirect, if it is returning an error 400 Bad Request it is something wrong in the url (are you urlencoding() your values?), and if it is returning a lot of unreadable characters it is more than likely the encoding.
With the knowledge above you can now build a content scraper, post forms, or interact with another website’s API.
What are you going to do though if the content you want to scrape or the form you want to post requires a log in before you can do what you want to do? Luckily there is an app for that, or rather it is a curlopt.
What you’ll need to make this happen is some detective work and a directory outside of your www directory (or htaccess’d to deny all) with write access. First you’ll need to create your directory, I’m assuming you are using some variation of unix, and chmod the directory for write access. Then you’ll need to come up with the name of the file and the location for your cookies.
$cookie_file = tempnam("cookies/", "CURLCOOKIE");
The tempnam function takes two parameters, the first one sets the directory for where to create your file, the second one sets the prefix for your file. The second one isn’t entirely important, but I tend to set it to CURLCOOKIE. The function then looks in that directory and generates a random filename and creates that file with write access for the script, when the script finishes the file is removed.
Now that we have somewhere to put our cookies we need to tell curl where to put the cookies that we’ll be receiving from the web server we are connecting to.
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
That may look redundant, but it is necessary for this to work.
After this is done you’ll need to do your curl_init() once and then not close the connection until you are finished, you can make multiple curl requests with the curl_exec function and it’ll basically be maintaining your state.
If you are having difficulties logging into a website you may have to hit the page a first time and parse the code looking for “__VIEWSTATE” and “__EVENTVALIDATION” and then passing them in your post fields when submitting your form — this is almost a certainty with IIS servers running ASP.
I’ve gone ahead and included a a function that seems to work pretty well for getting the values of __VIEWSTATE and the other one.
function item_value($copy_data, $search_item){
$copy_data = substr($copy_data, (strpos($copy_data, $search_item)+strlen($search_item)));
$search_item = 'value="';
$copy_data = substr($copy_data, (strpos($copy_data, $search_item)+strlen($search_item)));
$copy_data = substr($copy_data, 0, strpos($copy_data, '"'));
return $copy_data;
}
//usage
$post_fields['__VIEWSTATE'] = item_value($data, '__VIEWSTATE');
Hopefully this will help you to curl with the best of them.
CB
© 2012 CullenBreedlove.com
Stephie
October 29th, 2009 at 9:01 pm
Interesting points on web scraping, For web scraping i use python for simple things, but for larger projects like documents, files, or the web i tried http://www.extractingdata.com/web%20scraping.htm which worked great, they build quick custom screen scrapers, web scraping, and data parsing programs
Max
November 10th, 2009 at 11:22 pm
Nice blog. Can’t wait to start my own blog.
bofAgorogaw
December 12th, 2009 at 12:00 pm
Truthful words, some truthful words man. Totally made my day!!
Tim Williams
January 19th, 2010 at 8:41 am
Wonderful page , You really hit the
mark with this, I just don’t understand why people quite get what you’re saying.
I don’t know how many people I’ve talked to about this very
thing in the past month, and they just don’t grasp it.
Now I won’t have to fight aimlessly with people about it, I’ll just be able to send them copy of
to this blog.
Never the less, Excellent post!