Subtitle: “It can’t be that hard”…FLW.
A friend linked me a DeviantArt gallery this evening. After several click throughs, and resizing of images to better view the comic, I thought how nice it would be to have all the images downloaded on my computer. I mean how hard could it be?
RoboBrowser saved the day though. It made it really easy to auth with DeviantArt, then download the page while logged in, and then parse that. The entire code to start the session, and login is as follows:
browser = RoboBrowser(parser = 'lxml') browser.open(args.url) form = browser.get_forms() form['username'].value = args.username form['password'].value = password browser.submit_form(form) browser.open(args.url)
I used argparse for the the URL and username, then the getpass module  for the password, which is where the args.url, args.username and password come from. browser.get_forms() took a bit of fiddling to find the right form, but just looking at the items in the list made it obvious as to which form to use. From then on it was down to finding the right URLs. DeviantArt conveniently lists the fullsized image URLs on the gallery view page under the data-super-full-img attribute in a tag, so this bit of BeautifulSoup code easily extracted all of the URLS I wanted:
urls_to_download =  potentials = browser.find_all("a", class_ = "thumb ismature") for tag in potentials: try: urls_to_download.append(tag.attrs['data-super-full-img']) except KeyError: pass
That first gets all the thumbnails, then from there selects only the tags with the data-super-full-img attribute in the tag. There may be a more efficient way to do it, and optimizing it could be a fun endeavor, but this worked. I used the urls_to_download list because the gallery spans over multiple pages, so there’s another round of URL extracting on the next gallery page. The galleries I was downloading only ever had two pages so just using this code for the next page was enough, but one could easily iterate the offset if needed. 
Then all that’s left is downloading the actual images. This StackOverflow page describes the specifics of using Requests for image downloading nicely. I used the image names in the URLs prepended with an index for the filenames to make them sort nicely in a folder.
This made a nice evening project. I didn’t really need the images downloaded locally, and could have easily right-clicked and saved them, but once the challenge presented itself to me, I really wanted to do it just to see if I could. It was fun. Plus it felt really good to actually see the URLs pop up once I finally got the extracting just right, since that felt like such a schlep with so many pitfalls.
Here’s the whole downloader script:
|||I went so far down the rabbit hole of analyzing the POST request headers, trying to add my own cookies, examining the payload data being sent. I’m sure there’s some way to do it in Requests, but interfacing with websites brushes up against the edge of my knowledge. The Developer Tools in Chrome both feels super useful and therefore appealing, and like a road littered with UX potholes. I love that I can see all the requests, but damn if it would have been helpful to know that I need to check Preserve log to have the initial POST request saved. I love that I can view all the cookies, and delete individual items from them (super useful in clearing the age gate state), but I wish I could modify the cookies and see what happens. I love being able to modify the CSS of a page and see the results in real time, so I want something similar to that with cookies.|
|||This is the first time I’ve used getpass. It seems like such a straightforward helpful module. I’m not sure how I’ve made it this far without it.|
|||RoboBrowser inherits from Requests, so although the documentation isn’t super explicit about it, one can just use the Requests style params to add paramaters to the request. I looked around for a bit to see if I could do that, then not finding anything spelling it out, just tried it and it worked. I like that about RoboBrowser, even though the documentation seems a bit lacking, it makes intuitive sense.|