(the original link is here: http://www.bl.uk/onlinegallery/ttp/alice/accessible/introduction.html this is working on mirroring these sites. )
I worked out a quick shell loop to get all the original alice and wonderland, its a quick zsh/wget loop, though I’m sure there was a quicker way of doing it just with wget, this has some special exemptions (zsh, bash would need some formatting, if you willingly choose bash, then consider it a fun exercise):
for i in {2..90} ; do ; let j=$i+1 ; let a=$i%2 ; if [ 0 = $a ] ; then ; if [ ! -e pages${i}and${j}full.jpg ] ; then ; wget http://www.bl.uk/onlinegallery/ttp/alice/accessible/images/pages${i}and${j}full.jpg; fi ; fi ; done
Unfortunately, they seemed to have changed their naming convention since last night, it needed a few modifications:
for i in {1..91} ; do ; let a=$i%2 ; if [ 0 = $a ] ; then ; if [ ! -e page${i}full.jpg ] ; then ; wget http://www.bl.uk/onlinegallery/ttp/alice/accessible/images/page${i}full.jpg; sleep 5 ; fi ; fi ; done
Its shorter since the pages only reference odd numbers, and don’t reference two pages at a time, so it doesn’t need the extra variable to be set, and the file names are shorter.
The ‘sleep’ command near the end backs off from the server so you don’t hit them too hard, and they don’t auto-ban you. This is important for things you are pulling, if you aren’t writing a full fledged crawler that should be honoring robots.txt.
Ack! Just checked again, and it looks like they have changed the layout again.
for i in {1..91} ; do ; if [ ! -e page${i}full.jpg ] ; then ; wget http://www.bl.uk/onlinegallery/ttp/alice/accessible/images/page${i}full.jpg; sleep 5 ; fi ; done
Much simpler, actually curl can do this in one line:
curl ‘http://www.bl.uk/onlinegallery/ttp/alice/accessible/images/page[1-91]full.jpg’ -o ‘page#1.jpg’ # USE SINGLE QUOTES!
Now that the data is in a regular format, you can just have your own damn mirror of the files.
Also, there are two more pages you should get:
wget http://www.bl.uk/onlinegallery/ttp/alice/accessible/images/dedicationfull.jpg
wget http://www.bl.uk/onlinegallery/ttp/alice/accessible/images/coverfull.jpg # this one isn’t even accessible from the live site!
Anyway, thats enough of the over-verbose linux commands.
Laters