How to download images programmatically from Wikimedia Commons without registering for a Bot account?

Asked Sep 23 '09 at 17:00

Active Nov 14 '19 at 03:20

Viewed 1.3k times

It seems like the only way to get approval for a Bot account is if it adds to or edits information already on Wikimedia. If you try to download any images, without a bot account, using some of the api libraries out there you get error messages instead of the images. Seems like they block anyone not coming in from a browser? Anyone else have any experience with this? Am I missing something here?

edited Nov 14 '19 at 03:20

Cœur

37,241
25
195
267

asked Sep 23 '09 at 17:00

tomvon

5,121
3
22
16

5 Answers5

Having just done this myself I feel I should share:

http://www.mediawiki.org/wiki/API:Allimages

This API document does state that you can query the images:

http://en.wikipedia.org/w/api.php?action=query&list=allimages&aiprop=url&format=xml&ailimit=10&aifrom=Albert

with the aiprop=url you are given the url of the image you are looking for.

answered Jun 03 '11 at 15:25

Phil Hannent

12,047
17
71
118

Thanks! This one: http://www.mediawiki.org/wiki/API:Categorymembers is also useful. – Hypercube Jun 26 '11 at 18:32

Try explaining exactly what you want to do? And what you've tried? What error message did you get? You're not very clear...

What libraries have you tried? If you're not aggressive, there are no restrictions in downloading WM content. I've never heard of any restrictions. Some User-Agents are banned from editing to avoid stupid spamming, but really, I've never heard of downloading restrictions.

If you are trying to scrape a massive amount of images, downloading them through Commons, you're doing it wrong (tm). If you are trying to get a few images, anywhere from 10 to 200, you should be able to write a decent tool in a few lines of code, provided that you are respecting the throttling requirement: when the API tells you to slow down, if you don't do it, sysadmins are likely to kick you out.

If you need a complete image dump, (we're talking of a few TBs) try asking on wikitech-l. We had torrents available when there were less images, now it's more complicated, but still doable.

About bot accounts. How deep have you looked in the system? You need a bot account for fast, unsupervised edits. Bot privileges also open a few facilities such as increased query sizes. But remember: bot account? it's simply an augmented user-account. Have you tried running anything with a classical account?

answered Sep 24 '09 at 09:56

Nicolas Dumazet

7,147
27
36

1

Thanks, this is helpful. I have a site about plants and I'd like to include some photos from WikiMedia Commons. I ran a query against http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php to get a list of images in particular category and then ran another query against http://toolserver.org/~magnus/commonsapi.php to get the metadata about each image. I then used urllib.urlretrieve in python script to get the actual image. Tho I just tried it again and it works, so does wget. Hmmm, I may have had bug with the formation of the url. – tomvon Sep 24 '09 at 13:32
Im not looking for a complete dump, just a few pics. I'd also like to create a Wordpress plugin that lets you search WC and add more easily images to your site (with proper attribution). Do you know where there's info about the throttling limits? I've done some pretty extensive reading at WC but don't remember seeing anything about limits. I certainly want to respect the Terms of Use. – tomvon Sep 24 '09 at 13:34
See http://www.mediawiki.org/wiki/Manual:Maxlag_parameter for throttling. Note that it's a recommendation, so if you have never actually seen a "maxlag" error or blocked/autoblocked/ratelimited error codes, you probably have never been throttled or blocked. – Nicolas Dumazet Sep 25 '09 at 17:21

If you need between ten and one million files, using Magnus Manske's tools to recurse categories is a good choice. http://tools.wmflabs.org/magnustools/can_i_haz_files.html produces a list of UNIX commands which you can then just run locally.

An alternative, whose interface is in Germany only but easy enough, is https://tools.wmflabs.org/wikilovesdownloads/

edited Aug 26 '19 at 14:13

answered Apr 28 '15 at 14:35

Nemo

2,441
2
29
63

Note that there used to be an issue with using LWP: it's not idealogical, it's practical, agents can create massive load on already stretched servers. There are sensible strategies that agent users can follow to reduce the load - ask on www.mediawiki.org, or en:Village pump - Technical

answered Sep 01 '11 at 18:51

Rich Farmbrough

Didn't really find the answer I'm looking for .. but this page is interesting:: http://www.makeuseof.com/tag/4-free-tools-for-taking-wikipedia-offline/

Especially #4.. but it seems the page is down.. project dead?

answered Jul 24 '11 at 20:14

naim5am

1,334
3
17
33

How to download images programmatically from Wikimedia Commons without registering for a Bot account?

5 Answers5

Linked