TEDTalks Downloader 0.1 released

TEDTalks Downloader, which downloads all videos from TEDTalks, is now available:

tedtalks_downloader.sh

To use it, you will need a POSIX shell and wget or curl. If you are using Ubuntu or Mac OS X, you already meet the requirements. If you are using Windows, you will need to get a POSIX shell such as MSYS and wget for Windows.

To run TEDTalks Downloader, make sure the script is executable (chmod u+x tedtalks_downloader.sh should do it) and then run ./tedtalks_downloader.sh, preferably in an empty directory. The script will then download the TEDTalks feed, create a videos directory, and start downloading videos to it.

If TEDTalks Downloader is interrupted while running, you can run it again from the same place and it will automatically start where it left off. Also, if there are more TEDTalks available, running TEDTalks Downloader again will download the new talks without re-downloading all the other talks.

This tool was made in response to a request on my TEDTalks download script and MythTV metadata article. You can find more information about getting TEDTalks data in MythTV from there.

If you have any comments or questions about TEDTalks Downloader, please let me know by posting a comment to this article or contacting me directly.

20 Responses to “TEDTalks Downloader 0.1 released”


  • Thanks VERY much for this!

  • Thank you very much. As an avid TED Talks watcher this is great…any updates to this? For example, I would like it to check if the video is already there and if so then skip to the next video. This way I could run the script over and over again and it would simply update my list instead of downloading EVERY single video.

    I assume this would be a simple edit in the script but it is beyond my humble powers!

  • As it is written, the script will not re-download all the talks. It is supposed to re-download none of them (if you have all the talks). However, because the TEDTalks feed reports the file sizes of some of the talks incorrectly, it will re-download those ones (because the file size reported in the feed does not match the file size of the downloaded one). As of a few weeks ago, 48 of the 293 videos’ sizes were misreported. I have contacted the TED people about this but they haven’t gotten back to me about it. I recommend contacting them yourself to let them know that others would like the problem fixed, too. You can do this at http://www.ted.com/index.php/contact .

    In the meantime, you can change the script so that it won’t re-download the talks, but then it won’t be able to recover from a partially downloaded file (ie. if the script was canceled while one of the talks was downloading, it will think that one is finished and won’t re-download it). If that’s fine with you (for example, if you always let the script run to completion), then make the following change to the script:

    1. Remove line 58.
    2. Change line 57 to be the following:

    if [ ! -f "${FILENAME}" ]; then

    This will change the script so it won’t re-download a talk if the file exists already, with the side-effect that it will not resume partially downloaded talks.

    Let me know how this works for you. Thanks for your suggestions.

  • That worked great for me! Thanks a lot.

    In terms of suggestions/updates…

    I’d suggest adding a log file to note which new videos had been downloaded. Its useful since if there are 300+ videos its impossible to figure out what you haven’t already seen.

    I’d also suggest a cron job, however, until the video size thing has been sorted out this may be dangerous as the computer may be turned off while running the job and so there may be numerous videos that are broken.

  • Also I’d like to point out that there are 404 ted talks currently online (http://www.ted.com/index.php/talks/atoz). I am assuming that the TED rss feed is to blame for not showing all the talks available. Can this be rectified in any way?

  • 404 seems a bit high, but it’s possible there are that many. In my old scripts I went through all the possible talk identifiers to download all the videos. As an example, the most recent talk identifier as of this writing is 487. You can view this talk by going to http://www.ted.com/talks/view/id/487 .

    You could try the following code to download as many pages as possible:


    for i in `seq 1 487`; do
    wget -o ${i}.log -O - http://www.ted.com/talks/view/id/${i} > ${i}.html
    done

    Once you’ve run that, type ls -1 *.html | wc -l to find out how many actual pages (and thus talks) there are.

    I’m hesitant to change my script to use that method instead of using the feed because using the feed is much cleaner. But if you can show me that the feed is missing a lot of videos, I might be persuaded to change it.

  • Abhiroop Basu wrote:

    I’d suggest adding a log file to note which new videos had been downloaded. Its useful since if there are 300+ videos its impossible to figure out what you haven’t already seen.

    I would expect people to redirect the script’s output to a file (ie. ./tedtalks_downloader.sh > output.log) and then search it for some meaningful identifier (ie. grep '^Saving to' output.log). This would give you what you’re looking for.

    You could also add that to the script if you want. To do this, add the following after line 60:

    echo ${FILENAME} >> output.log

    I’d also suggest a cron job, however, until the video size thing has been sorted out this may be dangerous as the computer may be turned off while running the job and so there may be numerous videos that are broken.

    The right way to solve this is for TED to fix the feed. Have you contacted them about this yet?

  • I ran the code that you suggested and did indeed come up with 487 possible HTML pages, however, some of the pages seemed to be broken (as when I open them they don’t open anything, and the logfile lists a “404 error”) and and so, this method is clearly not ideal. However, it does seem to list the missing videos I highlighted.

    The output log works great, however, it keeps popping up the same name: Sir Ken Robinson. Even though it only downloaded this talk the first time. (not really an issue though).

    I will send off an e-mail to TED.

    Thanks again for this great script, hopefully the kinks can be worked out!

  • Firstly, the following talk was just uploaded onto TED (http://www.ted.com/index.php/talks/adam_savage_s_obsessions.html), however, it doesn’t seem to want to download. Not sure why. (OK just while writing this I tried the script again and it downloaded this talk, so I guess there is some delay).

    Secondly, following is a list of talks that are not in my current TED “videos” folder (although there are 304 videos in total (NB: this is just a sample, according to the TED talks website there are a total of 405 videos and the script only downloads 304).

    http://www.ted.com/index.php/talks/nellie_mckay_sings_clonie_1.html
    http://www.ted.com/index.php/talks/rokia_traore_sings_m_bifo.html (doesn’t seem available to download)
    http://www.ted.com/index.php/talks/chris_anderson_shares_his_vision_for_ted.html (the other Chris Anderson’s talk is there though)
    http://www.ted.com/index.php/talks/vusi_mahlasela_s_encore_at_tedglobal2007.html
    http://www.ted.com/index.php/talks/vusi_mahlasela_sings_thula_mama.html

    So, as you can see some ‘talks’ are clearly missing. Now I saw talks in inverted commas because all of the above seem not to be traditional talks. There are musical pieces, and a talk by the curator of TED. However, I don’t really have the time to go through all 405 videos! So, I can imagine that there are more talks missing.

    Now it is conceivable that some of them can’t be downloaded, however, there are many that can be downloaded but aren’t as a part of the script. I have a feeling that the feed isn’t completely up-to-date (a problem on the part of TED).

  • The feed TED provides is probably not updated immediately with new talks; it is delayed a bit. But as you saw, the delay is not too long.

    I can’t really do anything about the missing talks unless I change my script to try every talk number instead of using the feed. This is a possible solution, but it is much more complicated and much more likely to stop working in the future since TED sometimes changes how their site looks and I’d have to update my script to reflect that.

    I went through all the 489 talk IDs, trying to download each one, and found that 83 were not valid IDs, meaning that there are 406 valid talk pages. As of today, the feed contains 302 talks. So there are definitely some talk pages missing from there.

    Based on the samples you gave, it seems that most of the ones the feed doesn’t pick up are musical in nature or more administrative (Chris’ sharing of the vision for TED). Personally, I don’t find those as high-priority as the actually talks so I don’t see a big need to switch away from the feed.

    The best solution is to ask the TED people to add all the talks you’re interested in to the feed. From http://blog.ted.com/ I see you can e-mail them at contact@ted.com . I recommend contacting them there with your suggestions. You could also suggest that they add a separate feed for non-talk content (like music). It would not be hard to make TEDTalks Downloader download from 2 feeds instead of one.

    Out of curiosity, which operating system are you using TEDTalks Downloader on?

  • I’ll contact TED about that, I doubt they’ll change it any time soon, but it isn’t that big a priority. I still get the talks that I really want to watch. I’m on Ubuntu 8.10. I’m looking into writing a GUI version of this. Have you used TED’s Miro player? Its quite bloated but useful for exactly this purpose (although even that is limited to 304 talks).

  • Hi there — this is Emily from TED.com. A couple things to know:
    + we don’t selectively withhold talks from the video feed. Chris’ talk, for instance, should be there, as should the musical performances. So the talks that are missing are an interesting data set. What do they have in common that causes them to fail? We’re looking into this, but we’re a small nonprofit with some big projects bearing down on us … including a rollout in May that I can’t even begin to tell you how excited I am about.
    + as you noticed, the number identifiers in the URLs are not an unbroken sequence and many go unused — we’re up to ID# 502 as of Friday, which means that about 100 IDs are unused.
    And here’s a spreadsheet, updated once a week or so, with all the TEDTalks and their URLs, including ID:
    http://spreadsheets.google.com/pub?key=pjGlYH-8AK8ffDa6o2bYlXg
    Hope this is helpful. And thank you so much for your energy and dedication to sharing TEDTalks!

  • Hi Emily,

    I have sent you a more detailed e-mail to contact@ted.com. Hopefully you’ll have a minute to look it over.

    Best

  • Amazing Amazing script man, you’ve no idea how thankful I am..

    And thanks to Emily for the page with all the talks and the links, really helpful to check out what all talks you’ve missed…

    ossguy, thanks again :).. Great work!

  • Hey,
    The spreadsheet Emily provided has all the talks without duplication and all the links!! it is slower than the feed ( updated once a week or so ) but it is much more comprehensive!

    Is there anyway to use the spreadsheet’s RSS Feed as the source in your script? The feed does not have the direct links to the .zip files though but there should be a way to find it out via the links to the pages…

  • Stupid question:
    Does this downloader still works with the SQL script? I am using MythTV and having the metadata available is very useful…

  • Sriram wrote:

    Is there anyway to use the spreadsheet’s RSS Feed as the source in your script? The feed does not have the direct links to the .zip files though but there should be a way to find it out via the links to the pages…

    It’s theoretically possible to do that, but I don’t plan to implement it myself. In my original scripts, which I used to generate the content at http://ossguy.com/?p=26 , I tried that method. However, TED has changed the format of the pages since then so my scripts no longer work. This is likely to happen again. Because of this, and the further issues that I would have to do an extra download for each video and also unpack the zipped MP4 file, I won’t be using that method. If the TED people add a column that links directly to the MP4 file, then I’d consider using the spreadsheet. Otherwise, it’s just not feasible.

    That said, if someone else wants to implement it, I don’t mind hosting their changes on my web site. I’m just not interested in doing the work myself for the reasons above.

    I appreciate your feedback on my script and I’m glad you’ve found it useful.

  • Carlos wrote:

    Does this downloader still works with the SQL script? I am using MythTV and having the metadata available is very useful…

    The downloader should give you all the movies that are present in the MythTV SQL script at http://ossguy.com/?p=26 . However, the SQL script only has 186 of the 400 or so TED videos because I haven’t updated it since I created it back in February of last year and many videos have been added since then. Also, as I mentioned in my last comment, the TED pages have changed since then so my scripts don’t work anymore, meaning I would have to do extra work to update the SQL script with metadata for the new TED videos.

    The best way to update the SQL script would be to grab the metadata from the TEDTalks feed ( http://feeds.feedburner.com/tedtalks_video ) and convert that into the appropriate format for inputting into MythTV. Unfortunately, I don’t have a MythTV box setup anymore so this is pretty low on my priority list. It shouldn’t be too hard, though. If you’d like to make the script yourself, go ahead. I can provide pointers if needed.

  • A few days ago I released the first stable version of metaTED – a tool that makes it easy to download all of the TED talks. It does so by creating over 8 metalinks of TED talks varying in both the quality levels and possible talk groupings by directory.

    Download TED talks now, it’s easy and multiplatform.

    The project is hosted on bitbucket, where you can get the code and report bugs.

  • Actually I’m little bit confuse to use this downloader..
    can you explain it in detail how to use it?
    thanks..

Leave a Reply to Sriram