TEDTalks download script and MythTV metadata

I have been watching TEDTalks off and on since a friend of mine introduced them to me a couple months ago. They are videos of presentations done at TED (Technology, Entertainment, Design), an annual conference that “brings together the world’s most fascinating thinkers and doers”. I would highly recommend browsing through them if you have a minute; there is some really good food for thought (and action) in there. All of the TED videos are licensed under a Creative Commons BY-NC-ND (Attribution-Noncommercial-No Derivative Works) license, which allows them to be freely redistributed as long as they are not modified.

To make them more accessible to me, I downloaded all the TED videos and put them on a computer running MythTV. Read on for details on how I did it and links to scripts that will automate the process for you if you have a MythTV setup or if you just want to download all the TED videos.

Here are the files you need in order to download the TED videos and add the metadata to MythTV (the last one can be omitted if you’re just downloading the videos and not using MythTV):

ted_urls
ted_download.sh
ted.sql

To download the videos, put ted_urls and ted_download.sh in an empty directory. Then run the following in that directory (you may need to run chmod +x ted_download.sh first):

$ ./ted_download.sh

This will download all 186 TED videos that were available as of February 3, 2008, extract them (most of them are packaged in ZIP files), and delete the original ZIP files, leaving you with 186 MP4 files. The download is 10.07 GiB so it might take a while. If you wish to keep the ZIP files, then remove the rm *.zip line from ted_download.sh.

If you don’t have MythTV, then you’re done. You can go ahead and watch the MP4 files. If you do have MythTV, then you will probably want to add the metadata for the TED videos so you can see the title, description, length, and year when browsing them in MythTV.

First of all, put the MP4 files somewhere in MythTV’s video directory. Next, load MythTV’s Video Manager so that it adds the new videos to the database. You can close the Video Manager once it finishes scanning; it has already updated the database.

Then, from the directory where you downloaded ted.sql, type the following commands:
$ mysql -u root -p mysql> use mythconverg; mysql> source ted.sql; mysql> quit;
The commands above may vary based on how your version of MythTV is configured, but it will most likely work. After executing the third line, you should receive a bunch of lines of the following form:
Query OK, 1 row affected (0.03 sec) Rows matched: 1 Changed: 1 Warnings: 0
This means all is well. You should now be able to see the title, description, length, and year when you browse the TED videos in MythTV.

Notes

The information in this section is not required for downloading the TED files or setting up MythTV, but may be useful for those who are interested in expanding on or contributing to the scripts and metadata.

The above instructions were tested on Mythbuntu 7.10. If you use the above instructions to successfully setup the videos on a different distribution, please add a comment to this post indicating which distribution you used so I know where it works.

If you follow the instructions but notice that some videos do not include descriptions when you browse them in MythTV, it may be because the filenames for the videos hosted by TED have changed. Please report this by posting a comment.

To reduce the load on TED’s servers, consider downloading the videos using BitTorrent instead of running the ted_download.sh script. There appear to be several sites hosting the videos; I don’t have personal experience with any of them. You will probably find that there are videos (especially newer ones) that are not hosted via BitTorrent. With a little work, you can modify the ted_urls file to pickup the missing videos.

Since MythTV uses minutes to describe video lengths, I had to decide how to round the length. I decided to always round up, which means that a 10 minute 3 second video will show up as 11 minutes long. So you are guaranteed that the video will not be longer than the number of minutes in the description.

If you are interested in other metadata not present in the SQL script, such as the video number or length in seconds, download the following file:

ted_raw_data

The file contains data in the following format:

[video_number]=[year]=[length_in_minutes]=[length_in_seconds]=[MP4_filename]=[title]=[description]

[video_number]: a number used by TED to identify the talk; replace N by [video_number] in “http://www.ted.com/index.php/talks/view/id/N” to get its description page
[year]: the year the presentation occurred
[length_in_minutes]: the length of the video as reported by mplayer rounded up to the nearest minute
[length_in_seconds]; the length of the video as reported by mplayer to the nearest hundredth of a second
[MP4_filename]: the filename of the video
[title]: the title of the video as shown on the video’s TED web page
[description]: the description of the video as shown on the video’s TED web page

Note that none of the fields in ted_raw_data contain the “=” symbol so the above format unambiguously delimits the fields. You can print specific fields of the output by running a command like cut -d= -f2 ted_raw_data (to print the second field).

The [description] field sometimes contains extra information at the end of the field that isn’t related to the description such as “Download this talk in high resolution (480p) >>”. This is a result of doing a simple automated copy of the description on TED’s web pages without going through the descriptions to fix them. If you go to the effort of fixing them, please send me an updated raw data file. In practice, though, you will find that most descriptions are too long for MythTV’s description field so you won’t even notice the extra data.

Although [video_number] is in the range 1 to 212 in ted_raw_data, not all values of [video_number] in that range correspond to a video since there are only 186 videos. This is useful to know if you are writing a script that uses ted_raw_data to operate on the number of each video.

The SQL script does not set a cover file for the videos so none of them will show up with an image when browsing them in MythTV. I could not see an easy way to extra an image from each video file (or from the TED web site for the video) so I left that out. I could have added a single cover file with the TED logo for each file, but I thought that didn’t add much useful information. If you create a set of cover files to use for the TED videos, please let me know by posting a comment or contacting me so I can add it to the SQL script and update this page.

I produced very little, if any, of the SQL script and ted_raw_data file by hand. Most of this work was done by shell scripts that I wrote. If there is enough interest, I can cleanup these scripts and release them so people can update the data more easily as more videos are added to the TEDTalks site.

If you would like to see a new TED video added to the SQL script, please send me a properly-formatted line for ted_raw_data and a URL for ted_urls. I will add these as I have time.

If you know of some sort of MythTV metadata repository that would accept the TED video metadata described on this page, please let me know. It would be much better to have the metadata handled by a central repository that is designed for easy additions than handled by me manually updating files and re-posting them to this page.

20 Responses to “TEDTalks download script and MythTV metadata”

Feed for this Entry Trackback Address

Michael Biggs
February 4, 2008 at 7:37 pm

Heh, I wonder how they feel about you posting code to scrape all their videos. That might be against their ToS, despite the CC licensing

Good job with this, btw.

ossguy
February 4, 2008 at 7:58 pm

I don’t think scraping is against their Terms of Use (http://www.ted.com/termsofuse). To be honest, I didn’t read those until you posted, but fortunately I don’t seem to have violated anything. In fact, I think I’m pretty well in line with #9: “Please help spread TED.”

I do encourage people to use BitTorrent to get the videos if they can, but it is buried in the Notes section so perhaps not many people will see it. If there are people from TED out there that are concerned by the amount bandwidth used by scraping videos from their site, please let me know and I’ll do what I can to help alleviate that.

Ebrahim Bandegan
February 23, 2008 at 2:29 am

I need the scripts of the speeches and lectures in order to improve my listening. Would you please help me?
Are there any scripts available?
How can i get them?

ossguy
February 24, 2008 at 9:35 am

Some of the TED talks have transcripts, but not all of them. You can get a list of talks with transcripts by using the following Google search:

http://www.google.com/search?q=site:blog.ted.com transcript

Hopefully you find these helpful. Let me know if you have any questions about those transcripts.

ossguy
February 24, 2008 at 9:37 am

That link should be:

http://www.google.com/search?q=site:blog.ted.com transcript

ossguy
February 24, 2008 at 9:39 am

Let’s try that one more time:

http://www.google.com/search?q=site:blog.ted.com transcript

ossguy
February 24, 2008 at 9:40 am

Ok, I give up. Go to Google and type the following into the search box:

site:blog.ted.com transcript

This will give you a list of TED videos that include transcripts. Let me know if you have any questions.

Charlie
June 12, 2008 at 6:10 am

How would I do this if I have a mac computer, any body know?

ossguy
June 18, 2008 at 2:44 pm

Charlie wrote:

How would I do this if I have a mac computer, any body know?

If you want to get MythTV setup so that you can view the descriptions of the TEDTalks, use the MythTV on OS X wiki page (assuming you are using OS X).

If you just want to download the TEDTalks, you will need to modify the ted_download.sh script. Open up the script in your preferred text editor and change “wget” on line 10 to “curl -OL”. Then run the script as described above.

When I checked a few days ago, some of the links were not working so you won’t get all 186 videos, but you should still get a good portion of them by running the script. Let me know if this download method works for you.

Anna
December 28, 2008 at 6:40 pm

Could you please post the script that crawled ted.com to find all the download links? Thanks…

ossguy
December 31, 2008 at 7:15 am

Unfortunately the TED site has changed so the script I used no longer works. I will update the script and post it within the next week or two.

Ramesh
January 19, 2009 at 9:55 am

Hey Ossguy, Any Update on the url list?

Thanks…

ossguy
January 23, 2009 at 1:42 am

Not yet; haven’t had a chance to update my scripts. I see that there are multiple people requesting this, so I’ll likely get around to it sooner rather than later. Thanks for reminding me.

ossguy
January 29, 2009 at 12:27 am

I have created TEDTalks Downloader, which you can find in my TEDTalks Downloader 0.1 released article. That should provide everything you need to download all the TEDTalks. Please post any questions you have about it in a comment on that article.

Daniel
February 5, 2009 at 11:09 pm

Am I missing something?
http://feeds.feedburner.com/tedtalks_video

I use this with mythnettv
http://www.stillhq.com/mythtv/mythnettv/

Daniel
February 7, 2009 at 5:19 am

Even Better
http://feeds.feedburner.com/TedtalksHD

Anna Graham
February 13, 2009 at 6:04 pm

Cool. I substituted the URL for downloads in the script to the HD one that Daniel posted. Works great! Thanks!

Petar Marić
May 22, 2009 at 12:03 pm

A few days ago I released the first stable version of metaTED – a tool that makes it easy to download all of the TED talks. It does so by creating over 8 metalinks of TED talks varying in both the quality levels and possible talk groupings by directory.

Download TED talks now, it’s easy and multiplatform.

The project is hosted on bitbucket, where you can get the code and report bugs.

phazer
May 30, 2009 at 1:47 am

instead of breaking this up the way you do, a better way is to use a shellscript that pipes stuff through curl, grep, sed, and awk to determine the urls you need, get descriptive names for the files, and download. If you do it this way, I recommend going for the zipped files, to reduce network load.

EstibiaFliela
September 18, 2011 at 8:46 pm

Hi!
Very interesting name by the forum ossguy.com

It was specially registered at a forum to tell to you thanks for council. How I can thank you?