TEDTalks download script and MythTV metadata

I have been watching TEDTalks off and on since a friend of mine introduced them to me a couple months ago. They are videos of presentations done at TED (Technology, Entertainment, Design), an annual conference that “brings together the world’s most fascinating thinkers and doers”. I would highly recommend browsing through them if you have a minute; there is some really good food for thought (and action) in there. All of the TED videos are licensed under a Creative Commons BY-NC-ND (Attribution-Noncommercial-No Derivative Works) license, which allows them to be freely redistributed as long as they are not modified.

To make them more accessible to me, I downloaded all the TED videos and put them on a computer running MythTV. Read on for details on how I did it and links to scripts that will automate the process for you if you have a MythTV setup or if you just want to download all the TED videos.

Here are the files you need in order to download the TED videos and add the metadata to MythTV (the last one can be omitted if you’re just downloading the videos and not using MythTV):

ted_urls
ted_download.sh
ted.sql

To download the videos, put ted_urls and ted_download.sh in an empty directory. Then run the following in that directory (you may need to run chmod +x ted_download.sh first):

$ ./ted_download.sh

This will download all 186 TED videos that were available as of February 3, 2008, extract them (most of them are packaged in ZIP files), and delete the original ZIP files, leaving you with 186 MP4 files. The download is 10.07 GiB so it might take a while. If you wish to keep the ZIP files, then remove the rm *.zip line from ted_download.sh.

If you don’t have MythTV, then you’re done. You can go ahead and watch the MP4 files. If you do have MythTV, then you will probably want to add the metadata for the TED videos so you can see the title, description, length, and year when browsing them in MythTV.

First of all, put the MP4 files somewhere in MythTV’s video directory. Next, load MythTV’s Video Manager so that it adds the new videos to the database. You can close the Video Manager once it finishes scanning; it has already updated the database.

Then, from the directory where you downloaded ted.sql, type the following commands:

$ mysql -u root -p
mysql> use mythconverg;
mysql> source ted.sql;
mysql> quit;

The commands above may vary based on how your version of MythTV is configured, but it will most likely work. After executing the third line, you should receive a bunch of lines of the following form:

Query OK, 1 row affected (0.03 sec)
Rows matched: 1 Changed: 1 Warnings: 0

This means all is well. You should now be able to see the title, description, length, and year when you browse the TED videos in MythTV.

Notes

The information in this section is not required for downloading the TED files or setting up MythTV, but may be useful for those who are interested in expanding on or contributing to the scripts and metadata.

The above instructions were tested on Mythbuntu 7.10. If you use the above instructions to successfully setup the videos on a different distribution, please add a comment to this post indicating which distribution you used so I know where it works.

If you follow the instructions but notice that some videos do not include descriptions when you browse them in MythTV, it may be because the filenames for the videos hosted by TED have changed. Please report this by posting a comment.

To reduce the load on TED’s servers, consider downloading the videos using BitTorrent instead of running the ted_download.sh script. There appear to be several sites hosting the videos; I don’t have personal experience with any of them. You will probably find that there are videos (especially newer ones) that are not hosted via BitTorrent. With a little work, you can modify the ted_urls file to pickup the missing videos.

Since MythTV uses minutes to describe video lengths, I had to decide how to round the length. I decided to always round up, which means that a 10 minute 3 second video will show up as 11 minutes long. So you are guaranteed that the video will not be longer than the number of minutes in the description.

If you are interested in other metadata not present in the SQL script, such as the video number or length in seconds, download the following file:

ted_raw_data

The file contains data in the following format:

[video_number]=[year]=[length_in_minutes]=[length_in_seconds]=[MP4_filename]=[title]=[description]

  • [video_number]: a number used by TED to identify the talk; replace N by [video_number] in “http://www.ted.com/index.php/talks/view/id/N” to get its description page
  • [year]: the year the presentation occurred
  • [length_in_minutes]: the length of the video as reported by mplayer rounded up to the nearest minute
  • [length_in_seconds]; the length of the video as reported by mplayer to the nearest hundredth of a second
  • [MP4_filename]: the filename of the video
  • [title]: the title of the video as shown on the video’s TED web page
  • [description]: the description of the video as shown on the video’s TED web page

Note that none of the fields in ted_raw_data contain the “=” symbol so the above format unambiguously delimits the fields. You can print specific fields of the output by running a command like cut -d= -f2 ted_raw_data (to print the second field).

The [description] field sometimes contains extra information at the end of the field that isn’t related to the description such as “Download this talk in high resolution (480p) >>”. This is a result of doing a simple automated copy of the description on TED’s web pages without going through the descriptions to fix them. If you go to the effort of fixing them, please send me an updated raw data file. In practice, though, you will find that most descriptions are too long for MythTV’s description field so you won’t even notice the extra data.

Although [video_number] is in the range 1 to 212 in ted_raw_data, not all values of [video_number] in that range correspond to a video since there are only 186 videos. This is useful to know if you are writing a script that uses ted_raw_data to operate on the number of each video.

The SQL script does not set a cover file for the videos so none of them will show up with an image when browsing them in MythTV. I could not see an easy way to extra an image from each video file (or from the TED web site for the video) so I left that out. I could have added a single cover file with the TED logo for each file, but I thought that didn’t add much useful information. If you create a set of cover files to use for the TED videos, please let me know by posting a comment or contacting me so I can add it to the SQL script and update this page.

I produced very little, if any, of the SQL script and ted_raw_data file by hand. Most of this work was done by shell scripts that I wrote. If there is enough interest, I can cleanup these scripts and release them so people can update the data more easily as more videos are added to the TEDTalks site.

If you would like to see a new TED video added to the SQL script, please send me a properly-formatted line for ted_raw_data and a URL for ted_urls. I will add these as I have time.

If you know of some sort of MythTV metadata repository that would accept the TED video metadata described on this page, please let me know. It would be much better to have the metadata handled by a central repository that is designed for easy additions than handled by me manually updating files and re-posting them to this page.

20 Responses to “TEDTalks download script and MythTV metadata”


Leave a Reply to ossguy