Using Pandoc to Convert html Files

A few days ago, I wrote about batch converting video files using ffmpeg. A few days later, I faced a similar problem of needing to convert a directory of .html files. “Need” is perhaps too strong of a word. I was experimenting with how to save pages from a PBworks wiki.

PBworks allows the user the download a .zip file of all of the pages from a wiki.[1] My downloaded backup contained 44 .html files, many of which were nested into subfolders. Instead of figuring out to recursively loop thought the subfolders, I used a find command, which searches subfolders by default. In my script below, the find command is inserted using command substitution. The converted files are saved to the original subdirectory, keeping .html in the filename, but adding .md as the file extension.

I tried out two scripts to do the text conversion. First, I tried html2text, which worked great. Out of curiosity, I also tried using Pandoc. I ended up preferring how Pandoc formatted the final Markdown text. However, one feature of html2text I liked was the option to use --ignore-links, since most of the links were relative to the PBworks domain and would be broken when used offline. I decided it might be useful to see where the original link pointed to, so I decided to skip the --ignore-link option.

Here is the script I created:

 1  #!/bin/bash
 2  
 3  # Usage: html2md /path/to/file
 4  
 5  # Set $IFS so that filenames with spaces don't break the loop
 6  SAVEIFS=$IFS
 7  IFS=$(echo -en "\n\b")
 8  
 9  # Loop through path provided as argument
10  for x in $(find $@ -name '*.html')
11  do
12      pandoc -f html -t markdown -o $x.md $x
13  done
14  
15  # Restore original $IFS
16  IFS=$SAVEIFS  

Line 6 is necessary so that the script will work with filenames that contain spaces. The trick, as suggested in a Linux forum, is to set the internal field separator not to use spaces.[2]


  1. For a paid account, PBworks allows the user to download all pages, past revisions and files, but I was using a free account.  ↩

  2. A discussion at Stack Overflow suggests a similar fix using IFS=$'\n', but I found I still needed \b at the end for my script to work.  ↩

Using ffmpeg to Convert Video Files

Recently, my wife had students in her library create Photostory projects. This wasn’t her first choice of applications for a student project, but the Mac lab was in use for testing. Photostory outputs .wmv files, but my wife wanted to be able to merge the files using iMovie so that teachers could cue up one movie on their classroom presentation stations, which are Macs.

My wife thought she would need to use a service such as Zamzar to convert the files from .wmv into a format that iMovie could import, which seemed like a tedious, impractical task. I thought that perhaps ffmpeg, a command line tool, could help.

I found a Stack Exchange article that suggested this syntax to convert .wmv to mp4:

ffmpeg -i input.wmv -c:v libx264 -crf 23 -c:a libfaac -q:a 100 output.mp4  

I then created the following script to automate the process:

#!/bin/bash

for x in $@
do
    ffmpeg -i $x -c:v libx264 -crf 23 -c:a libfaac -q:a 100 ${x}.mp4
done  

Thus, using wmv-convert * would loop through all of the files in a directory, converting all .wmv files to .mp4, while keeping the same base filenames.

Each file took several minutes to convert, but I was able run the loop during dinner. Then my wife was able to merge the files using iMovie later that evening.