Using Pandoc to Convert html Files

A few days ago, I wrote about batch converting video files using ffmpeg. A few days later, I faced a similar problem of needing to convert a directory of .html files. “Need” is perhaps too strong of a word. I was experimenting with how to save pages from a PBworks wiki.

PBworks allows the user the download a .zip file of all of the pages from a wiki.[1] My downloaded backup contained 44 .html files, many of which were nested into subfolders. Instead of figuring out to recursively loop thought the subfolders, I used a find command, which searches subfolders by default. In my script below, the find command is inserted using command substitution. The converted files are saved to the original subdirectory, keeping .html in the filename, but adding .md as the file extension.

I tried out two scripts to do the text conversion. First, I tried html2text, which worked great. Out of curiosity, I also tried using Pandoc. I ended up preferring how Pandoc formatted the final Markdown text. However, one feature of html2text I liked was the option to use --ignore-links, since most of the links were relative to the PBworks domain and would be broken when used offline. I decided it might be useful to see where the original link pointed to, so I decided to skip the --ignore-link option.

Here is the script I created:

 1  #!/bin/bash
 2  
 3  # Usage: html2md /path/to/file
 4  
 5  # Set $IFS so that filenames with spaces don't break the loop
 6  SAVEIFS=$IFS
 7  IFS=$(echo -en "\n\b")
 8  
 9  # Loop through path provided as argument
10  for x in $(find $@ -name '*.html')
11  do
12      pandoc -f html -t markdown -o $x.md $x
13  done
14  
15  # Restore original $IFS
16  IFS=$SAVEIFS  

Line 6 is necessary so that the script will work with filenames that contain spaces. The trick, as suggested in a Linux forum, is to set the internal field separator not to use spaces.[2]


  1. For a paid account, PBworks allows the user to download all pages, past revisions and files, but I was using a free account.  ↩

  2. A discussion at Stack Overflow suggests a similar fix using IFS=$'\n', but I found I still needed \b at the end for my script to work.  ↩