Main»html2markdown

html2markdown

Name

html2markdown - Convert HTML back into Markdown

Synopsis

html2markdown (reads from stdin)
html2markdown input-file

Description

Markdown is a nice markup language for text that is easy to read in plain text, and Markdown is also a program to convert Markdown markup into HTML. Conversely, html2markdown converts HTML generated by Markdown (the program) back into Markdown markup.

html2markdown is not meant as a means to convert any old HTML into Markdown. I just sent a complete HTML document into html2markdown and it promptly died (which I might fix one day). html2markdown is really only equipped to understand the subset of HTML that Markdown outputs. However, I've aimed to make it very good at properly converting that subset of HTML back to Markdown (better than any tool available at the time I wrote html2markdown).

I use this tool to write my web log entries. I write the entry in Markdown using Emacs. I've then got a bit of Emacs Lisp that allows me to convert the buffer to HTML with a couple keystrokes, and then Emacs posts it to my web log via XML-RPC. Inevitably I find something I want to change or correct in the entry, so I just hit another couple of keystrokes to convert it back into Markdown, make my changes, and start all over again. In this way I don't have to worry about what web log software I'm using and what markup it supports; my entries are always stored in HTML, but editable as Markdown, independent of the software that powers my web log.

Example

Say you've got sample.txt which contains:

# An example

This is a paragraph.  This is only a paragraph.  If this were a real
book, it would be way too short.  No one would ever give me an advance
on something this useless.

## Some lists

* Bulleted list

    * Indented bulleted list

      > Random block quote.  Some random text, I guess I'll just keep
      > typing, la la la la.

(I have to put some text here because Markdown.pl will try and make
the numbered list a continuation of the bulleted list above.  I've
found a few things that I suspect are Markdown.pl bugs just writing
this example.)

1. Numbered lists
2. [Links]([@http://www.wikipedia.org/@]) can be where you least
   expect them, as can word wrapping.
3. Unf unf unf.

## More text

And of course we support things like `monospace`, _italics_, and
**bold**.

And if we were to do cat sample.txt | Markdown.pl | html2markdown.py, we'd get something like:

# An example #

This is a paragraph.  This is only a paragraph.  If this were a real
book, it would be way too short.  No one would ever give me an advance
on something this useless.

## Some lists ##

* Bulleted list

    * Indented bulleted list

      > Random block quote.  Some random text, I guess I'll just keep
      > typing, la la la la.

(I have to put some text here because Markdown.pl will try and make
the numbered list a continuation of the bulleted list above.  I've
found a few things that I suspect are Markdown.pl bugs just writing
this example.)

1. Numbered lists
2. [Links]([@http://www.wikipedia.org/@]) can be where you least
   expect them, as can word wrapping.
3. Unf unf unf.

## More text ##

And of course we support things like `monospace`, _italics_, and
**bold**.

The astute reader might notice that the output is nearly identical to the input. While it doesn't make for a very exciting example, it does show how the software works: what you put in to Markdown.pl is (mostly) what you get out of html2markdown.

If you're trying to figure out how Markdown knew, say, which of the two heading styles I used as input, or how it knew that I used an in-line link rather than the footnote style link, then by now you've probably guessed: html2markdown always outputs the above styles, because they're the styles I use when I'm writing Markdown. The word wrapping is also my style, though that is configurable with a variable in html2markdown.py. In the future I may allow the user to configure which variety of various markup is emitted.

Requirements

html2markdown is written in Python, and that is its only requirement. You probably need at least Python 2.3 to use it. I believe most of my development was done under Python 2.4.

License

html2markdown is licensed under the GPL.

Download

http://www.codefu.org/html2markdown/

Future plans

html2markdown is good enough for me right now, but I may be interested in making changes (especially bug fixes) if someone other than me actually wants to use it.

One particular area of improvement is error reporting. html2markdown is prone to spitting out a relatively inscrutable exception if you feed it some bad HTML, for example. Having it provide something like a line number where the error occurred would be helpful.