It has been my experience that most projects i
start are a direct result of need or a dissatisfaction with existing
products. The problem i have is that i often stop needing the product
before i finish creating it. A programmer like me is constantly building
tools to help him get his work done, but consequently hardly ever achieves
anything he originally set out to do.
A simple plan
When i first started working on this web
site, i found the management of multiple HTML pages surprisingly time consuming. I had
no sophisticated tools for the job, so i decided to start the even
more time consuming task of creating my own.
I always liked the simplicity of Microsoft's ActiveX DHTML control
(as featured in Outlook Express), so i decided to start working on a
basic tool which would give me simple WYSIWYG as well as direct source
editing. All i needed to do was wrap the DHTML object and voila! I had a
stand alone very basic HTML editor. But then, i really needed to have
source editing (raw HTML) as well, so i changed from a very
simple single window interface to a more complex tabbed
interface. The source editing component (with syntax hilighting) was
provided by the editing control i developed for my text editor. Also i
added a browser preview tab, since the tabbed interface made it trivial to
implement additional views.
So now i had something very close to what you get when you edit an
email in Outlook Express. Except mine was not for email, it was for
working with plain HTML files. This is pretty much what i had set out to
achieve. I added some extra buttons for table support (which you don't get
in Outlook Express).
But then, it was still only good for editing a
single page, and i was trying to manage a web site! It was obviously time to make it
a little more advanced, so i added some file & site management. Added
a TreeView control to it for the purpose. At first the tree view only showed files
that could be reached from the currently open page. Seemed like a good idea
at the time, but turned out to be more confusing than anything. So i changed
it to a more obvious directory structure, starting at the site's
root.
Of course a tree is not that useful by itself, so i also added
template based pages (you're reading one right now) and intelligent link
tracking and updating, so that renaming one file in your site will no
result in broken links, and all pages use relative links, to
make the web site as portable as possible.
But it wasn't enough... i was still tied to the MS DHTML control. I
wanted more. And so, i began work on my custom HTML editor...
Spinning out of control
ActiveX controls... puh-leeze! While there is
certainly some sense in packaging code into these executable chunks, i
cant see the point of objects which take twice the effort to
interface to as they would if implemented statically!
See, i got
frustrated trying to use table functionality of MS's ActiveX control. Basically
the control is lacking in some really useful
things, like the ability for a parent application to control selection and cursor positioning,
or to insert HTML directly. For example i cant find a way to
create a simple <BR> without pasting it to the
clipboard and using the inbuilt copy functionality. Doing it this way
clobbers the clipboard.
So i started looking at other implementations and wondering how much
work it would be to write an HTML editor from scratch, more out of
curiosity than necessity. First step would be a viewer
(renderer), with editing functionality to be added later. But then,
only an idiot with too much time on their hands would bother
reinventing that wheel.
But then, this idiot likes mucking around with fonts and
antialiasing, so i says to myself, i says "why not?", i says. At least i
could try some simple structured document parsing to see if i could use it
in the Book Reader
, which has been sadly languishing. (Book Reader is
really a nice app, if i could just get around to adding an interface to
it!)
Tried out some basic
XML structured document rendering, supporting little more than
paragrahs and indents. To the right is an early
detail. Images are represented by yellow circles with blue edges. While
it looks nothing like the same document in a browser, it looks cute enough
to inspire me to keep working on it. Added support for basic font control etcetera, and liked the
rapidity of the improvements. It is worth noting that this is the most exciting part
of reinventing the wheel, because you're reenacting some great idea that
someone else had a long time ago, and even though you're not the first
one to do it, you can still feel the excitement of creating something new,
which works. I start to recall the earliest web pages i ever saw, and the
speed with which new and mostly useful formatting options (and a wider
understanding of existing ones) appeared. Tables were a big step i think,
appearing in HTML 2.0? i'd have to check that...
Did i say tables were a big step? I should also mention that they are
a nightmare to implement! It's all very nice to have a structured document
when you can see a document as a tree... but trying to visualize a table
as a tree? At it's simplest, a TABLE contains one or more rows
(<TR>), and each row contains 1 or more cells (<TD>). The
problem here is that the width of cells in row B generally relate to the
width of cells in row A. So when a viewer/editor renders the table, it has
to take this table containing rows containing cells and turn it back into
a grid. Basically there's a lot of room for error in this procedure.
Actually, the "dynamic sizing" of HTML tables is
what makes them so powerful in document formatting. They can cope with
vastly different font sizes and other display settings from browser to
browser and user to user.
And did i mention that tables
were a nightmare to implement ?
Let's say you have a document with 1000 rows, each with 5 columns. In order
to work out how wide column 1 is, you would think there would be just
this table header info which says "column 1 is 50 pixels wide" or something
like that wouldn't you? Well what you actually have to do is process
the entire table, calculating the minimum width of the first cell in each of
the 1000 rows. Even if the header does give an absolute width, that width
will be ignored if it is smaller than the maximum of the minimum
widths. The minimum width of a cell is calculated from its
contents (text, images, other tables etc) and the formatting options
used (font-size, padding etc). The contents of each cell can be seen as a
sub-document in itself, so processing 1000 of them just to work out the width of
a single column is not a trivial procedure! Once i had tables working, i got some basic pages displaying fine, but others looked
a lot like crap. This was generally for 1 of 2 reasons: 1) My renderer
was still very buggy, and lacking support for all but the most basic formatting
options 2) The average HTML page is technically godawful, when approached
from an XML frame of mind. I had foolishly believed the idea
that HTML was a proper implementation [or application] of XML. Hah! An XML
document would not allow syntax like: <bold>Some text <italic>
Some text </bold> Some text </italic>. This is a
case of overlapping elements. Also XML defines the syntax for an empty element
to be <br/>, whereas this is virtually never used within HTML.
It seems that such syntax (and much worse) is very common within HTML
documents, and from a programmer's point of view, this lack of conformance is
as annoying as a calculator without a minus key.
There are many people who have a problem with
the general cruddiness of HTML regarding XML conformance. This concern has
led to a new proposal called XHTML, which is essentially a
modified HTML fully conforming to the XML standard. By definition this means that
any XHTML document produced will automatically be a valid
XML document. Yay!
After i had got over my annoyance at the vast difference between an ideal HTML and the one that people actually use,
i set to work mutilating my nice simple XML parser to support the horrible
inconsistencies and legacy kuldges. I am desperately trying to avoid writing
a dedicated HTML parser, but the fact that much of the dekludging
is specific to the element types means i may be resisting the inevitable
(FONT, EM, STRONG tags affect only font appearance, while TABLE & DIV tags
are structural, and deeply affect the layout and navigation of
the document. BR, HR and IMG tags never contain sub elements. A
pure XML parser makes no such distinction between different tags).
One major formatting option i had ignored up
until this point was CSS (Cascading Style Sheets)
... the "new" way of styling HTML documents. I downloaded the
CSS1 spec(recommended in 1996, revised in 1999) and was dismayed to
notice that its formatting model was incompatible with mine (specifically in
its interpretations of borders, padding and margins, and their
apparent applicability to virtually ALL element types). Uh-oh. This is
the least fun [and most embarrassing] stage of reinventing the wheel... the
realization that you've stupidly wandered off down the wrong path, and that
you never needed to go there in the first place, because the
right direction was clearly marked all along. Oh well, it's always a
learning experience.
And so i began implementing CSS1 formatting, adopting its elegantly simple box-element
rendering paradigm. A shame to see things get temporarily broken, but it's
for the best in the long run, so there you go. Still not sure how best to
cope with malformed HTML. Lots of info about this online, so hopefully
i'll find some useful references.
Not finished
And so, i now have a mostly functional CSS based HTML
viewer, using my proprietary font renderer. Some rendering glitches, no
editing functionality yet. Was it worth it? Well, there is still the potential
to use at least some of the functionality in Book Reader, and it did
help to mature my font rendering technology. Also i now have a
fantastically good understanding of HTML+CSS. Overall, my feeling is that
even when you do something that doesn't need to be done, you will still
learn a lot, and as long as no one is going to get cross at you for
wasting development time, you can still come out ahead.
Want to know what happens next? Maybe you'll find out
in the Captain's Log.
Sample images
Thursday, January 2, 2003