Jujusoft

The evolution of HTML Editor

It has been my experience that most projects i start are a direct result of need or a dissatisfaction with existing products. The problem i have is that i often stop needing the product before i finish creating it. A programmer like me is constantly building tools to help him get his work done, but consequently hardly ever achieves anything he originally set out to do.

A simple plan

 
When i first started working on this web site, i found the management of multiple HTML pages surprisingly time consuming. I had no sophisticated tools for the job, so i decided to start the even more time consuming task of creating my own.
 
I always liked the simplicity of Microsoft's ActiveX DHTML control (as featured in Outlook Express), so i decided to start working on a basic tool which would give me simple WYSIWYG as well as direct source editing. All i needed to do was wrap the DHTML object and voila! I had a stand alone very basic HTML editor. But then, i really needed to have source editing (raw HTML) as well, so i changed from a very simple single window interface to a more complex tabbed interface. The source editing component (with syntax hilighting) was provided by the editing control i developed for my text editor. Also i added a browser preview tab, since the tabbed interface made it trivial to implement additional views.
 
So now i had something very close to what you get when you edit an email in Outlook Express. Except mine was not for email, it was for working with plain HTML files. This is pretty much what i had set out to achieve. I added some extra buttons for table support (which you don't get in Outlook Express).
 
But then, it was still only good for editing a single page, and i was trying to manage a web site! It was obviously time to make it a little more advanced, so i added some file & site management. Added a TreeView control to it for the purpose. At first the tree view only showed files that could be reached from the currently open page. Seemed like a good idea at the time, but turned out to be more confusing than anything. So i changed it to a more obvious directory structure, starting at the site's root.
 
Of course a tree is not that useful by itself, so i also added template based pages (you're reading one right now) and intelligent link tracking and updating, so that renaming one file in your site will no result in broken links, and all pages use relative links, to make the web site as portable as possible.
 
But it wasn't enough... i was still tied to the MS DHTML control. I wanted more. And so, i began work on my custom HTML editor...
 

Spinning out of control

ActiveX controls... puh-leeze! While there is certainly some sense in packaging code into these executable chunks, i cant see the point of objects which take twice the effort to interface to as they would if implemented  statically!
See, i got frustrated trying to use table functionality of MS's ActiveX control. Basically the control is lacking in some really useful things, like the ability for a parent application to control selection and cursor positioning, or to insert HTML directly. For example i cant find a way to create a simple <BR> without pasting it to the clipboard and using the inbuilt copy functionality. Doing it this way clobbers the clipboard. 
 
So i started looking at other implementations and wondering how much work it would be to write an HTML editor from scratch, more out of curiosity than necessity. First step would be a viewer (renderer), with editing functionality to be added later. But then, only an idiot with too much time on their hands would bother reinventing that wheel.
 
But then, this idiot likes mucking around with fonts and antialiasing, so i says to myself, i says "why not?", i says. At least i could try some simple structured document parsing to see if i could use it in the Book Reader , which has been sadly languishing. (Book Reader is really a nice app, if i could just get around to adding an interface to it!)
 
Tried out some basic XML structured document rendering, supporting little more than paragrahs and indents. To the right is an early detail. Images are represented by yellow circles with blue edges. While it looks nothing like the same document in a browser, it looks cute enough to inspire me to keep working on it.
 
Added support for basic font control etcetera, and liked the rapidity of the improvements. It is worth noting that this is the most exciting part of reinventing the wheel, because you're reenacting some great idea that someone else had a long time ago, and even though you're not the first one to do it, you can still feel the excitement of creating something new, which works. I start to recall the earliest web pages i ever saw, and the speed with which new and mostly useful formatting options (and a wider understanding of existing ones) appeared. Tables were a big step i think, appearing in HTML 2.0? i'd have to check that...
 
Did i say tables were a big step? I should also mention that they are a nightmare to implement! It's all very nice to have a structured document when you can see a document as a tree... but trying to visualize a table as a tree? At it's simplest, a TABLE contains one or more rows (<TR>), and each row contains 1 or more cells (<TD>). The problem here is that the width of cells in row B generally relate to the width of cells in row A. So when a viewer/editor renders the table, it has to take this table containing rows containing cells and turn it back into a grid. Basically there's a lot of room for error in this procedure.
 
Actually, the "dynamic sizing" of HTML tables is what makes them so powerful in document formatting. They can cope with vastly different font sizes and other display settings from browser to browser and user to user.
And did i mention that tables were a nightmare to implement ? Let's say you have a document with 1000 rows, each with 5 columns. In order to work out how wide column 1 is, you would think there would be just this table header info which says "column 1 is 50 pixels wide" or something like that wouldn't you? Well what you actually have to do is process the entire table, calculating the minimum width of the first cell in each of the 1000 rows. Even if the header does give an absolute width, that width will be ignored if it is smaller than the maximum of the minimum widths. The minimum width of a cell is calculated from its contents (text, images, other tables etc) and the formatting options used (font-size, padding etc). The contents of each cell can be seen as a sub-document in itself, so processing 1000 of them just to work out the width of a single column is not a trivial procedure!
 
Once i had tables working, i got some basic pages displaying fine, but others looked a lot like crap. This was generally for 1 of 2 reasons: 1) My renderer was still very buggy, and lacking support for all but the most basic formatting options 2) The average HTML page is technically godawful, when approached from an XML frame of mind. I had foolishly believed the idea that HTML was a proper implementation [or application] of XML. Hah! An XML document would not allow syntax like: <bold>Some text <italic> Some text </bold> Some text </italic>. This is a case of overlapping elements. Also XML defines the syntax for an empty element to be <br/>, whereas this is virtually never used within HTML. It seems that such syntax (and much worse) is very common within HTML documents, and from a programmer's point of view, this lack of conformance is as annoying as a calculator without a minus key.
 
There are many people who have a problem with the general cruddiness of HTML regarding XML conformance. This concern has led to a new proposal called XHTML, which is essentially a modified HTML fully conforming to the XML standard. By definition this means that any XHTML document produced will automatically be a valid XML document. Yay!
After i had got over my annoyance at the vast difference between an ideal HTML and the one that people actually use, i set to work mutilating my nice simple XML parser to support the horrible inconsistencies and legacy kuldges. I am desperately trying to avoid writing a dedicated HTML parser, but the fact that much of the dekludging is specific to the element types means i may be resisting the inevitable (FONT, EM, STRONG tags affect only font appearance, while TABLE & DIV tags are structural, and deeply affect the layout and navigation of the document. BR, HR and IMG tags never contain sub elements. A pure XML parser makes no such distinction between different tags).
 
One major formatting option i had ignored up until this point was CSS (Cascading Style Sheets) ... the "new" way of styling HTML documents. I downloaded the CSS1 spec(recommended in 1996, revised in 1999) and was dismayed to notice that its formatting model was incompatible with mine (specifically in its interpretations of borders, padding and margins, and their apparent applicability to virtually ALL element types). Uh-oh. This is the least fun [and most embarrassing] stage of reinventing the wheel... the realization that you've stupidly wandered off down the wrong path, and that you never needed to go there in the first place, because the right direction was clearly marked all along. Oh well, it's always a learning experience.
 
And so i began implementing CSS1 formatting, adopting its elegantly simple box-element rendering paradigm. A shame to see things get temporarily broken, but it's for the best in the long run, so there you go. Still not sure how best to cope with malformed HTML. Lots of info about this online, so hopefully i'll find some useful references.
 
 

Not finished

 
And so, i now have a mostly functional CSS based HTML viewer, using my proprietary font renderer. Some rendering glitches, no editing functionality yet. Was it worth it? Well, there is still the potential to use at least some of the functionality in Book Reader, and it did help to mature my font rendering technology. Also i now have a fantastically good understanding of HTML+CSS. Overall, my feeling is that even when you do something that doesn't need to be done, you will still learn a lot, and as long as no one is going to get cross at you for wasting development time, you can still come out ahead.
 
Want to know what happens next? Maybe you'll find out in the Captain's Log.
 


Sample images

 IE Sample, to compare with...

 Jujusoft Sample

 Jujusoft detail with image filtering

Thursday, January 2, 2003