Lagoon User Guide

Overview

Lagoon is an XML-based framework for web site maintenance.

Lagoon does not require support for any dynamic content technology, such as Servlets, CGI, ASP, SSI, PHP or JSP, on the web server. It's therefore very useful for sites on cheep web hotels which gives limited (or no) access to dynamic content features. However, Lagoon is useful larger sites too, and can be used together with other dynamic content technologies.

Lagoon is also useful for building HTML based documentation bundles, for viewing without a web server.

Background and philosophy

The web server: Static and dynamic content

The basic functionality of a web server is to send the content of a regular file stored on disk as the response of a request, this is called static content. The other alternative is to start a process that generates the response for each request, this is called dynamic content. The use of dynamic content can be divided into several categories as follows.

Pseudo-dynamic
where the requested document is generated by composing information from several files and/or from a database. The information in the files and in the database is static, i.e. updated in a controlled manner and not very frequently. This is called pseudo-dynamic since the produced document is a function of static information only. In principle, this use of dynamic content is not necessary since the updates could have been done in static content. The main reason for using this approach is easier maintenance and updates. Technologies such as SSI, ASP, PHP and JSP are used for this.
Real-time data
where the generated document depends on some information that is updated frequently and outside the control of the web server. Technologies such as CGI and Servlets (sometimes also ASP, PHP and JSP) are used for this.
User interaction
where the generated document depends on parameters in the request and/or state information from previous requests. Technologies such as CGI and Servlets (sometimes also ASP, PHP and JSP) are used for this.

It is also possible to have combinations (such as both real-time data and user interaction).

Where Lagoon fits in

Lagoon produces all your pseudo-dynamic content off-line, and send the result to the web server as static files. This can give better performance, since processing doesn't have to be done at each request. This also gives you convenient pseudo-dynamic content on a web server without explicit support for it.

Lagoon does not handle user interaction, you have to use the conventional technologies for this. However, while using ASP, JSP (or whatever) for the few pages with user interaction, you can still use Lagoon for the rest of the site.

Lagoon can in some cases be used for real-time data, but this requires a more complicated setup than usual. See the Advanced User Guide.

In addition, Lagoon keeps track of all content for your web site, including static files (HTML, images, etc.) and any files for user interaction (ASP or JSP pages, CGI scripts, or whatever). Lagoon automatically detects if any source file is updated and regenerate the dependent content as necessary (and only when necessary). Lagoon can be seen as a Make tool for web sites, you have the "source code" on your computer, Lagoon performs "compilation" as necessary and stores the "object code" directly on the web server (with FTP or SSH if the web server is remote). This is especially useful if you have a large web site and updates it over a slow dial-up modem connection; if you make changes to only a few pages, only those pages are actually transmitted to the web server.

User requirements

To make use of Lagoon, you need to know XML (including namespaces). Many usage patterns require knowledge of XSLT too. See http://www.w3.org/xml/ for more information of XML, XSLT and related technologies.

You don't need to know the Java programming language to make use of the basic features of Lagoon.

System requirements

Installation

  1. Make sure that your JRE is properly installed and working.
  2. If you plan to use Batik and/or FOP, add the distribution JAR files Batik and/or FOP to your Java CLASSPATH, or install them as standard extensions in your JRE.
  3. Add the bin directory (which contains scripts for Windows, OS/2 and UNIX based systems) to your PATH and set an environment variable LAGOON_HOME to the directory where you installed Lagoon. Alternativly, you can add the .jar files in the Lagoon distribution to your Java CLASSPATH).

The sitemap

Lagoon is based on a sitemap, which is a file describing the structure of the website. The sitemap is in XML format, see the schema A sitemap may have a name which unique identifies it on the system.

The sitemap has entries for files, each <file> entry describes how that file should be created. The target is the request URL for the web server, and must be specified as a pseudo-absolute URL. The source URL must be specified as an absolute or pseudo-absolute URL. If source is omitted, it's set to the same string as target (including any wildcard). The pipeline must end with a byte producer. As a shortcut, an empty file element is equvivalent to a file element with a single read child.

The sitemap may also contain <part> entries, which defines a partial document, to be included by another document. A part has a name, which is not an URL. The source URL must be specified as an absolute or pseudo-absolute URL. The pipeline must end with an XML producer. Wildcards may not be used.

The sitemap may also contain <output> entries, which defines the end of the pipeline for use by several targets (and escially by the island feature). An <output> entry has a name, which is not an URL. The pipeline must end with a byte producer, and start with an XML consumer.

The sitemap may also contain <delete> entries, which are used to delete the target.

The sitemap may also contain <property> entries, which are used to define sitemap wide parameters.

URLs

In the sitemap, URLs are used to point out files and other resources. See RFC 1738 for more information about URLs. There are three kinds of URLs:

Absolute URLs are handled by the java.net.URLConnection class in Java. It has a pluggable API and it's possible to install custom handlers for any scheme (see the Java documentation for information about this). At least http, https and ftp are included in Java2 1.4. However, three schemes are handled specially by Lagoon: file, res and part.

The file scheme is used to point out a file anywhere in the local file system. Lagoon handles this by itself because the support for it in several versions of Java is broken. These URLs must be written in the form file:native absolute file path. E.g. file:/usr/local/file.txt on a UNIX system or file:C:\Documents\file.txt on a Windows or OS/2 system. (Note that this is not according to the definition of file URLs in RFC 1738.)

The res scheme is used to point out resources included in the Lagoon distribution. These resources are loaded from the Java CLASSPATH, you can add your own resources by adding them to the Java CLASSPATH. See the Resource Guide for information about the included resources. E.g. res:/style/imageindex.xsl

The part scheme is used to point out a <part> defined in the sitemap. E.g. part:thePart.

Dependency checking of sources works for relative, pesudo-absolute and absolute URLs with file or part schemes. However, it does not work with absolute URLs with other schemes, if any such is used as source, it causes rebuilding each time. Absolute URLs with res scheme never causes rebuilding.

Producers

A file is created with a pipeline of connected producers. A producer is a component which produces a stream of bytes (implemented using a Java OutputStream) or a stream of XML data (implemented using SAX2 events). A producer may additionally take a stream of bytes or XML data as input (if it does not, it is a source producer). A pipeline is a chain of connected producers.

Lagoon has defined six types of producers:

Format
which inputs XML data and outputs bytes. Will typically be used as the last step in the pipeline.
Transform
which inputs XML data and outputs XML data. Can be used to perform XSLT transformations.
Source
which is a source producer outputting XML data. Will typically be an XML parser that reads a source file.
Read
which is a source producer outputting bytes. Will typically be used to copy a source file unchanged (useful for e.g. images).
Parse
which inputs bytes and outputs XML data.
Process
which inputs bytes and outputs bytes.

Parameters can be passed to a producer, which is useful for e.g. giving the name of the stylesheet to a XSLT processor, parameters are given as attribute to the producer element in the sitemap. Any character content to the source producer elements is taken as a parameter with the name "name". Each file entry may have an main source, which can be read by the source producer (however, a source producer may instead obtain the data from some other source).

When Lagoon is started, the sitemap is parsed and a pipeline is set up for each file entry. Lagoon is now ready to build the website, which can be done several times since the pipelines are reusable.

The website building is performed by processing each file entry in the sitemap. A file entry is only processed if necessary, i.e. if any source data has been updated since the last time it was processed. It's up to each producer to implement this dependency checking.

Wildcards

You can specify similar treatment of several files by using a wildcard in the source filename. Lagoon will enumerate all source files matching the given wildcard pattern, and process all of them in sequence. The target filename must also contains a wildcard, it will be instantiated with the same string used to match the source pattern, i.e. if the source pattern is *.xml and the target is *.html, then the source file book.xml will generate book.html. You cannot use wildcards if the source is an absolute URL.

Splitting

Producer may have a special feature called split which works as a transform, but can generate several output files from one single source. The pipeline after a split producer will be executed once for each part the split producer generates (it has to generate a filename for it as well).

Islands

Producer may have a special feature called island which works as a transform, but can generate several output files from one single source. Parts of the source document will be redirected to other output pipelines, using <output> entries in the sitemap (it has to generate a filename for it as well).

The standard producers

You can also write your own produceres, see the Advanced User Guide.

If nothing else is stated, a producer signal the need for rebuilding when the main source has been updated (for source producers), or asks the next upstream producer (for other producers).

<source>

Parse the main source as XML.

<source type="dir">

The main source must be a directory (and the source URL must end with '/'), and may not be an absolute URL with other scheme than file or res. The files and subdirectories in this directory is listed and provided as XML in the following format:

<dirlist>
    <directory filename="somedir" url="/thisdir/somedir" 
        timestamp="987198810097" date="2001-04-13" time="23:53:30"/>
    <file filename="somefile.txt" url="/thisdir/somefile.txt" 
        timestamp="987197358000" date="2001-04-13" time="23:29:18" 
        size="445"/>
</dirlist>

The url attribute contains the same pseudo-absolute URL that would be used to refer to this file in the sitemap. The timestamp attribute contains the number of milliseconds since 1970, as a decimal number.

There is one optional parameter, "pattern", which gives a wildcard pattern to select which files and subdirectories to include. If omitted, all files and subdirectories are included.

This producer signals the need for rebuild when the timestamp on the directory is updated. However, since many operating systems usually doesn't do that, it will also check if any file is added, removed or renamed. However, it does not check if the content of any file in the directory is updated.

<read>

Read the main source as a byte stream.

<transform type="xslt">

Applies an XSLT stylesheet to the XML stream.

The mandatory parameter "stylesheet" specifies the location of the stylesheet, as an URL. If this URL is relative, it's searched for relative to the source file. Any relative URL imported or included from the stylesheet is searched for relative to the stylesheet.

Any relative URL refeered to by the document() function is searched for relative to the source file. part: URL:s may be used in the document() function.

This producer will check if the stylesheet, or any file imported or included from it has been updated, and in that case signal the need for rebuild and also recompile the stylesheet. This producer will check if any file referred to using the document() function has been updated, and in that case signal the need for rebuild. However, since the document() function can take an expression as argument, this may not always work properly. To remedy this problem, there is a parameter "always", if it's set to any non-empty string, this producer will signal the need for rebuilding (but not recompile the stylesheet) each time.

Any other parameters are passed as parameters to the stylesheet (to be used by top-level <xsl:param> elements in XSLT).

Any xsl:output elements in the stylesheet have no effect. Specify formatting properties with a format producer instead.

<transform type="split">

Implements the splitting feature.

The XML stream from the upstream producer is scanned for specific element, and each occurrence of that element generates one output file. The XML data outside this element is ignored. The main output will be a empty dummy file (however, it is needed for dependency checking).

This producer takes three mandatory parameters. "namespace" and "element" specifies the element to split on.

"outputname" specifies how to construct the filename for each part, it contains a filename template which contains attribute names surrounded by braces ([]) which are replaced with the value of that attribute on the split element. To actually include a literal brace in the filename, use a double brace.

For example, a sitemap fragment like this:

<transform type="split" namespace="" element="thepart" outputname="[name].xml">

with an input fragment like this:

<thepart name="first">

will result in the file first.xml being created.

The output name must be a relative URL, and is relative to the main target file. It must not be pseudo-absolute.

<transform type="island">

Implements the island feature.

The XML stream from the upstream producer is scanned for elements with specific XML namespaces, and each occurrence of such element generates one output file. The XML data outside those elements is passed through unchanged. This is typically used to processed embeeded SVG or MathML in XHTML documents.

For each XML namespace you want to extract, you need to specify the three parameters "namespacen", "outputn" and "outputExtn" where n is a number starting from 0. "output" specifies which <output> entry in sitemap to use for this namespace. "outputExt" specifies the file extension to give to the generated file (including '.').

Filenames for the extraced parts will be the name of the main file + "_image" + a number + the extension given.

<transform type="lssi">

Performs LSSI processing.

Any relative URL included from the LSSI page is searched for relative to the source file.

part: URL:s may be used when including. Included parts need not to be well-formed.

This producer will check all files (and other resources) the LSSI page depends on, and signal the need for rebuilding if any of them are updated (or always signal the need for rebuilding if any resource cannot be checked, e.g. an absolute URL).

<transform type="lsp">

Executes an LSP page.

Any parameters to this producers are used as parameters to the LSP page.

Any relative URL imported from the LSP page is searched for relative to the source file.

part: URL:s may be used when importing. Imported parts must be well-formed (i.e. have a single root element), which is not nessesary the case if it's generated by LSP or XSLT.

This producer always signal the need for rebuilding.

This producer can not be used together with wildcards.

You need to include the LSP jar files (lsprt.jar and lspc.jar) in your CLASSPATH.

<format type="xml">

Formats into well-formed XML.

An "encoding" parameter can be used to specify the character encoding to use, default is UTF-8.

An "indent" parameter can be used to specify wether to use indenting for pretty-printing the result. Default is no indenting.

The "doctype-public" and "doctype-system" parameters can be used to specify a specific DTD to use. Default is to not use any DTD.

An "omit-xml-declaration" parameter can be used to specify that the output should not contain any XML declaration. Default is to include an XML declaration.

<format type="html">

Formats into classical HTML 4.01.

An "encoding" parameter can be used to specify the character encoding to use, default is iso-8859-1.

An "indent" parameter can be used to specify wether to use indenting for pretty-printing the result. Default is no indenting.

An "html" parameter can be used to specify which HTML DTD to use, it can take the values "transitional", "strict" or "frameset". Default is "transitional".

The "doctype-public" and "doctype-system" parameters can be used to specify a specific DTD to use. If specified, they will override the setting of the "html" parameter.

<format type="xhtml">

Formats into XHTML (well-formed XML which also can be consumed by most non-XML aware HTML browsers).

An "encoding" parameter can be used to specify the character encoding to use, default is UTF-8.

An "indent" parameter can be used to specify wether to use indenting for pretty-printing the result. Default is no indenting.

An "html" parameter can be used to specify which HTML DTD to use, it can take the values "transitional", "strict" or "frameset". Default is "transitional".

The "doctype-public" and "doctype-system" parameters can be used to specify a specific DTD to use. If specified, they will override the setting of the "html" parameter.

An "omit-xml-declaration" parameter can be used to specify that the output should not contain any XML declaration. Default is to include an XML declaration if nessecary.

<format type="text">

Formats into plain text.

An "encoding" parameter can be used to specify the character encoding to use, default is iso-8859-1.

<format type="fo">

Formats XSL:FO into PDF using Apache FOP.

<format type="svg">

Formats SVG into an bitmap image using Apache Batik.

A mandatory parameter "format" specifies the format of the image, it may be "jpeg", "png" or "tiff".

For JPEG images, an additional parameter "quality" may be used to specify the compression rate, it's a floating point number between 0 and 1 where a larger number means better quality but less compression (larger file). The default quality is 0.8.

<parse>

Parses the byte stream from the upstream producer as XML.

The use of this producer is not recommended. In most cases you can use a <source> instead. Using this producer may affect performance, and can in some situations cause deadlocks.

Common usage patterns

Copy the content of any file, useful for e.g. images:

<file target="/john.jpeg" source="/img/john.jpeg">
  <read/>
</file>

Same as above:

<file target="/john.jpeg" source="/img/john.jpeg"/>

Format an HTML page. Note that the source HTML file must be XHTML (well-formed XML and all HTML elements in the XHTML namespace):

<file target="/index.html">
  <format type="html">
    <source/>
  </format>
</file>

Transform a bunch of XML files with an XSLT stylesheet into HTML:

<file target="/books/*.html" source="/books/*.xml">
  <format type="html">
    <transform type="xslt" stylesheet="/style/book_html.xsl">
      <source/>
    </transform>
  </format>
</file>

Transform the same XML files with another XSLT stylesheet into PDF:

<file target="/books/*.pdf" source="/books/*.xml">
  <format type="fo">
    <transform type="xslt" stylesheet="/style/book_print.xsl">
      <source/>
    </transform>
  </format>
</file>

Use XSLT to generate an index over all books:

<file target="/books/index.html" source="/books">
  <format type="html">
    <transform type="xslt" stylesheet="/books/index.xsl">
      <source type="dir" pattern="*.xml"/>
    </transform>
  </format>
</file>

Generate an HTML page using LSSI:

<file target="/coolpage.html">
  <format type="html">
    <transform type="lssi">
      <source/>
    </transform>
  </format>
</file>

Define a partial page to be included by another page:

<part name="header" source="/header.lsp">
  <transform type="lsp" menu="yes">
    <source/>
  </transform>
</part>

Render a JPEG image from SVG:

<file target="/picture.jpeg" source="/picture.svg">
  <format type="svg" format="jpeg" quality="0.5">
    <source/>
  </format>
</file>

Build PNG images for SVG and MathML islands in XHTML. (Requires that you have an XSLT stylesheet to transform MatmML to SVG, no such stylesheet is included in Lagoon. The SVG part works fine though.):

<output name="svgOutput">
  <format type="svg" format="png"/>
</output>

<output name="mathmlOutput">
  <format type="svg" format="png">
    <transform type="xslt" stylesheet="/mathml2svg.xsl"/>
  </format>
</output>

<file target="/island.html">
  <format type="html">
    <transform type="island" 
        namespace1="http://www.w3.org/2000/svg" 
        output1="svgOutput" outputext1=".png"
        namespace2="http://www.w3.org/1998/Math/MathML" 
        output2="mathmlOutput" outputext2=".png">
      <source/>
    </transform>
  </format>
</file>

Delete an old file:

<delete target="/oldstuff/outdated.html"/>

Running Lagoon

Lagoon is invoked by the application class nu.staldal.lagoon.LagoonCLI. The syntax is one of:

lagoon property_file how_to_run
lagoon sitemap_file how_to_run

The property_file and sitemap_file is specified using a platform-dependent path (e.g. use '\' as path separator in Windows), not as an URL. It may be absolute or relative (to the current working directory). If the filename ends with ".xml" or ".sitemap", it will be taken as a sitemap file, otherwise it will be taken as a property file.

The how_to_run argument specifies what Lagoon should do after initialization. "build" causes it to perform a normal build (perform dependency checking and rebuild the necessary files) once and then exit. "force" causes it perform a force build (override dependency checking and unconditionally rebuild every file) once and then exit. An integer n will cause it to perform a normal build every nth second, forever (until terminated). Leaving this argument out causes it to go into an interactive mode and wait for you to enter a command (write something on the keyboard and press [ENTER]), 'b' will cause normal build, 'f' will cause a force build and 'q' will cause it to quit.

The property file

The property file specifies the sitemap file, the source directory, the target and the password to access the target (if nessesary). The file is a standard Java property file, i.e. a text file with one keyword-value pair on each line, separated by ':'; lines beginning with '#' are ignored.

The sitemap file and source directory are specified using platform-dependent paths (e.g. use '\' as path separator in Windows), not as URLs. They may be absolute or relative (to the current working directory). Please note that the Java property file format requires you to escape '\' with '\\'.

If no property file is used (the sitemap file is specified directly on the command line), sourceDir and targetURL will both be set to the current directory. Note: this requires a careful setup of the sitemap, since the default behaviour is to use the same path as source and target (which obviously won't work if sourceDir and targetURL are the same).

Sample property file:

# Lagoon properties

sitemapFile: C:\\joe_files\\webbsite\\sitemap.xml
sourceDir: C:\\joe_files\\webbsite\\src
targetURL: ftp://joe@ftp.acme.com/public_html/
password: secret

The target specification

Lagoon is capable to store generated files in a local directory, or at a remote server using FTP or SSH. You can also write your own FileStorage to use some other protocol, see the Advanced User Guide.

To use a local directory, just specify a platform-dependent path. The directory will be created if it doesn't exist.

To use FTP, specify an absolute URL in the form ftp://login@host/path/. The path is relative to your home directory on the remote machine (to start from root, do like ftp://joe@foo.bar.com//abs/path/). Note that this will send everything, including your password, in clear-text over the network. If security is important, use SSH instead. This requires you to specify the password in the property file.

To use SSH, specify an absolute URL in the form ssh://login@host/path/. The path is relative to your home directory on the remote machine (to start from root, do like ssh://joe@foo.bar.com//abs/path/). You need to have an public key properly setup before using this (you should be able to login without entering any password), do not specify the password in the property file. This requires a UNIX style shell with access to the commands "mkdir -p", "rm -f" and "cat" on the remote server.

The working directory

Lagoon will create a working directory that is used to store cached data and dependency information. This directory is named ".lagoon" and is created in the user's home directory (as pointed out by the Java system property "user.home", you can change this by modify the lagoon script to pass "-Duser.home=/some/other/dir" on the java command line).

It's safe to remove the working directory when Lagoon is not running, it will be recreated next time Lagoon is run. If Lagoon suddenly start giving unexpected behavior, removing the working directory might remedy the problem. However, removing the working directory may cause unnecessary rebuilds next time Lagoon is run, especially if you use a FTP or SSH target.

Using Lagoon from within Apache Ant

Lagoon comes with an Ant task. Define the Lagoon Ant task in the Ant build file like this:

<taskdef name="lagoon" classname="nu.staldal.lagoon.LagoonAntTask">
  <classpath>
    <pathelement location="locationOfLagoonJars/lagoon.jar" />
    <pathelement location="locationOfLagoonJars/xmlutil.jar" />
  </classpath>
</taskdef>  

and use one of the following syntaxes:

<lagoon propertyFile="propertyFile"/>
    
<lagoon sitemapFile="sitemapFile"
           sourceDir="sourceDir"
           targetURL="targetURL"
           password="password" />

The password attribute can be omitted if not needed. Use the optional attribute force to override dependency checking.

Lagoon GUI

Lagoon comes with a simple GUI which you can use instead of the command line tool. The application class is nu.staldal.lagoon.LagoonGUI. The syntax is:

lagoongui [property_file]

Alternatively, you can simply execute lagoon.jar (java -jar lagoon.jar), but that requires you to install the required libraries (Batik and/or FOP) as standard extensions in your JRE.

Adapting your web site for Lagoon

To make use of the features of Lagoon, you have to ensure that your HTML source files are in XHTML format (well-formed XML and all HTML elements in the XHTML namespace). You might find the tool HTML Tidy useful for this.

Lagoon comes with a tool which checks an XML file for well-formedness, and reports any errors. The syntax for this tool is:

xmlcheck [-v] xml_file

xml_file is the file to check, can be specified with a platform-dependent path or an URL. Use the -v option to also check for validity (not nessesary for Lagoon). If the XML file is OK, nothing will be printed, otherwise error messages will be printed.

To help with the process of adapting an existing website, Lagoon comes with a tool which creates a sitemap from an existing directory structure. The syntax for this tool is:

buildsitemap source_dir sitemap_file

The source_dir directory will be recursively processed, and all files will be added to the sitemap which is written to the file sitemap_file. BuildSitemap tries to be a bit clever by treating files differently based on its extension. However, don't expect the generated sitemap to work fine directly, you probably have to do manual adjustments. Making use of LSSI, XSLT transformations and other features requires modifications of the sitemap.