Identifies the table of contents file. If specified with no packages, the TOC file is created from cached mirroring data. If no TOC is specified, the default index.html is created automatically when mirroring finishes.
Identifies the package(s) to mirror.
Restricts mirroring to documents not previously retrieved. (Useful if you add a new package or base URI and don't want to wait for all of the existing mirrored documents to be updated.)
The WebMirror configuration file.
The WebMirror command creates a local copy (or mirror) of one or more web documents. It recursively copies linked documents, patching internal references so that linking will function properly in the local mirror. Links to documents not copied into the mirror are made absolute so that they will function correctly as well.
Most configuration of WebMirror is accomplished in mirror.xml. The mirror.xml document passed to WebMirror describes “packages”, collections of URLs, and how to mirror them.
The mirror.xml file should obey the schema described by mirror.dtd, although validation is not performed by WebMirror. An example configuration file is shown in Figure 1. “A Sample mirror.xml Configuration File”.
<mirror root="C:\Mirror" recheck="3">
<depth local="2" global="1"/>
<depth content-type="text/html" local="2" global="2"/>
<depth content-type="text/css" local="3" global="2"/>
<authorization uri="http://www.w3.org/"
user="SomeUser" password="xxx"/>
<package name="xml" title="XML Core" recheck="1">
<depth content-type="text/css" local="3" global="3"/>
<base uri="http://www.w3.org/TR/REC-xml"/>
<base uri="http://www.w3.org/TR/REC-xml-names"/>
</package>
</mirror>
The mirror element is the document element for a configuration file.
The root attribute is required and must identify the name of a directory on the local system.
The recheck attribute controls how often WebMirror checks for updates. If a resource was last checked within “recheck” days, it is assumed to be up-to-date.
During processing WebMirror will create directories under this root (one for each host that documents are mirrored from) and a uridata.xml document which describes the URIs processed.
The uridata.xml document is maintained by WebMirror and does not need to be edited by hand. It obeys the schema described by uridata.dtd.
The depth element controls how deeply the mirroring process goes.
If content-type is specified, this depth element only applies to documents with a matching content type. If not specified, this element specifies the default depth for all content types not explicitly matched by another depth.
Thelocal depth applies to documents on the same server as the base URI. For example, a depth of 3 instructs WebMirror to retrieve the base URI, all of the documents (on the same server) that are direct children of the base URI, and all of the documents (on the same server) that are grand-children of the base URI. Great grand-children are not retrieved.
Theglobal depth applies to documents on servers other than the server of the base URI. For example, a depth of 2 instructs WebMirror to retrieve the base URI and all of the documents (on other servers) that are direct children of the base URI. Grand children of the base URI (on other servers) will not be retrieved.
A depth specified inside a package overrides values specified in depth elements that are children of mirror.
The authorization element is used to retrieve documents that are password protected (using the basic authentication scheme). This has nothing to do with secure transmissions over the network.
If the web server requests authorization to retrieve a document, WebMirror looks at the authorization elements and finds (the first one, if any) that has a uri that matches the beginning of the URI of the document being requested. If it finds one, it attempts to retrieve the document using the user and password settings specified.
A package identifies a base URI (or a set of base URIs) to retrieve.
The name and title attributes identify the package.
The recheck attribute controls how often WebMirror checks for updates. If a resource was last checked within “recheck” days, it is assumed to be up-to-date. The recheck value on a package overrides the default recheck interval specified on mirror.
Webmirror version V1.0 requires a recent Perl5 and the following modules available from CPAN: XML::DOM (which requires XML::Parser), URI, and LWP::UserAgent.
At this point, I haven't arranged a proper Makefile.PL for WebMirror. Simply unpack the distribution somewhere and run the tools with Perl.
Copyright © 2000 Norman Walsh
Ironically, assertion of copyright is done to make it easier to distribute this application. (At least one organization, Software in the Public Interest, requires an explicit copyright statement in order to redistribute the software.)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
Except as contained in this notice, the names of individuals credited with contribution to this software shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from the individuals in question.
Any application derived from this Software that is publically distributed will be identified with a different name and the version strings in any derived Software will be changed so that no possibility of confusion between the derived package and this Software will exist.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL NORMAN WALSH OR ANY OTHER CONTRIBUTOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
This program is maintained by Norman Walsh, <ndw@nwalsh.com> .
The best way to reach norm is by email. You will find additional contact information at http://nwalsh.com/~ndw/.