source code

NetarchiveSuite.
Is the complete software package developed and used by netarchive.dk. It is released as open source under the LGPL license. You can read much more and download the software here


The following source code has been developed by Netarkivet.dk and is available for free download under the GNU Public License. The code distributed here is not maintained anymore but feel free to ask any questions.

Java ARC utilities (dk.netarkivet.ArcUtils)

This is a collection of Java classes for manipulating ARC format files. ARC files are created mainly by Heritrix, but can also be made with HTTrack using a simple plugin. It is a flat ASCII format designed to be robust and self-descriptive.

The tarball contains the following classes:

ARCFileOutput.java
A class to create new ARC files.
ARCInputStream.java
An InputStream that reads single entries from ARC files.
BinSearch.java
A simple implementation of a binary search command-line tool that can find entries in a .cdx file.
ExtractCDX.java
A tool to extract .cdx files from ARC or .dat files. Note that it may have problems with compressed ARC files.
GetPage.java
A command-line tool that gets single entries out of ARC files.
GetPage2.java
A version of GetPage.java that emulates the Alexa tool av_getpage.

The only documentation available so far is the embedded JavaDoc comments.

Download the newest version:
JavaArcUtils version 0.3


ProxyViewer (dk.netarkivet.proxyviewer)

This is a Java application for browsing ARC format files. It operates as a standard web proxy, but instead of accessing the real web, it allows the user to browse the archived files using any proxy-enabled browser. This allows a good illusion of browsing the web as it looked then.

ProxyViewer requires Java version 1.4 (compile with '-source 1.4') and the following extra packages:

  • dk.netarkivet.ArcUtils
  • org.mortbay.jetty
  • javax.servlet

The tarball contains installation instructions, a diagrammatic overview, and the following (JavaDoc-enabled) classes:

ARCArchiveAccess.java
Interface to the ArcUtils package.
ArchivesInstruction.java
Server commands for setting up the archive to access.
CDXEntry.java
'struct' class for entries in a CDX file.
DefaultInstruction.java
Server command for when nothing else matches
GetMetaDataInstruction.java
Server command to get a history of metadata
HttpHandler.java
Interface to the Jetty proxy server.
Instruction.java
Abstract superclass for server commands
MissingURLLogger.java
Class to log non-existing URLs browsed, allows improving the archive through browsing.
ProxyLauncher.java
Application main class, handles command line arguments.
ProxyServer.java
Core class, ties the rest together.
Response.java
Wrapper around Jetty responses, collects response info (response code, headers, body)
ServerInstruction.java
Server commands to set server behaviour.
SessionData.java
Contains information about a session.
SessionHandler.java
Handles switching of sessions.

Download the current version:
ProxyViewer version 0.1