source code
NetarchiveSuite.
Is the complete software package developed and used by netarchive.dk. It is released as open source under the LGPL license. You can read much more and download the software here
The following source code has been developed by Netarkivet.dk and is available for free download under the GNU Public License.
The code distributed here is not maintained anymore but feel free to ask any questions.
This is a collection of Java classes for manipulating ARC format files. ARC files are created mainly by Heritrix, but can also be made with HTTrack using a simple plugin. It is a flat ASCII format designed to be robust and self-descriptive.
The tarball contains the following classes:
- ARCFileOutput.java
- A class to create new ARC files.
- ARCInputStream.java
- An InputStream that reads single entries from ARC files.
- BinSearch.java
- A simple implementation of a binary search command-line tool that can find entries in a .cdx file.
- ExtractCDX.java
- A tool to extract .cdx files from ARC or .dat files. Note that it may have problems with compressed ARC files.
- GetPage.java
- A command-line tool that gets single entries out of ARC files.
- GetPage2.java
- A version of GetPage.java that emulates the Alexa tool av_getpage.
The only documentation available so far is the embedded JavaDoc comments.
Download the newest version:
JavaArcUtils version 0.3
This is a Java application for browsing ARC format files. It operates as a standard web proxy, but instead of accessing the real web, it allows the user to browse the archived files using any proxy-enabled browser. This allows a good illusion of browsing the web as it looked then.
ProxyViewer requires Java version 1.4 (compile with '-source 1.4') and the following extra packages:
- dk.netarkivet.ArcUtils
- org.mortbay.jetty
- javax.servlet
The tarball contains installation instructions, a diagrammatic overview, and the following (JavaDoc-enabled) classes:
- ARCArchiveAccess.java
- Interface to the ArcUtils package.
- ArchivesInstruction.java
- Server commands for setting up the archive to access.
- CDXEntry.java
- 'struct' class for entries in a CDX file.
- DefaultInstruction.java
- Server command for when nothing else matches
- GetMetaDataInstruction.java
- Server command to get a history of metadata
- HttpHandler.java
- Interface to the Jetty proxy server.
- Instruction.java
- Abstract superclass for server commands
- MissingURLLogger.java
- Class to log non-existing URLs browsed, allows improving the archive through browsing.
- ProxyLauncher.java
- Application main class, handles command line arguments.
- ProxyServer.java
- Core class, ties the rest together.
- Response.java
- Wrapper around Jetty responses, collects response info (response code, headers, body)
- ServerInstruction.java
- Server commands to set server behaviour.
- SessionData.java
- Contains information about a session.
- SessionHandler.java
- Handles switching of sessions.
|