FAQ

1. Is internet archiving covered by Danish statutory law?

2. Who administers the law?

3. Who has access to the archive?

4. Why can the general public not access the archive?

5. How many resources do you spend on web archiving?

6. What are your strategies for harvesting the internet?

7. Do you use a list for bulk harvesting? How do you generate the list?

8. Do you respect robots.txt?

9. What major problems have you encountered when harvesting the internet?

10. How often do you harvest the internet?

11. Do you harvest foreign-based websites with Danish contents?

12. Do you crawl foreign content residing in local domain?


1. Is internet archiving covered by Danish statutory law?

Yes. “Act no. 1439 of December 22, 2004 on Legal Deposit of Published Material” went into force on July 1st, 2005. Part 3 of the act concerns “Materials published in electronic communications networks” and allows legal deposit institutions to harvest websites within the Top Level Domain “.dk” and websites on other domains aimed at a Danish audience. Read the English translation of the law.


2. Who administers the law?

Administering the law is a task of  the Royal Danish Library. The two libraries have established “Netarkivet” (Netarchive.dk), to carry out the task of web archiving. Netarchive is governed by a Steering Committee of eight members representing expertise in digital preservation, IT, legal deposit and national collection building. Netarchive has a daily manager who refers to the Steering Committee.


3. Who has access to the archive?

Access to the archive is limited to researchers who hold a Ph.D. or are doctoral candidates. However, website owners can get access to their own material under certain circumstances.


4. Why can the general public not access to the archive?

The Danish Data Protection Agency has decided that the entire archive is covered by The Act on Processing of Personal Data (based on the EU directive on personal data). As there will be sensitive data among the material collected, the entire collection cannot be open to the general public (even if Netarchive.dk only collects published data). The challenge is how to identify sensitive personal data and prevent the general public from seeing it while permitting access to the rest of the archive. This problem is still unsolved.


5. How many resources do you spend on web archiving?

The staff of Netarchive.dk comprises 4.5 full-time staff years. The assigned time is shared by 20 employees, who are IT engineers, computer scientists and web curators.


6. What are your strategies for harvesting the internet?

In order to capture the Danish part of the internet three types of strategies are used:

  • A general “snapshot” harvest of the entire “.dk” domain is done four times a year
  • Selective but frequent harvests of dynamic sites
  • Harvests of selected websites in connection with selected events.

Snapshot harvesting (bulk harvesting):

Bulk harvesting is done to get a complete picture of the Top Level Domain “.dk”. A harvest is started by loading a list of domains to be harvested, supplied by the administrator of Top Level Domain “.dk”. A list of non.dk URLs wiht content aimed at a Danish audience is added to the list.

Selective harvesting:

Selective harvesting is done in order to collect websites that are updated frequently and whose contents would thus be missed by the bulk harvests. These include:

  1. News sites (national and regional media)
  2. Characteristic dynamic and heavily used websites representing the civic society, the commercial sector and public authorities, and
  3. Experimental and/or unique sites, documenting new ways of using the internet (e.g. net art).

Event harvesting:

Event harvesting is done in order to collect webpages from new websites which are dedicated to a specific event and which are expected to disappear when the event is over.

An event qualified for event harvesting

  1. Creates a debate among the population and is expected to be of importance to Danish history or to have an impact on the development of Danish society,
  2. Causes the appearance of new websites devoted to the event, and
  3. Is dealt with extensively on existing websites.

7. Do you use a list for bulk harvesting? How do you generate the list?

According to the law, Legal Deposit of Published Material, the person/legal body administering internet domains that are specifically assigned to Denmark must upon demand deliver a copy of the list of these domains to the legal deposit institutions as well as information about the registrars. As a result of this we regularly receive a list of all “.dk” domains from DK Hostmaster, who administers Top Level Domain “.dk”.


8. Do you respect robots.txt?

No, we do not. When we collect the Danish part of the internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important web sites (e.g. news media, political parties) had very stringent robots.txt directives. If we follow these directives, very little or nothing at all will be archived from those websites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary in order to collect all relevant material.


9. What major problems have you encountered when harvesting the internet?

Crawlertraps

As our harvester crawls the internet through links it inevitably falls into crawler traps such as calendars, which have an unending number of links. We discover a lot of these traps and mark them either at the domain on which they were found or on a global level, enabling the harvester to avoid them on other websites holding the same trap.

Log-in sites

Currently, we only harvest websites protected by log-in in connection with selective harvests. For bulk harvests, we still need tools to harvest websites with log-in automatically. Our research has revealed that about 16 % of websites protected by log-in contain material subject to legal deposit.

Overloading web servers

Small private web servers can not handle the speed of the crawler when we do a snapshot crawl. Fortunately, this problem is decreasing.

Complaints that we do not obey robots.txt and that we use javascript extraction

Our test harvests have shown that we do not harvest all relevant contents if we obey robots.txt or avoid using aggressive javascript extraction.

Harvesting rich media

Harvesting video, audio or other rich media is often difficult to do.Unfortunately, this problem has been increasing in the last few years.


10. How often do you harvest the internet?

Snapshot / Bulk harvests:

We carry out snapshot harvests four times a year. Some large websites, though, are only harvested twice a year (for technical reasons).

Selective harvests:

We typically harvest a given website somewhere in between six times daily and once a month, depending on the update frequency of the individual website.

Event harvests:

The average annual number of event harvests is three. Some types of events such as parliamentary elections are predictable; others (e.g. the swine flu epidemic) are not.


11. Do you harvest foreign-based websites with Danish contents?

Yes. We have identified some 45,000 websites not under the “.dk” domain which are subject to the legal deposit law because they are aimed at a Danish audience, the contents and/or the website owner is Danish.


12. Do you harvest websites with contents not related to Denmark but with a “.dk” domain name?

Yes. All websites with a “.dk” domain name that are subject to legal deposit are harvested, unless we discover that a given website is neither Danish nor aimed at a Danish audience.