![]() |
|
||||||||||||
|
|
|
|
|
|
|
|
|
|
|
||||
![]() |
|||||||||||||
|
|
|||||||||||||
FAQAnswers to frequently asked questions 1. Do you have a Statuary Act in place for Internet archiving?2. Who is administering the act?3. Are there any plans for the act to be revised?4. Who has access to the archives?5. Are there any difficulties in granting broader access to the archives?6. How many resources do you spend on web archiving?7. What are your strategies for harvesting the internet?8. Do you use an inventory list for Whole National Domain crawl? How do you derive the list?9. Do you respect Robots.txt?10. What major problems have you encountered when harvesting the internet?11. How often do you harvest?12. Do you crawl foreign-based websites with local content?13. Do you crawl foreign content residing in local domain?1. Do you have a Statuary Act in place for Internet archiving?Yes. “Act no. 1439 of December 22, 2004 on Legal Deposit of Published Material” went into force on July 1st, 2005. Part 3 of the act concerns “Materials published in electronic communications networks” and allows legal deposit institutions to harvest websites within Top Level Domain .dk and websites on other domains aimed at a Danish audience. An English version of the law is found on http://www.kb.dk/en/kb/service/pligtaflevering-ISSN/lov.html. An amendment, postponing a revision of the act to 2011, was passed on February 20th, 2008. 2. Who is administering the act?The task of administering the act is shared jointly by the Royal Library and the State and University Library. The two libraries have established a virtual institution, “Netarkivet.dk” (Netarchive.dk), to perform the task of web archiving. Netarkivet.dk is governed by a Steering Committee of 6 members (3 from each library) representing expertise in digital preservation, IT, legal deposit and national collection building. Netarkivet.dk has a daily manager who reports to the Steering Committee. 3. Are there any plans for the act to be revised?The act and the accompanying circular are up for revision during the parliamentary session 2010-2011. Revisions will focus on providing broader access to the archives. Regulations concerning scope and collection policies are considered flexible enough to accommodate for technological development in the foreseeable future. 4. Who has access to the archives?Access is limited to researchers with a Ph.D. or a doctoral candidate. 5. Are there any difficulties in granting broader access to the archives?Yes. The Act on Processing of Personal Data (based on the EU directive on personal data) and the Danish Data Protection Agency has decided that the entire archive is covered by this act. As there will be sensitive data among the data collected, the entire collection of data cannot be open to the general public (even if Netarchive.dk only collect published data). The challenge is how to identify sensitive personal data and prevent the general public from seeing them while permitting access to the other data. So far (December 2010) no easy solution has been found. 6. How many resources do you spend on web archiving?The staff of the Netarchive.dk comprises 4.5 full-time staff years, shared by 20 employees, who are IT engineers, computer scientists, librarians and library assistants. 7. What are your strategies for harvesting the internet?In order to capture the Internet three types of strategies have been employed: a general snapshot of the entire< .dk> domain done four times a year, a selective but more frequent harvests of dynamic sites, and irregular harvests of selected websites in connections with events. Bulk (cross-sectional/snapshot) harvesting: Bulk harvesting is done to get a complete picture of the Top Level Domain .dk. A harvest is begun by loading a list of domains to be harvested, supplied by the Administrator of Top Level Domain .dk. To this list is added a list of URLs on other domains aimed at a Danish audience. Selective harvesting : Selective harvesting is done to gather web pages that are frequently updated and which would be missed by the bulk harvests such as (1) news sites (national and regional media), (2) “typical” dynamic and heavily used sites representing the civic society, the commercial sector and public authorities, and (3) experimental and/or unique sites, documenting new ways of using the web (e.g. net art). Event harvesting: Event harvesting is done to collect web pages from new sites, dedicated to one event and which is expected to disappear when the event is over. An event is defined as something that (1) creates a debate among the population and is expected to be of importance to Danish history or have an impact on the development of Danish society, (2) causes the appearance of new websites devoted to the event, and (3) is dealt with extensively on existing websites. 8. Do you use an inventory list for Whole National Domain crawl? How do you derive the list?According to the law, the person/legal body administering Internet domains that are specifically assigned to Denmark must upon demand deliver a copy in electronic form of the list of these domains to the legal deposit institution as well as information about the registrars. This means that we regularly receive a list of all .dk domains (Fall 2010: 1.3 million) from DK Hostmaster, who administers Top Level Domain .dk. 9. Do you respect Robots.txt?No, when collecting the Danish part of the Internet we ignore the so-called robots.txt directives. Studies from 2003-2004 showed that many of the truly important net sites (e.g. news media, political parties) had very stringent robots.txt directives. Had these directives been followed, very little or nothing at all would be archived from those sites. Therefore, ignoring robots.txt is explicitly mentioned in the commentary to the law as being necessary to collect all relevant material. 10. What major problems have you encountered when harvesting the internet?Crawlertraps As our harvester crawls the web through links it invariably falls into crawler traps such as calendars, which have an unending number of links. A lot of these traps are discovered and marked either at the domain on which they were found or on a global level, enabling the harvest to avoid them on other web sites holding the same trap. Log-in sites Currently, we only harvest sites, protected by log-ins during selective harvests. For bulk harvests, we still need tools to harvest log-in sites automatically. Our research has revealed, that about 16 % of sites with log-in contain material subject to legal deposit. Load on network Small private web servers can not handle the speed of our snapshot crawl. This problem is As mentioned has our test harvestings shown that we otherwise do not get enough harvested. Harvest of video, audio and other rich media often do not succeed. This problem has been increasing in the last few years. 11. How often do you harvest?Bulk harvests: 4 times a year. Some large websites, though, are only collected twice (for technical reasons). Selective harvests: Depends on rate of update and the presence of archival functions at the site. Currently we harvest between 6 times daily and once a month. Event harvests: Depends on rate of update and presence of archival functions of targeted websites. Period of harvest determined for each event. 12. Do you crawl foreign-based websites with local content?Yes. We have identified some 45.000 websites outside <.dk> which are subject to legal deposit because they are aimed at a Danish audience, the content and/or the website owner is Danish. 13. Do you crawl foreign content residing in local domain?Yes. All websites on <.dk> that are subject to legal deposit are collected unless we discover sites that are not Danish nor aimed at a Danish audience |
|||||||||||||
![]() |
|||||||||||||
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|