News from the Archive

March – May 2017

First broad crawl 2017
We run our first domain crawl this year from March 6 to March 26 – with a limit of 10 Mb per domain.
Ministries and administrative bodies and ”ultra big sites” (e.g. the Danish Broadcasting Corporation’s website dr.dk) are crawled separately. Thus, we are able to monitor that we catch all we want to catch.

Collecting Instagram and Twitter
We have revised our collection scope for Twitter and Instagram. By now, we crawl a big number Danish Twitter profiles twice a month, the profiles of political players are collected daily. Furthermore, we collect a representative number of Instagram profiles and hashtags, which reflect politics, daily life and sports in Denmark.

Further tests of BCWeb
We are testing the functionalities in the French National Library’s nomination tool. We need some modification in order to optimize the tool for our use.

NAS workshop in Vienna
In the end of April, some curators and developers from Netarchive participated in a fruitful workshop organized by the NetarchiveSuite (NAS) community. NAS is an open source tool. All partners of the community discussed and planned the further development of the tool.

January/February 2017

Social Media strategy
We analyzed a great number of social media platforms focusing on the content. With the help of the results, we will decide which social media platforms we want to collect in what scope.

Use of BCWeb
The National Library of France, one of our partner institutions, has developed a nomination curation tool (BCWeb): we found to find out whether this tool would be useful for us, e.g. for getting external help with the nomination of content to collect.

September/October 2016

Broad crawl
The third broad crawl 2016 finished on  30 september. The same applies for the crawls of ministeries and ultra big sites.

Event crawls
A roadmap for event crawls of parliamentary and local elections is almost in place. Thus we will be able to identify candidates rapidly and press the start button as soon as the call for election is out.

Selctive crawls

Our new crawl strategy is in place. Find a description (in Danish) of the new organization of the selective crawls here.

We gave our Social Media crawls a make over. We have revised the list of Twitter profiles and hashtags and started to crawl Danish Instagram profiles.

August 2016

Broad crawl
The third broad crawl 2016  will be launched in week 36/37. The crawl limit per domaine will be max. 100 MB. There will be special crawls for ministeries and government bodies, and for ultra big sites (e.g. dr.dk)

Event crawl
The event collection for the Olympics in Rio 2016 will go on until the end of the Paralympics 2016

Selctive crawls
We are working on the configuration of the regional/local news media crawls.
We have crawled about 60 Danish Facebook profiles with Archive-IT. We made a special crawl of Prime Minister Lars Løkkes Facebook profile on 2016.08.30, the day he published his 2025 plan.

July 2016

Selective crawls
Following our new collection strategy – extension of the selective crawls and smaller broad crawls – we now collect all national Danish news media selectively.
We investigate all local new media in order to decide frequency and depth for the future crawls.
Our harvester Heritrix 3 is not able to archive Facebook profiles. Archive-IT, the commercial part of Internet Archive uses an API to Heritrix for crawling Facebook. We will collect about 100 representative open Facebook profiles at Archive-IT, at the moment we are doing the selection of the profiles.

June 2016

Broad crawl
The second broad crawl 2016  (with the limit of 100 MB per domain) finished at  June 28. We harvested 11.255.368.320.635 bytes / 242.114.319 objects.

Event crawl
We started an event collection for the Olympics in Rio 2016 on July 24. We also participate in the IIPC Olympics collection

Selctive crawls
As part of our new collection strategy we have started working with university repositories, educational and law portals.

Policies and strategies
Our dissemination policy and strategy are getting the last brush up.

Collaboration agreement
A revised SB and KB’s collaboration agreement on Netarchive has been signed of the directors from both institutions.

Compressed Archive
We have finalized a recommendation on the compression of the WARC files in Netarchive.

May 2016

Broad crawl
We started the second  broad crawl 2016 with a limit of 100 MB from each domain to be crawled.

Event crawls
We stopped the refugee crisis crawl. We did a smaller event crawl for the “Eurovision Song Contest”, were we focused on the Danish participants presence on Twitter and on thematic news sections. We are preparing for a crawl of the Olympic in Rio.

Selctive crawls
We started the implementatoin of our revised collection strategy. We have almost established the new selective crawls of national news sites.

Potential collaboration project
The Parliamentary Library gives inhouse access to historical (archived) versions of the political parties’ websites. they are not quite satisfied with their solution. Netarchive and the Parliamentary Library are looking at potential future cooperation on this subject.

April 2016

Tools upgrade
We have moved our production site to NetarchiveSuite 5, Heritrix 3

Broad crawl
We will start the second  broad crawl 2016 as soon as NAS 5 and Heritrix 3 are running “smoothly”

Event crawls
The event crawl on the refugee crisis is stil ongoing: As it is a supplement to our selective news media and social media crawls, it is a very little event crawl.
We are preparing for a new event crawl on the European Capital of Culture project “Aarhus 2017”: we are looking at different scenarios for this event crawl

Selctive crawls
We are still unable to harvest anything from Facebook.

Collection strategy
We are revising our collection strategy: There will be less broad crawls and more selective crawls. At the moment we are looking at the selective news media crawls. According to our ressources we need a more streamlined approach for an extended number of domains to be crawled

Ad hoc crawls
The social platform arto.com will be closed down at juni 1st . We were offered a private crawl of the entire site (no WARC files, but likely WARC compatible). We decided to say no thanks and to do a last crawl of the entire site on our own.

Corpora from the archive
We are working on a business model (juridical and financial issues) for giving corpora from Netarchive to research institutions. Our first customer will be the University of Southern Denmark.

March 2016

Broad Crawl
The first broad crawl 2016 finished at Feb. 29

Event Crawl
The event crawl on the refugee crisis is ongoingIt is a supplement to our selective news media and social media crawls.

Selective Crawls
We are stil blocked by Facebook.

Curator seminar
All curators met for 1 1/2days in Aarhus. We prepared for the NetarchiveSuite (NAS) Heritrix 3 (H3) test and started a discussion on a new strategy for the broad crawls

Tools upgrade
We made some comparative tests: NAS 5 H3 versus NAS 4 H1

Research
We participate in an application for a research grand: “Real time analysis and visualization of news streams”. Netarchive will participate in the project with extraction of files (Twitter) from the archive under the condition of being paid for it by the project

February 2016

Broad crawl
Our first broad crawl is proceeding as planned.

Selective crawls
After 9 month we succeeded in breaking through the paywall for one of the biggest Danish newspaper’s sites, politiken.dk (IP validation for the harvester)

Facebook.com seems to have blocked our harvester, just now we do not capture anything from FB

Advisory Board Meeting
We had a fruitful meeting with our advisory board: the members gave us feedback on what they thought was important cultural heritage on the Internet.

Problem solving
We have established a Jira backlog for handling problems with the selective crawls

January 2016

Event Harvests
We stopped our event harvest on the Danish opt out referendum.

The refugee event harvest is still ongoing. Recently we focused again on foreign media reactions on the development of the crisis in Denmark, mainly in English, German, French and the Scandinavian languages. This is not a willful choice, but a choice due to our language competences. So, if you find articles in any other languages, please send us a link ?

Statistics
We are preparing our annual statistic to match the ISO standard.

Collection strategy
We are analyzing our collection strategy and will probably adjust the strategy due to coming new harvest methods (Heritrix3), the budget situation and better options for giving broader access to the archive.

Broad crawl
We have started our first broad crawl for 2016; we expect it to be finished by the beginning of March.

Full text search
Our full text index passed 10.000.000.000 objects. We estimate to be up to date within 125 days with the existing hardware-setup.

Image Search
Web colleagues at Statsbiblioteket are experimenting with “exact digital match” within image search: when uploading an image you will find all matches in the archive.

November/December 2015

Broad crawl

The first step of our fourth broad crawl for 2015 started on 13 November. We have set filters to avoid too many false 200 or 404 response codes and login pages. According to our estimation we will collect about 20 mio. URL’s our about 1 TB in this step.

Software migration

The preparation for the migration to H3/NAS 5 are ongoing: we will reduce the number of templates and filters significantly.

Event crawls:

  1. The Danish EU justice opt-out referendum: our goal is to document how political parties, predominant politicians, relevant organizations and NGO’s, samples of Danish citizens arguing on the subject. Furthermore we collect comments and articles from foreign medias.
  1. The European refugee crisis from the Danish point of view: this event crawl is a supplement to our selective crawls, as the Medias discussions of the subject are covered by our selective crawls.

We started identifying and registering our so-called special collections: We have special collections older than Netarchive as well as (ongoing) separate collections with content we are unable to capture with Heritrix (e.g. YouTube videos). We created a template/sheet for the description of these collections.

 

October 2015

Collection policy and strategy

are now published on  http://netarkivet.dk/om-netarkivet/ (in Danish)

Broad crawl

We finished our 3. broad crawl

Name Start time Stop time Bytes harvested Objects harvested
2015-3-10GB 01.09.2015 22.10.2015 37.762.965.450.120 624.564.140

Cultural heritage Cluster:

On October, 19 the Danish e-Infrastructure Cooperation officially opened the Cultural Heritage Cluster. Now Netarchivet’s users have the possibility to use a super computer on the data in our web archive (big data). Subsequently Netarchivet is invited to participate in a research project with the working title: Humanities – Through the Lenses of Dig Data//The Voices of the Humanities in a Digital world/universe (together with internet researchers)

 Heritrix 3:

We are working on the migration to H3, at the same time we want to reduce our number of harvest templates.

Wayback full text search:

Is officially open now. We published the manuals on netarchivet.dk and informed all actual users

RESAW:

We participate in an application (fund raising) from fra Horizon2020, especially the wor packages Research, Access and interoperability, Legal issues.

Event harvests:

We still run an event crawl on the refugee crisis

We are starting an event crawl on the Danish opt-out on Justice and Home affairs

Furthermore we will have a one week event crawl next week: that week is called ”media week”, just a “normal” week, where Statsbiblioteket collects the non-commercial radio and tv-station’s programs. Netarchivet collects their web pages.

 

September 2015

Broad crawl: Last week we finished the first step of our 3rd broad crawl (limit 10 MB)

Event crawls:

  • We finished the event crawl on the parliamentary elections in the end of June, but continued harvesting Social Media profils connected to this event crawl.
  • Last week we started a new event crawl on the Eurpean refugee crisis, that is to say mainly social media activities connected to the Danish handling of refugees.
  • We are preparing an event crawl on the referendum on the Danish opt out on Justice and Home affairs.

Fulltext search:

Colleagues from the IT department at SB have experimented with image search – some of the curators at KB/SB have seen a demonstration: it looks quite exiting.

Harvest problems:

We had problems with both our broad and selective crawls in July. We almost did not harvest anything for about one week.

Access policy and strategy:

We started working on this issue.

Newspaper paywalls:

We have problems with harvesting content behind paywalls. We need IP validation access, because we have not enough technical ressources to implement other solutions. We are now  focusing on paywalls:

Can Netarchive offer to pay for possible expenses for the implementation of IP validation

Curator-seminar,

held in September: How do we want Netarchive to be in 5 years, 10 years? What is realistic?

July 2015

Broad crawl

We started our second broad crawl 2015 with a limit of 10 GB per domain.

Event crawl

We are about to finish our event harvest of the Parliamentary elections

Trouble shooting

In the beginning of the month we ran into problems with harvesting dr.dk, the Danish public service broadcast corporation: they had changed their web site.

Harvesting some of the content probably made our system crash and we are not able to collect content from dr.dk’s https-pages. Maybe upgrading to Java 7 could be a solution for our problems.

Preparation of Full Text Search

We are still working on a best practice to hide person sensitive content from our archive  to launch full text search. This is required before we can open our full text search to our users.