March – May 2017
First broad crawl 2017
We run our first domain crawl this year from March 6 to March 26 – with a limit of 10 Mb per domain.
Ministries and administrative bodies and ”ultra big sites” (e.g. the Danish Broadcasting Corporation’s website dr.dk) are crawled separately. Thus, we are able to monitor that we catch all we want to catch.
Collecting Instagram and Twitter
We have revised our collection scope for Twitter and Instagram. By now, we crawl a big number Danish Twitter profiles twice a month, the profiles of political players are collected daily. Furthermore, we collect a representative number of Instagram profiles and hashtags, which reflect politics, daily life and sports in Denmark.
Further tests of BCWeb
We are testing the functionalities in the French National Library’s nomination tool. We need some modification in order to optimize the tool for our use.
NAS workshop in Vienna
In the end of April, some curators and developers from Netarchive participated in a fruitful workshop organized by the NetarchiveSuite (NAS) community. NAS is an open source tool. All partners of the community discussed and planned the further development of the tool.
Social Media strategy
We analyzed a great number of social media platforms focusing on the content. With the help of the results, we will decide which social media platforms we want to collect in what scope.
Use of BCWeb
The National Library of France, one of our partner institutions, has developed a nomination curation tool (BCWeb): we found to find out whether this tool would be useful for us, e.g. for getting external help with the nomination of content to collect.
The third broad crawl 2016 finished on 30 september. The same applies for the crawls of ministeries and ultra big sites.
A roadmap for event crawls of parliamentary and local elections is almost in place. Thus we will be able to identify candidates rapidly and press the start button as soon as the call for election is out.
Our new crawl strategy is in place. Find a description (in Danish) of the new organization of the selective crawls here.
We gave our Social Media crawls a make over. We have revised the list of Twitter profiles and hashtags and started to crawl Danish Instagram profiles.
The third broad crawl 2016 will be launched in week 36/37. The crawl limit per domaine will be max. 100 MB. There will be special crawls for ministeries and government bodies, and for ultra big sites (e.g. dr.dk)
The event collection for the Olympics in Rio 2016 will go on until the end of the Paralympics 2016
We are working on the configuration of the regional/local news media crawls.
We have crawled about 60 Danish Facebook profiles with Archive-IT. We made a special crawl of Prime Minister Lars Løkkes Facebook profile on 2016.08.30, the day he published his 2025 plan.
Following our new collection strategy – extension of the selective crawls and smaller broad crawls – we now collect all national Danish news media selectively.
We investigate all local new media in order to decide frequency and depth for the future crawls.
Our harvester Heritrix 3 is not able to archive Facebook profiles. Archive-IT, the commercial part of Internet Archive uses an API to Heritrix for crawling Facebook. We will collect about 100 representative open Facebook profiles at Archive-IT, at the moment we are doing the selection of the profiles.
The second broad crawl 2016 (with the limit of 100 MB per domain) finished at June 28. We harvested 11.255.368.320.635 bytes / 242.114.319 objects.
We started an event collection for the Olympics in Rio 2016 on July 24. We also participate in the IIPC Olympics collection
As part of our new collection strategy we have started working with university repositories, educational and law portals.
Policies and strategies
Our dissemination policy and strategy are getting the last brush up.
A revised SB and KB’s collaboration agreement on Netarchive has been signed of the directors from both institutions.
We have finalized a recommendation on the compression of the WARC files in Netarchive.
We started the second broad crawl 2016 with a limit of 100 MB from each domain to be crawled.
We stopped the refugee crisis crawl. We did a smaller event crawl for the “Eurovision Song Contest”, were we focused on the Danish participants presence on Twitter and on thematic news sections. We are preparing for a crawl of the Olympic in Rio.
We started the implementatoin of our revised collection strategy. We have almost established the new selective crawls of national news sites.
Potential collaboration project
The Parliamentary Library gives inhouse access to historical (archived) versions of the political parties’ websites. they are not quite satisfied with their solution. Netarchive and the Parliamentary Library are looking at potential future cooperation on this subject.
We will start the second broad crawl 2016 as soon as NAS 5 and Heritrix 3 are running “smoothly”
The event crawl on the refugee crisis is stil ongoing: As it is a supplement to our selective news media and social media crawls, it is a very little event crawl.
We are preparing for a new event crawl on the European Capital of Culture project “Aarhus 2017”: we are looking at different scenarios for this event crawl
We are still unable to harvest anything from Facebook.
We are revising our collection strategy: There will be less broad crawls and more selective crawls. At the moment we are looking at the selective news media crawls. According to our ressources we need a more streamlined approach for an extended number of domains to be crawled
Ad hoc crawls
The social platform arto.com will be closed down at juni 1st . We were offered a private crawl of the entire site (no WARC files, but likely WARC compatible). We decided to say no thanks and to do a last crawl of the entire site on our own.
Corpora from the archive
We are working on a business model (juridical and financial issues) for giving corpora from Netarchive to research institutions. Our first customer will be the University of Southern Denmark.
The first broad crawl 2016 finished at Feb. 29
The event crawl on the refugee crisis is ongoing: It is a supplement to our selective news media and social media crawls.
We are stil blocked by Facebook.
All curators met for 1 1/2days in Aarhus. We prepared for the NetarchiveSuite (NAS) Heritrix 3 (H3) test and started a discussion on a new strategy for the broad crawls
We made some comparative tests: NAS 5 H3 versus NAS 4 H1
We participate in an application for a research grand: “Real time analysis and visualization of news streams”. Netarchive will participate in the project with extraction of files (Twitter) from the archive under the condition of being paid for it by the project
Our first broad crawl is proceeding as planned.
After 9 month we succeeded in breaking through the paywall for one of the biggest Danish newspaper’s sites, politiken.dk (IP validation for the harvester)
Facebook.com seems to have blocked our harvester, just now we do not capture anything from FB
Advisory Board Meeting
We had a fruitful meeting with our advisory board: the members gave us feedback on what they thought was important cultural heritage on the Internet.
We have established a Jira backlog for handling problems with the selective crawls
We stopped our event harvest on the Danish opt out referendum.
The refugee event harvest is still ongoing. Recently we focused again on foreign media reactions on the development of the crisis in Denmark, mainly in English, German, French and the Scandinavian languages. This is not a willful choice, but a choice due to our language competences. So, if you find articles in any other languages, please send us a link ?
We are preparing our annual statistic to match the ISO standard.
We are analyzing our collection strategy and will probably adjust the strategy due to coming new harvest methods (Heritrix3), the budget situation and better options for giving broader access to the archive.
We have started our first broad crawl for 2016; we expect it to be finished by the beginning of March.
Full text search
Our full text index passed 10.000.000.000 objects. We estimate to be up to date within 125 days with the existing hardware-setup.
Web colleagues at Statsbiblioteket are experimenting with “exact digital match” within image search: when uploading an image you will find all matches in the archive.
The first step of our fourth broad crawl for 2015 started on 13 November. We have set filters to avoid too many false 200 or 404 response codes and login pages. According to our estimation we will collect about 20 mio. URL’s our about 1 TB in this step.
The preparation for the migration to H3/NAS 5 are ongoing: we will reduce the number of templates and filters significantly.
- The Danish EU justice opt-out referendum: our goal is to document how political parties, predominant politicians, relevant organizations and NGO’s, samples of Danish citizens arguing on the subject. Furthermore we collect comments and articles from foreign medias.
- The European refugee crisis from the Danish point of view: this event crawl is a supplement to our selective crawls, as the Medias discussions of the subject are covered by our selective crawls.
We started identifying and registering our so-called special collections: We have special collections older than Netarchive as well as (ongoing) separate collections with content we are unable to capture with Heritrix (e.g. YouTube videos). We created a template/sheet for the description of these collections.
Collection policy and strategy
are now published on http://netarkivet.dk/om-netarkivet/ (in Danish)
We finished our 3. broad crawl
|Name||Start time||Stop time||Bytes harvested||Objects harvested|
Cultural heritage Cluster:
On October, 19 the Danish e-Infrastructure Cooperation officially opened the Cultural Heritage Cluster. Now Netarchivet’s users have the possibility to use a super computer on the data in our web archive (big data). Subsequently Netarchivet is invited to participate in a research project with the working title: Humanities – Through the Lenses of Dig Data//The Voices of the Humanities in a Digital world/universe (together with internet researchers)
We are working on the migration to H3, at the same time we want to reduce our number of harvest templates.
Wayback full text search:
Is officially open now. We published the manuals on netarchivet.dk and informed all actual users
We participate in an application (fund raising) from fra Horizon2020, especially the wor packages Research, Access and interoperability, Legal issues.
We still run an event crawl on the refugee crisis
We are starting an event crawl on the Danish opt-out on Justice and Home affairs
Furthermore we will have a one week event crawl next week: that week is called ”media week”, just a “normal” week, where Statsbiblioteket collects the non-commercial radio and tv-station’s programs. Netarchivet collects their web pages.
Broad crawl: Last week we finished the first step of our 3rd broad crawl (limit 10 MB)
- We finished the event crawl on the parliamentary elections in the end of June, but continued harvesting Social Media profils connected to this event crawl.
- Last week we started a new event crawl on the Eurpean refugee crisis, that is to say mainly social media activities connected to the Danish handling of refugees.
- We are preparing an event crawl on the referendum on the Danish opt out on Justice and Home affairs.
Colleagues from the IT department at SB have experimented with image search – some of the curators at KB/SB have seen a demonstration: it looks quite exiting.
We had problems with both our broad and selective crawls in July. We almost did not harvest anything for about one week.
Access policy and strategy:
We started working on this issue.
We have problems with harvesting content behind paywalls. We need IP validation access, because we have not enough technical ressources to implement other solutions. We are now focusing on paywalls:
Can Netarchive offer to pay for possible expenses for the implementation of IP validation
held in September: How do we want Netarchive to be in 5 years, 10 years? What is realistic?
We started our second broad crawl 2015 with a limit of 10 GB per domain.
We are about to finish our event harvest of the Parliamentary elections
In the beginning of the month we ran into problems with harvesting dr.dk, the Danish public service broadcast corporation: they had changed their web site.
Harvesting some of the content probably made our system crash and we are not able to collect content from dr.dk’s https-pages. Maybe upgrading to Java 7 could be a solution for our problems.
Preparation of Full Text Search
We are still working on a best practice to hide person sensitive content from our archive to launch full text search. This is required before we can open our full text search to our users.