'Web harvest' to begin

The British Library is aiming to "harvest" the entire UK web domain to document current events and record the country's burgeoning collection of online cultural and intellectual works.

Billions of web pages, blogs and e-books will now be amassed along with the books, magazines and newspapers which have been stored for several centuries.

The library could eventually collect copies of every public Tweet or Facebook page in the British web domain.

Lucie Burgess, leading the project at the British Library, said the unprecedented operation would provide a complete snapshot of life in the 21st century which increasingly plays out online.

If you want a picture of what life is like today in the UK you have to look at the web.

We have already lost a lot of material, particularly around events such as the 7/7 London bombings or the 2008 financial crisis.

That material has fallen into the digital black hole of the 21st century because we haven't been able to capture it.

Most of that material has already been lost or taken down.

The social media reaction has gone.

– Lucie Burgess, project leader

The operation to "capture the digital universe" will begin with an automatic "web harvest" of an initial 4.8 million websites - or one billion web pages - from the UK domain, she said.

This will begin tomorrow and is expected to take three months.

It will then take another two months to process the data.

Until now the British Library could only preserve a relatively small handful of websites.

The 2003 Legal Deposit Library Act paved the way for the information to be stored but copy right laws forced the library to seek permission each time it wanted to collect web content.

Under the new regulations - which extend to the Bodleian Library, in Oxford, Cambridge University Library, the National Library of Scotland, the National Library of Wales and Trinity College Library in Dublin - it has the right to receive a copy of every UK electronic publication.

The British Library, which has invested £3 million in the project during the past two years, plans to collect the material by conducting an "annual trawl" of the UK web domain.

It will "harvest" information from another 200 sites - such as online newspapers or journals - on a more regular basis.

Access to the material, including archived websites, will be offered in reading rooms at each of the legal deposit libraries.