Interview: How the British Library aims to archive a billion web pages

The British Lbrary's Digital Universe project will harvest more than a billion web pages - saving them for future generations.

Historians of the future will delve into today's blogs and websites in the same way we now leaf through Shakespeare or Chaucer.

Even Tweets will become a vital part of our history, the British Library claims.

That's the verdict of Lucie Burgess, the British Library's Head of Content Strategy, who is working on a project to digitally archive the entire .uk web domain as well as e-journals, e-books, electronic newspapers and iPad magazines.

The Digital Universe harvest got underway on Saturday, aiming to copy and collate ONE BILLION web pages in its first year. These will come from 4.8 million websites creating a mammoth digital representation of the virtual pages Brits turn every day.

It will eventually also include tweets and other public-facing social media with the Library's web crawler searching through any UK domain, for example .gov.uk, .org.uk, .ac.uk and, of course, .co.uk.

Burgess said: "The web is a really essential part of our culture and our heritage. It is the story of our times. We need to be able collect websites, tweets and blogs just as we have with printed material for over 300 years. It is really important it is preserved.

"We want to be able to capture it for our great grandchildren and their great grandchildren so they have this wonderful resource to be understand what life was like."




[Related: British workers now face 10,000 emails a year]


And she believes the weight of online-only material could be just as important as some of the first-edition and more abstract printed works held at the British Library, such as Emily Bronte's Wuthering Heights, Charles Darwin's Origin of Species and a Mickey Mouse annual from 1930.

Burgess added: "They will be as important in the future for researchers. If you want to understand how events today panned out, whether the 2010 election, 2008 financial crisis or reactions to the London bombings, you have to look to the web, to blogs and to social media as well as the pictures people have uploaded on their phones and tweeted, assuming they are publicly available.

"They give us a really unique view of our social history, language, culture, what we wore and how we felt. Material in the past which people have felt wasn't important has proved fascinating over time."
Burgess likens the virtual material due to be harvested to its vast collection of newspapers that  define everything from advertising through the ages to how people lived and worked. And she explained the contrast to the amount being collected through the web crawl is stark.

She said: "It has taken 300 years to collect 750 million pages of newspapers in 56,000 titles since the first printed newspaper in 1665. In a single year, we will get one billion pages through the web crawl.

"Digital content changes incredibly rapidly. Websites themselves get taken down. The average life of a webpage is just 75 days, which is why we talk about this digital black hole of the 21st century because so much of this fascinating and important material is already gone."

The Digital Universe project has been made possible by a change in the laws known as Legal Deposit. This stated that publishers and distributors in the UK had a legal obligation to send one copy of each of their publications to be preserved for posterity. In 2003 the centuries old law was updated to include websites.

Alongside the British Library, the scheme is being undertaken with the National Library of Scotland, the National Library of Wales, Bodleian Libraries, Cambridge University Library and Trinity College Dublin.

There will be four copies held securely around the UK with the first access to the new digital archive being available from January 2014.

It is expected to generate 100 Terabytes of data - or 100,000 Gigabytes - in the first year including embedded audio and video. Its reach does not extend to the likes of YouTube. Within a decade, this amount will expand to a Petabyte of information, or one thousand Terabytes.

To coincide with the launch, curators and other experts from all the participating libraries have drawn up a list of 100 websites they believe will give future generations a good guide to how life was in 2013.

They include big established names such as catalogue shopping chain Argos, the new digital-only version of the Beano comic, Amazon and Mumsnet. But it also includes smaller more quirky sites such as Neverseconds, a primary school pupil's blog about school meals; a site referencing the history of the Unst Bus Shelter on a remote Scottish island and a range of special interest groups such as The Dracula Society.

The information will not be printed out in any way but instead stored for access via computers at the British Library in London, its site in Boston, West Yorkshire, and also at the national libraries of Scotland and Wales. An online search facility will allow anyone to check if what they are looking for is available before visiting.

Burgess added: "This has been an enormous technological challenge. Nobody has ever archived the entire UK webspace before. It is a big technical and processing challenge."

But the announcement has again raised questions about how much consideration the wider public give to their 'digital footprint' - the nickname for the information they place into the virtual world on social networks, for example.

Anthony Mayfield, author of the book Me & My Web Shadow, believes the Digital Universe project can act as a good reminder for us all.

He said: "People are beginning to become more aware and thoughtful of what they post online. The whole world might not be listening at a particular moment but they are making everything available for the whole world to see. The British Library project is a good time to remind ourselves that this is how the web works. British Library or not, you're always on the record online."

It comes after a 17-year-old teenager was this weekend forced to apologise for tweets she had written between the ages of 14 and 16 - before she was selected for the role as Britain's first Youth Police and Crime Commissioner.

Anthony hopes time and understanding of digital will ensure the nation are more forgiving of what people say online in their past but he believes we could already have young people out there watching what they are saying in case they become the next Prime Minister.

He added: "The average person should take note and be aware of their digital footprint. They should log out of Facebook and all their social networks and then go and look for themselves and see what they find. Then they can ask is that something a future employer, bank manager, partner, might find objectionable.

"Once it is out there, the only way to head it off or to have an insurance policy against people saying bad things about you is to manage your own web presence so people find your version of you before they find someone else's or a foolish episode from the past representing you.

"You can now expand the always on the record advice to you are always on the historical record. Your digital footprint will be there forever. Your grandchildren and great grandchildren could see it, if they can be bothered to look for it."