Google’s Digital Black Hole

So Google, or at least its Vice President, Vint Cerf, has now had a flash of inspiration!

It has realised that information preservation is important.

The Huffington Post reported :

Cerf told the American Association for the Advancement of Science that “if we want people in the future to be able to recreate what we are doing now, we are going to have to build the concept of preservation into the internet”.

Cerf says that a “digital vellum” must be developed which can maintain the state of hardware and software, as well as raw data, so that the web as it appears today can be experienced in decades to come.

“When you think about the quantity of documentation from our daily lives that is captured in digital form, like our interactions by email, people’s tweets, and all of the world wide web, it’s clear that we stand to lose an awful lot of our history,” he said, according to the Guardian.

This is an old story that pops up from time to time, like when the BBC’s re-hash of the Doomsday Book turned out to be unreadable after 15 years

People have for some time been discussing the probable gap in the historical record as society makes its messy 50 year transition from a paper-based world to a truly digital one (I think perhaps we are halfway through this transition, which varies in speed according to industry and culture).

Information scientists and those at the sharp end of delivering strategic platforms to industry have known about the preservation issue for as long as there has been IT, and strategies and standards are now well developed for addressing the preservation of information – at the physical preservation, content preservation and intellectual preservation levels.

As Jeff James, the Chief Executive of the UK’s National Archives, pointed out on BBC Radio 4’s Today programme (14th February 2015 – in response to this Google story), institutions such as his have a number of successful strategies for ensuring the national archives, at least, will be preserved in perpetuity.

Google naively sold the vision that content does not need the disciplines of information science – curation, preservation, indexing, etc. – because Big Search and Big Data mean the old skills are irrelevant. No, they are even more relevant today than ever, but of course, need to be reframed in terms of digital standards and strategies.

It is because businesses and institutions fired all those people like records managers in the past, because they imagined a word-processor and a fileshare meant that such disciplines were unnecessary, and did not implement enterprise content management (ECM), that there will be a hiatus in their historical records … not because we didn’t know how to standardize or migrate formats.

Those of us that work on information management in industries as diverse as healthcare, engineering and pharmaceuticals have if anything been on the case for even longer than our national archives.

The issue of aging storage devices has long been solved, because at least for those using ECM systems, content is silently moved to newer storage devices periodically, without even the need for human agency.

The issue of document formats is the next easiest to address (at least for those artefacts that are commonly regarded as documents). These may have been created in many forms:

  • Take a clinical safety report for a new drug, written using Microsoft Word 97.  Knowing that Microsoft upgrades its software at least once every 2 years, it is an obvious worry that the report might not be readable in 20 years time. In addition, this report might form part of a huge dossier that is then submitted to the regulatory authorities, to get approval for the drug, and will need to be accessible for some decades after a drug has been taken off the market. So let’s say, at least 50 years from the date of its creation; so we need to preserve the context not just the content.
  • Alternatively, imagine we have created a design for a bridge, using 3D computer-aided design (CAD) software, and then rendered this as 2D drawings (elevations and cross sections) that engineers will use to build the bridge. These are printed on large format sheets that the engineers and construction workers can take on site, and stored digitally in some format

These are non-trivial problems for a raft of reasons of which the format is the least difficult.

At the content level, we need to ask questions of content like,  “Do we ever need to revise, or reuse, the content?”, “For viewers of the content, what level of fidelity is required?”:

  • For the drug safety report, we want to make it non-revisable, non-deniable, so we could create a PDF-A (a flavour designed for archiving which is non-revisable) and for belts and braces, we also store a rendition in a facsimile format (using the TIFF standard, like the RAW type format from a good camera), because the information is quite flat. In this way, we have covered all bases. We are 100% sure we could read the document in 100 or even 1000 years. The fact that TIFF is such a basic format – just rows of coloured dots – is a weakness when it comes to reuse (e.g. editing) but a strength for long-term ‘readability’. A visitor from Alpha Centauri would have no problem understanding it, and so viewing it.
  • What about the engineering drawings? If we come back in 75 years to do a major re-work on the bridge, due to the failure of some components, then would we need to get hold of the original revisable 3D CAD  files? Yes we would! So how do we solve this problem? One strategy is to ensure there are industry standards for specialised formats like CAD (there are), and we make sure these are ‘forward compatible’ (i.e. new versions of software can read old content).  Where this is problematic, we need, periodically to refresh old content to bring it forward to newer formats; this is one of the strategies that national archives use.

In the document world, Goldfarb and others created ‘Generalized Mark-Up Languages’ (GML) in the 1960s, to get around the problem of different  formats, but mainly to enable high fidelity sharing of data/content between people and systems.

This evolved into SGML (Standard GML), and over time into XML (eXtensible Markup Language). There are two main benefits I wish to stress here:

  • Firstly, a separation of content from presentation, meaning that we can take the same content and render it for different viewing contexts or devices (this is now routine today, as people observe viewing the same content on different styles and devices).
  • Secondly, we can create different ‘dialects’ of XML specialised for different industries or applications, in the media, healthcare, finance, etc.

So, for example, NewsML allows many news agencies to send news to an aggregator like Reuters whose systems can automatically process those documents, because the syntax and semantics have been standardized. This includes not only content but the indexing information (in modern parlance, the meta-data) used to characterize or contextualise the content – which ensures high fidelity routing/ targeting of the information to Reuters’ clients.

For more complex situations, like that of a drug dossier, containing say 20,000 files, the whole structure can be defined using an XML ‘schema’ to standardize the structure and its meaning: it specifies what is required in terms of context, metadata and the content itself.

Interestingly, the HTML for which Sir Tim Berners-Lee is famous for, the World Wide Web is a kind of dumbed down SGML/XML which is great for creating simple web pages on the WWW, but loses the two main benefits mentioned above (contextual rendering and industry dialects).

Of course, some technologists would prefer to ignore the practicalities of information management, such as the need to think about document standards. Instead they propose a magic bullet which is to preserve software and hardware environments. By virtualizing the whole stack of software and hardware we preserve everything needed to read the old content. While virtualized systems have a big place in modern IT (because they enable fast deployment of complex systems), they do not obviate the need to tackle the underlying information management standards.

A regulator in healthcare is not going to certify a software stack instead of a document standard (and by ‘document’, I mean the whole machinery of meta-data, context and content formats).

Google may be in danger of looking for a technological magic bullet where none exists. Meanwhile, back in the real world of industry, the rest of us are finding solutions today to all aspects of the information preservation and fidelity issue.

So what if Google did offer a ‘Digital White Hole’ (instead of a Black one), to provide improved access to archived information? What would that mean?

I ask this because Google are on a mission to monetize both our content and our internet personas and behaviour. Unlike National Archives, they do not have a public duty to do what is best for our content, only what is best for them. And they already have an hegemony in relation to search!

Do I really want ‘search’ to extend to the custodianship of content? A new hegemony?! I don’t think so.

When I make a search, Google chooses to put results at the top for those who have paid to be there, not the most appropriate to my context. If I want a drug safety report from 30 years ago, would I expect Google to be the best custodian of such content in the future?  At present, there are no signs that would be even a remotely realistic outcome.

The risk we as citizens or businesses might face is that we become beholden to large service providers like Google to gain access to our archived content, with no statutory safeguards.

If this used some magic virtualized ‘digital vellum’, we might find it difficult or impossible to take our content away and move it to another provider.

The issues of long-term preservation and fidelity of information/ content are too important to be left to commercial interests alone (but of course they will need to play a role).

This is why the standards based approach, coupled with pragmatic strategies as illustrated earlier are what I would recommend. No magic, just hard work, experience, collaboration and persistence.

These are what work today in many industries/ businesses, albeit not implemented universally. If we face a ‘Black Hole’, all the more reason to scale up what we know works.

I would not fly on planes today without a standardized dossier called the Aircraft Maintenance Manual (AMM), linked to the actual maintenance applied to each plane in service, which is a global industry standard, independent of any commercial interest. Planes are like flying paragons of information management!

There is today, particularly in domestic use but also in a surprising number of businesses, far too much content that is unmanaged, unindexed and uncared for, stored on file shares in proprietary formats, not using the well established methods and tools (such as ECM) that would ensure digital and intellectual preservation.

Google’s rather belated realization that the world is not as simple as they would have had us believe, is a great first step (for them), but is it for us? We shall see.

Rather than imagining they have discovered something new, they might be interested in learning how others like Jeff James of the UK’s National Archives, and those of us at the coal face of information management, have been addressing these issues for many years.

Welcome Google, to the world of business critical content, and long-term preservation.

You’ve certainly taken your time!

Leave a comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s