Mr. Metadata’s Musings on Legacy Migration, e-Discovery and ECM 2.0

I had written earlier about what makes for an Intelligent Content Solution: you need Intelligent Content Design to supplement the Intelligent Content Framework.

Recently, we also started talking to some of our large customers about legacy migration approaches.  The fact is that despite the phenomenal success of MOSS, over 80% of unstructured corporate content still resides within file systems today.  Often, it is just one big mess, and leads to a tremendous loss of corporate knowledge, while creating a huge litigation liability.  There is simply no one-size-fits-all approach to solving this problem.  Applying Search technologies alone does not solve the issue.  There are a few specialized companies who I have worked with over the years who specialize in this field, such as Delve Information Group and the Gimmal Group.  I worked with a large Pharma customer a few years ago who had a team of people working on a project just to sort out the ownership of legacy content on their file systems.  It took them two years just to create a database that stored information on who owned what document.  But this information could not be turned into actionable intelligence for file systems migration.

When migrating legacy content, the following considerations need to be taken into account: 1.) Who is the owner of the legacy content?  If the person is no longer with the company, can the information be deleted, or archived?  A good tool to help with this Information Governance issue for file systems is Varonis This could be a key aid in migrating content from file systems to MOSS, and maintaining the ownership governance.  Just relying on a migration tool like Metalogix or or MetaVis only solves a fraction of the problem.

2.) Another area of consideration are the multiple redundant copies of legacy files.  According to Cohasset Associates, each content artifact has up to 18 identical copies scattered all over the place.  A key question when trying to manage the Corporate Truth is which document is the original, and which are copies thereof?  This issue alone lends itself to several approaches to e-Discovery and ‘document forensics’.  Many search engines which are combined with hashing capability can actually be adopted to find duplicate documents, and the server will store the date the document was uploaded.  So that is a start.  But that is not sufficient.  The best solution by far to address this problem is the new solution by NextPage Information Tracking Platform.  Their ‘digital threading’ technology is exactly what is needed, and I consider it revolutionary.  There is also the issue of document forensics.  This is they key consideration: if I take a document, and modify just one word in it and save it – does this make the new document completely unique (a hashing-only approach would create an entirely new hash for the document) or is this a ‘closely related’ document?  The so-called vectoring capabilities of FAST ESP can help with this problem of ‘near duplicate’ content.  There is also a tool from Equivio that can be used. This leads to some interesting possibilities when used in combination with NextPage.  This technology is actually extremely important in the case of e-Discovery, i.e. the ability to track parent-child relationships of related content, which is also a key element of an emerging area called document forensics.  There are some excellent SaaS or on-premise tools available to support the e-Discovery process, such as Digital Reef or Stratify.  Recently, we have engaged in a project where we brought together Navigant Consulting, Digital Reef and NextPage to deliver a comprehensive and integrated solution to e-Discovery.  I am very excited about the capabilities that these partners offer together.  Some of my colleagues have also been working closely with WorkProducts, who have a very interesting approach called Evidence Lifecycle Management (ELM).  There are also some interesting packaged legacy cleanup and migration tools available from Vamosa and Active Navigation.  Both vendors are emerging leaders in the Enteprise Information Governance space.

There are also several partners which take an archiving approach to e-Discovery.  See here:  However, in most cases, I prefer the federated Information Tracking approach that NextPage offers, simply because it is simply not realistic to archive all enterprise content: how about the content located on Desktops, Thumb Drives, etc. ?  And these solutions also lack the specialized capabilities that most customers need, so a solution like Digital Reefis still needed on top.

3.) Suppose we are able to perform all this cleanup and preparation work prior to being ready to move legacy content to MOSS.  Now we are still confronted with the issue of metadata.  Given that file systems have no concept of metadata, the process of metadata enrichment is extremely important.  There is some basic metadata that most search engines can extract from within documents, such as date, author name etc.  However, this information needs to be associated with the document as metadata, so it is more readily available.  However, I also have a strong belief that content without context is incomplete: there is an excellent article that I read a few years ago that describes the problem.  Legacy content needs to be metadata enriched before migration.  FAST ESP is an ideal tool for this metadata enrichment process, and for automatically building taxonomies.  We are also working on the new Microsoft Semantic Engine – the demo can be watched on-demand here:

4.) The final step is moving the content to MOSS, and applying all this metadata to making it useful and findable.  This is where the new MetaPoint Server by SchemaLogic comes into play.  FAST ESP and MetaPoint as an integrated solution working with MOSS are a key part of solving this problem space.  I have recently started thinking about what a an integration of MetaPoint and NextPage would look like, and delivered as a Service via the new Microsoft Azure Services Platform.  The possibilities are truly exciting!

A recent update to the above is that SchemaLogic and Vamosa have formed a technology partnership around Enterprise Content Governance.  I think this is exactly the kind of solution that companies need to address their needs around content quality, to support legacy migration, e-Discovery and Information Governance.

So now after all this discussion about Legacy Migration and metadata enrichment, let’s get back to the Intelligent Content Framework.  How does it all belong together?  The simple truth is that with Intelligent Content Design to begin with, there will be no need for legacy migration going forward.  I had already talked about Intelligent Content solutions needing the Intelligent Content Framework in combination with an Intelligent Content Design approach.  It occurred to me that if we add FAST ESP to the mix, we have now also introduced semantics, and the concept of the Semantic Web into the world of Enterprise Content Management.  This is why I am calling it ECM 2.0 – it is completely analogous to Web 2.0.  This is really exciting to me, all the more so because the tools to make ECM 2.0 happen are available here and now – and all built seamlessly to work with MOSS and Office 2007.  And of course, for the ultimate in legacy migration, we can set up services to pull in legacy content, analyze and ‘X-ray’ it and enrich it with metadata, break it into re-usable topics, and pull it all into the Intelligent Content Framework.

I do not mean to trivialize the effort required to get us there.  But we can get there – and we will!  Fact of the matter is that current approaches to ECM in Big Pharma are broken – the ‘Digital Scriptorium’ model of manually creating content on the Desktop, managing it in an ECM system, and cutting and pasting with no control of source and target is no longer viable.  It actually never was viable, but there was nothing better available for a long time, and companies could afford to throw money and bodies at the problem.  Those days are gone, and they are not coming back.  Flexible and innovative business models and approaches are the only alternative!

Update June 8, 2011: It was just announced that Microsoft is acquiring Prodiance Corporation, a leader in Enterprise Risk Management  This is definitely very exciting news, and a huge step in the area of Compliance and support for e-Discovery.  We also recently released a very relevant article on Technet: Microsoft IT Uses File Classification Infrastructure to Help Secure Personally Identifiable Information.  We are definitely ramping up the Compliance capabilities of our stack!


One thought on “Mr. Metadata’s Musings on Legacy Migration, e-Discovery and ECM 2.0

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s