We are often asked at the start of a new project, “can you pull our content from the old database directly?” If the data is well structured, like recipes, news articles, user account records, or events, then yes: we can migrate the content directly. As often as not, this isn’t desirable. Most organizations’ websites last about 5 to 7 years between major renovations. Across those years, we see changes in editorial vision, in organizational initiatives and programs, and in scope. One editor may have handled things in one fashion for some years, and another, responding to organizational changes, completely differently. We often find abandoned initiatives, unused pages, eddies of old projects, and sections that everyone had forgotten about. Or, we often find content of one type—exhibitions, events, news— implemented in the CMS as another type, typically editorial pages, which lack appropriate metadata, structure, or business rules. Editors struggle to maintain the weight of archives by hand, and the whole system is on the brink of collapse.
For these reasons, we think a CMS content migration is the critical opportunity to inventory, assess, and pave the way towards a new site architecture focused on customer needs. To do this, we recommend a six-step content migration plan:
- Understand what’s valuable with content efficiency reports
- Inventory what you have with a website crawl
- Get to your core content through a content inventory audit
- Sort your content inventory by content types
- Create a rough information architecture, with content identifiers
- Assess, score, and assign
Let’s dive in!
Step 1: understand what’s valuable with content efficiency reports
A content efficiency report sets aside vanity metrics (“how may people viewed my site today?”) for usable, content specific, per-page metrics. At a minimum, we can learn:
- What pages do people first arrive at?
- What attracts the most readers?
- Where do readers spend the most time?
A more sophisticated content efficient report ties the metrics to site goals and conversion. In goal-driven content efficiency metrics, we can learn:
- What pages provide the greatest conversions?
- What plays an important role in conversions?
- What content provides the most value to the organization?
We’ve created a free basic Google Analytics Content Efficiency Report that you can install easily and use a base for your own, goal-driven report. It’s ideal if a few months of site analytics data can be collected on how the current content performs—but even a few weeks is useful.
Step 2: inventory what you have with a website crawl
Use website crawling software to click through all your site links automatically and generate an inventory of your site URLs. Crawling software is helpful, but isn’t perfect: it can be difficult to handle content that is locked behind a paywall or login interface; hidden URLs might not be indexed; and the value of the resulting content is highly dependent on how the site was built to begin with. For example, if your site uses human-readable URLs that are properly structured (“/about”, “/about/history”, etc.), then your site crawl puts you a step closer to understanding your information architecture, too. If your site generates unique page titles, you’ll have even more valuable information to work with. But if your site uses unreadable URLs, or if everything is generated (“index?gid=7549”) then you’ll need to do more work. You may need to pull content from within your CMS or through CMS exports.
We love Integrity, available on MacOS. It’s simple, clear, and just plain works. Bonus points for checking external site links or broken images.
Oddly, outside this app, software gets complex very quickly. This Quora discussion highlights open source options. Kais Hassan wrote up a great survey of his experiences using the Python based Scrappy. I’ve also assessed web-based software like DYNO Mapper and Slickplan. Your mileage may vary (and I’d love to hear what tools you find effective.)
Step 3: get to your core content through a content inventory audit
The outcome of your crawl will be a spreadsheet of every public facing URL of your site. We want only the core, simple content objects of your website: the article itself, for example, but not the category list views in which the article highlight might appear, or pages in which a highlight appears. In the context of working with event calendars, we want the event itself, and not the month listing, or the upcoming event listing, or the archive, all of which are generated.
In this step, remove all auto-generated collections: indexes, dated archives, tagged groupings, category archives, etc. Also, remove all paginated URLs. Anything created by the CMS for you should be deleted.
Step 4: sort your content inventory by content types
Add a column to your spreadsheet, “Content type.” Assess each row, and sort into broad buckets. What a content type is exactly is a slippery notion. My colleague Hillary Marsh has led some excellent talks on this, and her article on content types is enlightening.
For our purposes, content types are typically things like user profiles, events, programs, exhibitions, biographies, articles, essays, news, evergreen pages, artworks, etc. Use what makes sense for your situation. You may find you’ll need to add a “secondary content type” column to capture subtypes, like articles: FAQ, articles: information sheets, etc.
Step 5: create a rough information architecture, with content identifiers
The goal of step 5 is to establish a team agreed-upon nomenclature to talk about the content. To do so, it’s useful to establish the first take on a site information architecture. Establish your major content groupings and the sections within them. This can be inspired by your navigation, but it does not need to directly reflect it. You can group other navigation interfaces (like footer, utility, or promotional navigation) arbitrarily. The key goal it to create a content reference ID for each item. We use a three point nomenclature. That way, the entire team can ask, “how is 2.6.12 going?” and everyone’s looking at the same thing.
Step 6: assess, score, and assign
The not too surprising news is that content published online has a limited shelf life. Content ROT is a common method of assessment: find and eliminate that which is Redundant, Outdated, or Trivial. Smashing Magazine has a nice article on this. In addition, I like to consider if it’s legally required. You might have other attributes that you’d like to weight by, feel free to use the template as a starting point for your own assessment plan.
Finally, establish who owns the content, what needs to happen to it, and when that will happen by. You might need research, image production, copywriting, or editing — but your inventory document can now become the master view of what content will be migrated to your site. You can even add a link to your files if that’s helpful to your team.
Go forth and migrate
I hope this website content migration strategy provides a useful framework and interesting tools for you. As you can see, a well-run content migration strategy is less about the technical challenge, and more about the editorial position you’ll take going forward to your new CMS. I’d be very interested to hear about your own experiences, workflows, and tools. Feel free to contact me if you have questions or comments.