Peter Morville's bi-weekly column on the evolving definition of information architecture

Little Blue Folders

The Web is big. A billion pages big, according to a recent study by Inktomi and the NEC Research Institute. It's the ultimate testing ground for information retrieval technologies.

If your search engine can automatically bring order to this overwhelming global mess of stuff, just think what it can do for a single web site or intranet. No more agonizing over the design of topical hierarchies. No more worrying about how you'll afford your growing staff of information architects. Just sit back and let the software work its magic.

According to Northern Light, there's no need to wait. Their television and magazine ads claim they've developed "a search engine that delivers the World Wide Web prioritized, categorized and organized into neat little blue folders." And guess what, you can license this technology for your web site or intranet.

Autonomy takes the auto-classification hyperbole a step further, claiming that their Portal-in-a-Box provides "an out-of-the-box solution that enables online publishers and corporations to easily create and automatically maintain an easy-to-navigate customized portal site, removing the need for manual labor in the process of categorizing, tagging, (and) hypertext linking large amounts of information."

The marketing folks at these companies obviously went to the P.T. Barnum Sucker-Born-Every-Minute School of Business. They are over-selling these automated classification products in a way that may pump up sales in the short term, but will inevitably lead to a major back-lash as their customers learn the hard way that software alone can't solve their portal problems.

Don't Believe the Hype

Information retrieval is inherently messy. Authors struggle to convey complex concepts by stringing together words and phrases into documents. Users try to articulate their information needs with a keyword or two. Attempts to connect the right users with the right content are frustrated by the ambiguity of language and organization and the subjective nature of relevance.

If you take a mess and stuff it into a bunch of little blue folders, you still have a mess. Remember the room-cleaning strategies of our childhood? Take the dirty clothes, candy wrappers, banana peels, and assorted pets and toys and shove them all under the bed. Declare your room "neat and tidy" and head out to play.

For those of us with observant parents, we soon learned this to be a play now, pay later strategy. People who fall for the auto-magical claims of these search engine vendors are sure to learn the same lesson.

Think Outside the Folder

Perhaps the biggest problem with these automated approaches to classification is the fact that they're completely content-centric. They focus solely on organizing the stuff inside the folders, ignoring the broader information ecology.

A succesful information architecture
finds points of intersection.

If you don't understand the goals and strategy of the business, how can you organize the content to further those business objectives? If you don't understand the information needs, behaviors, and vocabularies of your users, how can you organize content in a way that works for them?

The automated approach also ignores the valuable human interactions that occur during the process of information architecture design. As you're working to structure and organize a site, you inevitably ask hard questions about business strategy and content. This process drives towards a better definition of site goals and content policies.

Embrace the Genius of the AND

In Built to Last, James Collins and Jerry Porras explain that successful companies make it a habit to "avoid the tyranny of the OR" and "embrace the genius of the AND."

The key to success in designing information architecture solutions for really large web sites and intranets is to intelligently combine manual AND automated approaches.

It's silly for Autonomy to pretend their software eliminates "the need for any manual labor in the process." It would be equally silly to ignore the valuable role automated classification software can play.

Humans (preferably experienced information architects) are needed to develop an overall information architecture strategy, to create key classification schemes, and to train and test the performance of the software. Automated classification software is needed to classify documents in large, dynamic collections for which manual tagging of documents is impractical.

There are many gray areas in between that must be navigated based upon the context of a particular web site or intranet. Let's examine just a few of the dimensions that should be considered when trying to decide how to best mix manual and automated approaches.

Dimensions for Consideration

The Importance of Performance
If you're designing a mission-critical intranet application to support 6,000 call center operators who must interact with the system to answer questions for customers who are waiting on the phone, the value of high-quality manual classification may outweigh the cost-savings of an automated classification solution.

However, if you're publishing a free, online database of the 20,000 best jokes about over-hyped technologies, automated classification may be the best way to provide topical access.

Volume & Dynamism
People are slow and expensive. Yahoo has roughly 1.5 million links. Google claims an index of more than 1 billion URLs. Some sites are too big for manual indexing (or at least for manual indexing of all content). Some sites, particularly those featuring daily news reports, change too quickly for people to keep up with them. Before you plan a manual classification strategy, you need to carefully consider volume, growth, and dynamism.

Content Diversity
Automated classification solutions work best on large collections of full text documents. They don't work as well on collections that vary in terms of document type and granularity, for example, a web site that includes long technical articles, short abstracts, and PowerPoint presentations.

Automated classification really breaks down on multimedia collections. If you're trying to provide access to images, software applications, databases, or other non-textual digital objects, human application of metadata is still the best solution.

End Notes

Did this article mess up your little blue folders?

Please send your rants and raves to Peter Morville.

Subscribe to our bi-weekly newsletter for notification of new articles.

If you'd like to bookmark this column use this and if you'd like to bookmark this article use that.