the linked, linear, serendipitous Web

I’m taking a course on Web archiving for the second half of this winter term at U of M, and from the very beginning our major project has got my brain going on theoretical issues and implications of technology and our offline assumptions as they impact our approach to the Web.

Here’s the thing about the Web. (And let’s distinguish it from “Internet.” I am only talking about the Web.) Perhaps the most wonderful, inspiring, and revolutionary aspect of hypertext and hyperlinks are their difference from print, and from scanned book images or e-books treated as paper books. I am talking about text that means something to the computer (in a sense, in that it’s manipulable), rather than the image of words on a page, which is also how I’d describe print media.

How are hypermedia different? Two words: linked, and linear.
Technology dictates to an extent how we can approach, interact with, interpret, and give meaning to a text or other communication. Books are quite hierarchical: they have tables of contents, they are organized into large units that have smaller units inside (a series, a book, a chapter, a subsection, a paragraph). With a paper book or e-book, we can flip through it as our whims or interest dictate, so I can’t argue for extreme linearity. But with many books, the assumption – and thus the way the communication is crafted – is that the reader will go in order of hierarchy, reading deeper into chapters before looking at the next. Aside from Choose Your Own Adventure, I can’t think of many (or any) examples of books that encourage or even force the reader into a linear path rather than a hierarchy.

Going with the example of the Choose Your Own Adventure books, let’s think about the idea of a lack of emphasized hierarchy. I’m not saying that Web sites don’t have hierarchy; in fact, the majority do. The vast majority of things on the Web are crafted, organized, and communicated as though we are using computers as screens for traditional print media. Sure, hyperlinks are becoming much more common, for example within some news sites (although sadly they mostly manifest as randomly linked “keywords” that go to ads). But the hierarchical impulse seems to be even more easily implemented on the Web – with all of its potential for flatness and interlinked-ness – than in print!

What I want to encourage is an approach to thinking about the Web that doesn’t take hierarchical, mostly non-internlinked sites as the representation of what the nature of the Web is, and what can be done with hypertext and hyperlinks. Rather, let’s think about the paths of users of all kinds: starting here, going there, then going back to here, then going somewhere else, all by following links that interconnect these sites of communication and interaction. Let’s set those pages all on a plane together, make the links manifest, and think about this non-hierarchical plane of communication and meaning and the workings of users brains: making connections. Going back and forth without hierarchy. Linearity in an extreme sense: not linear in terms of going from beginning to end with no flipping, but linearity in the sense of lack of hierarchy. Going back and forth along lines, going by whim and by instinct.

And here in these paths we find serendipitous meaning.

Now what is my issue with Web archiving?

Basically, it’s that it follows sites as isolated, hierarchical entities. The software that we use for our class, Archive-It, takes seeds from domains, subdomains, directories, pages, RSS feeds. (And others, but I will work with these here.) It crawls to the boundaries of a site: unless specified, if we start with a seed of, we won’t go to crawl pages linked to from a page on We supply new seeds for extra domains. The very technology itself forces a representation of the Web through the lens of site paths and organization, directory-style, hierarchical in the extreme. It disallows representation of the revolutionary nature of the interlinked Web and the organic paths of users through this landscape.

I wonder if what limits our imagination here is technology: have we not developed technology that interacts with the Web in a non-hierarchical way, or is it more difficult to implement? Or is it a way of thinking that precludes capturing dynamic, unpredictable paths across, rather than up and down, a landscape of communication and reference?

What I would really find interesting and much more experimentally informative would be to crawl along the paths of potential users, perhaps with semi-random paths weighted by probability, to keep the crawl more or less going on a concept or topic or question. Where does it go? We won’t know until we do it. The results may surprise us. The results create new meaning that we were not aware of. The results are an illustration of the serendipity of exploration and discovery when our expectations of outcome are kept at a minimum. And that has value in its doing something really new with a technology whose potential we are barely beginning to tap with our creativity and imaginations.

2 thoughts on “the linked, linear, serendipitous Web”

  1. I would initially assume that not following the sort of hierarchy you describe in archiving would leave you in one of two difficult situations.
    One option would be to simply follow every link encountered and archive the linked domain and its links and so on. The issue here is that you are likely to end up archiving the entire web, a major storage space problem.
    Alternatively (and more along your lines of thinking if I understand your post correctly) would be to create an algorithm that simulates the end users pattern of searching. The issue here is that users don’t just follow links on a site. They see something of interest and search it out on there own leading to a separate and “unconnected” source of information that has its own links and potentially spurs further searches. Programs and algorithms, as you are no doubt aware, must have rules to follow or they simply fail to function. It is the nature of computers to follow commands, not to think for themselves, hence the difficulty in developing true AI.
    A third possibility would be to track ACTUAL users and archive the sites and pages they visit. I can see this being a fairly interesting pursuit since every user is going to yield different results based on their own browsing habits and interests. Of course the issue there is privacy vs. accuracy. If you tell the users they’re being tracked they might change their browsing habits, if you don’t tell them….. well let’s not go there.

  2. The crawling could be made more reliable with the addition of meta-info added to links. Maybe this is already possible, beats me. If there were keywords associated then you the crawl could be guided toward links that have common keywords. People would be encouraged to add the tags because otherwise the links won’t be followed. You could name these Molly-tags and become famous. Unless someone already has thought of this. In that case, never mind.

Leave a Reply to Dad Cancel reply

Your email address will not be published. Required fields are marked *