# Redundancy Metric

I am wondering if anyone has ever calculated the entropy of the web.

I was looking into paying my credit card bill online. The credit card company requires the routing number from your bank. They had, of course, a little diagram of ``your check'' and they circle the routing number and your account number and the check number in the machine-readable numbers along the bottom. I thought to myself, every place that I've ever seen that asks for the routing number of a check has the same sort of diagram.

Later, I was comparing various web photo-album sites. I was looking for one particular feature from them, so I was looking through all of their FAQs. Every single site's FAQ contained all of the following questions....

• What do you mean ``a JPG''?
• What do you mean ``resolution''?
• What does ``dpi'' mean?
• What do you mean ``upload''?
• How do I pick which files to upload?
• Why/How do you crop pictures?
• Should I use ``JPG'' or ``TIFF''?

So, I was wondering how redundant the web was in general. I was trying to compare it to other things. There are N textbooks on any given subject. So, I would say that there is probably a great deal of redundancy in printed material, too. Also, I am constantly reminded of how redundant the daily news is from day to day. And, how redundant are movie and book plots and characters?

I know that there are some metrics for English ([1][2]) that fix it at around 1.22 bits of information per character and such. But, most of them are based in just how badly garbled some text can be and still be understood as English. I'm looking for something even more overarching... something that considers the whole corpus of the web and determines what proportion of it could go away and still have the web contain all of the information that it does now (even if it's harder to find said information).

I was trying to think about how to measure it. There's so much freedom in writing that it would probably be pretty tough. It would probably also need some comparison to other media to give it context. An interesting aside would be to compare how redundant today's newscast is with respect to yesterday's vs. how redundant today's newscast is with one that happened more than a year ago.

Anyway, back to the redundancy of the web. I think there may be a market for a sort of ``faq''-hosting service. Business starting up wouldn't have to create nearly as much content to get them going if they could just have a FAQ to reference in such a way that the answer pages came out in their company theme.

Consider starting another web photo-album/photo-printing site.... If you wanted to start a new one right now, you'd have to get together a huge FAQ that is incredibly similar to the FAQs of all of your competitors. It would save you a great deal of time if you could just link into a meta-FAQ.

The meta-FAQ would have to have provisions for you to adapt the look and feel of the pages so that they fit in with the rest of your company's web site. But, that should be very easy to do accomplish. In fact, it shouldn't be hard at all to tweak any portion of the page or href's within the page into something more suited to your company.

For the most part, the stuff of FAQs isn't really company-specific. The content is by-and-large totally company-neutral. The presentation in terms of colors and fonts and banners and side-bars and such is very company-specific, but the content really isn't. Only small portions of the content would require any tweaking at all.

I have a fairly decent strawman architecture for such a system in my head. But, I'm not really into typing it all at the moment. But, if you're a venture-capitalist and want to hear more...., let me know.... 8^)

