Log in

No account? Create an account

Redundancy Metric

I am wondering if anyone has ever calculated the entropy of the web.

I was looking into paying my credit card bill online. The credit card company requires the routing number from your bank. They had, of course, a little diagram of ``your check'' and they circle the routing number and your account number and the check number in the machine-readable numbers along the bottom. I thought to myself, every place that I've ever seen that asks for the routing number of a check has the same sort of diagram.

Later, I was comparing various web photo-album sites. I was looking for one particular feature from them, so I was looking through all of their FAQs. Every single site's FAQ contained all of the following questions....

  • What do you mean ``a JPG''?
  • What do you mean ``resolution''?
  • What does ``dpi'' mean?
  • What do you mean ``upload''?
  • How do I pick which files to upload?
  • Can I still use your service if I don't have a digital camera?
  • Why/How do you crop pictures?
  • Should I use ``JPG'' or ``TIFF''?

So, I was wondering how redundant the web was in general. I was trying to compare it to other things. There are N textbooks on any given subject. So, I would say that there is probably a great deal of redundancy in printed material, too. Also, I am constantly reminded of how redundant the daily news is from day to day. And, how redundant are movie and book plots and characters?

I know that there are some metrics for English ([1][2]) that fix it at around 1.22 bits of information per character and such. But, most of them are based in just how badly garbled some text can be and still be understood as English. I'm looking for something even more overarching... something that considers the whole corpus of the web and determines what proportion of it could go away and still have the web contain all of the information that it does now (even if it's harder to find said information).

I was trying to think about how to measure it. There's so much freedom in writing that it would probably be pretty tough. It would probably also need some comparison to other media to give it context. An interesting aside would be to compare how redundant today's newscast is with respect to yesterday's vs. how redundant today's newscast is with one that happened more than a year ago.

Even more babblingCollapse )