Tuesday, March 6, 2012

Using all the data


USING ALL THE DATA            FEB 2009


The other day I was reading about how the Google search engine works (the original research paper). The key thing that Page and Brin did was to rank each page found by seeing how many other pages on the net have links to it. It’s a lot more complex than that, but the basic idea is simple. The thinking behind the ranking is rather like ranking research papers by how many future papers refer back to the paper concerned.

For some reason this set me thinking about how we are now using data in a different way than in the past. In a simple world, we only have access to fairly immediate data. Without any recording or transmitting devices, we can access our sense data and our memory, and use those inputs to make our choices and decisions. Of course, other people also act as recording and transmitting sources, because they can see things and remember stuff and tell us about it. In fact, that’s one of the ways we learn, along with reading books, which are another source of recorded information, and which came on the scene quite a while ago.

Then along came radio and TV to extend our sources of information, because they allowed long distance access to other people and recordings, as did the telephone. So now our ‘information network’ has grown enormously. We can see and hear things from around the world, both in the present and the recorded past. We can contact a network of friends and colleagues instantly, and amalgamate data from all these sources.

The Internet, email, instant messaging, social networking and cell phones have made these networks easier to set up and manage, so that we can now investigate the world from multiple viewpoints instead of just one.

But its not just that we have access to all this data, its also that the hugely increased processing power and storage capacity now available make it possible to look at data in new ways. The Google method is an example of how we can utilise large data sets by traversing them very rapidly in order to extract more detailed and in depth information than we can ‘at first sight’.

This kind of thing has been going on for a while. Consider holographs. Whereas a standard photograph uses only the magnitude of the light waves (OK, photons) arriving at the camera, a holograph uses both magnitude and phase. It extracts more information from the data source by deeper processing and recording.

Now consider WiFi. The latest generation devices use multiple signal paths to extract better signal quality and speed from a communication link than by just using a single signal path. All radio communications gets scattered along multiple paths. This can be a nuisance because it causes confusion at the receiving end (so it can cause ‘ghosting’ on TV pictures). Now it is being utilised by throwing vast amounts of processing power to resolve multiple signal paths into one ‘composite’ signal. The point is that you can squeeze more information out of the signal by these tricks, just as Google squeezed more information about web pages by looking at multiple link paths across the net.

(Added in Jan 2010). It has just been announced that scientists in France have reconstructed an image transmitted through an opaque object by changing the shape of the laser beam transmitted and extracting data from each shaped waveform. Same kind of trick, only using sequential transmissions to increase the amount of data. This sort of idea was utilised when I first started work in ‘sampling scopes’ – a kind of oscilloscope that looked at repeated waveforms over an interval of time to build up a picture (made of dots) of the waveform.

These tricks have been used for a while in cell phones, and in the allied area of phased array radars. All of them use the fact that you can get a better ‘picture’ of what is going on by looking at multiple data paths rather than just one.

Come to think of it, this is also kind of like what a journalist does (or should do) when he checks his sources. He is getting a better picture by using a variety of sources. We also practice something like this when we network among our friends and contacts. We don’t rely on just one opinion, we squeeze out more about what we want to know by asking different people. And I guess this has been going on for millennia – gossiping is a similar kind of activity, though perhaps somewhat less reliable.

In Malcom Gladwell’s book The Tipping Point, and in Philip Balls’ Critical Mass there are details of the ‘six steps’ idea; how anyone can be connected to anyone else by five or six intermediate people. There is also description of the ‘Kevin Bacon’ index –how many co-actors link a given actor to a film with our Kevin in it. People with large amounts of time on their hands have extended this to produce a large database of indexes of all (film) actors. Again, this has only been made possible by scanning and processing large online data network sets. New ideas have been brought out by processing these data networks in ways that previously had not been considered.

Satnav and route finding systems also work by scanning large data networks; as indeed do my old friends, circuit board routing systems – see my essay on ‘Backtracking’. In fact, ‘rip up’ routers use an idea bearing some similarity to Google’s page ranking, only here a route can be ranked ‘down’ by the number of other routes that would violate it. The point is that both methods build up information about how entities inter relate by deep scanning.

This is beginning to sound like some ideas in basic physics – multi body gravity problems for example, or even Feynman diagrams. In order to analyse what is going on, you have to examine relationships between each element with all other elements. Which is kind of what Google is doing.

No comments:

Post a Comment