USING ALL THE DATA FEB 2009
The other day I was reading about how the
Google search engine works (the original research paper). The key thing that
Page and Brin did was to rank each page found by seeing how many other pages on
the net have links to it. It’s a lot more complex than that, but the basic idea
is simple. The thinking behind the ranking is rather like ranking research
papers by how many future papers refer back to the paper concerned.
For some reason this set me thinking about
how we are now using data in a different way than in the past. In a simple
world, we only have access to fairly immediate data. Without any recording or
transmitting devices, we can access our sense data and our memory, and use
those inputs to make our choices and decisions. Of course, other people also
act as recording and transmitting sources, because they can see things and
remember stuff and tell us about it. In fact, that’s one of the ways we learn,
along with reading books, which are another source of recorded information, and
which came on the scene quite a while ago.
Then along came radio and TV to extend our
sources of information, because they allowed long distance access to other
people and recordings, as did the telephone. So now our ‘information network’
has grown enormously. We can see and hear things from around the world, both in
the present and the recorded past. We can contact a network of friends and
colleagues instantly, and amalgamate data from all these sources.
The Internet, email, instant messaging,
social networking and cell phones have made these networks easier to set up and
manage, so that we can now investigate the world from multiple viewpoints
instead of just one.
But its not just that we have access to all
this data, its also that the hugely increased processing power and storage
capacity now available make it possible to look at data in new ways. The Google
method is an example of how we can utilise large data sets by traversing them
very rapidly in order to extract more detailed and in depth information than we
can ‘at first sight’.
This kind of thing has been going on for a
while. Consider holographs. Whereas a standard photograph uses only the
magnitude of the light waves (OK, photons) arriving at the camera, a holograph
uses both magnitude and phase. It extracts more information from the data
source by deeper processing and recording.
Now consider WiFi. The latest generation
devices use multiple signal paths to extract better signal quality and speed
from a communication link than by just using a single signal path. All radio
communications gets scattered along multiple paths. This can be a nuisance
because it causes confusion at the receiving end (so it can cause ‘ghosting’ on
TV pictures). Now it is being utilised by throwing vast amounts of processing
power to resolve multiple signal paths into one ‘composite’ signal. The point
is that you can squeeze more information out of the signal by these tricks,
just as Google squeezed more information about web pages by looking at multiple
link paths across the net.
(Added in Jan 2010). It has just been
announced that scientists in France
have reconstructed an image transmitted through an opaque object by changing
the shape of the laser beam transmitted and extracting data from each shaped
waveform. Same kind of trick, only using sequential transmissions to increase
the amount of data. This sort of idea was utilised when I first started work in
‘sampling scopes’ – a kind of oscilloscope that looked at repeated waveforms
over an interval of time to build up a picture (made of dots) of the waveform.
These tricks have been used for a while in
cell phones, and in the allied area of phased array radars. All of them use the
fact that you can get a better ‘picture’ of what is going on by looking at multiple
data paths rather than just one.
Come to think of it, this is also kind of
like what a journalist does (or should do) when he checks his sources. He is
getting a better picture by using a variety of sources. We also practice
something like this when we network among our friends and contacts. We don’t
rely on just one opinion, we squeeze out more about what we want to know by
asking different people. And I guess this has been going on for millennia –
gossiping is a similar kind of activity, though perhaps somewhat less reliable.
In Malcom Gladwell’s book The Tipping
Point, and in Philip Balls’ Critical Mass there are details of the
‘six steps’ idea; how anyone can be connected to anyone else by five or six
intermediate people. There is also description of the ‘Kevin Bacon’ index –how
many co-actors link a given actor to a film with our Kevin in it. People with
large amounts of time on their hands have extended this to produce a large
database of indexes of all (film) actors. Again, this has only been made
possible by scanning and processing large online data network sets. New ideas
have been brought out by processing these data networks in ways that previously
had not been considered.
Satnav and route finding systems also work
by scanning large data networks; as indeed do my old friends, circuit board
routing systems – see my essay on ‘Backtracking’. In fact, ‘rip up’ routers use
an idea bearing some similarity to Google’s page ranking, only here a route can
be ranked ‘down’ by the number of other routes that would violate it. The point
is that both methods build up information about how entities inter relate by
deep scanning.
This is beginning to sound like some ideas
in basic physics – multi body gravity problems for example, or even Feynman
diagrams. In order to analyse what is going on, you have to examine
relationships between each element with all other elements. Which is kind of
what Google is doing.
No comments:
Post a Comment