Saturday, March 1, 2008

Computational Stuff, A Worked Example

Suppose you are listening to a sound, say a bird song, and you think, “I’d like to hear that song with an echo.” Good luck doing that in the target domain, the leafy glade in which you are enjoying the ambiance. You could try to wheel up some kind of sound reflector (if by some miracle there is one handy) but the bird would probably fly off while you do it. Even if it stays around it may not sing for you.

What you’ll need to do is map the bird’s song into some representation domain, say a pattern of magnetization in a film of oxide on a cassette tape. (Of course, doing a mapping of this kind is what we call "recording".) Now, with two cassette players (one to play the original song, and one to play the song again, starting a little later, at reduced volume) you can get your effect. But there’s a good deal of work (you have to copy the tape), and some dexterity to get the right delay, involved.

If you map the bird’s song into computational stuff, things are much easier. You can automate the production of the echo, for the bird’s song or any other sound. That is, given any sound, represented computationally, you can get the sound with the echo just by asking for it. No more work than that.

Representing Sounds using Sequences of Numbers

You hear a sound when certain cells in your ears detect rapid changes in air pressure, with the pressure going up, going down, going up, going down and so on, very rapidly. How big the swing in pressure is determines the loudness of the sound. How rapidly the pressure swings up and down determines the pitch or frequency of the sound. The more rapidly the swings happen, the higher the pitch. Human ears are sensitive to swings that go up and down between about 20 times per second and about 20,000 times a second.

If you measure air pressure, very rapidly and accurately, you’ll get a collection of numbers. If you have some way of working out when each measurement was made, you have captured a representation of the sound: you know when the pressure goes up, and how high, when it goes down, and how far, and so on. If you had some way of taking these numbers, and creating air pressures that match them, with the right timing, you could reproduce the sound, at least roughly. The roughness comes in because you don’t really know what is supposed to happen to the air pressure in between your measurements. But if our measurements are close enough together we can hope that the roughness won’t matter.

In theory, you could do all of this by keeping track separately of the time associated with each air pressure measurement. In practice a simpler scheme is used. If we know how many measurements per second we are making, evenly spaced in time, we can work out the time associated with each measurement, without having to store the time for each measurement.

If we want to represent a sound, then, we need a big collection of numbers (44,100 numbers for one second of sound), which are the pressure measurements. We also need to keep track of how often we made the measurements, 44,100 times a second in our example. We need to know this number to play back the sound: if recreate the air pressure assuming the measurements were collected more or less often, the sound we get won’t approximate the original, because the timing of the air pressure changes will be off.

The terminology that is used in talking about this stuff is different from what we’ve been using. Rather than “air pressure measurement” we’ll say sample from now on, and the number of measurements we collect per second when recording is called the sample rate. Using this terminology, we can represent a sound by a sequence of samples, also keeping track of the sample rate that was used when they were recorded, and that should be used to play them back.

Doing Work on Sounds

For our representation to be useful we have to be able to work on it that corresponds to work we might want to do on sounds. Here are some example operations.

Making a sound louder. You hear that bird song, and you want it louder. You could try to make the bird angry, but that isn’t very likely to work. Instead, you map the song to a sequence of samples, and you multiply each sample by some factor greater than one. If you now play back the new samples, you’ll get a louder version of the original song. Here’s a picture of the representation system at work here:


Making a sound fainter. If you multiply each sample by a number less than one, and play back, the sound will be softer (because the swings in pressure aren’t as big.)

Hearing one sound and then another. You hear one bird song, and then, after a while, another. You decide you’d like to hear them together, one right after the other. You map each song to a sequence of samples, and then you stick the two sequences together to get one long sequence. When you play back the long sequence you’ll hear what you want.

Delaying a sound. If you have a sequence representing some silence, and a sequence representing the sound you want to delay, you can stick the sequences together, with the silence first. When you play this sequence back, you’ll first hear the silence, and then, after a delay, the sound.

Hearing two sounds together. You hear one bird song, and then another. You think they might harmonize, so you wish the birds would sing together. Good luck trying to get them to do that! Using our representation, this is easy. Get the sequences of samples for each song, and then make a new sequence by adding together to first samples in each of the songs, then the second two, and so on. It may not be clear that this will work. But if you think about the physics of the original sounds, you may be able to see that the individual air pressures add up in just this way.

Making an echo. Suppose we have our favorite bird song, and we want to hear it with an echo. What happens in an echo is that the original song hits our ears, but it also goes off and hits some reflector, like a cliff, and bounces back to our ears from there. So what is coming to our ears at any moment consists of whatever we are hearing of the original song, plus the song that is bouncing back to us. Since it takes time for the song to get to the cliff and back to us, what we hear in the echo is delayed. Also, since the bounce off the cliff produces some scattering, sound that doesn’t get back to our ears, the delayed, bounced version of the song is softer that the original.

Here’s how we can do all this, step by step.

1. Record the original song, getting a sequence of samples. Let’s call this sequence originalSong.

2. Make a softer version of the original, by multiplying each sample by (say) .5. Call this sequence softerSong

3. Record .25 seconds of silence, getting a sequence called quarterSecondOfSilence.

4. Stick the sequence quarterSecondOfSilence onto the front of softerSong. Call the resulting sequence delayedSofterSong.

5. Add together the samples in originalSong and delayedSofterSong. If we play back the resulting sequence, we’ll hear the original song with an echo.

Representing Sounds using Computational Stuff

In theory you could do all of this with pencil and paper, if you had some way to read the sample numbers, and to get the samples from your paper into some kind of player. But at 44,100 samples in a second of bird song, that’s a lot of paper and pencil. You want to get a machine to do the work.

To do this, we are going to represent the sequences of numbers we are using to represent the sounds, using computational stuff. Notice the three levels here: sounds are what we are interested in, and we use sequences of numbers to represent them. But then to be able to automate work on the sequences of numbers, we represent them using computational stuff.

We’ll do this using the Python programming language.

We are going to used things called lists in our Python programs to represent our sequences of samples.

To do the above work, the things we need to do are: get a list of samples representing the original bird song, multiply each sample in a list by a factor, get a list of samples representing .25 sec of silence, stick two lists together, and produce a new list by adding the samples in two lists together. We’ll also need to play back a list of samples. Here’s how to do these things in Python.

Basics of lists. A list is a sequence of things, numbers in our case, written like this: [1,22,47,29]. You can give a list a name, using an assignment statement, like this:

eggplant=[1,22,47,29]

Now we can refer to any of the elements of the list eggplant by using a number as an index. The indices start with 0, so eggplant[0] is 1, eggplant[1] is 22, eggplant[2] is 47, and eggplant[3] is 29. Indices are also often called subscripts, from mathematical notation).

Get a list of samples representing the original bird song. Rather than make a field recording, we’ll get our samples from a WAV file. Because computational representations of sounds are so handy, machinery has been created for recording sounds and storing them as collections of information on computer systems. Collections of information that can be filed away on the computer for future use are called files. A WAV file is a file that contains samples of a sound, stored in a particular way.

Here is a program that gets a list of samples from a WAV file named "birdsong.wav", and prints the first 10 of them:

originalSong=readwave("birdsong.wav") for i in range(10): print originalSong[i]

Multiply each sample in a list by a factor. This statement is all we need:

softerSong=[.5*e for e in originalSong]

This statement builds a new list, each of whose elements is made from an element of originalSong by multiplying it by .5. You can read it by thinking of the first thing in the brackets, .5*e, as a kind of pattern that shows how to make an element of the new list from an element of the old one. We know that e is an element of the list originalSong, because we wrote “for e in originalSong”. The square brackets tell us that we are making a list.

Get a list of samples representing .25 sec of silence. We could look for a WAV file that contains a recording of .25 sec of silence, but it will be easier just to make our own list of samples. This code will do it:

quarterSecondOfSilence=int(.25*44100)*[0]

This takes a little explaining. The number .25*44100 is the number of samples we need for a quarter second of sound, since we need 44,100 samples for each second of sound. The “int” with parens around it makes this into a whole number, or integer (you and I know that that product is already a whole number, but because .25 isn’t a whole number Python worries that the product might not be.) Then comes something funny looking: we are “multiplying” the list [0] by that number. When you “multiply” a list in Python by a number, Python sticks together that number of copies of the original list. So we are getting a list of .25*44100 copies of the list with just 0 in it. The result is a sequence of as many samples as we need to get .25 sec of sound, and each sample is 0.

Stick two lists together. This one is really easy. In Python, if I have two lists, I can just “add” them together. For example, [1,2,3]+[11,12] is [1,2,3,11,12]. So to produce a delayed version of softerSong we can just write

delayedSofterSong=quarterSecondOfSilence+softerSong

Produce a new list by adding the samples in two lists together. We’ve just seen that Python won’t do this if we just “add” the two lists with +... it will concatenate them. So this is a little more involved. Here is Python code that will do the trick.

songWithEcho=[originalSong[i]+delayedSofterSong[i] for i in range(len(originalSong))]

The expression originalSong[i]+delayedSofterSong[i] is the pattern for elements of the new list. The values of i used in the pattern come from the list range(len(originalSong)). The function range produces the list [0,1,...] with as many elements as the number you specify. Finally, len(originalSong) is the number of elements in the list, originalSong. This means that the values of i will step up from 0 until we have used as many values as there are things in originalSong, so in the pattern, originalSong[i] will get to be all of the elements in the list originalSong, which is what we want. Note that delayedSofterSong[i] picks out the corresponding elements of that list.

Play back a list of samples. We now hope we have a version of the song with the echo added, but we can’t hear it. Unfortunately we don’t have a way to convert a list of samples directly to sound from Python. What we can do, though, is put our samples into a WAV file, and then play the WAV file on our computer. Let's assume that the statement makewave("modifiedsong.wav",44100,songWithEcho) will put the samples in the list songWithEcho into a WAV file named "modifiedsong.wav". (Note: functions readwave and makewave aren't part of standard Python, but they can be implemented.)

Pulling these pieces together, here is a complete program for adding an echo to a sound, represented by a WAV file:

originalSong=readwave("birdsong.wav")
softerSong=[.5*e for e in originalSong]

quarterSecondOfSilence=int(.25*44100)*[0]
delayedSofterSong=quarterSecondOfSilence+softerSong songWithEcho=[originalSong[i]+delayedSofterSong[i] for i in range(len(originalSong))] makewave("modifiedsong.wav",44100,songWithEcho)

Below is a tabular summary of the methods we’ve discussed. The table distinguishes three domains, the sounds, the sequences of samples, and the lists (the computational stuff).



What is Implementation?

As was mentioned earlier, the key difference between computer science and mathematics is that mappings used in computing are (or could be) implemented, whereas mappings in mathematics need not be. But what does it mean to implement a mapping?

An implementation of a mapping is a thing with two ends, an input end and an output end. In the case of computer implementation, the thing will be a technical device of some kind. For example, an adder is a device whose input end can be set to correspond to a pair of numbers, say 2 and 3. After a little delay, the adder sets its output end to correspond to the sum of the two numbers on the input end, 5 in this case. Any computer will have one or more adders in it.

Similarly, to draw a bar chart we will want a device of some kind whose input end can be set to represent a length, and whose output end will produce a colored bar. In practice, you won’t find a bar-drawer as a separate device inside a computer. Rather, simpler devices are hooked together so as to act like a bar-drawer when needed.

The program above, when installed on an appropriately configured computer, implements some of the key stages of the production of a song with an echo. When an WAV file is presented as input, after some delay an appropriate WAV file appears as output. The steps connecting the original bird song in the leafy glade to a WAV file (recording), and connecting the output WAV file to a sound (playback) are not implemented by the program. If we were to examine their implementation, in practice, we would find that recording is probably implemented by a person operating a piece of special equipment, while playback is implemented by another program on a computer. The fact that mappings are often implemented by people leads into the topic of human-centered computing, to be taken up later.

Input and output forms.

We'll often need to refer to the states of the input and output ends of implemented mappings. Because implementations are very diverse, we'll want a generic term: we'll describe these states as input and output forms. For example, a bar graph displayed by a program may act as an output form the the program, and as an input form for a person viewing it.

Aside: Communication and Storage as Implemented Mappings

Many of the mappings one thinks of are like addition, in that the output is different from the input. But there are two very useful families of implementations of the identity mapping, the mapping whose output is the same as its input. Because the ends of a physical device can be separated in space, we can create a communication system, whose input end is in New York and whose output end is in Singapore. Here it is an advantage that the output is identical to the input.

Similarly, the ends of a device can be separated in time. A device whose input end is in September, 2006, but whose output end is in September, 2016, can be a storage system. Again, we are happy if the output is the same as the input.

Reprise: Advantages of computational representations.

The example we've discussed here provides concrete illustrations of the characteristic advantages of computational representations, representations using "computational stuff", over other forms. A sound represented as a sequence of samples can be stored in very little space, and accurately retrieved at a later time. It can also be communicated, that is, made available in another place, very rapidly and cheaply. These advantages give us the iPod, and the ability to acquire all kinds of material to play on it at very low cost. Finally, we can automate all kinds of transformations on sounds, incomparably more easily than can be done in other representation domains, such as magnetic tape, let alone in the target domain, the leafy glade.

No comments: