VOICE MORPHING

Introduction

Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals, while generating a smooth transition between them. Speech morphing is analogous to image morphing. In image morphing the in-between images all show one face smoothly changing its shape and texture until it turns into the target face. It is this feature that a speech morph should possess. One speech signal should smoothly change into another, keeping the shared characteristics of the starting and ending signals but smoothly changing the other properties.

There are many application which may benefit from this sort of technology. For example, a TTS system with voice morphing technology integrated can produce many different voices,In case where the speaker identity plays a key role,such as dubbing movies and TV-shows,The availability of high quality voice morphing technology will be very valuable allowing the appropriate voice to be generated without original actors being present

The major properties of concern as far as a speech signal is concerned are its pitch and envelope information.These two reside in a convolved form in a speech signal. Hence some efficient method for extracting each of these is necessary.We have adopted an uncomplicated approach namely cepstral analysis to do the same. Pitch and formant information in each signal is extracted using the cepstral approach.Necessary processing to obtain the morphed speech signal include methods like Cross fading of envelope information, Dynamic Time Warping to match the major signal features(pitch)and Signal Re-estimation to convert the morphed speech signal back into the acoustic waveform.

An Introspection of the Morphing Process

voice morphing is quite challenging and interesting. Speech morphing can be achieved by transforming the signal’s representation from the acoustic waveform obtained by sampling of the analog signal, with which many people are familiar with, to another representation. To prepare the signal for the transformation, it is split into a number of 'frames' - sections of the waveform. The transformation is then applied to each frame of the signal. This provides another way of viewing the signal information. The new representation (said to be in the frequency domain) describes the average energy present at each frequency band.

Further analysis enables two pieces of information to be obtained: pitch information and the overall envelope of the sound. A key element in the morphing is the manipulation of the pitch information. If two signals with different pitches were simply cross-faded it is highly likely that two separate sounds will be heard. This occurs because the signal will have two distinct pitches causing the auditory system to perceive two different objects.A successful morph must exhibit a smoothly changing pitch throughout. The pitch information of each sound is compared to provide the best match between the two signals' pitches To do this match, the signals are stretched and compressed so that important sections of each signal match in time The interpolation of the two sounds can then be performed which creates the intermediate sounds in the morph. The final stage is then to convert the frames back into a normal waveform. The conversion of the sound to a representation in which the pitch and spectral envelope can be separated loses some information. Therefore, this information has to be re-estimated for the morphed sound. . This process obtains an acoustic waveform, which can then be stored or listened to.

Figure 1: Schematic block diagram of the speech morphing process

Morphing Process: A Comprehensive Analysis

Voice morphing contains a number of fundamental signal processing methods including sampling, the discrete Fourier transform and its inverse, cepstral analysis. However the main processes can be categorized as follows.

I. Preprocessing or representation conversion: This involves processes like signal acquisition in discrete form and windowing.

II. Cepstral analysis or Pitch and Envelope analysis: This process will extract the pitch and formant information in the speech signal.

III. Morphing which includes Warping and interpolation.

IV. Signal re-estimation.

Fig 2: Block diagram of the simplified speech morphing

Acoustics of speech production

Acoustic phonetics is filed that studies the acoustic properties of speech and how these are releated to human speech production system. The most important acoustic properties of the vocal tract are the resonancies that originate in a manner similar to that of, e.g. wind instrument, i.e. as resonating standing waves in an air tube. The human vocal tract is not an uniform tube but still, in vowels, there occurs roughly one format per1 kilohertz. Speech production can be viewed as a filtering operation in which a sound source excites a vocal tract filter. . The source may be periodic, in voiced speech, or noisy.voiced speech has a spectra consisting of harmonics of the fundamental frequency of the vocal cord vibration; this frequency is the physical aspect of the speech signal corresponding to the perceived pitch. Thus pitch refers to the fundamental frequency of the vocal cord vibrations or the resulting periodicity in the speech signal. This frequency can be determined either from the periodicity in the time domain or from the regularly spaced harmonics in the frequency domain.

Preprocessing

This section shall introduce the major concepts associated with processing a speech signal and transforming it to the new required representation to affect the morph. This process takes place for each of the signals involved with the morph.

1.Signal Acquisition

Before any processing can begin, the sound signal that is created by some real-world process has to be ported to the computer by some method. This is called sampling.

When a natural process, such as a musical instrument, produces sound the signal produced is analog (continuous-time) because it is defined along a continuum of times. A discrete-time signal is represented by a sequence of numbers - the signal is only defined at discrete times. A digital signal is a special instance of a discrete-time signal - both time and amplitude are discrete. Each discrete representation of the signal is termed a sample.

Fig3: Signal acquisition

The input speech signals are taken using MIC. The analog speech signal is converted into the discrete form by the inbuilt CODEC TLC320AD535 present onboard and stored in the processor memory. This completes the signal acquisition phase.

2.Windowing

A DFT (Discrete Fourier Transformation) can only deal with a finite amount of information. Therefore, a long signal must be split up into a number of segments. These are called frames. Generally, speech signals are constantly changing and so the aim is to make the frame short enough to make the segment almost stationary and yet long enough to resolve consecutive pitch harmonics Therefore, the length of such frames tends to be in the region of 25 to 75 milli seconds near about. The windowing function splits the signal into time-weighted frames.

Morphing

1.Matching and Warping: Background theory

Both signals will have a number of 'time-varying properties' To create an effective morph, it is necessary to match one or more of these properties of each signal to those of the other signal in some way. The property of concern is the pitch of the signal - although other properties such as the amplitude could be used - and will have a number of features. It is almost certain that matching features do not occur at exactly the same point in each signal. Therefore, the feature must be moved to some point in between the position in the first sound and the second sound. In other words, to smoothly morph the pitch information, the pitch present in each signals needs to be matched and then the amplitude at each frequency cross-faded. Consider the simple case of two signals, each with two features occurring in different positions as shown in the figure below.

Figure 4: The match path between two signals with differently located features

The match path shows the amount of movement (or warping) required in order aligning corresponding features in time. Such a match path is obtained by Dynamic Time Warping (DTW).

2.Dynamic Time Warping

Speaker recognition and speech recognition are two important applications of speech processing. These applications are essentially pattern recognition problems, which is a large field in itself. Some Automatic Speech Recognition (ASR) systems employ time normalization. This is the process by which time-varying features within the words are brought into line. Using Dynamic Time Warping (DTW) someone can find the best match between the features of the two sounds - in this case, their pitch. To create a successful morph, major features, which occur at generally the same time in each signal, ought to remain fixed and intermediate features should be moved or interpolated. DTW enables a match path to be created. This shows how each element in one signal corresponds to each element in the second signal.

3.Morphing Stage

Now, details about of how the morphing process is carried out. The overall aim in this section is to make the smooth transition from signal 1 to signal 2. This is partially accomplished by the match path provided by the DTW. At this stage, it was decided exactly what form the morph would take. The implementation chosen was to perform the morph in the duration of the longest signal. In other words, the final morphed speech signal would have the duration of the longest signal. However, one problem still remains: the interpolated pitch of each morph slice. If no interpolation were to occur then this would be equivalent to the warped cross-fade which would still be likely to result in a sound with two pitches. Therefore, a pitch in- between those of the first and second signals must be created.

To illustrate the morph process, these two cepstral slices shall be used.

There are three stages:

1. Combination of the envelope information;

2. Combination of the pitch information residual - the pitch information excluding the pitch peak;

3. Combination of the pitch peak information.

Combination of the envelope information

In digital audio production, a crossfade is editing that makes a smooth transition between two audio files/pitch. We can say that that the best morphs are obtained when the envelope information is merely cross-faded. Each envelope must be transformed back into the frequency domain (involving an inverse logarithm) before the cross-fade is performed. Once the envelopes have been successfully cross-faded according to the weighting, the morphed envelope is once again transformed back into the cepstral domain. This new cepstral slice forms the basis of the completed morph slice.

Combination of the pitch information residual

To produce the morphed residual, it is combined in a similar way to that of the envelope information: no further matching is performed. It is simply transformed back into the frequency domain and cross-faded.

Combination of the Pitch peak information

As stated above, in order to produce a satisfying morph, it must have just one pitch. This means that the morph slice must have a pitch peak, which has characteristics of both signal 1 and signal 2. Therefore, an artificial’ peak needs to be generated to satisfy this requirement. The positions of the signal 1 and signal 2 pitch peaks are stored

Signal re-estimation

This is a vital part of the system and the time expended on it was well spent. As is described above, due to the signals being transformed into the cepstral domain, a magnitude function is used. This results in a loss of phase information in the representation of the data. Therefore to estimate a signal whose magnitude DFT is close to that of the processed magnitude DFT is required.