Apr 27, 2014

As stated in my previous post, I had some issues when trying to train my network while using sparse-coded speech frames as input/output. The network was getting stuck at the same point after processing a single batch, and processing additional batches did anything training-wise (performance, weights, etc., were all stuck). I tried different initialization ranges for the weight matrices, a different learning rate, different network architectures (unit types, different number of units and layers) but the results were still the same.

In the same post, I talked about sparsity not being enforced at the output. Since the whole dataset was sparse-coded and we were trying to predict the vector of sparse coefficients for the next frame, all outputs were expected to be as sparse as the inputs. This post describes what I did to try to mitigate this issue. I used stochastic neurons that enforce sparsity in the output layer of a network with the same architecture as the previous network. This brings in the problem of propagating gradients through stochastic units, but fortunately David Warde-Farley pointed me to a paper which presents some alternative solutions [Bengio2013].

The units I used at the output layer are similar to the ones called "Stochastic times Smooth" (STS) in [Bengio2013]. The idea is that the output of a neuron is equal to the product of a stochastic part (sampled from a binomial distribution with probability \(\sqrt{p_i}\)) and a smooth part (for example, a sigmoid or a linear function). The value of \(p_i\) itself comes from a non-linear computation based on the unit's input. The stochastic part serves as a "gater" which prevents the activation with a probability determined by the activation at the "gater path" of the unit. A sparsity constraint is imposed by a combination of the KL-divergence criterion for the sigmoids in the "gater path" of the unit with added noise (as explained in sections 1 and 2 of Appendix A in the referred paper).

I found an implementation for these units by one of the authors (Nicholas Léonard) at GitHub, and updated it to the current Pylearn2 interface. His implementation had support for "hybrid" STS units which are semi-stochastic (i.e., part of the output is deterministic), but I did not use it. These units have a 2-layer non-linear gater path (I used two sigmoidal layers) and a linear path for the output.

Now, for the results: sadly, switching to these units made no significant difference at first. I did some tests using the same training/testing/validation sets from Vincent's TIMIT dataset but filtering it such that there were only sentences from male speakers. The network was still stuck at the same objective value after the first epoch. So, I decided to take a look on an alternative hypothesis: maybe less sparsity/a different sparse coding configuration could help?

After I have seen that tinkering with the hyperparameters and switching the output layer to an STS layer did not solve my problem, I decided to play a bit with the inputs/outputs I was using. My sparse coding scheme was resulting in a very sparse representation, as each frame with 160 samples was being represented by up to 16 sparse coding coefficients. Since the dictionary had 950 different atoms, it means that only approximately 1.7% of the coefficients were non-zero, both in the input and in the output. Note that while the quality of the reconstruction is acceptable (as you can listen in the samples below), this gives us no idea on whether it is a good representation phone-wise (i.e., if there's any relationship between the chosen atoms and specific phones). Given the poor results I had with the previous representation, I guess the representation I chose could be one of the culprits for the bad performance.

To test the effect of a different sparse representation, I decided to switch the dictionary used by my sparse coding scheme to an undercomplete dictionary (i.e., a dictionary where the number of atoms is smaller than the length of each atom), using longer frames and increasing the overlap to 87.5% instead of 50%. Another thing that motivated this change was realizing my previous dictionary still had a big problem: a dictionary with atoms with length 160 was not able to hold full atoms for frequencies lower than 1000 Hz, as the envelope for lower frequency atoms decay is slower than for high-frequency ones. To be able to have atoms for the desired frequency range (150 to 8000 Hz), I would need to increase the atom length. However, the tradeoff between atom length and number of atoms in an overcomplete dictionary would make the sparse coefficient vector even longer than the one I had (950 coefficients). I made some experiments with an undercomplete dictionary with longer frames (1600 samples, equivalent to 100 ms at 16 kHz sampling rate) and more frequency values for the gammatones (64, instead of 50), and felt that the reconstruction quality decreased a bit. However, increasing the overlap between samples led to a decent reconstruction quality. You can listen to the original sample and the three reconstructed samples below:

Original:

Reconstructed with "wrong" overcomplete dictionary (length 160, overlap 50%, atoms not correctly limited at frame borders):

Reconstructed with undercomplete dictionary (length 1600, overlap 50%):

Reconstructed with undercomplete dictionary (length 1600, overlap 87.5%):

The final dictionary I used had an atom length of 1600 and 1536 atoms.

Unfortunately, even after switching the previous dataset by the one
generated with the undercomplete dictionary, the behavior of my
network during training was still the same... with the added problem
of NaNs appearing eventually after being stuck for around 30
iterations (with this YAML file, which uses only a bunch of sentences
from the validation set to save processing time). I am not sure on
what is causing this and searching at the pylearn-users mailing
list, I have seen that David Krueger also had a similar issue
recently. In my case, using the `nan_guard` as suggested by Ian in
this thread showed that the error happened during the computation of a
weight matrix inside the STS layer, namely the second weight matrix
used in the gater path of the unit (second layer of the non-linear
part). The error log is very low level as you can see here, which
makes sense since it comes from inside an optimized Theano
graph. Altering the sparsity target in the output layer seems to avoid
this, but the results are still the same. Additionally, the predicted
coefficients are very far from the sparsity target I have set for the
output layer: 10% of the coefficients should be non-zero, but I am
getting something close to 50% instead (approximately 750 non-zero
coefficients per frame).

Here is an example of an audio file generated with the trained network, using the previous frame and the previous, current, and next phone as input:

(Yes, it does not sound like speech at all.)

None of my experiments led to results that would convince one that using a sparse representation based on gammatones would be a useful thing for speech synthesis. Architectures using time-domain audio samples as the input/output had much more exciting results. Looking forward, I would like to experiment with my other representation based on the gammatonegram and convolutional neural nets. As the gammatonegram for consecutive frames can be seen as an image (where each time-frequency cell is a "pixel"), the usual 2D CNNs could be tested.

Even though our course is over, my PhD research topic is speech processing, so I believe I'll continue playing with deep learning in the future. I am interested in investigating applications of deep learning to speech enhancement systems. There is a very recent paper (to appear in this year's ICASSP) where the authors used a DNN to learn spectral masks to reduce reverberation in speech signals [Han2014]. It would be interesting to see if a similar idea could be used not only for reverberation, but for speech enhancement under different kinds of environment. Other ideas will hopefully pop-up along the way!

[Bengio2013] | (1, 2) Y. Bengio, N. Léonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” arXiv:1308.3432 [cs], Aug. 2013. |

[Han2014] | K. Han, Y. Wang and D. Wang, “Learning spectral mapping for speech dereverberation”, To appear in the Proceedings of the IEEE ICASSP 2014, 2014. Available at http://www.cse.ohio-state.edu/~dwang/papers/HWW.icassp14.pdf. |