seaandsailor - ift6266/2014-04-27T16:00:00-04:00Using stochastic neurons for conditional computation2014-04-27T16:00:00-04:002014-04-27T16:00:00-04:00jfsantostag:None,2014-04-27:/stochastic_neurons_conditional.html<p>As stated in my previous <a class="reference external" href="/gammatone.html">post</a>, I had some issues when trying to train
my network while using sparse-coded speech frames as input/output. The
network was getting stuck at the same point after processing a single
batch, and processing additional batches did anything training-wise
(performance, weights, etc., were all …</p><p>As stated in my previous <a class="reference external" href="/gammatone.html">post</a>, I had some issues when trying to train
my network while using sparse-coded speech frames as input/output. The
network was getting stuck at the same point after processing a single
batch, and processing additional batches did anything training-wise
(performance, weights, etc., were all stuck). I tried different
initialization ranges for the weight matrices, a different learning
rate, different network architectures (unit types, different number of
units and layers) but the results were still the same.</p>
<p>In the same post, I talked about sparsity not being enforced at the
output. Since the whole dataset was sparse-coded and we were trying to
predict the vector of sparse coefficients for the next frame, all
outputs were expected to be as sparse as the inputs. This post
describes what I did to try to mitigate this issue. I used stochastic
neurons that enforce sparsity in the output layer of a network with
the same architecture as the previous network. This brings in the
problem of propagating gradients through stochastic units, but
fortunately David Warde-Farley pointed me to a paper which presents
some alternative solutions <a class="citation-reference" href="#bengio2013" id="id1">[Bengio2013]</a>.</p>
<p>The units I used at the output layer are similar to the ones called
"Stochastic times Smooth" (STS) in <a class="citation-reference" href="#bengio2013" id="id2">[Bengio2013]</a>. The idea is that the
output of a neuron is equal to the product of a stochastic part
(sampled from a binomial distribution with probability
<span class="math">\(\sqrt{p_i}\)</span>) and a smooth part (for example, a sigmoid or a
linear function). The value of <span class="math">\(p_i\)</span> itself comes from a
non-linear computation based on the unit's input. The stochastic
part serves as a "gater" which prevents the activation with a
probability determined by the activation at the "gater path" of the
unit. A sparsity constraint is imposed by a combination of the
KL-divergence criterion for the sigmoids in the "gater path" of the
unit with added noise (as explained in sections 1 and 2 of Appendix A
in the referred paper).</p>
<!-- add image showing how the STS unit is structured -->
<p>I found an implementation for these units by one of the authors
(Nicholas Léonard) at <a class="reference external" href="https://github.com/nicholas-leonard/delicious">GitHub</a>, and updated it to the current Pylearn2
interface. His implementation had support for "hybrid" STS units which
are semi-stochastic (i.e., part of the output is deterministic), but I
did not use it. These units have a 2-layer non-linear gater path (I
used two sigmoidal layers) and a linear path for the output.</p>
<p>Now, for the results: sadly, switching to these units made no
significant difference at first. I did some tests using the same
training/testing/validation sets from Vincent's TIMIT dataset but
filtering it such that there were only sentences from male
speakers. The network was still stuck at the same objective value
after the first epoch. So, I decided to take a look on an alternative
hypothesis: maybe less sparsity/a different sparse coding
configuration could help?</p>
<div class="section" id="using-an-undercomplete-gammatone-dictionary">
<h2>Using an undercomplete gammatone dictionary</h2>
<p>After I have seen that tinkering with the hyperparameters and
switching the output layer to an STS layer did not solve my problem, I
decided to play a bit with the inputs/outputs I was using. My sparse
coding scheme was resulting in a very sparse representation, as each
frame with 160 samples was being represented by up to 16 sparse coding
coefficients. Since the dictionary had 950 different atoms, it means
that only approximately 1.7% of the coefficients were non-zero, both
in the input and in the output. Note that while the quality of the
reconstruction is acceptable (as you can listen in the samples below),
this gives us no idea on whether it is a good representation
phone-wise (i.e., if there's any relationship between the chosen atoms
and specific phones). Given the poor results I had with the previous
representation, I guess the representation I chose could be one of the
culprits for the bad performance.</p>
<p>To test the effect of a different sparse representation, I decided to
switch the dictionary used by my sparse coding scheme to an
undercomplete dictionary (i.e., a dictionary where the number of atoms
is smaller than the length of each atom), using longer frames and
increasing the overlap to 87.5% instead of 50%. Another thing that
motivated this change was realizing my previous dictionary still had a
big problem: a dictionary with atoms with length 160 was not able to
hold full atoms for frequencies lower than 1000 Hz, as the envelope
for lower frequency atoms decay is slower than for high-frequency
ones. To be able to have atoms for the desired frequency range (150 to
8000 Hz), I would need to increase the atom length. However, the
tradeoff between atom length and number of atoms in an overcomplete
dictionary would make the sparse coefficient vector even longer than
the one I had (950 coefficients). I made some experiments with an
undercomplete dictionary with longer frames (1600 samples, equivalent
to 100 ms at 16 kHz sampling rate) and more frequency values for the
gammatones (64, instead of 50), and felt that the reconstruction
quality decreased a bit. However, increasing the overlap between
samples led to a decent reconstruction quality. You can listen to the
original sample and the three reconstructed samples below:</p>
<p> Original: <br>
<audio controls="controls" >
<source src="files/original.ogg" type="audio/ogg" />
Your browser does not support the audio element.
</audio> </p>
<p> Reconstructed with "wrong" overcomplete dictionary (length 160, overlap 50%, atoms not correctly limited at frame borders): <br>
<audio controls="controls" >
<source src="files/reconst_160_80.ogg" type="audio/ogg" />
Your browser does not support the audio element.
</audio> </p>
<p> Reconstructed with undercomplete dictionary (length 1600, overlap 50%): <br>
<audio controls="controls" >
<source src="files/reconst_1600_800.ogg" type="audio/ogg" />
Your browser does not support the audio element.
</audio> </p>
<p> Reconstructed with undercomplete dictionary (length 1600, overlap 87.5%): <br>
<audio controls="controls" >
<source src="files/reconst_1600_1400.ogg" type="audio/ogg" />
Your browser does not support the audio element.
</audio> </p><p>The final dictionary I used had an atom length of 1600 and 1536 atoms.</p>
</div>
<div class="section" id="trouble-with-a-capital-t-as-in-import-theano-tensor-as-t">
<h2>Trouble with a capital T (as in <tt class="docutils literal">import theano.tensor as T</tt>)</h2>
<p>Unfortunately, even after switching the previous dataset by the one
generated with the undercomplete dictionary, the behavior of my
network during training was still the same... with the added problem
of NaNs appearing eventually after being stuck for around 30
iterations (with <a class="reference external" href="https://github.com/jfsantos/ift6266h14/blob/master/experiments/mlp_sparse/sp1600_conditional.yaml">this</a> YAML file, which uses only a bunch of sentences
from the validation set to save processing time). I am not sure on
what is causing this and searching at the <a class="reference external" href="https://groups.google.com/forum/#!topic/pylearn-users/yr-i_RzY9a0">pylearn-users</a> mailing
list, I have seen that David Krueger also had a <a class="reference external" href="http://dskspeechsynthesis.wordpress.com/2014/04/25/no-more-nans/">similar issue</a>
recently. In my case, using the <tt class="docutils literal">nan_guard</tt> as suggested by Ian in
this thread showed that the error happened during the computation of a
weight matrix inside the STS layer, namely the second weight matrix
used in the gater path of the unit (second layer of the non-linear
part). The error log is very low level as you can see <a class="reference external" href="files/0.err">here</a>, which
makes sense since it comes from inside an optimized Theano
graph. Altering the sparsity target in the output layer seems to avoid
this, but the results are still the same. Additionally, the predicted
coefficients are very far from the sparsity target I have set for the
output layer: 10% of the coefficients should be non-zero, but I am
getting something close to 50% instead (approximately 750 non-zero
coefficients per frame).</p>
<p>Here is an example of an audio file generated with the trained
network, using the previous frame and the previous, current, and next
phone as input:</p>
<p> <audio controls="controls" >
<source src="files/test_sparse.ogg" type="audio/ogg" />
Your browser does not support the audio element.
</audio> </p><p>(Yes, it does not sound like speech at all.)</p>
</div>
<div class="section" id="ideas-for-the-future">
<h2>Ideas for the future?</h2>
<p>None of my experiments led to results that would convince one that
using a sparse representation based on gammatones would be a useful
thing for speech synthesis. Architectures using time-domain audio
samples as the input/output had much more exciting results. Looking
forward, I would like to experiment with my other representation based
on the gammatonegram and convolutional neural nets. As the
gammatonegram for consecutive frames can be seen as an image (where
each time-frequency cell is a "pixel"), the usual 2D CNNs could be
tested.</p>
<p>Even though our course is over, my PhD research topic is speech
processing, so I believe I'll continue playing with deep learning in
the future. I am interested in investigating applications of deep
learning to speech enhancement systems. There is a very recent paper
(to appear in this year's ICASSP) where the authors used a DNN to
learn spectral masks to reduce reverberation in speech signals
<a class="citation-reference" href="#han2014" id="id3">[Han2014]</a>. It would be interesting to see if a similar idea could be
used not only for reverberation, but for speech enhancement under
different kinds of environment. Other ideas will hopefully pop-up
along the way!</p>
<table class="docutils citation" frame="void" id="bengio2013" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[Bengio2013]</td><td><em>(<a class="fn-backref" href="#id1">1</a>, <a class="fn-backref" href="#id2">2</a>)</em> Y. Bengio, N. Léonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” arXiv:1308.3432 [cs], Aug. 2013.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="han2014" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[Han2014]</a></td><td>K. Han, Y. Wang and D. Wang, “Learning spectral mapping for speech dereverberation”, To appear in the Proceedings of the IEEE ICASSP 2014, 2014. Available at <a class="reference external" href="http://www.cse.ohio-state.edu/~dwang/papers/HWW.icassp14.pdf">http://www.cse.ohio-state.edu/~dwang/papers/HWW.icassp14.pdf</a>.</td></tr>
</tbody>
</table>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Using an auditory-inspired representation for speech2014-03-31T14:00:00-04:002014-03-31T14:00:00-04:00jfsantostag:None,2014-03-31:/gammatone.html<p>I <a class="reference external" href="http://www.seaandsailor.com/dict_learning.html">previously</a>
described an approach to representing speech signals by decomposing
them to an arbitrary dictionary (using a sparse coding algorithm such
as Orthogonal Matching Pursuit). In that post, I showed that learning
a representation from the data by using a dictionary learning method
could be useful. However, there were …</p><p>I <a class="reference external" href="http://www.seaandsailor.com/dict_learning.html">previously</a>
described an approach to representing speech signals by decomposing
them to an arbitrary dictionary (using a sparse coding algorithm such
as Orthogonal Matching Pursuit). In that post, I showed that learning
a representation from the data by using a dictionary learning method
could be useful. However, there were some problems with that
approach. First, the dictionary atoms were not localized in time: the
atoms I learned from the data were waveforms spreading throughout the
entire frame. This behavior has led to issues when reconstructing the
signal, as nothing guarantees the last sample in the <span class="math">\(k^{th}\)</span>
frame will be close to the first sample in the <span class="math">\(k+1^{th}\)</span>
frame. The second issue was related to not using overlapped windows to
split/resynthesize the signal. This is one of the main reasons that
made the signals I generated previously so noisy.</p>
<p>In order to solve these problems, I added two updates to my previous
code:</p>
<ol class="arabic simple">
<li>Dropped the dictionary I learned from the data and switched to a
gammatone dictionary.</li>
<li>Generated the audio frames using Hamming windows with 50% overlap.</li>
</ol>
<p>In the next sections, I will give a brief description and motivation
for each of these updates. I will also show why they didn't work as
well as I expected and inspired another architecture.</p>
<div class="section" id="gammatone-functions-and-gammatone-based-dictionary">
<h2>Gammatone functions and gammatone-based dictionary</h2>
<p>Gammatone filters are a popular way of modeling the auditory
processing at the cochlea. Basically, the cochlea is interpreted as a
filterbank whose impulse response follows the following equation (the
product of a <em>gamma</em> function and a cosine, or a <em>pure tone</em>):</p>
<div class="math">
\begin{equation*}
g(t) = at^{n-1}e^{-2\pi b t}\cos(2 \pi ft + \phi)
\end{equation*}
</div>
<p>In this equation, <span class="math">\(b\)</span> corresponds to the filter's bandwidth,
<span class="math">\(n\)</span> is the filter order, <span class="math">\(f\)</span> is the central frequency, and
<span class="math">\(\phi\)</span> is the phase of the carrier. The first two parameters can
be fixed for the entire filterbank, while the center frequencies
<span class="math">\(f\)</span> are usually defined according to the cochlea's critical
frequencies. One way of computing these frequencies is by using the
Equivalent Rectangular Bandwidth (ERB), which gives an approximation
to the bandwidths of the human auditory filters:</p>
<div class="math">
\begin{equation*}
ERB_j = \frac{f_j}{Q_{ear}} + B_{min}
\end{equation*}
</div>
<p>Here, <span class="math">\(Q_{ear} = 9.26449\)</span> and <span class="math">\(B_{min} = 24.7\)</span> are
constants corresponding to the Q factor and minimum bandwidth of human
auditory filters.</p>
<p>Gammatone functions are used in auditory modelling because they match
the resonance of different regions in the cochlea. As shown in
<a class="citation-reference" href="#smith2006" id="id1">[Smith2006]</a>, human speech can be sparsely represented by gammatone
atoms. <a class="citation-reference" href="#strahl2008" id="id2">[Strahl2008]</a> has later shown that a sparse gammatone model can
be optimized for English speech, even though the optimized model does
not match the human auditory filters anymore.</p>
<p>A gammatone dictionary can be built similarly to a Gabor dictionary,
as gammatones are localized both in time and frequency. First, we have
to choose a set of frequencies; usually, you pick the number of
frequencies you want and the range, and use the ERB equation to find
equally-spaced frequencies in the ERB space (these would be the so
called critical frequencies). Then, we select the resolution of our
atoms (which has to be less or equal to the frame length in our
application) and then time-shift the atoms inside the frame by a
specified amount. The following Python code does that (and also
normalizes the dictionary at the end):</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">gammatone_matrix</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="n">fc</span><span class="p">,</span> <span class="n">resolution</span><span class="p">,</span> <span class="n">step</span><span class="p">):</span>
<span class="sd">"""Dictionary of gammatone functions"""</span>
<span class="n">centers</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">resolution</span> <span class="o">-</span> <span class="n">step</span><span class="p">,</span> <span class="n">step</span><span class="p">)</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">empty</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">centers</span><span class="p">),</span> <span class="n">resolution</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">center</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">centers</span><span class="p">):</span>
<span class="n">D</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">gammatone_function</span><span class="p">(</span><span class="n">resolution</span><span class="p">,</span> <span class="n">fc</span><span class="p">,</span> <span class="n">center</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="n">b</span><span class="p">)</span>
<span class="n">D</span> <span class="o">/=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">D</span> <span class="o">**</span> <span class="mi">2</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">]</span>
<span class="k">return</span> <span class="n">D</span>
</pre>
<p>For illustration, see below 5 time-shifted versions of the same
gammatone (note that in the actual dictionary, we probably want the
time-shifted atoms to overlap a bit more than in this figure).</p>
<img alt="" class="align-center" src="images/gammatones.png" style="width: 500px;" />
<p>See my gammatone sparse coding library <a class="reference external" href="https://github.com/jfsantos/ift6266h14/blob/master/sparse_coding/sparse_coding_gammatone.py">here</a>, and an updated version
of my IPython notebook for sparse coding <a class="reference external" href="https://github.com/jfsantos/ift6266h14/blob/master/sparse_coding/Sparse%20coding%20with%20a%20multiscale%20Gammatone%20dictionary.ipynb">there</a> for more details. The
test code in the library reads a wave file, segments it in 2048 frames
with 50% overlap, windows each frame with a Hanning window (see next
section for details) and decomposes each frame using gammatone
atoms. The reconstruction in this example uses 200 non-zero
coefficients per frame and the dictionary has 3150 atoms. This amounts
for a compression of more than 10 times, but the reconstruction does
not sound as bad as the ones we've seen previously.</p>
</div>
<div class="section" id="overlapping-windows">
<h2>Overlapping windows</h2>
<p>A window function is a function that has non-zero values only inside a
given interval. The most classical example of it is the rectangular
window:</p>
<div class="math">
\begin{equation*}
w_{rect}[n] = \begin{cases} 1, \mbox{if } n_0 \leq n \leq n_f \\
0, \mbox{ otherwise} \end{cases}
\end{equation*}
</div>
<p>However, a problem with the rectangular window is that it does nothing
to smooth the signal at the window borders. If we are processing a
signal on a per-frame basis and then reconstructing it by
concatenating the processed frames, nothing guarantees continuity when
we join the processed frames. These abrupt changes introduce broadband
noise bursts in our signal, which is something that we probably do not
want!</p>
<p>A way to mitigate this problem is to do <a class="reference external" href="https://ccrma.stanford.edu/~jos/parshl/Overlap_Add_Synthesis.html">overlap-add</a>
synthesis. Instead of shifting a full frame at a time and using
rectangular windows, we overlap frames by a certain amount (25%, 50%,
and 75% are often used values) and multiply each frame by a smooth
window. We use window functions in such a way that the overlapped
windows always sum to unity. The figure below shows two <a class="reference external" href="https://ccrma.stanford.edu/~jos/sasp/Hamming_Window.html">Hamming
windows</a> with an overlap of 50% (blue and green curves), and the sum
of both windows (red curve). If we keep overlapping windows like this,
overlap-add is an identity operation (i.e., we do not change the final
result as long as we do not process the frames). Of course, in our
case we are processing the frames, but overlap-add will help a bit in
mitigating the abrupt changes between frames as we are now summing the
values in overlapping frames to reconstruct our output instead of just
connecting two non-overlapping frames.</p>
<img alt="" class="align-center" src="images/hamming_windows.png" style="width: 600px;" />
</div>
<div class="section" id="experiment-with-sparse-coding-using-gammatone-atoms">
<h2>Experiment with sparse coding using gammatone atoms</h2>
<p>Based on the ideas described above, I generated a sparse-coded version
of the TIMIT dataset using my gammatone sparse coding library. I used
gammatones with 50 different cosine frequencies between 150 and 8000
Hz, timeshifts of 8 samples, and frames of length 160 with 50%
overlap. For each frame, a sparse representation using 16 non-zero
coefficients was extracted by using Orthogonal Matching Pursuit with a
sparsity constraint.</p>
<p>This data and the one-hot encoded information about the previous,
current, and next phone were used to train an MLP with the following
characteristics:</p>
<ul class="simple">
<li>Two rectified linear hidden layers (2150 and 950 units, respectively);</li>
<li>Linear output layer with 950 units (one for each sparse coding coefficient);</li>
<li>Training: batch gradient descent (batch size of 512 samples), with squared error objective;</li>
<li>Termination criteria: 10 epochs with objective decrease lower than
<span class="math">\(10^{-6}\)</span> or 200 epochs.</li>
</ul>
<p>However, something strange happened when I tried to train this
network: it has converged after 10 epochs! Of course this would be too
good to be true, which means something terrible happened instead. In
my case, the training, testing, and validation objectives did not
change at all with training iterations. I still do not know exactly
what happened, but I suspect the large amount of zeros in the input
and target values made the majority of the gradients equal to zero,
and without gradients none of the weights will change. Maybe a
different kind of initialization could solve this issue, but there are
other problems as well. Namely, this network does nothing to enforce
sparsity at the output, and in the end the output coefficients will
have a distribution that is very different from the target
coefficients (which are zero most of the time). Prof. Bengio suggested
that I could try making the output distribution the product of a
Bernoulli distribution and a Gaussian distribution: the first one
would say if that coefficient should be zero or not, and the latter
would give its value. However, he noted that this is just an arbitrary
statistical model which probably does not correspond to the real
behavior of the coefficients, and we would probably be better by
trying to estimate this distribution too (maybe with an RBM).</p>
<p>While trying to solve these issues, I had an idea for another
architecture that could be easier to implement...</p>
</div>
<div class="section" id="splitting-signal-into-spectral-envelope-and-phase">
<h2>Splitting signal into spectral envelope and phase</h2>
<p>As <a class="reference external" href="http://ift6266speechsynthesisjt.wordpress.com/2014/03/19/randomized-phases-preserves-speech-content-and-identity/">Jessica</a> pointed out in her blog, most of the relevant
information in a speech signal is encoded in its envelope. Because of
that, we are less sensitive to phase distortions than to envelope
distortions. As we have already discussed in class, as speech envelope
variations are slower than the phase variations, some speech coding
models (such as LPC) take these facts into account by encoding the
envelope and the phase separately (and usually using a simpler model
for the phase than for envelopes).</p>
<p>It was also brought to my attention that a recent paper <a class="citation-reference" href="#han2014" id="id3">[Han2014]</a> to
be presented at this year's ICASSP uses gammatone filterbank features
to find spectral masks to use in speech dereverberation. The advantage
of using gammatone filterbanks instead of a simple STFT is that with
the gammatone filterbank, we are able to fine-tune spectral resolution
at lower frequency bands (the most important band for speech
content). While speech dereverberation is a totally different topic,
the feature space used in that paper is still relevant. They are
looking for spectral masks to filter an existing signal and not on
synthesis, so they can discard the phase completely. For our project,
we cannot do that but we could work with a slightly different
approach.</p>
<p>We have one network that is trained on spectral envelopes, using a
similar approach to that of the paper. This network is trained using
the gammatonegram, which consists of the total gammatone band energy
in all channels of our filterbank per frame. The figure below depicts
how this is done:</p>
<img alt="" class="align-center" src="images/filterbank.png" style="width: 700px;" />
<p>Here, <span class="math">\(y_i[n], i = 1, \dots, 64\)</span> are the frame energies (sum of
squared samples) for each gammatone channel (I'm using 64 channels
here as this is what was used in <a class="citation-reference" href="#han2014" id="id4">[Han2014]</a> and can be a good starting
point). As inputs of this network, we would use the gammatonegram of a
number of previous frames, one-hot encoded phones for these frames
(and possibly some of the next frames), and the output would be the
gammatonegram of the next frame. This is not enough to resynthesize a
speech signal as we don't have the phase, but that could be solved by
training a separate model for phases, either for an overall phase or a
per-channel phase. Resynthesis is done according to the following
signal flow diagram:</p>
<img alt="" class="align-center" src="images/synthesis.png" style="width: 500px;" />
<p>Here, <span class="math">\(p_i[n], i=1, \dots, 64\)</span> are vectors representing the
phase of each channel and <span class="math">\(g_i[n], i=1, \dots, 64\)</span> are the
amplitudes for each gammatone channel (which could be either the
<span class="math">\(y_i[n]\)</span> values computed before for each frame or a smoothed
version of them).</p>
<p>For the network architecture, I am planning on using an MLP (possibly
with unsupervised pretraining) for the spectral envelopes. For the
phase components, I will initially try RBMs using previous phase
samples, phone codes, and speaker characteristics (pitch, gender,
etc.) as input. I expect to be able to use simpler models for the
phase (or at least be able to control this model's complexity, as I
believe there should be a tradeoff between speech quality and the
accuracy of the phase models). I have already extracted the gammatone
features from the whole database and will report results for the
spectral envelope model on my next post.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils citation" frame="void" id="smith2006" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[Smith2006]</a></td><td>E. C. Smith and M. S. Lewicki, “Efficient auditory coding,” Nature, vol. 439, no. 7079, pp. 978–982, 2006.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="strahl2008" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[Strahl2008]</a></td><td>S. Strahl and A. Mertins, “Sparse gammatone signal model optimized for English speech does not match the human auditory filters,” Brain research, vol. 1220, pp. 224–233, 2008.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="han2014" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[Han2014]</td><td><em>(<a class="fn-backref" href="#id3">1</a>, <a class="fn-backref" href="#id4">2</a>)</em> K. Han, Y. Wang and D. Wang, “Learning spectral mapping for speech dereverberation”, To appear in the Proceedings of the IEEE ICASSP 2014, 2014. Available at <a class="reference external" href="http://www.cse.ohio-state.edu/~dwang/papers/HWW.icassp14.pdf">http://www.cse.ohio-state.edu/~dwang/papers/HWW.icassp14.pdf</a>.</td></tr>
</tbody>
</table>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Experiments with a 2 layer MLP incorporating phone information2014-03-19T21:00:00-04:002014-03-19T21:00:00-04:00jfsantostag:None,2014-03-19:/exp_mlp.html<p>During the spring break I decided to run some experiments with MLPs as
generative models, using both acoustic samples and phone codes as input.
The experiment's objective is two-fold: I wanted to investigate if using
information from surrounding frames improves the models (when compared
to what our colleagues have found …</p><p>During the spring break I decided to run some experiments with MLPs as
generative models, using both acoustic samples and phone codes as input.
The experiment's objective is two-fold: I wanted to investigate if using
information from surrounding frames improves the models (when compared
to what our colleagues have found), and I also wanted to have a baseline
to compare to the models based on sparse coding that I have been working
on. In the experiments by
<a class="reference external" href="http://ift6266hjb.wordpress.com/2014/02/10/speech-synthesis-project-description-and-first-attempt-at-a-regression-mlp/">Hubert</a>,
<a class="reference external" href="http://jpraymond.wordpress.com/2014/02/27/results-with-a-one-hidden-layer-neural-net/">Jean-Phillipe</a>,
and <a class="reference external" href="http://twuilliam.wordpress.com/2014/02/27/quick-experiment-breaking-the-sin-in-one-line/">William</a>, only the acoustic samples information was used as
input.
<a class="reference external" href="http://amjadmahayri.wordpress.com/2014/02/27/frame-prediction-given-phoneme-window/">Amjad</a>
has already done some tests incorporating phone information, but it
seems he is using only the current phone.</p>
<p>In my experiments, I updated
<a class="reference external" href="http://vdumoulin.github.io/articles/timit-part-5">Vincent's</a> dataset
implementation in order to make it provide the phones corresponding to
the current, previous, and next frame. The code can be found in my
<a class="reference external" href="https://github.com/jfsantos/research">fork</a>. Previously I was using
Python code to setup pylearn2 experiments, but I decided to switch to
YAML for these experiments as I didn't need to do anything fancy. The
YAML for this experiment can be found
<a class="reference external" href="https://github.com/jfsantos/ift6266h14/blob/master/experiments/mlp_acoustic/ac160_ph3_rl2_malespkr.yaml">here</a> and the serialized model is <a class="reference external" href="https://github.com/jfsantos/ift6266h14/blob/master/experiments/mlp_acoustic/ac160_ph3_rl2_malespkr.pkl">here</a>.
The dataset was configured as follows:</p>
<ul class="simple">
<li>Frame length: 160 samples</li>
<li>Frame overlap: 0 (not ideal, but it can be seen as a subsampling of
the complete dataset)</li>
<li>Frames per example: 1</li>
<li>Number of predicted samples: 1</li>
<li>Phone information: one-hot encoded phone code for the previous,
current, and next frame</li>
</ul>
<p>With this configuration, each example is a vector with
<span class="math">\(160 + 3*62 = 346\)</span> values. The MLP was set-up and trained as
follows:</p>
<ul class="simple">
<li>Two rectified linear hidden layers (the first with 500 and the second
with 100 units)</li>
<li>Linear output layer with a single unit (a single sample is predicted
for each input)</li>
<li>Training algorithm: SGD with fixed learning rate of 0.01, running for
a maximum of 200 epochs (alternative convergence condition was set as
10 iterations with improvement lower than <span class="math">\(1^{-10}\)</span>). The batch
size was of 512 examples.</li>
</ul>
<p>Total training time for this experiment was approximately 1.15 hours,
running on a CPU (Intel Core i7-2600, 8 GB RAM, with Theano running over
MKL and using 4 cores simultaneously). As mentioned before, I considered
10 iterations without improvement as the convergence condition, which
happened by the iteration 159. A plot for the training, testing, and
validation set errors can be seen below. The errors found after
convergence (for the normalized, i.e., centered and divided by the
standard deviation) were the following:</p>
<ul class="simple">
<li>Training error: 0.02284</li>
<li>Test error: 0.03309</li>
<li>Validation error: 0.05482</li>
</ul>
<p>A plot showing the evolution of the errors over epochs can be seen below.</p>
<img alt="" src="images/exp_mlp_1.png" />
<p>To evaluate the trained network as a synthesizer, I got a sequence of
phone codes straight out of a sentence in the validation set and used it
as input to the MLP. As I did not have a previous frame, the initial
input is a frame with only zeros on it. I played with a multiplicative
factor on the noise added to the Gaussian sampling as <a class="reference external" href="http://davidtob.wordpress.com/2014/03/14/generating-one-phone-from-one-timit-speaker/">David
did</a>,
as using directly the test error I ended up with bursts as can be seen
below. The following multiplicative factors were tested:
<tt class="docutils literal">[0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]</tt>. As using high noise levels ends
up corrupting too much the signal, I filtered them down to approximately
the telephone bandwidth (300-4000 Hz) with a <span class="math">\(4^{th}\)</span> order
Butterworth passband filter, just to reduce the overall effect of noisy
sampling. For low noise multipliers, all I got was a short burst and
then the output stays at zero. However, by increasing the noise level to
five times the mean square test error, apparently I got some more
structure: even though it has almost nothing to do with whatever should
have been synthesized, it does sound like multiple speakers babbling. The respective audio files and plots (acoustic waveform + spectrogram) can be seen below:</p>
<p> 0.01: <br>
<audio controls="controls" >
<source src="files/y_noise_0.01.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 0.05: <br>
<audio controls="controls" >
<source src="files/y_noise_0.05.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 0.1: <br>
<audio controls="controls" >
<source src="files/y_noise_0.1.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 0.5: <br>
<audio controls="controls" >
<source src="files/y_noise_0.5.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 1.0: <br>
<audio controls="controls" >
<source src="files/y_noise_1.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 2.0: <br>
<audio controls="controls" >
<source src="files/y_noise_2.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 5.0: <br>
<audio controls="controls" >
<source src="files/y_noise_5.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> 10.0: <br>
<audio controls="controls" >
<source src="files/y_noise_10.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p><img alt="" src="images/exp_mlp_2.png" />
<p>One interesting thing: I did the same procedure to generate an output
during training, sometime around the <span class="math">\(130^{th}\)</span> iteration. The
output generated at that stage sounded much nicer than what I got
after the training finished, but unfortunately the pickled model was
overwritten because of the way I set up my YAML file, which overwrites
the old model every time an iteration improves the objective. The only
thing I kept was the output:</p>
<audio controls="controls" >
<source controls src="files/malespkr_rl2_not_converged.ogg"> type="audio/ogg" />
Your browser does not support the audio element.
</audio><div class="section" id="next-steps">
<h2>Next steps</h2>
<p>Moving forward, I will write in my next post about the (not so
successful) tests I did using sparse coding coefficients instead of
acoustic samples as inputs. I will also comment about some ideas to
incorporate more advanced models in my experiments.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Dictionary Learning and Sparse Coding for Speech Signals2014-02-25T17:00:00-05:002014-02-25T17:00:00-05:00jfsantostag:None,2014-02-25:/dictlearning.html<p>Sparse signal approximations are the basis for a variety of signal
processing techniques. Such approximations are usually employed with
the objective of having a signal representation that is more
meaningful, malleable, and robust to noise than the ones obtained by
standard transform methods <a class="citation-reference" href="#sturm2009" id="id1">[Sturm2009]</a>. The so-called dictionary
based methods (DBM …</p><p>Sparse signal approximations are the basis for a variety of signal
processing techniques. Such approximations are usually employed with
the objective of having a signal representation that is more
meaningful, malleable, and robust to noise than the ones obtained by
standard transform methods <a class="citation-reference" href="#sturm2009" id="id1">[Sturm2009]</a>. The so-called dictionary
based methods (DBM) decompose a signal into a linear combination of
waveforms through an approximation technique such as Matching Pursuit
(MP) <a class="citation-reference" href="#mallat1993" id="id2">[Mallat1993]</a>, Orthogonal Matching Pursuit (OMP) <a class="citation-reference" href="#pati1993" id="id3">[Pati1993]</a>, or
basis pursuit <a class="citation-reference" href="#chen2001" id="id4">[Chen2001]</a>. The collection of waveforms that can be
selected for the linear combination is called a dictionary. This
dictionary is usually overcomplete, either because it is formed by
merging complete dictionaries or because the waveforms are chosen
arbitrarily (and we have more waveforms than the length of the signal
we want to represent).</p>
<div class="section" id="sparse-approximation-problem-formulations">
<h2>Sparse approximation problem formulations</h2>
<p>The sparse coding problem is usually formulated either as a
sparsity-constrained problem or as an error-constrained problem. The
formulations are as follows:</p>
<p>Sparsity-constrained:
<span class="math">\(\underline{\hat{\gamma}} = \underset{\underline{\gamma}}{arg\,min}\|\underline{x} - D \underline{\gamma}\|_2^2 \quad\text{s.t.}\quad \|\underline{\gamma}\|_0 \leq K\)</span></p>
<p>Error-constrained:
<span class="math">\(\underline{\hat{\gamma}} = \underset{\underline{\gamma}}{arg\,min}\|\underline{\gamma}\|_0 \quad\text{s.t.}\quad \|\underline{x} - D \underline{\gamma}\|_2^2 \leq \epsilon\)</span></p>
<p>In the first one, the idea is that we want to represent the signal by
a linear combination of up to K known waveforms. In the second
formulation, we want the squared error of the representation to be
below a certain threshold. Both formulations are useful, depending on
the problem you are trying to solve: the first one will lead to more
compact representations, while with the second one you can avoid
higher representation errors.</p>
<p>The second formulation is also useful for applications which need
denoising: consider you have a corrupted version of your signal, and
also that you know (more or less) the signal-to-noise ratio (SNR). If
the noise is very different from the signal you are interested in and
your dictionary is optimized to represent these signals only, it may
be the case that noise is not well represented by the waveforms in
your dictionary. So, you could use the second formulation, setting
<span class="math">\(\epsilon\)</span> as the estimated noise level, and expect that a good
part of the noise component is not going to be represented in the
sparse approximation.</p>
</div>
<div class="section" id="dictionary-learning">
<h2>Dictionary learning</h2>
<p>Reconstructing a speech signal based on a learned set of segments is not
a new thing. It is done in a well-known technique called vector
quantization (VQ). In VQ, the signal is reconstructed by using only a
single atom (or <em>codeword</em>, on the VQ literature jargon) per signal
frame. The dictionary (or <em>codebook</em>) is usually designed by a
nearest-neighbor method, which aims to find the codebook that can
reconstruct a signal by using the codewords that have the smaller
distances to the original signal frames while minimizing the residual.
K-means is a codebook learning algorithm for VQ that solves this problem
by dividing the training samples into <span class="math">\(K\)</span> clusters of the nearest
neighbors of each of the <span class="math">\(K\)</span> items in the initial codebook. The
codebook is then updated by finding the centroid for each of the
<span class="math">\(K\)</span> clusters. These steps are ran iteratively until the algorithm
converges to a local minimum solution.</p>
<p>For sparse coding, we want to use multiple atoms to reconstruct the
signal. In the snippet below, we generate a dictionary with 1024
waveforms by using the dictionary learning functions available in
<a class="reference external" href="http://scikit-learn.org">scikit-learn</a>, which is based on a paper by <a class="citation-reference" href="#mairal2009" id="id5">[Mairal2009]</a>. The
training data consists of two minutes of audio from the TIMIT
database; sentences were randomly chosen and then split into frames of
256 samples each.</p>
<pre class="code python literal-block">
<span class="c1"># Build the dictionary</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">MiniBatchDictionaryLearning</span>
<span class="n">dico</span> <span class="o">=</span> <span class="n">MiniBatchDictionaryLearning</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_iter</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">D</span> <span class="o">=</span> <span class="n">dico</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span><span class="o">.</span><span class="n">components_</span>
</pre>
<img alt="" class="align-center" src="images/dictlearning_5_1.png" style="width: 720px;" />
<p>If we take a look into some of the learned waveforms in the figure
above, we'll see that we have both low-frequency, quasiperiodic
signals (which are probably matching vowels) and signals with more
high-frequency components that look a bit noisy (probably representing
stops/fricatives).</p>
</div>
<div class="section" id="reconstructing-speech-segments-using-sparse-coding-with-the-learned-dictionary">
<h2>Reconstructing speech segments using sparse coding with the learned dictionary</h2>
<p>Now that we have a dictionary which (supposedly) is good for
representing speech signals, let's use Orthogonal Matching Pursuit
(OMP) to reconstruct a speech segment based on a linear combination of
dictionary entries. Let's get 10 seconds of audio from TIMIT (from a
segment of the set that was not in the training set) and reconstruct
it using a sparse approximation. We use the sparsity-based constraint
form, as we are more interested in representing speech in a sparse
way:</p>
<pre class="code python literal-block">
<span class="c1"># Get sample speech segment to reconstruct</span>
<span class="n">test_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">fs</span><span class="o">*</span><span class="mi">200</span><span class="p">:</span><span class="n">fs</span><span class="o">*</span><span class="mi">210</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">fs</span><span class="o">*</span><span class="mi">10</span><span class="o">/</span><span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">)</span>
<span class="c1"># Reconstruct it frame-by-frame using a linear combination of 20</span>
<span class="c1"># atoms per frame (sparsity-constrained OMP)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">numpy</span><span class="o">.</span><span class="n">ndarray</span><span class="p">((</span><span class="n">test_data</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">512</span><span class="p">))</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">SparseCoder</span>
<span class="n">coder</span> <span class="o">=</span> <span class="n">SparseCoder</span><span class="p">(</span><span class="n">dictionary</span> <span class="o">=</span> <span class="n">D</span><span class="p">,</span> <span class="n">transform_n_nonzero_coefs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
<span class="n">transform_alpha</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">transform_algorithm</span><span class="o">=</span><span class="s2">"omp"</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">coder</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">result</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="n">out</span><span class="p">[</span><span class="n">n</span><span class="o">*</span><span class="mi">256</span><span class="p">:(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="mi">256</span><span class="p">]</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">D</span><span class="o">.</span><span class="n">T</span><span class="o">*</span><span class="n">result</span><span class="p">[</span><span class="n">n</span><span class="p">],</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre>
<p>Here are the results: you can listen above the original file and the reconstructed one.</p>
<p> Original: <br>
<audio controls="controls" >
<source src="files/orig.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio> </p>
<p> Reconstructed with 20 atoms/frame:<br>
<audio controls="controls" >
<source src="files/reconst.ogg" type="audio/wav" />
Your browser does not support the audio element.
</audio></p><p>These figures show the original signal, the reconstructed one, and the squared error:</p>
<img alt="" class="align-center" src="images/dictlearning_10_1.png" />
<p>While the reconstruction error is low for most of the time considering
we are using only 20 non-zero values per frame to represent the
signal, as opposed to using 256 samples, we can clearly hear the
reconstruction-related artifacts. However, that may be OK if all we
want with the learned dictionary is to have a sparser representation
for speech that will be used later in our synthesizer.</p>
</div>
<div class="section" id="relationship-with-our-project-and-next-steps">
<h2>Relationship with our project and next steps</h2>
<p>I started working on some experiments comparing the performance of a
sample predictor to two other predictors: one based on LPC
coefficients and the other on a sparse representation of speech. As we
discussed in class, speech has some parameters that change quickly
(source/excitation signal), while others change slowly
(articulation-related). In the first experiments prof. Bengio
suggested, we were working on an MLP-based generative model for
samples without any consideration for phones. His second suggestion
was to design a generative model for the next sample conditioned on
the previous, current, and next phone.</p>
<p>I started developing generative models based on MLPs for the three
representations above, using one-hot encoded phones and the relative
position in time of the current phone as inputs. For the model based
on LPCs, I am planning to have a separate generative model for the
excitation signal, which is going to work pretty much like the
next-sample predictor we worked on previously; this model could also
be based on the previous, current, and next phone, previous samples,
and things such as pitch/speaker gender. Unfortunately, due to a <a class="reference external" href="https://groups.google.com/forum/#!topic/pylearn-users/EZ3H8xP7gN8">bug</a>
in pylearn2 I was not able to get them working yet. <a class="reference external" href="http://vdumoulin.github.io/">Vincent</a> said
there's already a <a class="reference external" href="https://github.com/lisa-lab/pylearn2/pull/512">pull request</a> which solves this
issue and it seems it will get fixed anytime soon.</p>
<p>Last note: you can view the IPython notebook containing all the code used to generate the dictionary and the plots <a class="reference external" href="http://nbviewer.ipython.org/urls/seaandsailor.com/files/dictlearning.ipynb">here</a>, or <a class="reference external" href="files/dictlearning.ipynb">download</a> and run it interactively in your computer.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils citation" frame="void" id="sturm2009" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[Sturm2009]</a></td><td>B. L. Sturm, C. Roads, A. McLeran, and J. J. Shynk, “Analysis, Visualization, and Transformation of Audio Signals Using Dictionary-based Methods†,” Journal of New Music Research, vol. 38, no. 4, pp. 325–341, 2009.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="mallat1993" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[Mallat1993]</a></td><td>S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="pati1993" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[Pati1993]</a></td><td>Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, 1993, pp. 40–44.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="chen2001" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[Chen2001]</a></td><td>S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM journal on scientific computing, vol. 20, no. 1, pp. 33–61, 1998.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="mairal2009" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id5">[Mairal2009]</a></td><td>J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 689–696.</td></tr>
</tbody>
</table>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Speech signal representations2014-02-01T18:00:00-05:002014-02-01T18:00:00-05:00jfsantostag:None,2014-02-01:/initial_representation.html<p>One of the objectives of our project is to learn a useful
representation from speech signals that can be used to synthesize new
(arbitrary) sentences. There are many different ways of representing
speech signals; those representations are usually tailored to specific
applications. In speech recognition, for example, we want to …</p><p>One of the objectives of our project is to learn a useful
representation from speech signals that can be used to synthesize new
(arbitrary) sentences. There are many different ways of representing
speech signals; those representations are usually tailored to specific
applications. In speech recognition, for example, we want to minimize
the variability from different speakers while keeping sufficient
information to discriminate different phonemes. In speech coding,
however, we usually want to keep information that is associated with
the speaker's identity as well as reduce the amount of data to be
stored/transmitted.</p>
<p>Our dataset was initially distributed as frame MFCCs (input) and
one-hot encoded phonemes (labels). While this representation is
usually enough for speech recognition, I believe it is not enough for
learning a useful representation for synthesis (as briefly mentioned
by Laurent Dinh in his <a class="reference external" href="http://deeprandommumbling.wordpress.com/2014/01/29/listening-to-a-vector">post</a>). The reason is that MFCCs are a
destructive/lossy representation of a speech signal. First,
fundamental frequency information is completely lost, as well as
instantaneous phase. MFCCs more or less represent the energy in
different frequency channels that are considered important for human
speech (following the Mel scale <a class="citation-reference" href="#stevens2005" id="id1">[Stevens2005]</a>).</p>
<p>In this post, I will present some alternative speech signal
representations that may be more suitable for speech synthesis. Even
though one of our objectives is to learn a representation, we need to
understand a little bit about what has been developed by the speech
processing community, as it can serve as an inspiration.</p>
<div class="section" id="acoustic-samples-time-domain">
<h2>Acoustic samples (time domain)</h2>
<p>Using raw acoustic samples from overlapping frames is the simplest
approach. A discrete signal <span class="math">\(x[n]\)</span> is simply a sequence of (real
or integer) numbers corresponding to the signal samples (sampled
uniformly at an arbitrary sampling rate). The usual sampling rate for
speech recognition applications is 16 kHz, while the sampling rate
used for "telephone speech" coding is 8 kHz. This is essentially the
information we find in a PCM-encoded WAV file.</p>
<!-- add plots -->
<div class="figure align-center">
<img alt="" src="images/timedomain.png" />
<p class="caption"><em>Time-domain speech signal sampled at 16 kHz.</em></p>
</div>
<div class="figure align-center">
<img alt="" src="images/timedomain_zoom.png" />
<p class="caption"><em>Zoom of a 200-sample segment of the above signal.</em></p>
</div>
</div>
<div class="section" id="short-time-fourier-transform">
<h2>Short-time Fourier Transform</h2>
<p>Another possible representation is to use Short-time Fourier Transform
(STFT) coefficients from overlapping frames. This is essentially the
same as using raw acoustic samples in the sense that there is no
information loss, but the representation in the frequency domain is
usually clearer for humans because we can associate the content in
different frequency bands with different phonemes. The STFT of a
discrete signal <span class="math">\(x[n]\)</span> is given by:</p>
<div class="math">
\begin{equation*}
STFT{x[n]}(m,\omega) = X(m,\omega) = \sum_{n=-\infty}^{\infty} x[n] w[n-m] e^{-j \omega n}
\end{equation*}
</div>
<p>where <span class="math">\(n,\omega\)</span> are the time and frequency indexes, and
<span class="math">\(w[n]\)</span> is the windowing function. A spectrogram is the
magnitude-squared version of this equation (i.e., without phase
information).</p>
<p>Spectrograms can be done using windows with different lengths. This is
related to the <a class="reference external" href="http://en.wikipedia.org/wiki/Short-time_Fourier_transform#Resolution_issues">Gabor (or Heisenberg-Gabor)</a> limit: we cannot
simultaneously localize a signal in both time and frequency domains
with a high degree of certainty. Therefore, we usually have to use
different window lengths depending on what we want to analyze: wide
windows have better frequency resolution and bad time resolution,
while the opposite happens for short windows. A possible compromise is
to choose a single window length that has sufficient resolution for
the target application.</p>
<div class="figure align-center">
<img alt="" src="images/specgram.png" />
<p class="caption"><em>Spectrogram (using a 20 ms rectangular window) of the speech signal above.</em></p>
</div>
</div>
<div class="section" id="linear-predictive-coding">
<h2>Linear Predictive Coding</h2>
<p>Linear Predictive Coding <a class="citation-reference" href="#o1988linear" id="id2">[o1988linear]</a> coefficients + residual
(basically excitation information). LPC is based on the source-filter
model of speech production, which assumes a speech signal is produced
by filtering a series of pulses (and eventually noise bursts). The LPC
coefficients are related to the position of the articulators in the
mouth (e.g., tongue, palate, lips), while the pitch/noise information
is related to how the vocal tract is excited. This is usually
represented as an auto-regressive (AR) model with order <span class="math">\(p\)</span>:</p>
<div class="math">
\begin{equation*}
x[n] = \sum_{k=1}^{p} a_k x[n-k]
\end{equation*}
</div>
<p>where <span class="math">\(a[k]\)</span> are the model's coefficients. LPCs are computed for each speech frame based on a least-squares method:</p>
<div class="math">
\begin{equation*}
\arg\min_{a_k} \sum_{-\infty}^\infty [x[n] - \sum_{k=1}^p a_k x[n-k]]^2
\end{equation*}
</div>
<p>Because of its error criteria, LPC also has problems to represent the
phase of acoustic signals (by squaring the error, we are modeling the
spectral magnitude of the signal, and not the phase). For this reason,
LPC speech may sound artificial when resynthesized. More robust
methods are used nowadays, such as the code-excited linear prediction
(CELP) <a class="citation-reference" href="#valin2006speex" id="id3">[valin2006speex]</a>. These methods, for example, use
psychoacoustics-inspired techniques to shape the coding noise to
frequency regions where the human auditory system is more tolerant. In
CELP, the residual is not transmitted directly, but represented as
entries in two codebooks.</p>
</div>
<div class="section" id="wavelets">
<h2>Wavelets</h2>
<p>Them main purpose of a wavelet transform is to decompose arbitrary
signals into localized contributions that can be labelled by a scale
(or resolution) parameter <a class="citation-reference" href="#mallat1989theory" id="id4">[mallat1989theory]</a>. The representation
achieved through the wavelet transform can be seen as hierarchical: at
a coarse resolution, we have an idea of “context”, while with highest
resolution we can see more details. This is achieved by decomposing
the original signal using a set of functions well-localized both in
time and frequency (the so-called wavelets).</p>
<p>Discrete wavelet transforms are implemented as a cascade of digital
filters with transfer functions derived from a discrete "mother
wavelet". The figure below shows an example. Check also the <a class="reference internal" href="#notebook">notebook</a>
for an example of wavelet decomposition of the audio signal shown
above.</p>
<div class="figure align-center">
<img alt="" src="images/Wavelets_-_Filter_Bank.png" />
<p class="caption"><em>Filter bank used by a discrete wavelet transform with 3 levels of decomposition (image from the</em> <a class="reference external" href="http://en.wikipedia.org/wiki/File:Wavelets_-_Filter_Bank.png">WikiMedia Commons</a> <em>)</em>.</p>
</div>
</div>
<div class="section" id="sparse-coding-and-dictionary-based-methods">
<h2>Sparse coding and dictionary-based methods</h2>
<p>Sparse signal approximations are the basis for a variety of signal
processing techniques. Such approximations are usually employed with
the objective of having a signal representation that is more
meaningful, malleable, and robust to noise than the ones obtained by
standard transform methods <a class="citation-reference" href="#sturm" id="id5">[Sturm]</a>. The so-called
dictionary-based methods (DBM) decompose a signal into a linear
combination of waveforms through an approximation technique such as
Matching Pursuit (MP) <a class="citation-reference" href="#mallat1993" id="id6">[Mallat1993]</a> or Orthogonal Matching Pursuit
(OMP) <a class="citation-reference" href="#pati1993" id="id7">[Pati1993]</a>. The collection of waveforms that can be
selected for the linear combination is called a dictionary. This
dictionary is usually overcomplete, either because it is formed by
merging complete dictionaries or because the functions are chosen
arbitrarily.</p>
<p>I will talk more about sparse coding and dictionary-based methods
later, since sparse coding is one of the methods we'll see in the
course.</p>
</div>
<div class="section" id="ipython-notebook">
<span id="notebook"></span><h2>IPython notebook</h2>
<p>An IPython notebook with examples for all the representations
described here (except sparse coding) is available on my <a class="reference external" href="https://github.com/jfsantos/ift6266h14">GitHub
repo</a>. You will need to install the packages <a class="reference external" href="http://www.pybytes.com/pywavelets/">PyWavelets</a> and
<a class="reference external" href="https://github.com/cournape/talkbox">scikits.talkbox</a> (both are available at PyPI) to be able to run it. If you just want to take a look without interacting with the code, you can access it <a class="reference external" href="http://nbviewer.ipython.org/github/jfsantos/ift6266h14/blob/master/notebooks/Speech%20representation%20examples.ipynb">here</a>.</p>
</div>
<div class="section" id="references">
<h2>References</h2>
<table class="docutils citation" frame="void" id="stevens2005" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[Stevens2005]</a></td><td>S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement of the Psychological Magnitude Pitch,” The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, Jun. 2005.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="o1988linear" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[o1988linear]</a></td><td>D. O’Shaughnessy, “Linear predictive coding,” IEEE Potentials, vol. 7, no. 1, pp. 29–32, Feb. 1988.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="valin2006speex" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[valin2006speex]</a></td><td>J.-M. Valin, “Speex: a free codec for free speech,” in Australian National Linux Conference, Dunedin, New Zealand, 2006.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="mallat1989theory" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[mallat1989theory]</a></td><td>S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 11, no. 7, pp. 674–693, 1989.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="sturm" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id5">[Sturm]</a></td><td>B. L. Sturm, C. Roads, A. McLeran, and J. J. Shynk, “Analysis, Visualization, and Transformation of Audio Signals Using Dictionary-based Methods†,” Journal of New Music Research, vol. 38, no. 4, pp. 325–341, 2009.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="mallat1993" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id6">[Mallat1993]</a></td><td>S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.</td></tr>
</tbody>
</table>
<table class="docutils citation" frame="void" id="pati1993" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id7">[Pati1993]</a></td><td>Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, 1993, pp. 40–44.</td></tr>
</tbody>
</table>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Personal research journal: deep learning for speech synthesis2014-01-25T15:00:00-05:002014-01-25T15:00:00-05:00jfsantostag:None,2014-01-25:/intro_ift6266.html<p>This is the introduction to a series of reports on my experiments on
deep learning methods for speech synthesis. These experiments are part
of my coursework for Dr. Yoshua Bengio's <a class="reference external" href="http://ift6266h14.wordpress.com">Representation Learning</a>
course at Université de Montréal. All the related code is going to be
posted at a <a class="reference external" href="https://github.com/jfsantos/ift6266h14">GitHub repository …</a></p><p>This is the introduction to a series of reports on my experiments on
deep learning methods for speech synthesis. These experiments are part
of my coursework for Dr. Yoshua Bengio's <a class="reference external" href="http://ift6266h14.wordpress.com">Representation Learning</a>
course at Université de Montréal. All the related code is going to be
posted at a <a class="reference external" href="https://github.com/jfsantos/ift6266h14">GitHub repository</a> as well.</p>
<p>Please visit the <a class="reference external" href="/tag/ift6266.html">ift6266 tag page</a> for a list of all the posts
related to this project.</p>