Predictive Information in Gaussian Processes with Application to Music Analysis

28/08/2013
OAI : oai:www.see.asso.fr:2552:5119
DOI :

Résumé

Predictive Information in Gaussian Processes with Application to Music Analysis

Métriques

655
137
1.13 Mo
 application/pdf
bitcache://ae8a9a337cd7221027d819585890ced8ca24de28

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors financier

logo_gdr-mia.png
logo_inria.png
image010.png
logothales.jpg

Sponsors logistique

logo-minesparistech.jpg
logo-universite-paris-sud.jpg
logo_supelec.png
Séminaire Léon Brillouin Logo
logo_cnrs_2.jpg
logo_ircam.png
logo_imb.png
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/2552/5119</identifier><creators><creator><creatorName>Samer Abdallah</creatorName></creator><creator><creatorName>Mark Plumbley</creatorName></creator></creators><titles>
            <title>Predictive Information in Gaussian Processes with Application to Music Analysis</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2013</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Tue 15 Oct 2013</date>
	    <date dateType="Updated">Mon 25 Jul 2016</date>
            <date dateType="Submitted">Fri 13 Jul 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">ae8a9a337cd7221027d819585890ced8ca24de28</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>9939</version>
        <descriptions>
            <description descriptionType="Abstract"></description>
        </descriptions>
    </resource>
.

Predictive Information in Gaussian Processes with Application to Music Analysis Samer Abdallah (1) and Mark Plumbley (2) (1) Department of Computer Science, UCL (2) Centre for Digital Music, Queen Mary, University of London August 28, 2013 1/35 Outline Process information measures for Gaussian processes Banishing infinities with observation noise Dynamic information measures Predictive information and Bayesian surprise Bayesian estimation of autoregressive processes Applications to audio analysis Summary and conclusions 2/35 Information theory in sequences Consider an observer receiving elements of a random sequence (. . . , X−1, X0, X1, X2, . . .), so that at any time t there is a ‘present’ Xt, an observed past ← Xt, and an unobserved future → Xt. Eg, at time t = 3: · · · X1 X2 X3 X4 X5 · · · Past: ← X3 Future → X3Present Consider how the observer’s belief state evolves when, having observed up to X2, it learns the value of X3. 3/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the United 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the United States 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the United States of America. 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the United States of America. Oscar 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the United States of America. Oscar lives in 4/35 Motivation: surprise and information in sequences Music, language, and other temporal things unfold in time (forwards). [someone] lives in the United States of America. Khaled lives in the United Arab Emirates. Brad lives in the United States of America. Oscar lives in a wheely-bin (English for trash can). 4/35 Global measures for stationary processes hX ρX X0 infinite past . . . , X−1 entropy rate : hX = H(Xt| ← Xt) multi-information rate : ρX = I( ← Xt; Xt) = H(Xt) − hX 5/35 Global measures for stationary processes rX bXρX X0 infinite future infinite past . . . , X−1 X1, . . . entropy rate : hX = H(Xt| ← Xt) multi-information rate : ρX = I( ← Xt; Xt) = H(Xt) − hX erasue entropy rate : rX = H(Xt| ← Xt, → Xt) predictive information rate : bX = I(Xt; → Xt| ← Xt) = hX − rX 5/35 Global measures for stationary processes rX bXρX X0 infinite future infinite past . . . , X−1 X1, . . . entropy rate : hX = H(Xt| ← Xt) multi-information rate : ρX = I( ← Xt; Xt) = H(Xt) − hX erasue entropy rate : rX = H(Xt| ← Xt, → Xt) predictive information rate : bX = I(Xt; → Xt| ← Xt) = hX − rX See James et al’s Anatomy of a bit [JEC11] for discussion of remaining unlabelled atom, related to excess entropy. 5/35 Discrete-time Gaussian processes Information-theoretic quantities used earlier have analogues for continuous-valued random variables. For stationary Gaussian processes, we can obtain results in terms of the power spectral density S(ω). Standard methods give H(Xt) = 1 2 log 2πe + log 1 2π π −π S(ω) dω , hX = 1 2 log 2πe + 1 2π π −π log S(ω) dω , ρX = 1 2 log 1 2π π −π S(ω) dω − 1 2π π −π log S(ω) dω . Entropy rate is also known as Kolmogorov-Sinai entropy. Multi-information rate is Dubnov’s ‘information rate’ [Dub04] and is a function of spectral flatness measure [GM74]. 6/35 Erasure Entropy and Predictive Information Rates Verdú and Weissman [VW06] give erasure entropy rate of a GP: rX = 1 2 log 2πe − log 1 2π π −π 1 S(ω) dω . 7/35 Erasure Entropy and Predictive Information Rates Verdú and Weissman [VW06] give erasure entropy rate of a GP: rX = 1 2 log 2πe − log 1 2π π −π 1 S(ω) dω . From this we get the PIR: bX = 1 2 1 2π π −π log S(ω) dω + log 1 2π π −π 1 S(ω) dω . 7/35 Autoregressive processes An AR(N) process is defined as Xt = Ut − N k=1 akXt−k, Ut ∼ N(0, σ2 ), where innovations Ut are a Gaussian white noise with variance σ2. The ak are the autogressive or prediction coefficients, effectively an IIR filter. If the filter is stable, process will be stationary. We obtain hX = 1 2 log(2πeσ2 ) and bX = 1 2 log 1 + N k=1 a2 k . The multi-information rate ρX does not have a simple general expression in terms of the prediction coefficients. 8/35 Autoregressive example 0 1 2 3 4 0 0.5 λ 1 =0 λ1 =±0.50 λ1 =±0.90 λ 1 =±0.99 ρµ /bits b µ /bits ρX vs. bX for an AR(1) process with parameter a1. As a1 → −1 processes becomes Gaussian random walk with tiny steps. • Multi-information rate diverges. • PIR has no local maximum. 9/35 Moving average processes An MA(N) process is defines as (Xt)t∈Z such that Xt = N k=0 bkUt−k, Ut ∼ N(0, σ2 ), where the Ut are white noise and the bk are like FIR filter coefficients. Standard assumptions (wlog) that b0 = 1 and process is invertible (no zeros outside unit circle). We get hX = 1 2 log(2πeσ2 ) and ρX = 1 2 log 1 + N k=1 b2 k . The PIR bX does not have a simple general expression in terms of the prediction coefficients. 10/35 Moving average example For MA(1), we get ρX = 1 2 log(1 + b2 1), bX = −1 2 log(1 − b2 1). Same as for AR(1) but with b1 instead of a1, and ρX and bX swapped. Divergence of PIR: bX → ∞ as b1 → ±1. But b1 = 1 seems like a perfectly sensible process (Xt is the sum of last two innovations). What’s going on? 11/35 Moving average example Consider X0, X1, X2 given infinite sequence of previous observations ← Xt= ← xt, and look at an I-diagram describing the situation. N.B. I(X0; X1, X2) > I(X0; X1) even though I(X0; X2) = 0; I(X1, X2, X3) < 0. This is ‘synergy’, also occurs in XOR/parity systems, which maximise binding information. As if X1 ‘unlocks’ info in X0 about X2. 2.05 1.55 1.25 0.29 0.50.79 −0.29 X1 X0 X2 Continues down chain, I(X0; X1:m) diverges as m → ∞. 12/35 PIR/Multi-information duality bX = 1 2 log 1 2π π −π 1 S(ω) dω − 1 2π π −π log 1 S(ω) dω . Simple for AR(N); can diverge for MA(N). 13/35 PIR/Multi-information duality bX = 1 2 log 1 2π π −π 1 S(ω) dω − 1 2π π −π log 1 S(ω) dω . Simple for AR(N); can diverge for MA(N). ρX = 1 2 log 1 2π π −π S(ω) dω − 1 2π π −π log S(ω) dω . Simple for MA(N); can diverge for AR(N). 13/35 PIR/Multi-information duality bX = 1 2 log 1 2π π −π 1 S(ω) dω − 1 2π π −π log 1 S(ω) dω . Simple for AR(N); can diverge for MA(N). ρX = 1 2 log 1 2π π −π S(ω) dω − 1 2π π −π log S(ω) dω . Simple for MA(N); can diverge for AR(N). Dualities: S(ω) ↔ 1/S(ω) MA(N) ↔ AR(N) ρX ↔ bX 13/35 Adding observation noise Infinities are troublesome and point to problem with notion of infinitely precise observation of continuous-valued variables. Can be avoided by adding observation noise: Yt = Xt + Vt, Vt ∼ N(0, σ2 v). 14/35 Adding observation noise Infinities are troublesome and point to problem with notion of infinitely precise observation of continuous-valued variables. Can be avoided by adding observation noise: Yt = Xt + Vt, Vt ∼ N(0, σ2 v). 0 5 10 15 20 0 1 2 3 4 5 6 7 ρµ /bits b µ /bits AR(8), random poles 14/35 Adding observation noise Infinities are troublesome and point to problem with notion of infinitely precise observation of continuous-valued variables. Can be avoided by adding observation noise: Yt = Xt + Vt, Vt ∼ N(0, σ2 v). 0 5 10 15 20 0 1 2 3 4 5 6 7 ρµ /bits b µ /bits AR(8), random poles 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 ρ µ /bits b µ /bits AR(8), random poles plus noise (−20dB) These scatter plots show the information measures forAR(8) systems generated by randomly sampling poles in the frequency response, with (right) and without (left) observation noise. 14/35 Noisy AR(1) process examples 0 100 200 300 400 500 600 700 800 900 1000 −3 0 3 (a) ψ=0.030 h µ =2.046, ρ µ =0.001, b µ =0.001 0 100 200 300 400 500 600 700 800 900 1000 −3 0 3 (b) ψ=0.940 hµ =0.593, ρµ =1.454, bµ =0.408 0 100 200 300 400 500 600 700 800 900 1000 −3 0 3 (c) ψ=0.999 hµ =−0.963, ρµ =3.010, bµ =0.142 15/35 Linear state space models All processes so far subsumed by the general state space representation, which also covers vector-valued processes (Xt ∈ RN and Yt ∈ RM ): Xt = AXt−1 + Ut, Ut ∼ N(0, Σu) Yt = CXt + Vt, Vt ∼ N(0, Σv) Answer is not too horrible and fits on one slide if we shrink it a bit. 16/35 Linear state space models All processes so far subsumed by the general state space representation, which also covers vector-valued processes (Xt ∈ RN and Yt ∈ RM ): Xt = AXt−1 + Ut, Ut ∼ N(0, Σu) Yt = CXt + Vt, Vt ∼ N(0, Σv) Answer is not too horrible and fits on one slide if we shrink it a bit. Λ0 = dlyap(A, Σu), hX = HY (Λf ), Λf = dare(A , C , Σu, Σv), ρX = HY (Λ0) − HY (Λf ), Kb = dare(A, B, Q, I) − Q, bX = HY (Λf ) − HY (Λfb), Λfb = (Λ−1 f + Kb)−1 . where HY (Λ) = 1 2 log (2πe)M det CΛC + Σv and Q = C Σ−1 v C. 16/35 Surprise and expected surprise Given actually observed values up to time t, we can compute dynamic information measures: First, the surprisingness: hx X(t) def = − log p(xt| ← xt) = hX + πe1−2hX (xt − ˆxt)2 − 1 2, where ˆxt = E (Xt| ← Xt= ← xt). Hence surprise is basically the innovation (xt − ˆxt) squared and innovations are white noise, so: 17/35 Surprise and expected surprise Given actually observed values up to time t, we can compute dynamic information measures: First, the surprisingness: hx X(t) def = − log p(xt| ← xt) = hX + πe1−2hX (xt − ˆxt)2 − 1 2, where ˆxt = E (Xt| ← Xt= ← xt). Hence surprise is basically the innovation (xt − ˆxt) squared and innovations are white noise, so: • Expected surprise is always hX—no temporal structure. 17/35 Surprise and expected surprise Given actually observed values up to time t, we can compute dynamic information measures: First, the surprisingness: hx X(t) def = − log p(xt| ← xt) = hX + πe1−2hX (xt − ˆxt)2 − 1 2, where ˆxt = E (Xt| ← Xt= ← xt). Hence surprise is basically the innovation (xt − ˆxt) squared and innovations are white noise, so: • Expected surprise is always hX—no temporal structure. • No temporal structure in surprises (like independent χ2 variables). 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 2 4 6 8 [hX x (t)−hX +(1/2)nat]/bit 17/35 IPI and expected IPI Then, the instantaneous predictive information (IPI) in the observation Xt=xt about the unobserved future → Xt given the previous observations ← Xt= ← xt, bx X(t) def = I(Xt=xt → → Xt| ← Xt= ← xt) = 1 − e−2bX [hx X(t) − hX] + bX. Deviation from bX proportional to deviation of surprise from entropy rate. Coefficient is 0 for bX = 0 and → 1 as bX → ∞. • Expected IPI is always bX—no temporal structure. • No temporal structure in IPI either. 18/35 Conclusion: Perceptual flatness of Gaussian processes • Differences in global information measures hX, ρX and bX between different Gaussian processes consistent with differences in perceptual qualities, e.g. in audio, ρX relates to spectral flatness and tonal qualities. 19/35 Conclusion: Perceptual flatness of Gaussian processes • Differences in global information measures hX, ρX and bX between different Gaussian processes consistent with differences in perceptual qualities, e.g. in audio, ρX relates to spectral flatness and tonal qualities. • Lack of dynamic structure suggests lack of ‘eventness’ once a particular GP is in progress—no sense of things happening at certain times. As audio signals, GPs range from ‘shhhh’ or ‘sssss’ to ‘oooo’ or ‘eeeee’—they sound homogenous. 19/35 Conclusion: Perceptual flatness of Gaussian processes • Differences in global information measures hX, ρX and bX between different Gaussian processes consistent with differences in perceptual qualities, e.g. in audio, ρX relates to spectral flatness and tonal qualities. • Lack of dynamic structure suggests lack of ‘eventness’ once a particular GP is in progress—no sense of things happening at certain times. As audio signals, GPs range from ‘shhhh’ or ‘sssss’ to ‘oooo’ or ‘eeeee’—they sound homogenous. To get more interesting information dynamics, we will consider what happens if we allow GP parameters to vary over time. Hence we look at Bayesian learning of AR(N) process parameters. 19/35 Predictive information in parametric models Consider a parametric model where the observations Xt are iid once the parameters are fixed, but actually the parameters are unknown and must be estimated, so Xt are marginally dependent (an exchangeable random sequence—de Finetti’s theorem). Θ X1 X2 X3 . . . 20/35 Predictive information in parametric models Consider a parametric model where the observations Xt are iid once the parameters are fixed, but actually the parameters are unknown and must be estimated, so Xt are marginally dependent (an exchangeable random sequence—de Finetti’s theorem). Θ X1 X2 X3 . . . Observer’s belief state at time t includes probability distribution over the parameters p(Θ=θ| ← Xt= ← xt). 20/35 Predictive information in parametric models Consider a parametric model where the observations Xt are iid once the parameters are fixed, but actually the parameters are unknown and must be estimated, so Xt are marginally dependent (an exchangeable random sequence—de Finetti’s theorem). Θ X1 X2 X3 . . . Observer’s belief state at time t includes probability distribution over the parameters p(Θ=θ| ← Xt= ← xt). Each observation causes revision of belief state and hence supplies information I(Xt=xt → Θ| ← Xt= ← xt) about Θ: In previous work we called this the ‘model information rate’. 20/35 Predictive information in parametric models Consider a parametric model where the observations Xt are iid once the parameters are fixed, but actually the parameters are unknown and must be estimated, so Xt are marginally dependent (an exchangeable random sequence—de Finetti’s theorem). Θ X1 X2 X3 . . . Observer’s belief state at time t includes probability distribution over the parameters p(Θ=θ| ← Xt= ← xt). Each observation causes revision of belief state and hence supplies information I(Xt=xt → Θ| ← Xt= ← xt) about Θ: In previous work we called this the ‘model information rate’. (Same as Haussler and Opper’s [HO95] IIG or Itti and Baldi’s [IB05] Bayesian surprise.) 20/35 IPI equals IIG (in some cases) Mild assumptions yield a relation between IIG (instantaneous information gain) and IPI. (Everything conditioned on Kt ≡ knowledge at time t). Xt → Xt Θ 21/35 IPI equals IIG (in some cases) Mild assumptions yield a relation between IIG (instantaneous information gain) and IPI. (Everything conditioned on Kt ≡ knowledge at time t). Xt → Xt Θ 0 1 Xt ⊥ → Xt|Θ: observations iid given Θ; 21/35 IPI equals IIG (in some cases) Mild assumptions yield a relation between IIG (instantaneous information gain) and IPI. (Everything conditioned on Kt ≡ knowledge at time t). Xt → Xt Θ 0 0 1 Xt ⊥ → Xt|Θ: observations iid given Θ; 2 Θ ⊥ Xt| → Xt: assumption that Xt adds no new information about Θ given infinitely long sequence → Xt = Xt+1:∞. NB often satisfied, but not by Mandelbrot’s so-called ‘wild’ random processes [MT07]. 21/35 IPI equals IIG (in some cases) Mild assumptions yield a relation between IIG (instantaneous information gain) and IPI. (Everything conditioned on Kt ≡ knowledge at time t). Bt Xt → Xt Θ 0 0 1 Xt ⊥ → Xt|Θ: observations iid given Θ; 2 Θ ⊥ Xt| → Xt: assumption that Xt adds no new information about Θ given infinitely long sequence → Xt = Xt+1:∞. NB often satisfied, but not by Mandelbrot’s so-called ‘wild’ random processes [MT07]. Hence, I(Xt; Θ) = I(Xt; → Xt) = Bt. Careful extension of assumption 2. yields I(Xt=xt → Θ) = I(Xt=xt → → Xt) = bx X(t). 21/35 IPI equals IIG plus Θ-IPI (in other cases) Can drop assumption 1 and still get bx X(t) = I(Xt=xt → Θ|Kt) + I(Xt=xt → → Xt|Θ, Kt). 22/35 IPI equals IIG plus Θ-IPI (in other cases) Can drop assumption 1 and still get bx X(t) = I(Xt=xt → Θ|Kt) + I(Xt=xt → → Xt|Θ, Kt). First term is just IIG or Bayesian surprise—it’s the KL divergence between Bayesian prior and posterior over Θ. 22/35 IPI equals IIG plus Θ-IPI (in other cases) Can drop assumption 1 and still get bx X(t) = I(Xt=xt → Θ|Kt) + I(Xt=xt → → Xt|Θ, Kt). First term is just IIG or Bayesian surprise—it’s the KL divergence between Bayesian prior and posterior over Θ. Second term is the IPI in the model when Θ is known (easy) but averaged (hard) over the unknown value of Θ according to posterior: I(Xt=xt → → Xt|Θ, Kt) = I(Xt=xt → → Xt|Θ=θ, Kt)p(θ|Kt+1) dθ. 22/35 IPI equals IIG plus Θ-IPI (in other cases) Can drop assumption 1 and still get bx X(t) = I(Xt=xt → Θ|Kt) + I(Xt=xt → → Xt|Θ, Kt). First term is just IIG or Bayesian surprise—it’s the KL divergence between Bayesian prior and posterior over Θ. Second term is the IPI in the model when Θ is known (easy) but averaged (hard) over the unknown value of Θ according to posterior: I(Xt=xt → → Xt|Θ, Kt) = I(Xt=xt → → Xt|Θ=θ, Kt)p(θ|Kt+1) dθ. Result can applied to previous work on Markov chains and to GPs. Means Bayesian surprise is a bona fide component of the IPI. 22/35 Interpretation and caveat Need to bear a couple of things in mind when using these results: • Assumption 2 needs to be checked—might not be satisfied for models featuring heavy-tailed or power-law distributions. 23/35 Interpretation and caveat Need to bear a couple of things in mind when using these results: • Assumption 2 needs to be checked—might not be satisfied for models featuring heavy-tailed or power-law distributions. • Assumption 2 says ‘any information we learn about parameters will eventually show up in the data, only valid if true parameter is constant and future is infinite, so... 23/35 Interpretation and caveat Need to bear a couple of things in mind when using these results: • Assumption 2 needs to be checked—might not be satisfied for models featuring heavy-tailed or power-law distributions. • Assumption 2 says ‘any information we learn about parameters will eventually show up in the data, only valid if true parameter is constant and future is infinite, so... • ...if parameter is slowly changing, there is effectively only a finite future for parameter value’s effects to be felt. Hence quality of approximation needs to be checked. 23/35 Interpretation and caveat Need to bear a couple of things in mind when using these results: • Assumption 2 needs to be checked—might not be satisfied for models featuring heavy-tailed or power-law distributions. • Assumption 2 says ‘any information we learn about parameters will eventually show up in the data, only valid if true parameter is constant and future is infinite, so... • ...if parameter is slowly changing, there is effectively only a finite future for parameter value’s effects to be felt. Hence quality of approximation needs to be checked. Hope to address last point by considering information geometry of learning to judge what fraction of information learned about parameters will be reflected in a finite block of future observations. 23/35 Bayesian AR(N) estimation We follow Giovannelli et al’s approach based on Kitagawa and Gersch’s ‘spectral smoothness prior’ [GDH96, KG85]. Consider AR(N) process with parameters σ2 and a ≡ (a1, . . . , aN ). Conjugate prior is Gaussian for a and inverse-Gamma for σ2. Kitagawa’s prior sets a ∼ N(0, σ2Ra), where R−1 a = λ      12p 0 · · · 0 0 22p · · · 0 ... ... ... 0 0 N2p      p relates to smoothness of whitening filter. λ controls overall strength of regularisation. 24/35 Online Bayesian AR(N) estimation We adapted the algorithm to work sequentially, updating posterior over σ2 and a after each observation. 25/35 Online Bayesian AR(N) estimation We adapted the algorithm to work sequentially, updating posterior over σ2 and a after each observation. Posterior is Normal-Inverse-Gamma, parametrically at time t: σ2 ∼ IG(α(t) , β(t) ), a|σ2 ∼ N(µ(t) a , σ2 R(t) a ) 25/35 Online Bayesian AR(N) estimation We adapted the algorithm to work sequentially, updating posterior over σ2 and a after each observation. Posterior is Normal-Inverse-Gamma, parametrically at time t: σ2 ∼ IG(α(t) , β(t) ), a|σ2 ∼ N(µ(t) a , σ2 R(t) a ) Easy to compute Bayesian surprise as KL divergence D(α(t) , β(t) , µ(t) a , R(t) a ||α(t−1) , β(t−1) , µ(t−1) a , R(t−1) a ). Easy to compute expectations and so estimate bX and hX. 25/35 Online Bayesian AR(N) estimation We adapted the algorithm to work sequentially, updating posterior over σ2 and a after each observation. Posterior is Normal-Inverse-Gamma, parametrically at time t: σ2 ∼ IG(α(t) , β(t) ), a|σ2 ∼ N(µ(t) a , σ2 R(t) a ) Easy to compute Bayesian surprise as KL divergence D(α(t) , β(t) , µ(t) a , R(t) a ||α(t−1) , β(t−1) , µ(t−1) a , R(t−1) a ). Easy to compute expectations and so estimate bX and hX. Final modification: decay factor on sufficient statistics to implement ‘forgetting’. 25/35 Steve Reich’s Drumming · Material and Methods We took an audio recording of Steve Reich’s Drumming (∼57 mins). Fairly elaborate pre-processing, modelled on Dubnov, McAdams and Reynold’s system for analysis audio in terms of multi-information rate [DMR06]. Audio frames → Mel-scale log power spectrum 26/35 Steve Reich’s Drumming · Material and Methods We took an audio recording of Steve Reich’s Drumming (∼57 mins). Fairly elaborate pre-processing, modelled on Dubnov, McAdams and Reynold’s system for analysis audio in terms of multi-information rate [DMR06]. Audio frames → Mel-scale log power spectrum → Smooth additive normalisation 26/35 Steve Reich’s Drumming · Material and Methods We took an audio recording of Steve Reich’s Drumming (∼57 mins). Fairly elaborate pre-processing, modelled on Dubnov, McAdams and Reynold’s system for analysis audio in terms of multi-information rate [DMR06]. Audio frames → Mel-scale log power spectrum → Smooth additive normalisation → PCA (30 components) 26/35 Steve Reich’s Drumming · Material and Methods We took an audio recording of Steve Reich’s Drumming (∼57 mins). Fairly elaborate pre-processing, modelled on Dubnov, McAdams and Reynold’s system for analysis audio in terms of multi-information rate [DMR06]. Audio frames → Mel-scale log power spectrum → Smooth additive normalisation → PCA (30 components) → Channelwise additive normalisation 26/35 Steve Reich’s Drumming · Material and Methods We took an audio recording of Steve Reich’s Drumming (∼57 mins). Fairly elaborate pre-processing, modelled on Dubnov, McAdams and Reynold’s system for analysis audio in terms of multi-information rate [DMR06]. Audio frames → Mel-scale log power spectrum → Smooth additive normalisation → PCA (30 components) → Channelwise additive normalisation PCA basis was computed offline using lots of audio—functions like Mel-cepstral analysis. Result is 30 pseudo-independent zero-mean channels to analyse as GPs (lots of nice periodicities in Drumming). 26/35 Drumming · Results 5 10 15 20 25 30 35 40 45 50 55 0 50 100 bits Marginal entropy 5 10 15 20 25 30 35 40 45 50 55 −10 0 10 bits Entropy rate 5 10 15 20 25 30 35 40 45 50 55 0 50 100 bits Multi−information rate (rlevinson) 5 10 15 20 25 30 35 40 45 50 55 15 20 25 bits Predictive information rate 5 10 15 20 25 30 35 40 45 50 55 0 10 20 bits Bayesian surprise 5 10 15 20 25 30 35 40 45 50 55 0 5 10 bits Squared prediction error time/mins 27/35 Angel of Death · Background This experiment is modelled on that of Dubnov, McAdams and Reynolds [DMR06], which involved analysis of a specially composed piece, The Angel of Death by Roger Reynolds. Piece can be performed in two ways: the ‘S-D’ form and the ‘D-S’ form. (See original paper for details.) During a live performance, audience members were asked to make continuous ratings on the dimensions of ‘emotional force’ and ‘familiarity’. These were then correlated with various measures derived from the audio, including the multi-information (though this was computed block-wise rather than using online estimation). Correlating the computed MIR with the emotional force ratings averaged across subjects, correlations of 0.63 and 0.46 were obtained for the S-D and D-S forms respectively. 28/35 Angel of Death · Methods We repeated the analysis but using the online estimation method and computed correlations between all information measures and each subject’s ratings individually. For each pair, we computed lagged correlations over a range of lags (-5 s to +5 s) and picked the maximal absolute correlation. We also computed correlations with ratings averaged across subjects (shown as Subject 0 on next slide). Results suggest that overall, best correlation with emotional force is obtained by marginal entropy (0.53 for S-D, 0.52 for D-S, with mean subject responses, 0.63 and 0.55 for best subject) not multi-information rate. Further investigation required! Note also that different measures seem to fit behaviour of different subjects—some subjects’ responses are modelled best by multi-information rate or predictive information rate. 29/35 Angel of Death · Results (S-D) 0 5 10 15 20 25 30 35 40 45 50 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 Emotional force correlations Subject (subj 0:mean) 0 5 10 15 20 25 30 35 40 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Subject (subj 0:mean) Familiarity correlations Bayesian Surprise Entropy Rate Predinfo Rate Marginal Entropy Surprise MultiInfo Rate 30/35 Angel of Death · Results (D-S) 0 5 10 15 20 25 30 35 40 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Emotional force correlations Subject (subj 0:mean) 0 5 10 15 20 25 30 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Subject (subj 0:mean) Familiarity correlations Bayesian Surprise Entropy Rate Predinfo Rate Marginal Entropy Surprise MultiInfo Rate 31/35 Summary • Obtained process information measures for any Gaussian process • Duality between bX and ρX; inverse spectra. • Obtained dynamic information measures, lacking temporal structure. • Bayesian surprise in adaptive parametetric modesl is component of PIR. • Applied to real time analysis of minimalist music. 32/35 Bibliography I [DMR06] Shlomo Dubnov, Stephen McAdams, and Roger Reynolds. Structural and affective aspects of music from statistical audio signal analysis. Journal of the American Society for Information Science and Technology, 57(11):1526–1536, 2006. [Dub04] S. Dubnov. Generalization of spectral flatness measure for non-gaussian linear processes. Signal Processing Letters, IEEE, 11(8):698–701, 2004. [GDH96] J-F Giovannelli, Guy Demoment, and Alain Herment. A bayesian method for long ar spectral estimation: A comparative study. Ultrasonics, Ferroelectrics and Frequency Control, IEEE Transactions on, 43(2):220–233, 1996. [GM74] Jr. Gray, A. and J. Markel. A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(3):207 – 217, jun 1974. 33/35 Bibliography II [HO95] David Haussler and Manfred Opper. General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational learning theory (COLT ’95), pages 402–411, New York, NY, USA, 1995. ACM. [IB05] Laurent Itti and Pierre Baldi. Bayesian surprise attracts human attention. In Advances Neural in Information Processing Systems (NIPS 2005), volume 19, pages 547–554, Cambridge, MA, 2005. MIT Press. [JEC11] Ryan G. James, Christopher J. Ellison, and James P. Crutchfield. Anatomy of a bit: Information in a time series observation. Chaos, 21(3):037109, 2011. [KG85] Genshiro Kitagawa and Will Gersch. A smoothness priors time-varying ar coefficient modeling of nonstationary covariance time series. IEEE Transactions on Automatic Control, 30(1):48–56, 1985. 34/35 Bibliography III [MT07] Benoit Mandelbrot and Nassim Nicholas Taleb. Mild vs. wild randomness: focusing on risks that matter. In Neil Doherty Frank Diebold and Richard Herring, editors, The Known, the Unknown and the Unknowable in Financial Institutions. Princeton University Press, 2007. [VW06] S. Verdú and T Weissman. Erasure entropy. In IEEE International Symposium on Information Theory (ISIT 2006), pages 98–102, 2006. 35/35