GSI2015
About
LIX Colloquium 2015 conferences

Provide an overview on the most recent stateoftheart

Exchange mathematical information/knowledge/expertise in the area

Identify research areas/applications for future collaboration

Identify academic & industry labs expertise for further collaboration
This conference will be an interdisciplinary event and will unify skills from Geometry, Probability and Information Theory. The conference proceedings are published in Springer's Lecture Note in Computer Science (LNCS) series.
Authors will be solicited to submit a paper in a special Issue "Differential Geometrical Theory of Statistics” in ENTROPY Journal, an international and interdisciplinary open access journal of entropy and information studies published monthly online by MDPI 
Provisional Topics of Special Sessions:

Manifold/Topology Learning

Riemannian Geometry in Manifold Learning

Optimal Transport theory and applications in Imagery/Statistics

Shape Space & Diffeomorphic mappings

Geometry of distributed optimization

Random Geometry/Homology

Hessian Information Geometry

Topology and Information

Information Geometry Optimization

Divergence Geometry

Optimization on Manifold

Lie Groups and Geometric Mechanics/Thermodynamics

Quantum Information Geometry

Infinite Dimensional Shape spaces

Geometry on Graphs

Bayesian and Information geometry for inverse problems

Geometry of Time Series and Linear Dynamical Systems

Geometric structure of Audio Processing

Lie groups in Structural Biology

Computational Information Geometry
Committees
Secrétaire
 Valérie Alidor  SEE, France https://www.see.asso.fr
Webmestre
 Jean Vieille  SyntropicFactory http://www.syntropicfactory.com
Program chairs
 Frédéric Barbaresco  Thales, France http://www.thalesgroup.com
 Frank Nielsen  Ecole Polytechnique, France http://www.lix.polytechnique.fr/~nielsen/
Scientific committee
 PierreAntoine Absil  University of Louvain, Belgium http://sites.uclouvain.be/absil/
 Bijan Afsari  Johns Hopkins University, USA http://www.cis.jhu.edu/~bijan/
 Stéphanie Allassonnière  Ecole Polytechnique, France https://sites.google.com/site/stephanieallassonniere/home
 Shunichi Amari  RIKEN, Japan http://www.brain.riken.jp/labs/mns/amari/homeE.html
 Jesus Angulo  Mines ParisTech, France http://cmm.ensmp.fr/~angulo/
 JeanPhilippe Anker  Université d'Orléans, France http://www.univorleans.fr/mapmo/membres/anker/
 Sylvain Arguillère  John Hopkins University, USA http://www.cis.jhu.edu/~arguille/
 Marc Arnaudon  Université de Bordeaux, France http://www.math.ubordeaux1.fr/~marnaudo/
 Dena Asta  Carnegie Mellon University, USA http://www.stat.cmu.edu/~dasta/
 Michael Aupetit  Qatar Computing Research Institute, Quatar http://michael.aupetit.free.fr/
 Roger Balian  Academy of Sciences, France https://en.wikipedia.org/wiki/Roger_Balian
 Trivellato Barbara  Politecnico di Torino, Italy http://calvino.polito.it/~trivellato/
 Frédéric Barbaresco  Thales, France http://www.thalesgroup.com
 Michèle Basseville  IRISA, France http://people.irisa.fr/Michele.Basseville/
 Pierre Baudot  Max Planck Institute for Mathematic in the Sciences http://www.mis.mpg.de/jjost/members/pierrebaudot.html
 Martin Bauer  University of Vienna, Austria http://mat.univie.ac.at/~bauerm/Home_Page_of_Martin_Bauer/Home.html
 Roman Belavkin  Middlesex University, UK http://www.eis.mdx.ac.uk/staffpages/rvb/
 Daniel Bennequin  ParisDiderot University http://webusers.imjprg.fr/~daniel.bennequin/
 Jérémy Bensadon  LRI, France https://www.lri.fr/~bensadon/
 JeanFrançois Bercher  ESIEE, France http://perso.esiee.fr/~bercherj/
 Yannick Berthoumieu  IMS Université de Bordeaux, France https://sites.google.com/site/berthoumieuims/
 Jérémie Bigot  Université de Bordeaux, France https://sites.google.com/site/webpagejbigot/
 Michael Blum  IMAG, France http://membrestimc.imag.fr/Michael.Blum/
 Lionel Bombrun  IMS, Université de Bordeaux, France https://www.imsbordeaux.fr/fr/annuaire/4158bombrunlionel
 Silvère Bonnabel  MinesParistech http://www.silverebonnabel.com/
 Ugo Boscain  Ecole polytechnique, France http://www.cmapx.polytechnique.fr/~boscain/
 Nicolas Boumal  Inria & ENS Paris, France http://perso.uclouvain.be/nicolas.boumal/
 Charles Bouveyron  University Paris Descartes, France http://w3.mi.parisdescartes.fr/~cbouveyr/
 Michel Boyom  Université de Montpellier, France http://www.i3m.univmontp2.fr/
 Michel Broniatowski  University of Pierre and Marie Curie, France http://www.lsta.upmc.fr/Broniatowski/
 Martins Bruveris  Brunel University London, UK http://www.brunel.ac.uk/~mastmmb/
 Olivier Cappé  Telecom Paris, France http://perso.telecomparistech.fr/~cappe/
 Charles Cavalcante  Federal University of Ceará, Brazil http://www.ppgeti.ufc.br/charles/
 Antonin Chambolle  Ecole Polytechnique, France http://www.cmap.polytechnique.fr/~antonin/
 Frédéric Chazal  INRIA, France http://geometrica.saclay.inria.fr/team/Fred.Chazal/
 Emmanuel Chevallier  Mines ParisTech, France http://cmm.ensmp.fr/~chevallier/
 Sylvain Chevallier  IUT de Vélizy, France https://sites.google.com/site/sylvchev/
 Arshia Cont  Ircam, France http://repmus.ircam.fr/arshiacont
 Benjamin Couéraud  LAREMA Université d'Angers, France
 Philippe Cuvillier  Ircam, France http://repmus.ircam.fr/cuvillier
 Laurent Decreusefond  Telecom ParisTech, France http://www.infres.enst.fr/~decreuse/
 Alexis Decurninge  Huawei Technologies, Paris, France http://www.huawei.com/en/
 Michel Deza  Ecole Normale Supérieure Paris, CNRS, France http://www.liga.ens.fr/~deza/
 Stanley Durrleman  INRIA, France https://who.rocq.inria.fr/Stanley.Durrleman/index.html
 Patrizio Frosini  Università di Bologna, Italy http://www.dm.unibo.it/~frosini/
 Alfred Galichon  New York University, USA http://alfredgalichon.com/
 JeanPaul Gauthier  University of Toulon, France http://www.lsis.org/gauthierjp/
 Alexis Glaunès  Mines ParisTech, France http://www.mi.parisdescartes.fr/~glaunes/
 PierreYves Gousenbourger  Ecole Polytechnique de Louvain, Belgium http://www.uclouvain.be/pierreyves.gousenbourger
 Piotr Graczyk  University of Angers, France math.univangers.fr
 Peter Grunwald  CWI, Amsterdam, The Netherlands http://homepages.cwi.nl/~pdg/
 Nikolaus Hansen  INRIA, France www.lri.fr
 K V Harsha  Indian Institute of Space Science & Technology, India http://www.iist.ac.in/departments/
 Susan Holmes  Stanford University, USA http://statweb.stanford.edu/~susan/
 Wen Huang  University of Louvain, Belgium
 Stephan Huckemann  Institut für Mathematische Stochastik, Göttingen, Germany http://www.stochastik.math.unigoettingen.de/index.php?id=huckemann
 Shiro Ikeda  ISM, Japan http://www.ism.ac.jp/~shiro/
 Alexander Ivanov  Lomonosov Moscow State University, Russia  Imperial College, UK http://www.imperial.ac.uk/people/a.ivanov
 Jérémie Jakubowicz  Institut Mines Telecom, France http://wwwpublic.itsudparis.eu/~jakubowi/
 Martin Kleinsteuber  Technische Universität München, Germany http://www.professoren.tum.de/en/kleinsteubermartin/
 Ryszard Kostecki  Perimeter Institute for Theoretical Physics, Canada http://www.fuw.edu.pl/~kostecki/
 Hong Van Le  Mathematical Institute of ASCR, Czech Republik http://users.math.cas.cz/~hvle/
 Nicolas Le Bihan  Université de Grenoble, CNRS, France  University of Melbourne, Australia http://www.gipsalab.grenobleinp.fr/~nicolas.lebihan/
 Christian Léonard  Ecole Polytechnique, France http://www.cmap.polytechnique.fr/~leonard/
 Hervé Lombaert  INRIA, France http://step.polymtl.ca/~rv101/
 Jeanmichel Loubes  Toulouse University, France http://perso.math.univtoulouse.fr/loubes/
 Luigi Malagò  Shinshu University, Japan http://malago.di.unimi.it/
 Jonathan Manton  The University of Melbourne http://people.eng.unimelb.edu.au/jmanton/
 Matilde Marcolli  Caltech, USA http://www.its.caltech.edu/~matilde/
 JeanFrançois Marcotorchino  Thales, France https://www.thalesgroup.com/
 CharlesMichel Marle  Université Pierre et Marie Curie, France http://charlesmichel.marle.pagespersoorange.fr/
 Juliette Mattioli  THALES, France https://www.thalesgroup.com/en
 Bertrand Maury  Université Paris Sud, France http://www.math.upsud.fr/~maury/
 Quentin Mérigot  Université ParisDauphine / CNRS, France http://quentin.mrgt.fr/
 Fernand Meyer  Mines ParisTech, France fernandmeyer
 Klas Modin  Chalmers University of Technology, Göteborg, Sweden https://klasmodin.wordpress.com/
 Ali MohammadDjafari  Supelec, CNRS, France http://djafari.free.fr/
 Guido Montufar  Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany http://personalhomepages.mis.mpg.de/montufar/
 Subrahamanian Moosath  Indian Institute of Space Science and Technology, India http://www.iist.ac.in
 Eric Moulines  Telecom ParisTech, France http://perso.telecomparistech.fr/~moulines/
 Jan Naudts  Universiteit Antwerpen, Belgium https://www.uantwerpen.be/en/staff/jannaudts/mywebsite/
 Frank Nielsen  Ecole Polytechnique, France http://www.lix.polytechnique.fr/~nielsen/
 Richard Nock  Université des Antilles et de la Guyane, France  NICTA, Australia http://www.univag.fr/rnock/index.html
 Yann Ollivier  Université Paris Sud, France http://www.yannollivier.org/
 JeanPhilippe Ovarlez  ONERA & SONDRA Lab, France http://www.jeanphilippeovarlez.com
 Bruno Pelletier  University of Rennes, France http://pelletierb.perso.math.cnrs.fr/
 Xavier Pennec  INRIA, France http://wwwsop.inria.fr/members/Xavier.Pennec/
 Michel Petitjean  Université Paris Diderot, CNRS, France http://petitjeanmichel.free.fr/itoweb.petitjean.html
 Gabriel Peyre  Université Paris Dauphine, CNRS, France http://gpeyre.github.io/
 Giovanni Pistone  Collegio Carlo Alberto, Moncalieri, Italie http://www.giannidiorestino.it/
 Julien Rabin  Université de Caen, France https://sites.google.com/site/rabinjulien/
 Tudor Ratiu  Ecole Polytechnique Federale de Lausanne, Swiss http://cag.epfl.ch/page39504en.html
 Johannes Rauh  Leibniz Universität hannover, Germany http://www2.iag.unihannover.de/~jrauh/index.php
 Olivier Rioul  Telecom ParisTech, France http://perso.telecomparistech.fr/~rioul/
 Said Salem  Université de Bordeaux, France https://www.imsbordeaux.fr/fr/annuaire/4069saidsalem
 Alessandro Sarti  Ecole des hautes études en sciences sociales, France http://cams.ehess.fr/document.php?id=1194
 Gery de Saxcé  Université des Sciences et des Technologies de Lille, France http://www.univlille1.fr/
 Olivier Schwander  Ecole Polytechnique, France http://www.lix.polytechnique.fr/~schwander/en/
 Rodolphe Sepulchre  Cambridge University, Department of Engineering, UK http://wwwcontrol.eng.cam.ac.uk/Main/RodolpheSepulchre
 Hichem Snoussi  Université de Technologie de Troyes, France http://h.snoussi.free.fr/
 Anuj Srivastava  Florida State University, USA http://stat.fsu.edu/~anuj/
 Udo von Toussaint  MaxPlanckInstitut fuer Plasmaphysik, Garching, Germany http://home.rzg.mpg.de/~udt/
 Emmanuel Trelat  UPMC, France https://www.ljll.math.upmc.fr/trelat/
 Alain Trouvé  ENS Cachan, France http://atrouve.perso.math.cnrs.fr/
 Corinne Vachier  Université Paris Est Créteil, France www.upec.fr
 Claude Vallée  Poitiers University, France http://www.univpoitiers.fr/
 Geert Verdoolaege  Ghent University, Belgium http://www.ugent.be/ea/appliedphysics/en/research/fusion/personal_pages.htm/verdoolaege.htm
 JeanPhilippe Vert  Mines ParisTech, France http://cbio.ensmp.fr/~jvert/
 FrançoisXavier Vialard  Ceremade, Paris, France https://www.ceremade.dauphine.fr/~vialard/
 Rui Vigelis  Universidade Federal do ceará, Brazil
 Stephan Weis  Unicamp, Brazil http://www.stephanweis.info
 Laurent Younes  John Hopkins University, USA www.cis.jhu.edu
 Jun Zhang  University of Michigan, Ann Arbor, USA http://www.lsa.umich.edu/psych/junz/
Sponsors and Organizers
Links
Documents
Opening Session (chaired by Frédéric Barbaresco)
Geometric Science of Information SEE/SMAI GSI’15 Conference LIX Colloquium 2015 Frédéric BARBARESCO* & Frank Nielsen** GSI’15 General Chairmen (*) President of SEE ISIC Club (Ingéniérie des Systèmes d’Information de Communications) (**) LIX Department, Ecole Polytechnique Société de l'électricité, de l'électronique et des technologies de l'information et de la communication Flashback GSI’13 Ecole des Mines de Paris Hirohiko Shima JeanLouis Koszul ShinIchi Amari SEE at a glance • Meeting place for science, industry and society • An officialy recognised nonprofit organisation • About 2000 members and 5000 individuals involved • Large participation from industry (~50%) • 19 «Clubs techniques» and 12 «Groupes régionaux» • Organizes conferences and seminars • Initiates/attracts International Conferences in France • Institutional French member of IFAC and IFIP • Awards (Glavieux/Brillouin Prize, Général Ferrié Prize, Néel Prize, Jerphagnon Prize, BlancLapierre Prize,Thévenin Prize), grades and medals (Blondel, Ampère) • Publishes 3 periodical publications (REE, …) & 3 monographs each year • Web: http://www.see.asso.fr and LinkedIn SEE group • SEE Presidents: Louis de Broglie, Paul Langevin, … 18832015: From SIE & SFE to SEE: 132 years of Sciences Société de l'électricité, de l'électronique et des technologies de l'information et de la communication 1881 Exposition Internationale d’Electricité 1883: SIE Société Internationale des Electriciens 1886: SFE Société Française des Electriciens 2013: SEE 17 rue de l'Amiral Hamelin 75783 Paris Cedex 16 Louis de Broglie Paul Langevin GSI’15 Sponsors GSI Logo: Adelard of Bath • He left England toward the end of the 11th century for Tours in France • Adelard taught for a time at Laon, leaving Laon for travel no later than 1109. • After Laon, he travelled to Southern Italy and Sicily no later than 1116. • Adelard also travelled extensively throughout the "lands of the Crusades": Greece, West Asia, Sicily, Spain, and potentially Palestine. The frontispiece of an Adelard of Bath Latin translation of Euclid's Elements, c. 1309– 1316; the oldest surviving Latin translation of the Elements is a 12thcentury translation by Adelard from an Arabic version Adelard of Bath was the first to translate Euclid’s Elements in Latin Adelard of Bath has introduced the word « Algorismus » in Latin after his translation of Al Khuwarizmi SMAI/SEE GSI’15 • More than 150 attendees from 15 different countries • 85 scientific presentations on 3 days • 3 keynote speakers • Mathilde MARCOLLI (CallTech): “From Geometry and Physics to Computational Linguistics” • Tudor RATIU (EPFL): “Symmetry methods in geometric mechanics” • Marc ARNAUDON (Bordeaux University): “Stochastic EulerPoincaré reduction” • 1 Short Course • Chaired by Roger BALIAN • Dominique SPEHNER (Grenoble University): “Geometry on the set of quantum states and quantum correlations” • 1 Guest speaker • CharlesMichel MARLE (UPMC): “Actions of Lie groups and Lie algebras on symplectic and Poisson manifolds. Application to Hamiltonian systems” • Social events: • Welcome cocktail at Ecole Polytechnique • Diner in Versailles Palace Gardens GSI’15 Topics • GSI’15 federates skills from Geometry, Probability and Information Theory: • Dimension reduction on Riemannian manifolds • Optimal Transport and applications in Imagery/Statistics • Shape Space & Diffeomorphic mappings • Random Geometry/Homology • Hessian Information Geometry • Topological forms and Information • Information Geometry Optimization • Information Geometry in Image Analysis • Divergence Geometry • Optimization on Manifold • Lie Groups and Geometric Mechanics/Thermodynamics • Computational Information Geometry • Lie Groups: Novel Statistical and Computational Frontiers • Geometry of Time Series and Linear Dynamical systems • Bayesian and Information Geometry for Inverse Problems • Probability Density Estimation GSI’15 Program GSI’15 Proceedings • Publication by SPRINGER in « Lecture Notes in Computer Science » LNCS vol. 9389 (800 pages), ISBN 9783319250397 • http://www.springer.com/us/book/9783319250397 GSI’15 Special Issue • Authors will be solicited to submit a paper in a special Issue "Differential Geometrical Theory of Statistics” in ENTROPY Journal, an international and interdisciplinary open access journal of entropy and information studies published monthly online by MDPI • http://www.mdpi.com/journal/entropy/special_issues/entropystatistics • A book could be edited by MDPI: e.g. Ecole Polytechnique • Special thanks to « LIX » Department A product of the French Revolution and the Age of Enlightenment, École Polytechnique has a rich history that spans over 220 years. https://www.polytechnique.edu/en/history Henri Poincaré – X1873 ParisSaclay University in Top 8 World Innovation Hubs http://www.technologyreview.com/news/517626/ infographictheworldstechnologyhubs/ A new Grammar of Information “Mathematics is the art of giving the same name to different things” – Henri Poincaré GROUP EVERYWHERE Elie Cartan Henri Poincaré METRIC EVERYWHERE Maurice Fréchet Misha Gromov “the problems addressed by Elie Cartan are among the most important, most abstract and most general dealing with mathematics; group theory is, so to speak, the whole mathematics, stripped of its material and reduced to pure form. This extreme level of abstraction has probably made my presentation a little dry; to assess each of the results, I would have had virtually render him the material which he had been stripped; but this refund can be made in a thousand different ways; and this is the only form that can be found as well as a host of various garments, which is the common link between mathematical theories that are often surprised to find so near” H. Poincaré Elie Cartan: Group Everywhere (Henri Poincaré review of Cartan’s Works) Maurice Fréchet: Metric Everywhere • Maurice Fréchet made major contributions to the topology of point sets and introduced the entire concept of metric spaces. • His dissertation opened the entire field of functionals on metric spaces and introduced the notion of compactness. • He has extended Probability in Metric space 1948 (Annales de l’IHP) Les éléments aléatoires de nature quelconque dans un espace distancié Extension of Probability/Statistic in abstract/Metric space GSI’15 & Geometric Mechanics • The master of geometry during the last century, Elie Cartan, was the son of Joseph Cartan who was the village blacksmith. • Elie recalled that his childhood had passed under “blows of the anvil, which started every morning from dawn”. • We can imagine easily that the child, Elie Cartan, watching his father Joseph “coding curvature” on metal between the hammer and the anvil, insidiously influencing Elie’s mind with germinal intuition of fundamental geometric concepts. • The etymology of the word “Forge”, that comes from the late XIV century, “a smithy”, from Old French forge “forge, smithy” (XII century), earlier faverge, from Latin fabrica “workshop, smith’s shop”, from faber (genitive fabri) “workman in hard materials, smith”. HAMMER = The CoderANVIL = Curvature Libraries Bigorne Bicorne Venus at the Forge of Vulcan, Le Nain Brothers, Musée SaintDenis, Reims From Homo Sapiens to Homo Faber “Intelligence is the faculty of manufacturing artificial objects, especially tools to make tools, and of indefinitely varying the manufacture.” Henri Bergson Into the Flaming Forge of Vulcan, Diego Velázquez, Museo Nacional del Prado Geometric Thermodynamics & Statistical Physics Enjoy all « Geometries » (Dinner at Versailles Palace Gardens) Restaurant of GSI’15 Gala Dinner André Le Nôtre Landscape Geometer of Versailles the Apex of “Le Jardin à la française” Louis XIV Patron of Science The Royal Academy of Sciences was established in 1666 On 1st September 1715, 300 years ago, Louis XIV passed away at the age of 77, having reigned for 72 years Keynote Speakers Prof. Mathilde MARCOLLI (CALTECH, USA) From Geometry and Physics to Computational Linguistics Abstact: I will show how techniques from geometry (algebraic geometry and topology) and physics (statistical physics) can be applied to Linguistics, in order to provide a computational approach to questions of syntactic structure and language evolution, within the context of Chomsky's Principles and Parameters framework. Biography: • Laurea in Physics, University of Milano, 1993 • Master of Science, Mathematics, University of Chicago, 1994 • PhD, Mathematics, University of Chicago, 1997 • Moore Instructor, Massachusetts Institute of Technology, 19972000 • Associate Professor (C3), Max Planck Institute for Mathematics, 20002008 • Professor, California Institute of Technology, 2008present • Distinguished Visiting Research Chair, Perimeter Institute for Theoretical Physics, 2013present . Talk chaired by Daniel Bennequin Keynote Speakers Prof. Marc ARNAUDON (Bordeaux University, France) Stochastic EulerPoincaré reduction Abstact: We will prove a EulerPoincaré reduction theorem for stochastic processes taking values in a Lie group, which is a generalization of the Lagrangian version of reduction and its associated variational principles. We will also show examples of its application to the rigid body and to the group of diffeomorphisms, which includes the NavierStokes equation on a bounded domain and the CamassaHolm equation. Biography: Marc Arnaudon was born in France in 1965. He graduated from Ecole Normale Supérieure de Paris, France, in 1991. He received the PhD degree in mathematics and the Habilitation à diriger des Recherches degree from Strasbourg University, France, in January 1994 and January 1998 respectively. After postdoctoral research and teaching at Strasbourg, he began in September 1999 a full professor position in the Department of Mathematics at Poitiers University, France, where he was the head of the Probability Research Group. In January 2013 he left Poitiers and joined the Department of Mathematics of Bordeaux University, France, where he is a full professor in mathematics. Talk chaired by Frank Nielsen Keynote Speakers Prof. Tudor RATIU (EPFL, Switzerland) Symmetry methods in geometric mechanics Abstact: The goal of these lectures is to show the influence of symmetry in various aspects of theoretical mechanics. Canonical actions of Lie groups on Poisson manifolds often give rise to conservation laws, encoded in modern language by the concept of momentum maps. Reduction methods lead to a deeper understanding of the dynamics of mechanical systems. Basic results in singular Hamiltonian reduction will be presented. The Lagrangian version of reduction and its associated variational principles will also be discussed. The understanding of symmetric bifurcation phenomena in for Hamiltonian systems are based on these reduction techniques. Time permitting, discrete versions of these geometric methods will also be discussed in the context of examples from elasticity. Biography: • BA in Mathematics, University of Timisoara, Romania, 1973 • MA in Applied Mathematics, University of Timisoara, Romania, 1974 • Ph.D. in Mathematics, University of California, Berkeley, 1980 • T.H. Hildebrandt Research Assistant Professor, University of Michigan, Ann Arbor, USA 19801983 • Associate Professor of Mathematics, University of Arizona, Tuscon, USA 1983 1988 • Professor of Mathematics, University of California, Santa Cruz, USA, 19882001 • Chaired Professor of Mathematics, Ecole Polytechnique Federale de Lausanne, Switzerland, 1998  present • Professor of Mathematics, Skolkovo Institute of Science and Technonology, Moscow, Russia, 2014  present Talk chaired by Xavier Pennec Short Course Prof. Dominique SPEHNER (Grenoble University) Geometry on the set of quantum states and quantum correlations Abstact: I will show that the set of states of a quantum system with a finite dimensional Hilbert space can be equipped with various Riemannian distances having nice properties from a quantum information viewpoint, namely they are contractive under all physically allowed operations on the system. The corresponding metrics are quantum analogs of the Fisher metric and have been classified by D. Petz. Two distances are particularly relevant physically: the BogoliubovKuboMori distance studied by R. Balian, Y. Alhassid and H. Reinhardt, and the Bures distance studied by A. Uhlmann and by S.L. Braunstein and C.M. Caves. The latter gives the quantum Fisher information playing an important role in quantum metrology. A way to measure the amount of quantum correlations (entanglement or quantum discord) in bipartite systems (that is, systems composed of two parties) with the help of these distances will be also discussed. Biography: • Diplôme d'Études Approfondies (DEA) in Theoretical Physics at the École Normale Supérieure de Lyon, 1994 • Civil Service (Service National de la Coopération), Technion Institute of Technology, Haifa, Israel, 19951996 • PhD in Theoretical Physics, Université Paul Sabatier, Toulouse, France, 1996 2000. • Postdoctoral fellow, Pontificia Universidad Católica, Santiago, Chile, 20002001 • Research Associate, University of DuisburgEssen, Germany, 20012005 • Maître de Conférences, Université Joseph Fourier, Grenoble, France, 2005present • Habilitation à diriger des Recherches (HDR), Université Grenoble Alpes, 2015 • Member of the Institut Fourier (since 2005) and the Laboratoire de Physique et Modélisation des Milieux Condensés (since 2013) of the university Grenoble Alpes, France Talk chaired by Roger Balian Guest Speakers Prof. CharlesMichel MARLE (UPMC, France) Actions of Lie groups and Lie algebras on symplectic and Poisson manifolds. Application to Hamiltonian systems Abstact: I will present some tools in Symplectic and Poisson Geometry in view of their applications in Geometric Mechanics and Mathematical Physics. Lie group and Lie algebra actions on symplectic and Poisson manifolds, momentum maps and their equivariance properties, first integrals associated to symmetries of Hamiltonian systems will be discussed. Reduction methods taking advantage of symmetries will be discussed. Biography: CharlesMichel Marle was born in 1934; He studied at Ecole Polytechnique (19531955), Ecole Nationale Supérieure des Mines de Paris (19571958) and Ecole Nationale Supérieure du Pétrole et des Moteurs (19571958). He obtained a doctor's degree in Mathematics at the University of Paris in 1968. From 1959 to 1969 he worked as a research engineer at the Institut Français du Pétrole. He joined the Université de Besançon as Associate Professor in 1969, and the Université Pierre et Marie Curie, first as Associate Professor (1975) and then as full Professor (1981). His resarch works were first about fluid flows through porous media, then about Differential Geometry, Hamiltonian systems and applications in Mechanics and Mathematical Physics. Talk chaired by Frédéric Barbaresco
Keynote speach Matilde Marcolli (chaired by Daniel Bennequin)
From Geometry and Physics to Computational Linguistics Matilde Marcolli Geometric Science of Information, Paris, October 2015 Matilde Marcolli Geometry, Physics, Linguistics A Mathematical Physicist’s adventures in Linguistics Based on: 1 Alexander Port, Iulia Gheorghita, Daniel Guth, John M.Clark, Crystal Liang, Shival Dasu, Matilde Marcolli, Persistent Topology of Syntax, arXiv:1507.05134 2 Karthik Siva, Jim Tao, Matilde Marcolli, Spin Glass Models of Syntax and Language Evolution, arXiv:1508.00504 3 Jeong Joon Park, Ronnel Boettcher, Andrew Zhao, Alex Mun, Kevin Yuh, Vibhor Kumar, Matilde Marcolli, Prevalence and recoverability of syntactic parameters in sparse distributed memories, arXiv:1510.06342 4 Sharjeel Aziz, VyLuan Huynh, David Warrick, Matilde Marcolli, Syntactic Phylogenetic Trees, in preparation ...coming soon to an arXiv near you Matilde Marcolli Geometry, Physics, Linguistics What is Linguistics? • Linguistics is the scientiﬁc study of language  What is Language? (langage, lenguaje, ...)  What is a Language? (lange, lengua,...) Similar to ‘What is Life?’ or ‘What is an organism?’ in biology • natural language as opposed to artiﬁcial (formal, programming, ...) languages • The point of view we will focus on: Language is a kind of Structure  It can be approached mathematically and computationally, like many other kinds of structures  The main purpose of mathematics is the understanding of structures Matilde Marcolli Geometry, Physics, Linguistics • How are di↵erent languages related? What does it mean that they come in families? • How do languages evolve in time? Phylogenetics, Historical Linguistics, Etymology • How does the process of language acquisition work? (Neuroscience) • Semiotic viewpoint (mathematical theory of communication) • Discrete versus Continuum (probabilistic methods, versus discrete structures) • Descriptive or Predictive? to be predictive, a science needs good mathematical models Matilde Marcolli Geometry, Physics, Linguistics A language exists at many di↵erent levels of structure An Analogy: Physics looks very di↵erent at di↵erent scales: General Relativity and Cosmology ( 1010 m) Classical Physics (⇠ 1 m) Quantum Physics ( 10 10 m) Quantum Gravity (10 35 m) Despite dreams of a Uniﬁed Theory, we deal with di↵erent mathematical models for di↵erent levels of structure Matilde Marcolli Geometry, Physics, Linguistics Similarly, we view language at di↵erent “scales”: units of sound (phonology) words (morphology) sentences (syntax) global meaning (semantics) We expect to be dealing with di↵erent mathematical structures and di↵erent models at these various di↵erent levels Main level I will focus on: Syntax Matilde Marcolli Geometry, Physics, Linguistics Linguistics view of syntax kind of looks like this... Alexander Calder, Mobile, 1960 Matilde Marcolli Geometry, Physics, Linguistics Modern Syntactic Theory: • grammaticality: judgement on whether a sentence is well formed (grammatical) in a given language, ilanguage gives people the capacity to decide on grammaticality • generative grammar: produce a set of rules that correctly predict grammaticality of sentences • universal grammar: ability to learn grammar is built in the human brain, e.g. properties like distinction between nouns and verbs are universal ... is universal grammar a falsiﬁable theory? Matilde Marcolli Geometry, Physics, Linguistics Principles and Parameters (Government and Binding) (Chomsky, 1981) • principles: general rules of grammar • parameters: binary variables (on/o↵ switches) that distinguish languages in terms of syntactic structures • Example of parameter: headdirectionality (headinitial versus headﬁnal) English is headinitial, Japanese is headﬁnal VP= verb phrase, TP= tense phrase, DP= determiner phrase Matilde Marcolli Geometry, Physics, Linguistics ...but not always so clearcut: German can use both structures auf seine Kinder stolze Vater (headﬁnal) or er ist stolz auf seine Kinder (headinitial) AP= adjective phrase, PP= prepositional phrase • Corpora based statistical analysis of headdirectionality (Haitao Liu, 2010): a continuum between headinitial and headﬁnal Matilde Marcolli Geometry, Physics, Linguistics Examples of Parameters Headdirectionality Subjectside Prodrop Nullsubject Problems • Interdependencies between parameters • Diachronic changes of parameters in language evolution Matilde Marcolli Geometry, Physics, Linguistics Dependent parameters • nullsubject parameter: can drop subject Example: among Latin languages, Italian and Spanish have nullsubject (+), French does not () it rains, piove, llueve, il pleut • prodrop parameter: can drop pronouns in sentences • Prodrop controls Nullsubject How many independent parameters? Geometry of the space of syntactic parameters? Matilde Marcolli Geometry, Physics, Linguistics Persistent Topology of Syntax • Alexander Port, Iulia Gheorghita, Daniel Guth, John M.Clark, Crystal Liang, Shival Dasu, Matilde Marcolli, Persistent Topology of Syntax, arXiv:1507.05134 Databases of Syntactic Parameters of World Languages: 1 Syntactic Structures of World Languages (SSWL) http://sswl.railsplayground.net/ 2 TerraLing http://www.terraling.com/ 3 World Atlas of Language Structures (WALS) http://wals.info/ Matilde Marcolli Geometry, Physics, Linguistics Persistent Topology of Data Sets how data cluster around topological shapes at di↵erent scales Matilde Marcolli Geometry, Physics, Linguistics Vietoris–Rips complexes • set X = {x↵} of points in Euclidean space EN, distance d(x, y) = kx yk = ( PN j=1(xj yj )2)1/2 • VietorisRips complex R(X, ✏) of scale ✏ over ﬁeld K: Rn(X, ✏) is Kvector space spanned by all unordered (n + 1)tuples of points {x↵0 , x↵1 , . . . , x↵n } in X where all pairs have distances d(x↵i , x↵j ) ✏ Matilde Marcolli Geometry, Physics, Linguistics • inclusion maps R(X, ✏1) ,! R(X, ✏2) for ✏1 < ✏2 induce maps in homology by functoriality Hn(X, ✏1) ! Hn(X, ✏2) barcode diagrams: births and deaths of persistent generators Matilde Marcolli Geometry, Physics, Linguistics Persistent Topology of Syntactic Parameters • Data: 252 languages from SSWL with 115 parameters • if consider all world languages together too much noise in the persistent topology: subdivide by language families • Principal Component Analysis: reduce dimensionality of data • compute Vietoris–Rips complex and barcode diagrams Persistent H0: clustering of data in components – language subfamilies Persistent H1: clustering of data along closed curves (circles) – linguistic meaning? Matilde Marcolli Geometry, Physics, Linguistics Sources of Persistent H1 • “Hopf bifurcation” type phenomenon • two di↵erent branches of a tree closing up in a loop two di↵erent types of phenomena of historical linguistic development within a language family Matilde Marcolli Geometry, Physics, Linguistics Persistent Topology of IndoEuropean Languages • Two persistent generators of H0 (IndoIranian, European) • One persistent generator of H1 Matilde Marcolli Geometry, Physics, Linguistics Persistent Topology of Niger–Congo Languages • Three persistent components of H0 (Mande, AtlanticCongo, Kordofanian) • No persistent H1 Matilde Marcolli Geometry, Physics, Linguistics The origin of persistent H1 of IndoEuropean Languages? Naive guess: the AngloNorman bridge ... but lexical not syntactic Matilde Marcolli Geometry, Physics, Linguistics Answer: No, it is not the AngloNorman bridge! Persistent topology of the Germanic+Latin languages Matilde Marcolli Geometry, Physics, Linguistics Answer: It’s all because of Ancient Greek! Persistent topology with Hellenic (and IndoIranic) branch removed Matilde Marcolli Geometry, Physics, Linguistics Syntactic Parameters as Dynamical Variables • Example: Word Order: SOV, SVO, VSO, VOS, OVS, OSV Very uneven distribution across world languages Matilde Marcolli Geometry, Physics, Linguistics • Word order distribution: a neuroscience explanation?  D. Kemmerer, The crosslinguistic prevalence of SOV and SVO word orders reﬂects the sequential and hierarchical representation of action in Broca’s area, Language and Linguistics Compass, 6 (2012) N.1, 50–66. • Internal reasons for diachronic switch?  F.Antinucci, A.Duranti, L.Gebert, Relative clause structure, relative clause perception, and the change from SOV to SVO, Cognition, Vol.7 (1979) N.2 145–176. Matilde Marcolli Geometry, Physics, Linguistics Changes over time in Word Order • Ancient Greek: switched from Homeric to Classical  A. Taylor, The change from SOV to SVO in Ancient Greek, Language Variation and Change, 6 (1994) 1–37 • Sanskrit: di↵erent word orders allowed, but prevalent one in Vedic Sanskrit is SOV (switched at least twice by inﬂuence of Dravidian languages)  F.J. Staal, Word Order in Sanskrit and Universal Grammar, Springer, 1967 • English: switched from Old English (transitional between SOV and SVO) to Middle English (SVO)  J. McLaughlin, Old English Syntax: a handbook, Walter de Gruyter, 1983. Syntactic Parameters are Dynamical in Language Evolution Matilde Marcolli Geometry, Physics, Linguistics Spin Glass Models of Syntax • Karthik Siva, Jim Tao, Matilde Marcolli, Spin Glass Models of Syntax and Language Evolution, arXiv:1508.00504 – focus on linguistic change caused by language interactions – think of syntactic parameters as spin variables – spin interaction tends to align (ferromagnet) – strength of interaction proportional to bilingualism (MediaLab) – role of temperature parameter: probabilistic interpretation of parameters – not all parameters are independent: entailment relations – Metropolis–Hastings algorithm: simulate evolution Matilde Marcolli Geometry, Physics, Linguistics The Ising Model of spin systems on a graph G • conﬁgurations of spins s : V (G) ! {±1} • magnetic ﬁeld B and correlation strength J: Hamiltonian H(s) = J X e2E(G):@(e)={v,v0} sv sv0 B X v2V (G) sv • ﬁrst term measures degree of alignment of nearby spins • second term measures alignment of spins with direction of magnetic ﬁeld Matilde Marcolli Geometry, Physics, Linguistics Equilibrium Probability Distribution • Partition Function ZG ( ) ZG ( ) = X s:V (G)!{±1} exp( H(s)) • Probability distribution on the conﬁguration space: Gibbs measure PG, (s) = e H(s) ZG ( ) • low energy states weight most • at low temperature (large ): ground state dominates; at higher temperature ( small) higher energy states also contribute Matilde Marcolli Geometry, Physics, Linguistics Average Spin Magnetization MG ( ) = 1 #V (G) X s:V (G)!{±1} X v2V (G) sv P(s) • Free energy FG ( , B) = log ZG ( , B) MG ( ) = 1 #V (G) 1 ✓ @FG ( , B) @B ◆ B=0 Ising Model on a 2dimensional lattice • 9 critical temperature T = Tc where phase transition occurs • for T > Tc equilibrium state has m(T) = 0 (computed with respect to the equilibrium Gibbs measure PG, • demagnetization: on average as many up as down spins • for T < Tc have m(T) > 0: spontaneous magnetization Matilde Marcolli Geometry, Physics, Linguistics Syntactic Parameters and Ising/Potts Models • characterize set of n = 2N languages Li by binary strings of N syntactic parameters (Ising model) • or by ternary strings (Potts model) if take values ±1 for parameters that are set and 0 for parameters that are not deﬁned in a certain language • a system of n interacting languages = graph G with n = #V (G) • languages Li = vertices of the graph (e.g. language that occupies a certain geographic area) • languages that have interaction with each other = edges E(G) (geographical proximity, or high volume of exchange for other reasons) Matilde Marcolli Geometry, Physics, Linguistics graph of language interaction (detail) from Global Language Network of MIT MediaLab, with interaction strengths Je on edges based on number of book translations (or Wikipedia edits) Matilde Marcolli Geometry, Physics, Linguistics • if only one syntactic parameter, would have an Ising model on the graph G: conﬁgurations s : V (G) ! {±1} set the parameter at all the locations on the graph • variable interaction energies along edges (some pairs of languages interact more than others) • magnetic ﬁeld B and correlation strength J: Hamiltonian H(s) = X e2E(G):@(e)={v,v0} NX i=1 Je sv,i sv0,i • if N parameters, conﬁgurations s = (s1, . . . , sN) : V (G) ! {±1}N • if all N parameters are independent, then it would be like having N noninteracting copies of a Ising model on the same graph G (or N independent choices of an initial state in an Ising model on G) Matilde Marcolli Geometry, Physics, Linguistics Metropolis–Hastings • detailed balance condition P(s)P(s ! s0) = P(s0)P(s0 ! s) for probabilities of transitioning between states (Markov process) • transition probabilities P(s ! s0) = ⇡A(s ! s0) · ⇡(s ! s0) with ⇡(s ! s0) conditional probability of proposing state s0 given state s and ⇡A(s ! s0) conditional probability of accepting it • Metropolis–Hastings choice of acceptance distribution (Gibbs) ⇡A(s ! s0 ) = ⇢ 1 if H(s0) H(s) 0 exp( (H(s0) H(s))) if H(s0) H(s) > 0. satisfying detailed balance • selection probabilities ⇡(s ! s0) singlespinﬂip dynamics • ergodicity of Markov process ) unique stationary distribution Matilde Marcolli Geometry, Physics, Linguistics Example: Single parameter dynamics SubjectVerb parameter Initial conﬁguration: most languages in SSWL have +1 for SubjectVerb; use interaction energies from MediaLab data Matilde Marcolli Geometry, Physics, Linguistics Equilibrium: low temperature all aligned to +1; high temperature: Temperature: ﬂuctuations in bilingual users between di↵erent structures (“codeswitching” in Linguistics) Matilde Marcolli Geometry, Physics, Linguistics Entailment relations among parameters • Example: {p1, p2} = {Strong Deixis, Strong Anaphoricity} p1 p2 `1 +1 +1 `2 1 0 `3 +1 +1 `4 +1 1 {`1, `2, `3, `4} = {English, Welsh, Russian, Bulgarian} Matilde Marcolli Geometry, Physics, Linguistics Modeling Entailment • variables: S`,p1 = exp(⇡iX`,p1 ) 2 {±1}, S`,p2 2 {±1, 0} and Y`,p2 = S`,p2  2 {0, 1} • Hamiltonian H = HE + HV HE = Hp1 + Hp2 = X `,`02languages J``0 ⇣ S`,p1 ,S`0,p1 + S`,p2 ,S`0,p2 ⌘ HV = X ` HV ,` = X ` J` X`,p1 ,Y`,p2 J` > 0 antiferromagnetic • two parameters: temperature as before and coupling energy of entailment • if freeze p1 and evolution for p2: Potts model with external magnetic ﬁeld Matilde Marcolli Geometry, Physics, Linguistics Acceptance probabilities ⇡A(s ! s ± 1 (mod 3)) = ⇢ 1 if H 0 exp( H) if H > 0. H := min{H(s + 1 (mod 3)), H(s 1 (mod 3))} H(s) Equilibrium conﬁguration (p1, p2) HT/HE HT/LE LT/HE LT/LE `1 (+1, 0) (+1, 1) (+1, +1) (+1, 1) `2 (+1, 1) ( 1, 1) (+1, +1) (+1, 1) `3 ( 1, 0) ( 1, +1) (+1, +1) ( 1, 0) `4 (+1, +1) ( 1, 1) (+1, +1) ( 1, 0) Matilde Marcolli Geometry, Physics, Linguistics Average value of spin p1 left and p2 right in low entailment energy case Matilde Marcolli Geometry, Physics, Linguistics Syntactic Parameters in Kanerva Networks • Jeong Joon Park, Ronnel Boettcher, Andrew Zhao, Alex Mun, Kevin Yuh, Vibhor Kumar, Matilde Marcolli, Prevalence and recoverability of syntactic parameters in sparse distributed memories, arXiv:1510.06342 – Address two issues: relative prevalence of di↵erent syntactic parameters and “degree of recoverability” (as sign of underlying relations between parameters) – If corrupt information about one parameter in data of group of languages can recover it from the data of the other parameters? – Answer: di↵erent parameters have di↵erent degrees of recoverability – Used 21 parameters and 165 languages from SSWL database Matilde Marcolli Geometry, Physics, Linguistics Kanerva networks (sparse distributed memories) • P. Kanerva, Sparse Distributed Memory, MIT Press, 1988. • ﬁeld F2 = {0, 1}, vector space FN 2 large N • uniform random sample of 2k hard locations with 2k << 2N • median Hamming distance between hard locations • Hamming spheres of radius slightly larger than median value (access sphere) • writing to network: storing datum X 2 FN 2 , each hard location in access sphere of X gets ith coordinate (initialized at zero) incremented depending on ith entry ot X • reading at a location: ith entry determined by majority rule of ith entries of all stored data in hard locations within access sphere Kanerva networks are good at reconstructing corrupted data Matilde Marcolli Geometry, Physics, Linguistics Procedure • 165 data points (languages) stored in a Kanerva Network in F21 2 (choice of 21 parameters) • corrupting one parameter at a time: analyze recoverability • language bitstring with a single corrupted bit used as read location and resulting bit string compared to original bitstring (Hamming distance) • resulting average Hamming distance used as score of recoverability (lowest = most easily recoverable parameter) Matilde Marcolli Geometry, Physics, Linguistics Parameters and frequencies 01 SubjectVerb (0.64957267) 02 VerbSubject (0.31623933) 03 VerbObject (0.61538464) 04 ObjectVerb (0.32478634) 05 SubjectVerbObject (0.56837606) 06 SubjectObjectVerb (0.30769232) 07 VerbSubjectObject (0.1923077) 08 VerbObjectSubject (0.15811966) 09 ObjectSubjectVerb (0.12393162) 10 ObjectVerbSubject (0.10683761) 11 AdpositionNounPhrase (0.58974361) 12 NounPhraseAdposition (0.2905983) 13 AdjectiveNoun (0.41025642) 14 NounAdjective (0.52564102) 15 NumeralNoun (0.48290598) 16 NounNumeral (0.38034189) 17 DemonstrativeNoun (0.47435898) 18 NounDemonstrative (0.38461539) 19 PossessorNoun (0.38034189) 20 NounPossessor (0.49145299) A01 AttributiveAdjectiveAgreement (0.46581197) Matilde Marcolli Geometry, Physics, Linguistics Matilde Marcolli Geometry, Physics, Linguistics Overall e↵ect related to relative prevalence of a parameter Matilde Marcolli Geometry, Physics, Linguistics More reﬁned e↵ect after normalizing for prelavence (syntactic dependencies) Matilde Marcolli Geometry, Physics, Linguistics • Overall e↵ect relating recoverability in a Kanerva Network to prevalence of a certain parameter among languages (depends only on frequencies: see in random data with assigned frequencies) • Additional e↵ects (that deviate from random case) which detect possible dependencies among syntactic parameters: increased recoverability beyond what e↵ect based on frequency • Possible neuroscience implications? Kanerva Networks as models of human memory (parameter prevalence linked to neuroscience models) • More reﬁned data if divided by language families? Matilde Marcolli Geometry, Physics, Linguistics Phylogenetic Linguistics (WORK IN PROGRESS) • Constructing family trees for languages (sometimes possibly graphs with loops) • Main information about subgrouping: shared innovation a speciﬁc change with respect to other languages in the family that only happens in a certain subset of languages  Example: among Mayan languages: Huastecan branch characterized by initial w becoming voiceless before a vowel and ts becoming t, q becoming k, ... Quichean branch by velar nasal becoming velar fricative, ´c becoming ˇc (prepalatal a↵ricate to palatoalveolar)... Known result by traditional Historical Linguistics methods: Matilde Marcolli Geometry, Physics, Linguistics Mayan Language Tree Matilde Marcolli Geometry, Physics, Linguistics Computational Methods for Phylogenetic Linguistics • Peter Foster, Colin Renfrew, Phylogenetic methods and the prehistory of languages, McDonald Institute Monographs, 2006 • Several computational methods for constructing phylogenetic trees available from mathematical and computational biology • Phylogeny Programs http://evolution.genetics.washington.edu/phylip/software.html • Standardized lexical databases: Swadesh list (100 words, or 207 words) Matilde Marcolli Geometry, Physics, Linguistics • Use Swadesh lists of languages in a given family to look for cognates:  without additional etymological information (keep false positives)  with additional etymological information (remove false positives) • Two further choices about loan words:  remove loan words  keep loan words • Keeping loan words produces graphs that are not trees • Without loan words it should produce trees, but small loops still appear due to ambiguities (di↵erent possible trees matching same data) ... more precisely: coding of lexical data ... Matilde Marcolli Geometry, Physics, Linguistics Coding of lexical data • After compiling lists of cognate words for pairs of languages within a given family (with/without lexical information and loan words) • Produce a binary string S(L1, L2) = (s1, . . . , sN) for each pair of languages L1, L2, with entry 0 or 1 at the ith word of the lexical list of N words if cognates for that meaning exist in the two languages or not (important to pay attention to synonyms) • lexical Hamming distance between two languages d(L1, L2) = #{i 2 {1, . . . , N}  si = 1} counts words in the list that do not have cognates in L1 and L2 Matilde Marcolli Geometry, Physics, Linguistics Distancematrix method of phylogenetic inference • after producing a measure of “genetic distance” Hamming metric dH(La, Lb) • hierarchical data clustering: collecting objects in clusters according to their distance • simplest method of tree construction: neighbor joining (1)  create a (leaf) vertex for each index a (ranging over languages in given family) (2)  given distance matrix D = (Dab) distances between each pair Dab = dH(La, Lb) construct a new matrix Qtest Q = (Qab) with Qab = (n 2)Dab nX k=1 Dak nX k=1 Dbk this matrix Q decides ﬁrst pairs of vertices to join Matilde Marcolli Geometry, Physics, Linguistics (3)  identify entries Qab with lowest values: join each such pair (a, b) of leaf vertices to a newly created vertex vab (4)  set distances to new vertex by d(a, vab) = 1 2 Dab + 1 2(n 2) nX k=1 Dak nX k=1 Dbk ! d(b, vab) = Dab d(a, vab) d(k, vab) = 1 2 (Dak + Dbk Dab) (5)  remove a and b and keep vab and all the remaining vertices and the new distances, compute new Q matrix and repeat until tree is completed Matilde Marcolli Geometry, Physics, Linguistics NeighborhoodJoining Method for Phylogenetic Inference Matilde Marcolli Geometry, Physics, Linguistics Example of a neighborjoining lexical linguistic phylogenetic tree from DelmestriCristianini’s paper Matilde Marcolli Geometry, Physics, Linguistics N. Saitou, M. Nei, The neighborjoining method: a new method for reconstructing phylogenetic trees, Mol Biol Evol. Vol.4 (1987) N. 4, 406425. R. Mihaescu, D. Levy, L. Pachter, Why neighborjoining works, arXiv:cs/0602041v3 A. Delmestri, N. Cristianini, Linguistic Phylogenetic Inference by PAMlike Matrices, Journal of Quantitative Linguistics, Vol.19 (2012) N.2, 95120. F. Petroni, M. Serva, Language distance and tree reconstruction, J. Stat. Mech. (2008) P08012 Matilde Marcolli Geometry, Physics, Linguistics Syntactic Phylogenetic Trees (instead of lexical) • instead of coding lexical data based on cognate words, use binary variables of syntactic parameters • Hamming distance between binary string of parameter values • shown recently that one gets an accurate reconstruction of the phylogenetic tree of IndoEuropean languages from syntactic parameters only • G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini, A. Ceolin, Towards a syntactic phylogeny of modern IndoEuropean languages, Journal of Historical Linguistics 3 (2013) N.1, 122–152. • G. Longobardi, C. Guardiano, Evidence for syntax as a signal of historical relatedness, Lingua 119 (2009) 1679–1706. Matilde Marcolli Geometry, Physics, Linguistics Work in Progress • Sharjeel Aziz, VyLuan Huynh, David Warrick, Matilde Marcolli, Syntactic Phylogenetic Trees, in preparation ...coming soon to an arXiv near you – Assembled a phylogenetic tree of world languages using the SSWL database of syntactic parameters – Ongoing comparison with speciﬁc historical linguistic reconstruction of phylogenetic trees – Comparison with Computational Linguistic reconstructions based on lexical data (Swadesh lists) and on phonetical analysis – not all linguistic families have syntactic parameters mapped with same level of completeness... di↵erent levels of accuracy in reconstruction Matilde Marcolli Geometry, Physics, Linguistics
Random Geometry/Homology (chaired by Laurent Decreusefond/Frédéric Chazal)
Let m be a random tessellation in R d , d ≥ 1, observed in the window W p = ρ1/d[0, 1] d , ρ > 0, and let f be a geometrical characteristic. We investigate the asymptotic behaviour of the maximum of f(C) over all cells C ∈ m with nucleus W p as ρ goes to infinity.When the normalized maximum converges, we show that its asymptotic distribution depends on the socalled extremal index. Two examples of extremal indices are provided for PoissonVoronoi and PoissonDelaunay tessellations.

Random tessellations Main problem Extremal index The extremal index for a random tessellation Nicolas Chenavier Université Littoral Côte d’Opale October 28, 2015 Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Plan 1 Random tessellations 2 Main problem 3 Extremal index Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Random tessellations Deﬁnition A (convex) random tessellation m in Rd is a partition of the Euclidean space into random polytopes (called cells). We will only consider the particular case where m is a : PoissonVoronoi tessellation ; PoissonDelaunay tessellation. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index PoissonVoronoi tessellation X, Poisson point process in Rd ; ∀x ∈ X, CX(x) := {y ∈ Rd , y − x ≤ y − x , x ∈ X} (Voronoi cell with nucleus x) ; mPVT := {CX(x), x ∈ X}, PoissonVoronoi tessellation ; ∀CX(x) ∈ mPVT , we let z(CX(x)) := x. x CX(x) Mosaique de PoissonVoronoi Figure: PoissonVoronoi tessellation. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index PoissonDelaunay tessellation X, Poisson point process in Rd ; ∀x, x ∈ X, x and x deﬁne an edge if CX(x) ∩ CX(x ) = ∅ ; mPDT , PoissonDelaunay tessellation ; ∀C ∈ mPDT , we let z(C) as the circumcenter of C. x x z(C) Mosaique de PoissonDelaunay Figure: PoissonDelaunay tessellation. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Typical cell Deﬁnition Let m be a stationary random tessellation. The typical cell of m is a random polytope C in Rd which distribution given as follows : for each bounded translationinvariant function g : {polytopes} → R, we have E [g(C)] := 1 N(B) E C∈m, z(C)∈B g(C) , where : B ⊂ R is any Borel subset with ﬁnite and nonempty volume ; N(B) is the mean number of cells with nucleus in B. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index 1 Random tessellations 2 Main problem 3 Extremal index Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Main problem Framework : m = mPVT , mPDT ; Wρ := [0, ρ]d , with ρ > 0 ; g : {polytopes} → R, geometrical characteristic. Aim : asymptotic behaviour, when ρ → ∞, of Mg,ρ = max C∈m, z(C)∈Wρ g(C)? Figure: Voronoi cell maximizing the area in the square. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Objective and applications Objective : ﬁnd ag,ρ > 0, bg,ρ ∈ R s.t. P Mg,ρ ≤ ag,ρt + bg,ρ converges, as ρ → ∞, for each t ∈ R. Applications : regularity of the tessellation ; discrimination of point processes and tessellations ; PoissonVoronoi approximation. Approximation de PoissonVoronoi Figure: PoissonVoronoi approximation. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Asymptotics under a local correlation condition Notation : let vρ := ag,ρt + bρ be a threshold such that ρd · P (g(C) > vρ) −→ ρ→∞ τ, for some τ := τ(t) ≥ 0. Local Correlation Condition (LCC) ρd (log ρ)d · E (C1,C2)=∈m2, z(C1),z(C2)∈[0,log ρ]d 1g(C1)>vρ,g(C2)>vρ −→ ρ→∞ 0. Theorem Under (LCC), we have : P (Mg,ρ ≤ vρ) −→ ρ→∞ e−τ . Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index 1 Random tessellations 2 Main problem 3 Extremal index Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Deﬁnition of the extremal index Proposition Assume that for all τ ≥ 0, there exists a threshold v (τ) ρ depending on ρ such that ρd · P(g(C) > v (τ) ρ ) −→ ρ→∞ τ. Then there exists θ ∈ [0, 1] such that, for all τ ≥ 0, lim ρ→∞ P(Mg,ρ ≤ v(τ) ρ ) = e−θτ , provided that the limit exists. Deﬁnition According to Leadbetter, we say that θ ∈ [0, 1] is the extremal index if, for each τ ≥ 0, we have : ρd · P g(C) > v(τ) ρ −→ ρ→∞ τ and lim ρ→∞ P(Mg,ρ ≤ v(τ) ρ ) = e−θτ . Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Example 1 Framework : m := mPVT : PoissonVoronoi tessellation ; g(C) := r(C) : inradius of any cell C := CX(x) with x ∈ X, i.e. r(C) := r (CX(x)) := max{r ∈ R+ : B(x, r) ⊂ CX(x)}. rmin,PVT (ρ) := minx∈X∩Wρ r (CX(x)). Extremal index : θ = 1/2 for each d ≥ 1. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Minimum of inradius for a PoissonVoronoi tessellation (b) Typical Poisson−Voronoï cell with a small inradii x y −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Example 2 Framework : m := mPDT : PoissonDelaunay tessellation ; g(C) := R(C) : circumradius of any cell C, i.e. R(C) := min{r ∈ R+ : B(x, r) ⊃ C}. Rmax,PDT (ρ) := maxC∈mPDT :z(C)∈Wρ R(C). Extremal index : θ = 1; 1/2; 35/128 for d = 1; 2; 3. Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Maximum of circumradius for a PoissonDelaunay tessellation (d) Typical Poisson−Delaunay cell with a large circumradii x y −15 −10 −5 0 5 10 15 −15−10−5051015 Nicolas Chenavier The extremal index for a random tessellation Random tessellations Main problem Extremal index Work in progress Joint work with C. Robert (ISFA, Lyon 1) : new characterization of the extremal index (not based on classical block and run estimators appearing in the classical Extreme Value Theory) ; simulation and estimation for the extremal index and cluster size distribution (for PoissonVoronoi and PoissonDelaunay tessellations). Nicolas Chenavier The extremal index for a random tessellation
A model of twotype (or twocolor) interacting random balls is introduced. Each colored random set is a union of random balls and the interaction relies on the volume of the intersection between the two random sets. This model is motivated by the detection and quantification of colocalization between two proteins. Simulation and inference are discussed. Since all individual balls cannot been identified, e.g. a ball may contain another one, standard methods of inference as likelihood or pseudolikelihood are not available and we apply the TakacsFiksel method with a specific choice of test functions.

A testing procedure A model for colocalization Estimation A twocolor interacting random balls model for colocalization analysis of proteins. Frédéric Lavancier, Laboratoire de Mathématiques Jean Leray, Nantes INRIA Rennes, Serpico team Joint work with C. Kervrann (INRIA Rennes, Serpico team). GSI’15, 2830 October 2015. A testing procedure A model for colocalization Estimation Introduction : some data Vesicular traﬃcking analysis and colocalization quantiﬁcation by TIRF microscopy (1px = 100 nanometer) [SERPICO team, INRIA] ? =⇒ Langerin proteins (left) and Rab11 GTPase proteins (right). Is there colocalization ? ⇔ Is there some spatial dependencies between the two types of proteins ? A testing procedure A model for colocalization Estimation Image preprocessing After segmentation Superposition : ? ⇒ After a Gaussian weights thresholding Superposition : ? ⇒ A testing procedure A model for colocalization Estimation The problem of colocalization can be described as follows : We observe two binary images in a domain Ω : First image (green) : realization of a random set Γ1 ∩ Ω Second image (red) : realization of a random set Γ2 ∩ Ω −→ Is there some dependencies between Γ1 and Γ2 ? −→ If so, can we quantify/model this dependency ? A testing procedure A model for colocalization Estimation 1 A testing procedure 2 A model for colocalization 3 Estimation problem A testing procedure A model for colocalization Estimation 1 A testing procedure 2 A model for colocalization 3 Estimation problem A testing procedure A model for colocalization Estimation Testing procedure Let a generic point o ∈ Rd and p1 = P(o ∈ Γ1), p2 = P(o ∈ Γ2), p12 = P(o ∈ Γ1 ∩ Γ2). If Γ1 and Γ2 are independent, then p12 = p1p2. A testing procedure A model for colocalization Estimation Testing procedure Let a generic point o ∈ Rd and p1 = P(o ∈ Γ1), p2 = P(o ∈ Γ2), p12 = P(o ∈ Γ1 ∩ Γ2). If Γ1 and Γ2 are independent, then p12 = p1p2. A natural measure of departure from independency is ˆp12 − ˆp1 ˆp2 where ˆp1 = Ω−1 x∈Ω 1Γ1 (x), ˆp2 = Ω−1 x∈Ω 1Γ2 (x), ˆp12 = Ω−1 x∈Ω 1Γ1∩Γ2 (x). A testing procedure A model for colocalization Estimation Testing procedure Assume Γ1 and Γ2 are mdependent stationary random sets. If Γ1 is independent of Γ2, then as Ω tends to inﬁnity, T := Ω ˆp12 − ˆp1 ˆp2 x∈Ω y∈Ω ˆC1(x − y) ˆC2(x − y) → N(0, 1) where ˆC1 and ˆC2 are the empirical covariance functions of Γ1 ∩ Ω and Γ2 ∩ Ω respectively. Hence to test the null hypothesis of independence between Γ1 and Γ2 pvalue = 2(1 − Φ(T)) where Φ is the c.d.f. of the standard normal distribution. A testing procedure A model for colocalization Estimation Some simulations Simulations when Γ1 and Γ2 are union of random balls A testing procedure A model for colocalization Estimation Some simulations Simulations when Γ1 and Γ2 are union of random balls Independent case (and each color ∼ Poisson) Number of p−values < 0.05 over 100 realizations : 4. A testing procedure A model for colocalization Estimation Some simulations Dependent case (see later for the model) Number of p−values < 0.05 over 100 realizations : 100. A testing procedure A model for colocalization Estimation Some simulations Independent case, larger radii Number of p−values < 0.05 over 100 realizations : 5. A testing procedure A model for colocalization Estimation Some simulations Dependent case, larger radii and "small" dependence Number of p−values < 0.05 over 100 realizations : 97. A testing procedure A model for colocalization Estimation Real Data Depending on the preprocessing : T = 9.9 T = 17 p − value = 0 p − value = 0 A testing procedure A model for colocalization Estimation 1 A testing procedure 2 A model for colocalization 3 Estimation problem A testing procedure A model for colocalization Estimation We view each set Γ1 and Γ2 as a union of random balls. We model the superposition of the two images, i.e. Γ1 ∪ Γ2. A testing procedure A model for colocalization Estimation We view each set Γ1 and Γ2 as a union of random balls. We model the superposition of the two images, i.e. Γ1 ∪ Γ2. The reference model is a twotype (two colors) Boolean model with equiprobable marks, where the radii follow some distribution µ on [Rmin, Rmax]. A testing procedure A model for colocalization Estimation We view each set Γ1 and Γ2 as a union of random balls. We model the superposition of the two images, i.e. Γ1 ∪ Γ2. The reference model is a twotype (two colors) Boolean model with equiprobable marks, where the radii follow some distribution µ on [Rmin, Rmax]. Notation : (ξ, R)i : ball centered at ξ with radius R and color i ∈ {1, 2}. → viewed as a marked point, marked by R and i. xi : collection of all marked points with color i. Hence Γi = (ξ,R)i∈xi (ξ, R)i x = x1 ∪ x2 : collection of all marked points. A testing procedure A model for colocalization Estimation Example : three realizations of the reference process A testing procedure A model for colocalization Estimation The model We consider a density on any bounded domain Ω with respect to the reference model f(x) ∝ zn1 1 zn2 2 eθ Γ1∩ Γ2 where n1 : number of green balls and n2 : number of red balls. This density depends on 3 parameters z1 : rules the mean number of green balls z2 : rules the mean number of red balls θ : interaction parameter. If θ > 0 : attraction (colocalization) between Γ1 and Γ2 If θ = 0 : back to the reference model, up to the intensities (independence between Γ1 and Γ2). A testing procedure A model for colocalization Estimation Simulation Realizations can be generated by a standard birthdeath MetropolisHastings algorithm. Examples : A testing procedure A model for colocalization Estimation 1 A testing procedure 2 A model for colocalization 3 Estimation problem A testing procedure A model for colocalization Estimation Estimation problem Aim : Assume that the law µ of the radii is known. Given a realization of Γ1 ∪ Γ2 on Ω, estimate z1, z2 and θ in f(x) = 1 c(z1, z2, θ) zn1 1 zn2 2 eθ Γ1∩ Γ2 , where c(z1, z2, θ) is the normalizing constant. A testing procedure A model for colocalization Estimation Estimation problem Aim : Assume that the law µ of the radii is known. Given a realization of Γ1 ∪ Γ2 on Ω, estimate z1, z2 and θ in f(x) = 1 c(z1, z2, θ) zn1 1 zn2 2 eθ Γ1∩ Γ2 , where c(z1, z2, θ) is the normalizing constant. Issue : The number of balls n1 and n2 is not observed. ⇒ likelihood or pseudolikelihood based inference is not feasible. = A testing procedure A model for colocalization Estimation An equilibrium equation Consider, for any nonnegative function h, C(z1, z2, θ; h) = S(h) − z1I1(θ; h) − z2I2(θ; h) where S(h) = (ξ,R)∈x,ξ∈Ω h((ξ, R), x\(ξ, R)) and for i = 1, 2, Ii(θ; h) = Rmax Rmin Ω h((ξ, R)i, x) λ((ξ, R)i, x) 2zi dξ µ(dR). Denoting by z∗ 1 , z∗ 2 and θ∗ the true unknown values of the parameters, we know from the GeorgiiNguyenZessin equation that for any h E(C(z∗ 1 , z∗ 2 , θ∗ ; h)) = 0. A testing procedure A model for colocalization Estimation Takacs Fiksel estimator Given K test functions (hk)1≤k≤K, the TakacsFiksel estimator is deﬁned by (ˆz1, ˆz2, ˆθ) := arg min z1,z2,θ K k=1 C(z1, z2, θ; hk)2 . (1) A testing procedure A model for colocalization Estimation Takacs Fiksel estimator Given K test functions (hk)1≤k≤K, the TakacsFiksel estimator is deﬁned by (ˆz1, ˆz2, ˆθ) := arg min z1,z2,θ K k=1 C(z1, z2, θ; hk)2 . (1) Consistency and asymptotic normality studied in Coeurjolly et al. 2012. A testing procedure A model for colocalization Estimation Takacs Fiksel estimator Given K test functions (hk)1≤k≤K, the TakacsFiksel estimator is deﬁned by (ˆz1, ˆz2, ˆθ) := arg min z1,z2,θ K k=1 C(z1, z2, θ; hk)2 . (1) Consistency and asymptotic normality studied in Coeurjolly et al. 2012. Recall that C(z1, z2, θ; h) = S(h) − z1I1(θ; h) − z2I2(θ; h) where S(h) = (ξ,R)∈x,ξ∈Ω h((ξ, R), x\(ξ, R)) To be able to compute (1), we must ﬁnd test functions hk such that S(h) is computable A testing procedure A model for colocalization Estimation Takacs Fiksel estimator Given K test functions (hk)1≤k≤K, the TakacsFiksel estimator is deﬁned by (ˆz1, ˆz2, ˆθ) := arg min z1,z2,θ K k=1 C(z1, z2, θ; hk)2 . (1) Consistency and asymptotic normality studied in Coeurjolly et al. 2012. Recall that C(z1, z2, θ; h) = S(h) − z1I1(θ; h) − z2I2(θ; h) where S(h) = (ξ,R)∈x,ξ∈Ω h((ξ, R), x\(ξ, R)) To be able to compute (1), we must ﬁnd test functions hk such that S(h) is computable How many ? At least K = 3 because 3 parameters to estimate. A testing procedure A model for colocalization Estimation A ﬁrst possibility : h1((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ1)c 1{i=1} where S(ξ, R) is the sphere {y, y − ξ = R}. ⇓ ⇓ ⇓ ⇓ A testing procedure A model for colocalization Estimation What about S(h1) = (ξ,R)∈x,ξ∈Ω h1((ξ, R), x\(ξ, R)) ? A testing procedure A model for colocalization Estimation What about S(h1) = (ξ,R)∈x,ξ∈Ω h1((ξ, R), x\(ξ, R)) ? = A testing procedure A model for colocalization Estimation What about S(h1) = (ξ,R)∈x,ξ∈Ω h1((ξ, R), x\(ξ, R)) ? = ⇒ S(h1) = P(Γ1) (the perimeter of Γ1) A testing procedure A model for colocalization Estimation So, for h1((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ1)c 1{i=1} S(h1) = P(Γ1) and the TakacsFiksel contrast function C(z1, z2, θ; h1) is computable. A testing procedure A model for colocalization Estimation So, for h1((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ1)c 1{i=1} S(h1) = P(Γ1) and the TakacsFiksel contrast function C(z1, z2, θ; h1) is computable. Similarly, Let h2((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ2)c 1{i=2} then S(h2) = P(Γ2). A testing procedure A model for colocalization Estimation So, for h1((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ1)c 1{i=1} S(h1) = P(Γ1) and the TakacsFiksel contrast function C(z1, z2, θ; h1) is computable. Similarly, Let h2((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ2)c 1{i=2} then S(h2) = P(Γ2). Let h3((ξ, R)i, x) = Length S(ξ, R) ∩ (Γ1 ∪ Γ2)c then S(h3) = P(Γ1 ∪ Γ2). A testing procedure A model for colocalization Estimation Simulations with test functions h1, h2 and h3 over 100 realizations θ = 0.2 (and small radii) θ = 0.05 (and large radii) Frequency 0.15 0.20 0.25 0.30 05101520 Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 010203040 A testing procedure A model for colocalization Estimation Real Data We assume the law of the radii is uniform on [Rmin, Rmax]. (each image is embedded in [0, 250] × [0, 280]) Rmin = 0.5, Rmax = 2.5 Rmin = 0.5, Rmax = 10 ˆθ = 0.45 ˆθ = 0.03 A testing procedure A model for colocalization Estimation Conclusion The testing procedure allows to detect colocalization between two binary images is easy and fast to implement does not depend too much on the image preprocessing The model for colocalization relies on geometric features (area of intersection) can be ﬁtted by the TakacsFiksel method allows to compare the degree of colocalization θ between two pairs of images if the laws of radii are similar
The characteristic independence property of Poisson point processes gives an intuitive way to explain why a sequence of point processes becoming less and less repulsive can converge to a Poisson point process. The aim of this paper is to show this convergence for sequences built by superposing, thinning or rescaling determinantal processes. We use Papangelou intensities and Stein’s method to prove this result with a topology based on total variation distance.

IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications 2nd conference on Geometric Science of Information Aurélien VASSEUR Asymptotics of some Point Processes Transformations Ecole Polytechnique, ParisSaclay, October 28, 2015 1/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Mobile network in Paris  Motivation −2000 0 2000 4000 100020003000 −2000 0 2000 4000 100020003000 Figure: On the left, positions of all BS in Paris. On the right, locations of BS for one frequency band. 2/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Table of Contents IGeneralities on point processes Correlation function, Papangelou intensity and repulsiveness Determinantal point processes IIKantorovichRubinstein distance Convergence dened by dKR dKR(PPP, Φ) ≤ "nice" upper bound IIIApplications to transformations of point processes Superposition Thinning Rescaling 3/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Framework Y a locally compact metric space µ a diuse and locally nite measure of reference on Y NY the space of congurations on Y NY the space of nite congurations on Y 4/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Correlation function  Papangelou intensity Correlation function ρ of a point process Φ: E[ α∈NY α⊂Φ f (α)] = +∞ k=0 1 k! ˆ Yk f · ρ({x1, . . . , xk})µ(dx1) . . . µ(dxk) ρ(α) ≈ probability of nding a point in at least each point of α Papangelou intensity c of a point process Φ: E[ x∈Φ f (x, Φ \ {x})] = ˆ Y E[c(x, Φ)f (x, Φ)]µ(dx) c(x, ξ) ≈ conditionnal probability of nding a point in x given ξ 5/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Point process Properties Intensity measure: A ∈ FY → ´ A ρ({x})µ(dx) ρ({x}) = E[c(x, Φ)] If Φ is nite, then: IP(Φ = 1) = ˆ Y c(x, ∅)µ(dx) IP(Φ = 0). 6/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Poisson point process Properties Φ PPP with intensity M(dy) = m(y)dy Correlation function: ρ(α) = x∈α m(x) Papangelou intensity: c(x, ξ) = m(x) 7/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Repulsive point process Denition Point process repulsive if φ ⊂ ξ =⇒ c(x, ξ) ≤ c(x, φ) Point process weakly repulsive if c(x, ξ) ≤ c(x, ∅) 8/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Determinantal point process Denition Determinantal point process DPP(K, µ): ρ({x1, · · · , xk}) = det(K(xi , xj ), 1 ≤ i, j ≤ k) Proposition Papangelou intensity of DPP(K, µ): c(x0, {x1, · · · , xk}) = det(J(xi , xj ), 0 ≤ i, j ≤ k) det(J(xi , xj ), 1 ≤ i, j ≤ k) where J = (I − K)−1K. 9/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process Ginibre point process Denition Ginibre point process on B(0, R): K(x, y) = 1 π e−1 2 (x2 +y2 ) exy 1{x∈B(0,R)}1{y∈B(0,R)} βGinibre point process on B(0, R): Kβ(x, y) = 1 π e − 1 2β (x2 +y2 ) e 1 β xy 1{x∈B(0,R)} 1{y∈B(0,R)} 10/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Framework Determinantal point process βGinibre point processes 11/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications KantorovichRubinstein distance Total variation distance: dTV(ν1, ν2) := sup A∈FY ν1(A),ν2(A)<∞ ν1(A) − ν2(A) F : NY → IR is 1Lipschitz (F ∈ Lip1) if F(φ1) − F(φ2) ≤ dTV (φ1, φ2) for all φ1, φ2 ∈ NY KantorovichRubinstein distance: dKR(IP1, IP2) = sup F∈Lip1 ˆ NY F(φ) IP1(dφ) − ˆ NY F(φ) IP2(dφ) Convergence in K.R. distance =⇒ strictly Convergence in law 12/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Upper bound theorem Theorem (L. Decreusefond, AV) Φ a nite point process on Y ζM a PPP with nite control measure M(dy) = m(y)µ(dy). Then, we have: dKR(IPΦ, IPζM ) ≤ ˆ Y ˆ NY m(y) − c(y, φ)IPΦ(dφ)µ(dy). 13/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning Superposition of weakly repulsive point processes Φn,1, . . . , Φn,n: n independent, nite and weakly repulsive point processes on Y Φn := n i=1 Φn,i Rn := ´ Y  n i=1 ρn,i (x) − m(x)µ(dx) ζM a PPP with control measure M(dx) = m(x)µ(dx) 14/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning Superposition of weakly repulsive point processes Proposition (LD, AV) Φn = n i=1 Φn,i ζM a PPP with control measure M(dx) = m(x)µ(dx) dKR(IPΦn , IPζM ) ≤ Rn + max 1≤i≤n ˆ Y ρn,i (x)µ(dx) 15/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning Consequence Corollary (LD, AV) f pdf on [0; 1] such that f (0+) := limx→0+ f (x) ∈ IR Λ compact subset of IR+ X1, . . . , Xn i.i.d. with pdf fn = 1 n f (1 n ·) Φn = {X1, . . . , Xn} ∩ Λ dKR(Φn, ζ) ≤ ˆ Λ f 1 n x − f (0+) dx + 1 n ˆ Λ f 1 n x dx where ζ is the PPP(f (0+)) reduced to Λ. 16/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning βGinibre point processes Proposition (LD, AV) Φn the βnGinibre process reduced to a compact set Λ ζ the PPP with intensity 1/π on Λ dKR(IPΦn , IPζ) ≤ Cβn 17/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning Kallenberg's theorem Theorem (O. Kallenberg) Φn a nite point process on Y pn : Y → [0; 1) uniformly −−−−−→ 0 Φn the pnthinning of Φn γM a Cox process (pnΦn) law −−→ M ⇐⇒ (Φn) law −−→ γM 18/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning Polish distance (fn) a sequence in the space of real continuous functions with compact support generating FY d∗(ν1, ν2) = n≥1 1 2n Ψ(ν1(fn) − ν2(fn)) with Ψ(x) = x 1 + x d∗ KR the KantorovichRubinstein distance associated to the distance d∗ 19/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning Thinned point processes Proposition (LD, AV) Φn a nite point process on Y pn : Y → [0; 1) Φn the pnthinning of Φn γM a Cox process Then, we have: d∗ KR(IPΦn , IPγM ) ≤ 2E[ x∈Φn p2 n(x)] + d∗ KR(IPM, IPpnΦn ). 20/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Application to superposition Application to βGinibre point processes Application to thinning References L.Decreusefond, and A.Vasseur, Asymptotics of superposition of point processes, 2015. H.O. Georgii, and H.J. Yoo, Conditional intensity and gibbsianness of determinantal point processes, J. Statist. Phys. (118), January 2004. J.S. Gomez, A. Vasseur, A. Vergne, L. Decreusefond, P. Martins, and Wei Chen, A Case Study on Regularity in Cellular Network Deployment, IEEE Wireless Communications Letters, 2015. A.F. Karr, Point Processes and their Statistical Inference, Ann. Probab. 15 (1987), no. 3, 12261227. 21/22 Aurélien VASSEUR Télécom ParisTech IGeneralities on point processes IIKantorovichRubinstein distance IIIApplications Thank you ... ... for your attention. Questions? 22/22 Aurélien VASSEUR Télécom ParisTech
Random polytopes have constituted some of the central objects of stochastic geometry for more than 150 years. They are in general generated as convex hulls of a random set of points in the Euclidean space. The study of such models requires the use of ingredients coming from both convex geometry and probability theory. In the last decades, the study has been focused on their asymptotic properties and in particular expectation and variance estimates. In several joint works with Tomasz Schreiber and J. E. Yukich, we have investigated the scaling limit of several models (uniform model in the unitball, uniform model in a smooth convex body, Gaussian model) and have deduced from it limiting variances for several geometric characteristics including the number of kdimensional faces and the volume. In this paper, we survey the most recent advances on these questions and we emphasize the particular cases of random polytopes in the unitball and Gaussian polytopes.

Asymptotic properties of random polytopes Pierre Calka 2nd conference on Geometric Science of Information ´Ecole Polytechnique, ParisSaclay, 28 October 2015 default Outline Random polytopes: an overview Main results: variance asymptotics Sketch of proof: Gaussian case Joint work with Joseph Yukich (Lehigh University, USA) & Tomasz Schreiber (Toru´n University, Poland) default Outline Random polytopes: an overview Uniform polytopes Gaussian polytopes Expectation asymptotics Main results: variance asymptotics Sketch of proof: Gaussian case default Uniform polytopes Binomial model K := convex body of Rd (Xk,k ∈ N∗):= independent and uniformly distributed in K Kn := Conv(X1, · · · , Xn), n ≥ 1 K50, K ball K50, K square default Uniform polytopes Binomial model K := convex body of Rd (Xk,k ∈ N∗):= independent and uniformly distributed in K Kn := Conv(X1, · · · , Xn), n ≥ 1 K100, K ball K100, K square default Uniform polytopes Binomial model K := convex body of Rd (Xk,k ∈ N∗):= independent and uniformly distributed in K Kn := Conv(X1, · · · , Xn), n ≥ 1 K500, K ball K500, K square default Uniform polytopes Poissonian model K := convex body of Rd Pλ, λ > 0:= Poisson point process of intensity measure λdx Kλ := Conv(Pλ ∩ K) K500, K ball K500, K square default Gaussian polytopes Binomial model Φd (x) := 1 (2π)d/2 e− x 2/2, x ∈ Rd, d ≥ 2 (Xk, k ∈ N∗):= independent and with density Φd Kn := Conv(X1, · · · , Xn) Poissonian model Pλ, λ > 0:= Poisson point process of intensity measure λΦd(x)dx Kλ := Conv(Pλ) default Gaussian polytopes K50 K100 K500 default Gaussian polytopes: spherical shape K50 K100 K500 default Asymptotic spherical shape of the Gaussian polytope Geﬀroy (1961) : dH(Kn, B(0, 2 log(n))) → n→∞ 0 a.s. K50000 default Expectation asymptotics Considered functionals fk(·) := number of kdimensional faces, 0 ≤ k ≤ d Vol(·) := volume B. Efron’s relation (1965): Ef0(Kn) = n 1 − EVol(Kn−1) Vol(K) Uniform polytope, K smooth E[fk(Kλ)] ∼ λ→∞ cd,k ∂K κ 1 d+1 s ds λ d−1 d+1 κs := Gaussian curvature of ∂K Uniform polytope, K polytope E[fk(Kλ)] ∼ λ→∞ c′ d,kF(K) logd−1 (λ) F(K) := number of ﬂags of K Gaussian polytope E[fk(Kλ)] ∼ λ→∞ c′′ d,k log d−1 2 (λ) A. R´enyi & R. Sulanke (1963), H. Raynaud (1970), R. Schneider & J. Wieacker (1978), F. Aﬀentranger & R. Schneider (1992) default Outline Random polytopes: an overview Main results: variance asymptotics Uniform model, K smooth Uniform model, K polytope Gaussian model Sketch of proof: Gaussian case default Uniform model, K smooth K := convex body of Rd with volume 1 and with a C3 boundary κ := Gaussian curvature of ∂K lim λ→∞ λ−(d−1)/(d+1) Var[fk(Kλ)] = ck,d ∂K κ(z)1/(d+1) dz lim λ→∞ λ(d+3)/(d+1) Var [Vol(Kλ)] = c′ d ∂K κ(z)1/(d+1) dz (ck,d , c′ d explicit positive constants) M. Reitzner (2005): Var[fk (Kλ)] = Θ(λ(d−1)/(d+1) ) default Uniform model, K polytope K := simple polytope of Rd with volume 1 i.e. each vertex of K is included in exactly d facets. lim λ→∞ log−(d−1) (λ)Var[fk(Kλ)] = cd,kf0(K) lim λ→∞ λ2 log−(d−1) (λ)Var[Vol(Kλ)] = c′ d,k f0(K) (ck,d , c′ k,d explicit positive constants) I. B´ar´any & M. Reitzner (2010): Var[fk (Kλ)] = Θ(log(d−1) (λ)) default Gaussian model lim λ→∞ log− d−1 2 (λ)Var[fk(Kλ)] = ck,d lim λ→∞ log−k+ d+3 2 (λ)Var[Vol(Kλ)] = c′ k,d E Vol(Kλ) Vol(B(0, 2 log(n))) = λ→∞ 1 − d log(log(λ)) 4 log(λ) + O 1 log(λ) (ck,d , c′ k,d explicit positive constants) D. Hug & M. Reitzner (2005), I. B´ar´any & V. Vu (2007): Var[fk (Kλ)] = Θ(log(d−1)/2 (λ)) default Outline Random polytopes: an overview Main results: variance asymptotics Sketch of proof: Gaussian case Calculation of the expectation of fk(Kλ) Calculation of the variance of fk(Kλ) Scaling transform default Calculation of the expectation of fk(Kλ) 1. Decomposition: E[fk(Kλ)] = E x∈Pλ ξ(x, Pλ) ξ(x, Pλ) := 1 k+1 #kface containing x if x extreme 0 if not 2. MeckeSlivnyak formula E[fk(Kλ)] = λ E[ξ(x, Pλ ∪ {x})]Φd (x)dx 3. Limit of the expectation of one score default Calculation of the variance of fk(Kλ) Var[fk (Kλ)] = E x∈Pλ ξ2 (x, Pλ) + x=y∈Pλ ξ(x, Pλ)ξ(y, Pλ) − (E[fk (Kλ)]) 2 = λ E[ξ2 (x, Pλ ∪ {x})]Φd(x)dx + λ2 E[ξ(x, Pλ ∪ {x, y})ξ(y, Pλ ∪ {x, y})]Φd (x)Φd (y)dxdy − λ2 E[ξ(x, Pλ ∪ {x})]E[ξ(y, Pλ ∪ {y})]Φd (x)Φd (y)dxdy = λ E[ξ2 (x, Pλ ∪ {x})]Φd(x)dx + λ2 ”Cov”(ξ(x, Pλ ∪ {x}), ξ(y, Pλ ∪ {y}))Φd (x)Φd (y)dxdy default Scaling transform Question : Limits of E[ξ(x, Pλ)] and ”Cov”(ξ(x, Pλ), ξ(y, Pλ)) ? Answer : deﬁnition of limit scores in a new space ◮ Critical radius Rλ := 2 log λ − log(2 · (2π)d · log λ) ◮ Scaling transform : Tλ : Rd \ {0} −→ Rd−1 × R x −→ Rλ exp−1 d−1 x x, R2 λ(1 − x Rλ ) expd−1 : Rd−1 ≃ Tu0 Sd−1 → Sd−1 exponential map at u0 ∈ Sd−1 ◮ Image of a score : ξ(λ)(Tλ(x), Tλ(Pλ)) := ξ(x, Pλ) ◮ Convergence of Pλ : Tλ(Pλ) D → P o`u P : Poisson point process in Rd−1 × R of intensity measure ehdvdh default Action of the scaling transform Π↑ := {(v, h) ∈ Rd−1 × R : h ≥ v 2 2 } Π↓ := {(v, h) ∈ Rd−1 × R : h ≤ − v 2 2 } Halfspace Translate of Π↓ Sphere containing O Translate of ∂Π↑ Convexity Parabolic convexity Extreme point (x + Π↑) not fully covered kface of Kλ Parabolic kface RλVol Vol default Limiting picture Ψ := x∈P(x + Π↑) In red : image of the balls of diameter [0, x] where x is extreme default Limiting picture Φ := x∈Rd−1×R:x+Π↓∩P=∅(x + Π↓) In green : image of the boundary of the convex hull Kλ default Thank you for your attention!
Asymmetric information distances are used to define asymmetric norms and quasimetrics on the statistical manifold and its dual space of random variables. Quasimetric topology, generated by the KullbackLeibler (KL) divergence, is considered as the main example, and some of its topological properties are investigated.

Asymmetric Topologies on Statistical Manifolds Roman V. Belavkin School of Science and Technology Middlesex University, London NW4 4BT, UK GSI2015, October 28, 2015 Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 1 / 16 Sources and Consequences of Asymmetry Method: Symmetric Sandwich Results Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 2 / 16 Sources and Consequences of Asymmetry Sources and Consequences of Asymmetry Method: Symmetric Sandwich Results Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 3 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} q Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] q Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Asymmetry of the KLdivergence D[p, q] = D[q, p] Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Asymmetry of the KLdivergence D[p, q] = D[q, p] D[q + (p − q), q] = D[q − (p − q), q] Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Asymmetry of the KLdivergence D[p, q] = D[q, p] D[q + (p − q), q] = D[q − (p − q), q] Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Asymmetry of the KLdivergence D[p, q] = D[q, p] D[q + (p − q), q] = D[q − (p − q), q] p − q = inf{α−1 > 0 : D[q + α(p − q), q] ≤ 1} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Asymmetry of the KLdivergence D[p, q] = D[q, p] D[q + (p − q), q] = D[q − (p − q), q] p − q = inf{α−1 > 0 : D[q + α(p − q), q] ≤ 1} sup x {Ep−q{x} : Eq{ex − 1 − x} ≤ 1} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Asymmetric Information Distances KullbackLeibler divergence D[p, q] = Eq{ln(p/q)} D[p1⊗p2, q1⊗q2] = D[p1, q1]+D[p2, q2] ln : (R+, ×) → (R, +) q Asymmetry of the KLdivergence D[p, q] = D[q, p] D[q + (p − q), q] = D[q − (p − q), q] p − q = inf{α−1 > 0 : D[q + α(p − q), q] ≤ 1} sup x {Ep−q{x} : Eq{ex − 1 − x} ≤ 1} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 4 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Dual quasimetrics ρ(x, y) and ρ−1(x, y) = ρ(y, x) induce two diﬀerent topologies. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Dual quasimetrics ρ(x, y) and ρ−1(x, y) = ρ(y, x) induce two diﬀerent topologies. There are 7 notions of Cauchy sequences: left (right) Cauchy, left (right) KCauchy, weakly left (right) KCauchy, Cauchy. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Dual quasimetrics ρ(x, y) and ρ−1(x, y) = ρ(y, x) induce two diﬀerent topologies. There are 7 notions of Cauchy sequences: left (right) Cauchy, left (right) KCauchy, weakly left (right) KCauchy, Cauchy. This gives 14 notions of completeness (with respect to ρ or ρ−1). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Dual quasimetrics ρ(x, y) and ρ−1(x, y) = ρ(y, x) induce two diﬀerent topologies. There are 7 notions of Cauchy sequences: left (right) Cauchy, left (right) KCauchy, weakly left (right) KCauchy, Cauchy. This gives 14 notions of completeness (with respect to ρ or ρ−1). Compactness is related to outer precompactness or precompactness, which are strictly weaker properties than total boundedness. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Dual quasimetrics ρ(x, y) and ρ−1(x, y) = ρ(y, x) induce two diﬀerent topologies. There are 7 notions of Cauchy sequences: left (right) Cauchy, left (right) KCauchy, weakly left (right) KCauchy, Cauchy. This gives 14 notions of completeness (with respect to ρ or ρ−1). Compactness is related to outer precompactness or precompactness, which are strictly weaker properties than total boundedness. An asymmetric seminormed space may fail to be a topological vector space, because y → αy can be discontinuous (Borodin, 2001). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Functional Analysis in Asymmetric Spaces Theorem (e.g. Theorem 1.5 in Fletcher and Lindgren (1982)) Every topological space with a countable base is quasipseudometrizable. An asymmetric seminormed space can be T0, but not T1 (and hence not Hausdorﬀ T2). Dual quasimetrics ρ(x, y) and ρ−1(x, y) = ρ(y, x) induce two diﬀerent topologies. There are 7 notions of Cauchy sequences: left (right) Cauchy, left (right) KCauchy, weakly left (right) KCauchy, Cauchy. This gives 14 notions of completeness (with respect to ρ or ρ−1). Compactness is related to outer precompactness or precompactness, which are strictly weaker properties than total boundedness. An asymmetric seminormed space may fail to be a topological vector space, because y → αy can be discontinuous (Borodin, 2001). Practically all other results have to be reconsidered (e.g. Baire category theorem, AlaogluBourbaki, etc). (Cobzas, 2013). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 5 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Minkowski functional: µM◦ (x) = inf{α > 0 : x/α ∈ M◦ } Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Minkowski functional: µM◦ (x) = inf{α > 0 : x/α ∈ M◦ } Support function = sM(x) = sup{ x, y : y ∈ M} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Minkowski functional: µM◦ (x) = inf{α > 0 : x/α ∈ M◦ } Support function = sM(x) = sup{ x, y : y ∈ M} M = {u : D[(1 + u)z, z] ≤ 1} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Minkowski functional: µM◦ (x) = inf{α > 0 : x/α ∈ M◦ } Support function = sM(x) = sup{ x, y : y ∈ M} M = {u : D[(1 + u)z, z] ≤ 1} D = (1 + u) ln(1 + u) − u, z Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Minkowski functional: µM◦ (x) = inf{α > 0 : x/α ∈ M◦ } M◦ {y : D∗[x, 0] ≤ 1} Support function = sM(x) = sup{ x, y : y ∈ M} M = {u : D[(1 + u)z, z] ≤ 1} D = (1 + u) ln(1 + u) − u, z Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Random Variables as the Source of Asymmetry M◦ := {x : x, y ≤ 1, ∀ y ∈ M} M Minkowski functional: µM◦ (x) = inf{α > 0 : x/α ∈ M◦ } M◦ {y : D∗[x, 0] ≤ 1} D∗[x, 0] = ex − 1 − x, z Support function = sM(x) = sup{ x, y : y ∈ M} M = {u : D[(1 + u)z, z] ≤ 1} D = (1 + u) ln(1 + u) − u, z Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 6 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} 0 /∈ Int(dom Eq{ex}) Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} 0 /∈ Int(dom Eq{ex}) Example (Error minimization) Minimize x = 1 2 a − b 2 2 subject to DKL[w, q ⊗ p] ≤ λ, a, b ∈ Rn. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} 0 /∈ Int(dom Eq{ex}) Example (Error minimization) Minimize x = 1 2 a − b 2 2 subject to DKL[w, q ⊗ p] ≤ λ, a, b ∈ Rn. Ew{x} < ∞ minimized at w ∝ e−βxq ⊗ p. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} 0 /∈ Int(dom Eq{ex}) Example (Error minimization) Minimize x = 1 2 a − b 2 2 subject to DKL[w, q ⊗ p] ≤ λ, a, b ∈ Rn. Ew{x} < ∞ minimized at w ∝ e−βxq ⊗ p. Maximization of x has no solution. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} 0 /∈ Int(dom Eq{ex}) Example (Error minimization) Minimize x = 1 2 a − b 2 2 subject to DKL[w, q ⊗ p] ≤ λ, a, b ∈ Rn. Ew{x} < ∞ minimized at w ∝ e−βxq ⊗ p. Maximization of x has no solution. 1 2 a − b 2 2 /∈ dom Eq⊗p{ex}, −1 2 a − b 2 2 ∈ dom Eq⊗p{ex} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Sources and Consequences of Asymmetry Examples Example (St. Peterbourgh lottery) x = 2n, q = 2−n, n ∈ N. Eq{x} = ∞ n=1(2n/2n) → ∞ Ep{x} < ∞ for all biased p = 2−(1+α)n, α > 0. 2n /∈ dom Eq{ex}, −2n ∈ dom Eq{ex} 0 /∈ Int(dom Eq{ex}) Example (Error minimization) Minimize x = 1 2 a − b 2 2 subject to DKL[w, q ⊗ p] ≤ λ, a, b ∈ Rn. Ew{x} < ∞ minimized at w ∝ e−βxq ⊗ p. Maximization of x has no solution. 1 2 a − b 2 2 /∈ dom Eq⊗p{ex}, −1 2 a − b 2 2 ∈ dom Eq⊗p{ex} 0 /∈ Int(dom Eq⊗p{ex}) Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 7 / 16 Method: Symmetric Sandwich Sources and Consequences of Asymmetry Method: Symmetric Sandwich Results Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 8 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] s[−A ∩ A] = s(−A)co ∧ sA = inf{sA(z) + sA(z − y) : z ∈ Y } Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] s[−A ∩ A] = s(−A)co ∧ sA = inf{sA(z) + sA(z − y) : z ∈ Y } s[−A ∪ A] = s(−A) ∨ sA Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] s[−A ∩ A] = s(−A)co ∧ sA = inf{sA(z) + sA(z − y) : z ∈ Y } s[−A ∪ A] = s(−A) ∨ sA Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] s[−A ∩ A] = s(−A)co ∧ sA = inf{sA(z) + sA(z − y) : z ∈ Y } s[−A ∪ A] = s(−A) ∨ sA Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] s[−A ∩ A] = s(−A)co ∧ sA = inf{sA(z) + sA(z − y) : z ∈ Y } s[−A ∪ A] = s(−A) ∨ sA µM◦ ≤ µ(−M◦ ) ∨ µM◦ µ(−M)co ∧ µM ≤ µM Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Method: Symmetric Sandwich s[−A ∩ A] ≤ sA ≤ s[−A ∪ A] µco [−A◦ ∪ A◦] ≤ µA◦ ≤ µ[−A◦ ∩ A◦] s[−A ∩ A] = s(−A)co ∧ sA = inf{sA(z) + sA(z − y) : z ∈ Y } s[−A ∪ A] = s(−A) ∨ sA µ(−M◦ )co ∧ µM◦ ≤ µM◦ µM ≤ µ(−M) ∨ µM Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 9 / 16 Method: Symmetric Sandwich Lower and upper Luxemburg (Orlicz) norms −2 −1 0 1 2 ϕ∗ (x) = ex − 1 − x −2 −1 0 1 2 ϕ(u) = (1 + u) ln(1 + u) − u Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 10 / 16 Method: Symmetric Sandwich Lower and upper Luxemburg (Orlicz) norms −2 −1 0 1 2 ϕ∗ (x) = ex − 1 − x ϕ∗ +(x) = ϕ∗ (x) /∈ ∆2 −2 −1 0 1 2 ϕ(u) = (1 + u) ln(1 + u) − u ϕ+(u) = ϕ(u) ∈ ∆2 Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 10 / 16 Method: Symmetric Sandwich Lower and upper Luxemburg (Orlicz) norms −2 −1 0 1 2 ϕ∗ (x) = ex − 1 − x ϕ∗ +(x) = ϕ∗ (x) /∈ ∆2 ϕ∗ −(x) = ϕ∗ (−x) ∈ ∆2 −2 −1 0 1 2 ϕ(u) = (1 + u) ln(1 + u) − u ϕ+(u) = ϕ(u) ∈ ∆2 ϕ−(u) = ϕ(−u) /∈ ∆2 Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 10 / 16 Method: Symmetric Sandwich Lower and upper Luxemburg (Orlicz) norms −2 −1 0 1 2 ϕ∗ (x) = ex − 1 − x ϕ∗ +(x) = ϕ∗ (x) /∈ ∆2 ϕ∗ −(x) = ϕ∗ (−x) ∈ ∆2 x∗ ϕ = µ{x : ϕ∗ (x), z ≤ 1} −2 −1 0 1 2 ϕ(u) = (1 + u) ln(1 + u) − u ϕ+(u) = ϕ(u) ∈ ∆2 ϕ−(u) = ϕ(−u) /∈ ∆2 uϕ = µ{u : ϕ(u), z ≤ 1} Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 10 / 16 Method: Symmetric Sandwich Lower and upper Luxemburg (Orlicz) norms −2 −1 0 1 2 ϕ∗ (x) = ex − 1 − x ϕ∗ +(x) = ϕ∗ (x) /∈ ∆2 ϕ∗ −(x) = ϕ∗ (−x) ∈ ∆2 x∗ ϕ = µ{x : ϕ∗ (x), z ≤ 1} −2 −1 0 1 2 ϕ(u) = (1 + u) ln(1 + u) − u ϕ+(u) = ϕ(u) ∈ ∆2 ϕ−(u) = ϕ(−u) /∈ ∆2 uϕ = µ{u : ϕ(u), z ≤ 1} Proposition · ∗ ϕ+, · ∗ ϕ− are Luxemburg norms and x ∗ ϕ− ≤ x∗ ϕ ≤ x ∗ ϕ+ · ϕ+, · ϕ− are Luxemburg norms and u ϕ+ ≤ uϕ ≤ u ϕ− Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 10 / 16 Method: Symmetric Sandwich Lower and upper Luxemburg (Orlicz) norms −2 −1 0 1 2 ϕ∗ (x) = ex − 1 − x ϕ∗ +(x) = ϕ∗ (x) /∈ ∆2 ϕ∗ −(x) = ϕ∗ (−x) ∈ ∆2 x∗ ϕ = µ{x : ϕ∗ (x), z ≤ 1} −2 −1 0 1 2 ϕ(u) = (1 + u) ln(1 + u) − u ϕ+(u) = ϕ(u) ∈ ∆2 ϕ−(u) = ϕ(−u) /∈ ∆2 uϕ = µ{u : ϕ(u), z ≤ 1} Proposition · ∗ ϕ+, · ∗ ϕ− are Luxemburg norms and x ∗ ϕ− ≤ x∗ ϕ ≤ x ∗ ϕ+ · ϕ+, · ϕ− are Luxemburg norms and u ϕ+ ≤ uϕ ≤ u ϕ− Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 10 / 16 Results Sources and Consequences of Asymmetry Method: Symmetric Sandwich Results Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 11 / 16 Results KL Induces Hausdorﬀ (T2) Asymmetric Topology Theorem (Y, · ϕ) (resp. (X, · ∗ ϕ)) is Hausdorﬀ. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 12 / 16 Results KL Induces Hausdorﬀ (T2) Asymmetric Topology Theorem (Y, · ϕ) (resp. (X, · ∗ ϕ)) is Hausdorﬀ. Proof. u ϕ+ ≤ uϕ (resp. x ϕ− ≤ xϕ) implies (Y, · ϕ) (resp. (X, · ∗ ϕ)) is ﬁner than normed space (Y, · ϕ+) (resp. (X, · ∗ ϕ−)). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 12 / 16 Results Separable Subspaces Theorem (Y, · ϕ+) (resp. (X, · ∗ ϕ−)) is a separable Orlicz subspace of (Y, · ϕ) (resp. (X, · ∗ ϕ)). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 13 / 16 Results Separable Subspaces Theorem (Y, · ϕ+) (resp. (X, · ∗ ϕ−)) is a separable Orlicz subspace of (Y, · ϕ) (resp. (X, · ∗ ϕ)). Proof. ϕ+(u) = (1 + u) ln(1 + u) − u ∈ ∆2 (resp. ϕ∗ −(x) = e−x − 1 + x ∈ ∆2). Note that ϕ− /∈ ∆2 and ϕ∗ + /∈ ∆2. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 13 / 16 Results Completeness Theorem (Y, · ϕ) (resp. (X, · ∗ ϕ)) is 1 BiComplete: ρsCauchy yn ρs → y. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 14 / 16 Results Completeness Theorem (Y, · ϕ) (resp. (X, · ∗ ϕ)) is 1 BiComplete: ρsCauchy yn ρs → y. 2 ρsequentially complete: ρsCauchy yn ρ → y. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 14 / 16 Results Completeness Theorem (Y, · ϕ) (resp. (X, · ∗ ϕ)) is 1 BiComplete: ρsCauchy yn ρs → y. 2 ρsequentially complete: ρsCauchy yn ρ → y. 3 Right Ksequentially complete: right KCauchy yn ρ → y. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 14 / 16 Results Completeness Theorem (Y, · ϕ) (resp. (X, · ∗ ϕ)) is 1 BiComplete: ρsCauchy yn ρs → y. 2 ρsequentially complete: ρsCauchy yn ρ → y. 3 Right Ksequentially complete: right KCauchy yn ρ → y. Proof. ρs(y, z) = z − yϕ ∨ y − zϕ ≤ y − z ϕ−, where (Y, · ϕ−) is Banach. Then use theorems of Reilly et al. (1982) and Chen et al. (2007). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 14 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. We have proved that topologies induced by the KLdivergence are: Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. We have proved that topologies induced by the KLdivergence are: Hausdorﬀ. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. We have proved that topologies induced by the KLdivergence are: Hausdorﬀ. Bicomplete, ρsequentially complete and right Ksequentially complete. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. We have proved that topologies induced by the KLdivergence are: Hausdorﬀ. Bicomplete, ρsequentially complete and right Ksequentially complete. Contain a separable Orlicz subspace. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. We have proved that topologies induced by the KLdivergence are: Hausdorﬀ. Bicomplete, ρsequentially complete and right Ksequentially complete. Contain a separable Orlicz subspace. Total boundedness, compactness? Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 Results Summary and Further Questions Topologies induced by asymmetric information divergences may not have the same properties as their symmetrized counterparts (e.g. Banach spaces), and therefore many properties have to be reexamined. We have proved that topologies induced by the KLdivergence are: Hausdorﬀ. Bicomplete, ρsequentially complete and right Ksequentially complete. Contain a separable Orlicz subspace. Total boundedness, compactness? Other asymmetric information distances (e.g. Renyi divergence). Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 15 / 16 References Sources and Consequences of Asymmetry Method: Symmetric Sandwich Results Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 16 / 16 Results Borodin, P. A. (2001). The BanachMazur theorem for spaces with asymmetric norm. Mathematical Notes, 69(3–4), 298–305. Chen, S.A., Li, W., Zou, D., & Chen, S.B. (2007, Aug). Fixed point theorems in quasimetric spaces. In Machine learning and cybernetics, 2007 international conference on (Vol. 5, p. 24992504). IEEE. Cobzas, S. (2013). Functional analysis in asymmetric normed spaces. Birkh¨auser. Fletcher, P., & Lindgren, W. F. (1982). Quasiuniform spaces (Vol. 77). New York: Marcel Dekker. Reilly, I. L., Subrahmanyam, P. V., & Vamanamurthy, M. K. (1982). Cauchy sequences in quasipseudometric spaces. Monatshefte f¨ur Mathematik, 93, 127–140. Roman Belavkin (Middlesex University) Asymmetric Topologies October 28, 2015 16 / 16
Computational Information Geometry (chaired by Frank Nielsen, Paul Marriott)
We introduce a new approach to goodnessoffit testing in the high dimensional, sparse extended multinomial context. The paper takes a computational information geometric approach, extending classical higher order asymptotic theory. We show why the Wald – equivalently, the Pearson X2 and score statistics – are unworkable in this context, but that the deviance has a simple, accurate and tractable sampling distribution even for moderate sample sizes. Issues of uniformity of asymptotic approximations across model space are discussed. A variety of important applications and extensions are noted.

Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Geometry of GoodnessofFit Testing in High Dimensional Low Sample Size Modelling R. Sabolová1 , P. Marriott2 , G. Van Bever1 & F. Critchley1 . 1 The Open University (EPSRC grant EP/L010429/1), United Kingdom 2 University of Waterloo, Canada GSI 2015, October 28th 2015 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Key points In CIG, the multinomial model ∆k = (π0, . . . , πk) : πi ≥ 0, i πi = 1 provides a universal model. 1 goodnessofﬁt testing in large sparse extended multinomial contexts 2 CressieRead power divergence λfamily  equivalent to Amari’s αfamily asymptotic properties of two test statistics: Pearson’s χ2test and deviance simulation study for other statistics within power divergence family 3 kasymptotics instead of Nasymptotics Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Outline 1 Introduction 2 Pearson’s χ2 versus the deviance 3 Other test statistics from power divergence family 4 Summary Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Big data Statistical Theory and Methods for Complex, HighDimensional Data programme, Isaac Newton Institute (2008): . . . the practical environment has changed dramatically over the last twenty years, with the spectacular evolution of computing facilities and the emergence of applications in which the number of experimental units is relatively small but the underlying dimension is massive. . . . Areas of application include image analysis, microarray analysis, ﬁnance, document classiﬁcation, astronomy and atmospheric science. continuous data  High dimensional low sample size data (HDLSS) discrete data databases image analysis Sparsity (N << k) changes everything! Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Image analysis  example Figure: m1 = 10, m2 = 10 Dimension of a state space: k = 2m1m2 − 1 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Sparsity changes everything S. Fienberg, A. Rinaldo (2012): Maximum Likelihood Estimation in LogLinear Models Despite the widespread usage of these [loglinear] models, the applicability and statistical properties of loglinear models under sparse settings are still very poorly understood. As a result, even though highdimensional sparse contingency tables constitute a type of data that is common in practice, their analysis remains exceptionally difﬁcult. Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Outline 1 Introduction 2 Pearson’s χ2 versus the deviance 3 Other test statistics from power divergence family 4 Summary Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Extended multinomial distribution Let n = (ni) ∼ Mult(N, (πi)), i = 0, 1, . . . , k, where each πi≥0. Goodnessofﬁt test H0 : π = π∗ . Pearson’s χ2 test (Wald, score statistic) W := k i=0 (π∗ i − ni/N)2 π∗ i ≡ 1 N2 k i=0 n2 i π∗ i − 1. Rule of thumb (for accuracy of χ2 k asymptotic approximation) Nπi ≥ 5 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Performance of Pearson’s χ2 test on the boundary  example 0 50 100 150 200 0.000.010.020.030.040.05 (a) Null distribution Rank of cell probability Cellprobability 0 200 400 600 800 1000 02000400060008000 (b) Sample of Wald Statistic Index WaldStatistic Figure: N = 50, k = 200, exponentially decreasing πi Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Performance of Pearson’s χ2 test on the boundary  theory Theorem For k > 1 and N ≥ 6, the ﬁrst three moments of W are: E(W) = k N , var(W) = π(−1) − (k + 1)2 + 2k(N − 1) N3 and E[{W − E(W)}3 ] given by π(−2) − (k + 1)3 − (3k + 25 − 22N) π(−1) − (k + 1)2 + g(k, N) N5 where g(k, N) = 4(N − 1)k(k + 2N − 5) > 0 and π(a) := i πa i . In particular, for ﬁxed k and N, as πmin → 0 var(W) → ∞ and γ(W) → +∞ where γ(W) := E[{W − E(W)}3 ]/{var(W)}3/2 . Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary The deviance statistic Deﬁne the deviance D via D/2 = {0≤i≤k:ni>0} {ni log(ni/N) − log(πi)} = {0≤i≤k:ni>0} ni log(ni/N) + log 1 πi = {0≤i≤k:ni>0} ni log(ni/µi), where µi := E(ni) = Nπi. Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Distribution of deviance let {n∗ i , i = 0, . . . , k} be mutually independent, with n∗ i ∼ Po(µi) then N∗ := k i=0 n∗ i ∼ Po(N) and ni = (n∗ i N∗ = N) ∼ Mult(N, πi) deﬁne S∗ := N∗ D∗ /2 = k i=0 n∗ i n∗ i log(n∗ i /µi) Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Distribution of deviance let {n∗ i , i = 0, . . . , k} be mutually independent, with n∗ i ∼ Po(µi) then N∗ := k i=0 n∗ i ∼ Po(N) and ni = (n∗ i N∗ = N) ∼ Mult(N, πi) deﬁne S∗ := N∗ D∗ /2 = k i=0 n∗ i n∗ i log(n∗ i /µi) deﬁne ν, τ and ρ via N ν := E(S∗ ) = N k i=0 E(n∗ i log {n∗ i /µi}) , N ρτ √ N · τ2 := cov(S∗ ) = N k i=0 Ci · k i=0 Vi , where Ci := Cov(n∗ i , n∗ i log(n∗ i /µi)) and Vi := V ar(n∗ i log(n∗ i /µi)). Then under equicontinuity D/2 D −−−−→ k→∞ N1(ν, τ2 (1 − ρ2 )). Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity near the boundary 0 50 100 150 200 0.000.010.020.030.040.05 (a) Null distribution Rank of cell probability Cellprobability 0 200 400 600 800 1000 0500150025003500 (b) Sample of Wald Statistic Index WaldStatistic 0 200 400 600 800 1000 5060708090100110 (c) Sample of Deviance Statistic Index Deviance Figure: Stability of sampling distributions  Pearson’s χ2 and deviance statistic, N = 50, k = 200, exponentially decreasing πi Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Asymptotic approximations normal approximation can be improved χ2 approximation, correction for skewness symmetrised deviance statistics 40 60 80 100 120 5060708090 Normal Approximation Deviance quantiles Normalquantiles 60 80 100 120 5060708090100 Chi−squared Approximation Deviance quantiles Chi−squaredquantiles 40 60 80 100 120 5060708090 Symmetrised Deviance Symmetric Deviance quantiles Normalquantiles Figure: Quality of kasymptotics approximations near the boundary Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity and higher moments does kasymptotic approximation hold uniformly across the simplex? rewrite deviance as D∗ /2 = {0≤i≤k:n∗ i >0} n∗ i log(n∗ i /µi) = Γ∗ + ∆∗ where Γ∗ := k i=0 αin∗ i and ∆∗ := {0≤i≤k:n∗ i >1} n∗ i log n∗ i ≥ 0 and αi := − log µi. how well is the moment generating function of the (standardised) Γ∗ approximated by that of a (standard) normal? Mγ(t) = exp − E(Γ∗ )t V ar(Γ∗) exp k i=0 ∞ h=1 (−1)h h! µi(log µi)h t V ar(Γ∗) h Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity and higher moments maximise skewness k i=0 µi(log µi)3 for ﬁxed E(Γ∗ ) = − k i=0 µi log(µi) and V ar(Γ∗ ) = k i=0 µi(log µi)2 . Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity and higher moments maximise skewness k i=0 µi(log µi)3 for ﬁxed E(Γ∗ ) = − k i=0 µi log(µi) and V ar(Γ∗ ) = k i=0 µi(log µi)2 . solution: distribution with three distinct values for µi 0 50 100 150 200 0.0000.0020.0040.006 (a) Null distribution Rank of cell probability Cellprobability (b) Sample of Wald Statistic (out1) WaldStatistic 160 180 200 220 240 260 280 300 050100150200 (c) Sample of Deviance Statistic outDeviance 110 115 120 125 130 135 050100150200 Figure: Worst case solution for normality of Γ∗ Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity and discreteness Worst case for asymptotic normality? Where? Why? Pearson χ2 boundary ’unstable’ deviance centre discreteness D∗ /2 = {0≤i≤k:n∗ i >0} n∗ i (log n∗ i − logµi) = Γ∗ + ∆∗ For the distribution of any discrete random variable to be well approximated by a continuous one, it is necessary that it have a large number of support points, close together. Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity and discreteness, continued 0 50 100 150 200 0.0000.0010.0020.0030.0040.005 (a) Null distribution Rank of cell probability Cellprobability 0 200 400 600 800 1000 115120125130135 (b) Sample of Deviance Statistic Index Deviance −3 −2 −1 0 1 2 3 −101234 (c) QQplot Deviance Statistic Theoretical Quantiles StandardisedDeviance Figure: Behaviour at the centre of the simplex, N = 30, k = 200 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Uniformity and discreteness, continued 0 50 100 150 200 0.0000.0010.0020.0030.0040.005 (a) Null distribution Rank of cell probability Cellprobability 0 200 400 600 800 1000 150160170180190 (b) Sample of Deviance Statistic Index Deviance −3 −2 −1 0 1 2 3 −2−10123 (c) QQplot Deviance Statistic Theoretical Quantiles StandardisedDeviance Figure: Behaviour at the centre of the simplex, N = 60, k = 200 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Outline 1 Introduction 2 Pearson’s χ2 versus the deviance 3 Other test statistics from power divergence family 4 Summary Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Comparison of performance of different test statistics belonging to power divergence family as we are approaching the boundary (exponentially decreasing values of π) 2NIλ (ni/N, π∗ ) = 2 λ(λ + 1) k i=1 ni ni Nπ∗ i λ − 1 , where α = 1 + 2λ α = 3 Pearson’s χ2 statistic α = 7/3 CressieRead recommendation α = 1 deviance α = 0 Hellinger statistic α = −1 Kullback MDI α = −3 Neyman χ2 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Numerical comparison of test statistics belonging to power divergence family 0 50 100 150 200 0.000.020.04 Index pi.base Pearson's χ2 , α= 3 Frequency 0 1000 2000 3000 4000 0200400600800 CressieRead, α= 7/3 Frequency 0 100 200 300 400 500 0100300500 deviance, α= 1 Frequency 40 60 80 100 050100150 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Numerical comparison of test statistics belonging to power divergence family 0 50 100 150 200 0.000.020.04 Index pi.base Hellinger distance, α= 0 Frequency 60 80 100 120 140 050100150 Kullback MDI, α= 1 Frequency 30 40 50 60 70 80 90 050100150 Neyman χ2 , α= 3 Frequency 10 15 20 25 050100200 Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Outline 1 Introduction 2 Pearson’s χ2 versus the deviance 3 Other test statistics from power divergence family 4 Summary Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary Summary  key points 1 goodnessofﬁt testing in large sparse extended multinomial contexts 2 kasymptotics instead of Nasymptotics 3 CressieRead power divergence λfamily asymptotic properties of two test statistics: Pearson’s χ2 statistic and deviance simulation study for other statistics within power divergence family Radka Sabolová Geometry of GOF Testing in HDLSS Modelling Introduction Pearson’s χ2 versus the deviance Other test statistics from power divergence family Summary References A. Agresti (2002): Categorical Data Analysis. Wiley: Hoboken NJ. K. AnayaIzquierdo, F. Critchley, and P. Marriott (2014): When are ﬁrst order asymptotics adequate? a diagnostic. STAT, 3: 17 – 22. K. AnayaIzquierdo, F. Critchley, P. Marriott, and P. Vos (2013): Computational information geometry: foundations. Proceedings of GSI 2013, LNCS. F. Critchley and Marriott P (2014): Computational information geometry in statistics: theory and practice. Entropy, 16: 2454 – 2471. S.E. Fienberg and A. Rinaldo (2012): Maximum likelihood estimation in loglinear models. Annals of Statistics, 40: 996 – 1023. L. Holst (1972): Asymptotic normality and efﬁciency for certain goodnesofﬁt tests, Biometrika, 59: 137 – 145. C. Morris (1975): Central limit theorems for multinomial sums, Annals of Statistics, 3: 165 – 188. Radka Sabolová Geometry of GOF Testing in HDLSS Modelling
Local mixture models give an inferentially tractable but still flexible alternative to general mixture models. Their parameter space naturally includes boundaries; near these the behaviour of the likelihood is not standard. This paper shows how convex and differential geometries help in characterising these boundaries. In particular the geometry of polytopes, ruled and developable surfaces is exploited to develop efficient inferential algorithms.

Computing Boundaries in Local Mixture Models Computing Boundaries in Local Mixture Models Vahed Maroufy & Paul Marriott Department of Statistics and Actuarial Science University of Waterloo October 28 GSI 2015, Paris Computing Boundaries in Local Mixture Models Outline Outline 1 Inﬂuence of boundaries on parameter inference 2 Local mixture models (LMM) 3 Parameter space and boundaries Hard boundaries and Soft boundaries 4 Computing the boundaries for LMMs 5 Summary and future direction Computing Boundaries in Local Mixture Models Boundary inﬂuence When boundary exits: MLE does not exist =⇒ ﬁnd the Extended MLE MLE exists, but does not satisfy the regular properties Examples Binomial distribution, logistic regression, contingency table, loglinear and graphical models Geyer (2009), Rinaldo et al. (2009), AnayaIzquierdo et al. (2013) Computing boundary is a hard problem, Fukuda (2004) Many mathematical results in the literature polytope approximation, Boroczky and Fodor (2008), Barvinok (2013) smooth surface approximation, Batyrev (1992), Ghomi (2001, 2004) Computing Boundaries in Local Mixture Models Boundary inﬂuence When boundary exits: MLE does not exist =⇒ ﬁnd the Extended MLE MLE exists, but does not satisfy the regular properties Examples Binomial distribution, logistic regression, contingency table, loglinear and graphical models Geyer (2009), Rinaldo et al. (2009), AnayaIzquierdo et al. (2013) Computing boundary is a hard problem, Fukuda (2004) Many mathematical results in the literature polytope approximation, Boroczky and Fodor (2008), Barvinok (2013) smooth surface approximation, Batyrev (1992), Ghomi (2001, 2004) Computing Boundaries in Local Mixture Models Boundary inﬂuence When boundary exits: MLE does not exist =⇒ ﬁnd the Extended MLE MLE exists, but does not satisfy the regular properties Examples Binomial distribution, logistic regression, contingency table, loglinear and graphical models Geyer (2009), Rinaldo et al. (2009), AnayaIzquierdo et al. (2013) Computing boundary is a hard problem, Fukuda (2004) Many mathematical results in the literature polytope approximation, Boroczky and Fodor (2008), Barvinok (2013) smooth surface approximation, Batyrev (1992), Ghomi (2001, 2004) Computing Boundaries in Local Mixture Models Boundary inﬂuence When boundary exits: MLE does not exist =⇒ ﬁnd the Extended MLE MLE exists, but does not satisfy the regular properties Examples Binomial distribution, logistic regression, contingency table, loglinear and graphical models Geyer (2009), Rinaldo et al. (2009), AnayaIzquierdo et al. (2013) Computing boundary is a hard problem, Fukuda (2004) Many mathematical results in the literature polytope approximation, Boroczky and Fodor (2008), Barvinok (2013) smooth surface approximation, Batyrev (1992), Ghomi (2001, 2004) Computing Boundaries in Local Mixture Models LMMs Local Mixture Models Deﬁnition Marriott (2002) g(x; µ, λ) = f (x; µ) + k j=2 λj f (j) (x; µ), λ ∈ Λµ ⊂ Rk−1 Properties AnayaIzquierdo and Marriott (2007) g is identiﬁable in all parameters and the parametrization (µ, λ) is orthogonal at λ = 0 The log likelihood function of g is a concave function of λ at a ﬁxed µ0 Λµ is convex Approximate continuous mixture models when mixing is “small” M f (x, µ) dQ(µ) Family of LMMs is richer that Family of mixtures Computing Boundaries in Local Mixture Models LMMs Local Mixture Models Deﬁnition Marriott (2002) g(x; µ, λ) = f (x; µ) + k j=2 λj f (j) (x; µ), λ ∈ Λµ ⊂ Rk−1 Properties AnayaIzquierdo and Marriott (2007) g is identiﬁable in all parameters and the parametrization (µ, λ) is orthogonal at λ = 0 The log likelihood function of g is a concave function of λ at a ﬁxed µ0 Λµ is convex Approximate continuous mixture models when mixing is “small” M f (x, µ) dQ(µ) Family of LMMs is richer that Family of mixtures Computing Boundaries in Local Mixture Models Example and Motivation Example LMM of Normal f (x; µ) = φ(x; µ, σ2 ), (σ2 is known). g(x; µ, λ) = φ(x; µ, σ2 ) 1 + k j=2 λj pj (x) , λ ∈ Λµ pj (x) polynomial of degree j. Why we care about λ and Λµ? They are interpretable µ (2) g = σ2 + 2λ2 µ (3) g = 6λ3 µ (4) g = µ (4) φ + 12σ2 λ2 + 24λ4 (1) λ represents the mixing distribution Q via its moments in M f (x, µ) dQ(µ) Computing Boundaries in Local Mixture Models Example and Motivation Example LMM of Normal f (x; µ) = φ(x; µ, σ2 ), (σ2 is known). g(x; µ, λ) = φ(x; µ, σ2 ) 1 + k j=2 λj pj (x) , λ ∈ Λµ pj (x) polynomial of degree j. Why we care about λ and Λµ? They are interpretable µ (2) g = σ2 + 2λ2 µ (3) g = 6λ3 µ (4) g = µ (4) φ + 12σ2 λ2 + 24λ4 (1) λ represents the mixing distribution Q via its moments in M f (x, µ) dQ(µ) Computing Boundaries in Local Mixture Models Example and Motivation The costs for all these good properties and ﬂexibility are Hard boundary =⇒ Positivity (boundary of Λµ) Soft boundary =⇒ Mixture behavior We compute them for two models here: Poisson and Normal We ﬁx k = 4 Computing Boundaries in Local Mixture Models Boundaries Hard boundary Λµ = λ  1 + k j=2 λj qj (x; µ) ≥ 0, ∀x ∈ S , Λµ is intersection of halfspaces so convex Hard boundary is constructed by a set of (hyper)planes Soft boundary Deﬁnition For a density function f (x; µ) with k ﬁnite moments let, Mk (f ) := (Ef (X), Ef (X2 ), · · · , Ef (Xk )). and for compact M deﬁne C = convhull{Mr (f )µ ∈ M} Then, the boundary of C is called the soft boundary. Computing Boundaries in Local Mixture Models Boundaries Hard boundary Λµ = λ  1 + k j=2 λj qj (x; µ) ≥ 0, ∀x ∈ S , Λµ is intersection of halfspaces so convex Hard boundary is constructed by a set of (hyper)planes Soft boundary Deﬁnition For a density function f (x; µ) with k ﬁnite moments let, Mk (f ) := (Ef (X), Ef (X2 ), · · · , Ef (Xk )). and for compact M deﬁne C = convhull{Mr (f )µ ∈ M} Then, the boundary of C is called the soft boundary. Computing Boundaries in Local Mixture Models Computing hard boundary Poisson model Λµ = λ  A2(x) λ2 + A3(x)λ3 + A4(x) λ4 + 1 ≥ 0, ∀x ∈ Z+ , Figure : Left: slice through λ2 = −0.1; Right: slice through λ3 = 0.3. Theorem For a LMM of a Poisson distribution, for each µ, the space Λµ can be arbitrarily well approximated, as measured by volume for example, by a ﬁnite polytope. Computing Boundaries in Local Mixture Models Computing hard boundary Poisson model Λµ = λ  A2(x) λ2 + A3(x)λ3 + A4(x) λ4 + 1 ≥ 0, ∀x ∈ Z+ , Figure : Left: slice through λ2 = −0.1; Right: slice through λ3 = 0.3. Theorem For a LMM of a Poisson distribution, for each µ, the space Λµ can be arbitrarily well approximated, as measured by volume for example, by a ﬁnite polytope. Computing Boundaries in Local Mixture Models Computing hard boundary Normal model let y = x−µ σ2 Λµ = {λ  (y2 − 1)λ2 + (y3 − 3y)λ3 + (y4 − 6y2 + 3)λ4 + 1 ≥ 0, ∀y ∈ R}. We need a more geometric tools to compute this boundary. Computing Boundaries in Local Mixture Models Ruled and developable surfaces Ruled and developable surfaces Deﬁnition Ruled surface: Γ(x, γ) = α(x) + γ · β(x), x ∈ I ⊂ R, γ ∈ Rk Developable surface: β(x), α (x) and β (x) are coplanar for all x ∈ I. Computing Boundaries in Local Mixture Models Ruled and developable surfaces Deﬁnition The family of planes, A = {λ ∈ R3  a(x) · λ + d(x) = 0, x ∈ R}, each determined by an x ∈ R, is called a oneparameter inﬁnite family of planes. Each element of the set {λ ∈ R3 a(x) · λ + d(x) = 0, a (x) · λ + d (x) = 0, x ∈ R} is called a characteristic line of the surface at x and the union is called the envelope of the family. A characteristic line is the intersection of two consecutive planes The envelope is a developable surface Computing Boundaries in Local Mixture Models Ruled and developable surfaces Boundaries for Normal LMM Hard boundary of for Normal LMM (y2 − 1)λ2 + (y3 − 3y)λ3 + (y4 − 6y2 + 3)λ4 + 1 = 0, ∀y ∈ R . λ2 λ3 λ4 λ4 λ3 λ2 Figure : Left: The hard boundary for the normal LMM (shaded) as a subset of a self intersecting ruled surface (unshaded); Right: slice through λ4 = 0.2. Computing Boundaries in Local Mixture Models Ruled and developable surfaces Boundaries for Normal LMM Soft boundary of for Normal LMM recap : Mk (f ) := (Ef (X), Ef (X2 ), · · · , Ef (Xk )). For visualization purposes let k = 3, (µ ∈ M, ﬁx σ) M3(f ) = (µ, µ2 + σ2 , µ3 + 3µσ2 ), M3(g) = (µ, µ2 + σ2 + 2λ2, µ3 + 3µσ2 + 6µλ2 + 6λ3). Figure : the 3D curve ϕ(µ); Middle: the bounding ruled surface γa(µ, u); Right: the convex subspace restricted to soft boundary. Computing Boundaries in Local Mixture Models Ruled and developable surfaces Boundaries for Normal LMM Ruled surface parametrization Two boundary surfaces, each constructed by a curve and a set of lines attached to it. γa(µ, u) = ϕ(µ) + u La(µ) γb(µ, u) = ϕ(µ) + u Lb(µ) where for M = [a, b] and ϕ(µ) = M3(f ) La(µ): lines between ϕ(a) and ϕ(µ) Lb(µ): lines between ϕ(µ) and ϕ(b) Computing Boundaries in Local Mixture Models Summary Summary Understanding these boundaries is important if we want to exploit the nice statistical properties of LMM The boundaries described in this paper have both discrete aspects and smooth aspects The two example discussed represent the structure for almost all exponential family models It is a interesting problem to design optimization algorithms on these boundaries for ﬁnding boundary maximizers of likelihood Computing Boundaries in Local Mixture Models References AnayaIzquierdo, K., Critchley, F., and Marriott, P. (2013). when are ﬁrst order asymptotics adequate? a diagnostic. Stat, 3(1):17–22. AnayaIzquierdo, K. and Marriott, P. (2007). Local mixture models of exponential families. Bernoulli, 13:623–640. Barvinok, A. (2013). Thrifty approximations of convex bodies by polytopes. International Mathematics Research Notices, rnt078. Batyrev, V. V. (1992). Toric varieties and smooth convex approximations of a polytope. RIMS Kokyuroku, 776:20. Boroczky, K. and Fodor, F. (2008). Approximating 3dimensional convex bodies by polytopes with a restricted number of edges. Contributions to Algebra and Geometry, 49(1):177–193. Fukuda, K. (2004). From the zonotope construction to the minkowski addition of convex polytopes. Journal of Symbolic Computation, 38(4):1261–1272. Geyer, C. J. (2009). Likelihood inference in exponential familes and direction of recession. Electronic Journal of Statistics, 3:259–289. Ghomi, M. (2001). Strictly convex submanifolds and hypersurfaces of positive curvature. Journal of Diﬀerential Geometry, 57(2):239–271. Ghomi, M. (2004). Optimal smoothing for convex polytopes. Bulletin of the London Mathematical Society, 36(4):483–492. Marriott, P. (2002). On the local geometry of mixture models. Biometrika, 89:77–93. Rinaldo, A., Fienberg, S. E., and Zhou, Y. (2009). On the geometry of discrete exponential families with application to exponential random graph models. Electronic Journal of Statistics, 3:446–484. Computing Boundaries in Local Mixture Models END Thank You
We generalize the O(dnϵ2)time (1 + ε)approximation algorithm for the smallest enclosing Euclidean ball [2,10] to point sets in hyperbolic geometry of arbitrary dimension. We guarantee a O(1/ϵ2) convergence time by using a closedform formula to compute the geodesic αmidpoint between any two points. Those results allow us to apply the hyperbolic kcenter clustering for statistical locationscale families or for multivariate spherical normal distributions by using their Fisher information matrix as the underlying Riemannian hyperbolic metric.

Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry Frank Nielsen1 Ga¨etan Hadjeres2 ´Ecole Polytechnique 1 Sony Computer Science Laboratories, Inc 1,2 Conference on Geometric Science of Information c 2015 Frank Nielsen  Ga¨etan Hadjeres 1 The Minimum Enclosing Ball problem Finding the Minimum Enclosing Ball (or the 1center) of a ﬁnite point set P = {p1, . . . , pn} in the metric space (X, dX (., .)) consists in ﬁnding c ∈ X such that c = argminc ∈X max p∈P dX (c , p) Figure : A ﬁnite point set P and its minimum enclosing ball MEB(P) c 2015 Frank Nielsen  Ga¨etan Hadjeres 2 The approximating minimum enclosing ball problem In a euclidean setting, this problem is welldeﬁned: uniqueness of the center c∗ and radius R∗ of the MEB computationally intractable in high dimensions. We ﬁx an > 0 and focus on the Approximate Minimum Enclosing Ball problem of ﬁnding an approximation c ∈ X of MEB(P) such that dX (c, p) ≤ (1 + )R∗ ∀p ∈ P. c 2015 Frank Nielsen  Ga¨etan Hadjeres 3 The approximating minimum enclosing ball problem: prior work Approximate solution in the euclidean case are given by Badoiu and Clarkson’s algorithm [Badoiu and Clarkson, 2008]: Initialize center c1 ∈ P Repeat 1/ 2 times the following update: ci+1 = ci + fi − ci i + 1 where fi ∈ P is the farthest point from ci . How to deal with point sets whose underlying geometry is not euclidean ? c 2015 Frank Nielsen  Ga¨etan Hadjeres 4 The approximating minimum enclosing ball problem: prior work This algorithm has been generalized to dually ﬂat manifolds [Nock and Nielsen, 2005] Riemannian manifolds [Arnaudon and Nielsen, 2013] Applying these results to hyperbolic geometry give the existence and uniqueness of MEB(P), but give no explicit bounds on the number of iterations assume that we are able to precisely cut geodesics. c 2015 Frank Nielsen  Ga¨etan Hadjeres 5 The approximating minimum enclosing ball problem: our contribution We analyze the case of point sets whose underlying geometry is hyperbolic. Using a closedform formula to compute geodesic αmidpoints, we obtain a intrinsic (1 + )approximation algorithm to the approximate minimum enclosing ball problem a O(1/ 2) convergence time guarantee a oneclass clustering algorithm for speciﬁc subfamilies of normal distributions using their Fisher information metric c 2015 Frank Nielsen  Ga¨etan Hadjeres 6 Model of ddimensional hyperbolic geometry: The Poincar´e ball model The Poincar´e ball model (Bd , ρ(., .)) consists in the open unit ball Bd = {x ∈ Rd : x < 1} together with the hyperbolic distance ρ (p, q) = arcosh 1 + 2 p − q 2 (1 − p 2) (1 − q 2) , ∀p, q ∈ Bd . This distance induces on the metric space (Bd , ρ) a Riemannian structure. c 2015 Frank Nielsen  Ga¨etan Hadjeres 7 Geodesics in the Poincar´e ball model Shorter paths between two points (geodesics) are exactly straight (euclidean) lines passing through the origin circle arcs orthogonal to the unit sphere Figure : “Straight” lines in the Poincar´e ball model c 2015 Frank Nielsen  Ga¨etan Hadjeres 8 Circles in the Poincar´e ball model Circles in the Poincar´e ball model look like euclidean circles but with diﬀerent center Figure : Diﬀerence between euclidean MEB (in blue) and hyperbolic MEB (in red) for the set of blue points in hyperbolic Poincar´e disk (in black). The red cross is the hyperbolic center of the red circle while the pink one is its euclidean center. c 2015 Frank Nielsen  Ga¨etan Hadjeres 9 Translations in the Poincar´e ball model Tp (x) = 1 − p 2 x + x 2 + 2 x, p + 1 p p 2 x 2 + 2 x, p + 1 Figure : Tiling of the hyperbolic plane by squares c 2015 Frank Nielsen  Ga¨etan Hadjeres 10 Closedform formula for computing αmidpoints A point m is the αmidpoint p#αq of two points p, q for α ∈ [0, 1] if m belongs to the geodesic joining the two points p, q m veriﬁes ρ (p, mα) = αρ (p, q) . c 2015 Frank Nielsen  Ga¨etan Hadjeres 11 Closedform formula for computing αmidpoints A point m is the αmidpoint p#αq of two points p, q for α ∈ [0, 1] if m belongs to the geodesic joining the two points p, q m veriﬁes ρ (p, mα) = αρ (p, q) . For the special case p = (0, . . . , 0), q = (xq, 0, . . . , 0), we have p#αq := (xα, 0, . . . , 0) with xα = cα,q − 1 cα,q + 1 , where cα,q := eαρ(p,q) = 1 + xq 1 − xq α . c 2015 Frank Nielsen  Ga¨etan Hadjeres 11 Closedform formula for computing αmidpoints Noting that p#αq = Tp (T−p (p) #αT−p (q)) ∀p, q ∈ Bd we obtain a closedform formula for computing p#αq how to compute p#αq in linear time O(d) that these transformations are exact. c 2015 Frank Nielsen  Ga¨etan Hadjeres 12 (1+ )approximation of an hyperbolic enclosing ball of ﬁxed radius For a ﬁxed radius r > R∗, we can ﬁnd c ∈ Bd such that ρ (c, P) ≤ (1 + )r ∀p ∈ P with Algorithm 1: (1 + )approximation of EHB(P, r) 1: c0 := p1 2: t := 0 3: while ∃p ∈ P such that p /∈ B (ct, (1 + ) r) do 4: let p ∈ P be such a point 5: α := ρ(ct ,p)−r ρ(ct ,p) 6: ct+1 := ct#αp 7: t := t+1 8: end while 9: return ct c 2015 Frank Nielsen  Ga¨etan Hadjeres 13 Idea of the proof By the hyperbolic law of cosines : ch (ρt) ≥ ch (h) ch (ρt+1) ch (ρ1) ≥ ch (h)T ≥ ch ( r)T . ct+1 ct c∗ pt h > r ρt+1 ρt r ≤ rr θ θ Figure : Update of ct c 2015 Frank Nielsen  Ga¨etan Hadjeres 14 (1+ )approximation of an hyperbolic enclosing ball of ﬁxed radius The EHB(P, r) algorithm is a O(1/ 2)time algorithm which returns the center of a hyperbolic enclosing ball with radius (1 + )r in less than 4/ 2 iterations. c 2015 Frank Nielsen  Ga¨etan Hadjeres 15 (1+ )approximation of an hyperbolic enclosing ball of ﬁxed radius The EHB(P, r) algorithm is a O(1/ 2)time algorithm which returns the center of a hyperbolic enclosing ball with radius (1 + )r in less than 4/ 2 iterations. Our error with the true MEHB center c∗ veriﬁes ρ (c, c∗ ) ≤ arcosh ch ((1 + ) r) ch (R∗) c 2015 Frank Nielsen  Ga¨etan Hadjeres 15 (1 + + 2 /4)approximation of MEHB(P) In fact, as R∗ is unknown in general, the EHB algorithm returns for any r: an (1 + )approximation of EHB(P) if r ≥ R∗ the fact that r < R∗ if the result obtained after more than 4/ 2 iterations is not good enough. c 2015 Frank Nielsen  Ga¨etan Hadjeres 16 (1 + + 2 /4)approximation of MEHB(P) In fact, as R∗ is unknown in general, the EHB algorithm returns for any r: an (1 + )approximation of EHB(P) if r ≥ R∗ the fact that r < R∗ if the result obtained after more than 4/ 2 iterations is not good enough. This suggests to implement a dichotomic search in order to compute an approximation of the minimal hyperbolic enclosing ball. We obtain a O(1 + + 2/4)approximation of MEHB(P) in O N 2 log 1 iterations. c 2015 Frank Nielsen  Ga¨etan Hadjeres 16 (1 + + 2 /4)approximation of MEHB(P) algorithm Algorithm 2: (1 + )approximation of MEHB(P) 1: c := p1 2: rmax := ρ (c, P); rmin = rmax 2 ; tmax := +∞ 3: r := rmax; 4: repeat 5: ctemp := Alg1 P, r, 2 , interrupt if t > tmax in Alg1 6: if call of Alg1 has been interrupted then 7: rmin := r 8: else 9: rmax := r ; c := ctemp 10: end if 11: dr := rmax−rmin 2 ; r := rmin + dr ; tmax := log(ch(1+ /2)r)−log(ch(rmin)) log(ch(r /2)) 12: until 2dr < rmin 2 13: return c c 2015 Frank Nielsen  Ga¨etan Hadjeres 17 Experimental results The number of iterations does not depend on d. Figure : Number of αmidpoint calculations as a function of in logarithmic scale for diﬀerent values of d. c 2015 Frank Nielsen  Ga¨etan Hadjeres 18 Experimental results The running time is approximately O(dn 2 ) (vertical translation in logarithmic scale). Figure : execution time as a function of in logarithmic scale for diﬀerent values of d. c 2015 Frank Nielsen  Ga¨etan Hadjeres 19 Applications Hyperbolic geometry arises when considering certain subfamilies of multivariate normal distributions. For instance, the following subfamilies N µ, σ2In of nvariate normal distributions with scalar covariance matrix (In is the n × n identity matrix), N µ, diag σ2 1, . . . , σ2 n of nvariate normal distributions with diagonal covariance matrix N(µ0, Σ) of dvariate normal distributions with ﬁxed mean µ0 and arbitrary positive deﬁnite covariance matrix Σ are statistical manifolds whose Fisher information metric is hyperbolic. c 2015 Frank Nielsen  Ga¨etan Hadjeres 20 Applications In particular, our results apply to the twodimensional locationscale subfamily: Figure : MEHB (D) of probability density functions (left) in the (µ, σ) superior halfplane (right). P = {A, B, C}. c 2015 Frank Nielsen  Ga¨etan Hadjeres 21 Openings Plugging the EHB and MEHB algorithms to compute clusters centers in the approximation algorithm by [Gonzalez, 1985], we obtain approximate algorithms for covering in hyperbolic spaces the kcenter problem in O kNd 2 log 1 c 2015 Frank Nielsen  Ga¨etan Hadjeres 22 Algorithm 3: Gonzalez farthestﬁrst traversal approximation algo rithm 1: C1 := P, i = 0 2: while i ≤ k do 3: ∀j ≤ i, compute cj := MEB(Cj ) 4: ∀j ≤ i, set fj := argmaxp∈P ρ(p, cj ) 5: ﬁnd f ∈ {fj } whose distance to its cluster center is maximal 6: create cluster Ci containing f 7: add to Ci all points whose distance to f is inferior to the distance to their cluster center 8: increment i 9: end while 10: return {Ci }i c 2015 Frank Nielsen  Ga¨etan Hadjeres 23 Openings The computation of the minimum enclosing hyperbolic ball does not necessarily involve all points p ∈ P. Coresets in hyperbolic geometry the MEHB obtained by the algorithm is an coreset diﬀerences with the euclidean setting: coresets are of size at most 1/ [Badoiu and Clarkson, 2008] c 2015 Frank Nielsen  Ga¨etan Hadjeres 24 Thank you! c 2015 Frank Nielsen  Ga¨etan Hadjeres 25 Bibliography I Arnaudon, M. and Nielsen, F. (2013). On approximating the Riemannian 1center. Computational Geometry, 46(1):93–104. Badoiu, M. and Clarkson, K. L. (2008). Optimal coresets for balls. Comput. Geom., 40(1):14–22. Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306. Nock, R. and Nielsen, F. (2005). Fitting the smallest enclosing Bregman ball. In Machine Learning: ECML 2005, pages 649–656. Springer. c 2015 Frank Nielsen  Ga¨etan Hadjeres 26
Brain Computer Interfaces (BCI) based on electroencephalography (EEG) rely on multichannel brain signal processing. Most of the stateoftheart approaches deal with covariance matrices, and indeed Riemannian geometry has provided a substantial framework for developing new algorithms. Most notably, a straightforward algorithm such as Minimum Distance to Mean yields competitive results when applied with a Riemannian distance. This applicative contribution aims at assessing the impact of several distances on real EEG dataset, as the invariances embedded in those distances have an influence on the classification accuracy. Euclidean and Riemannian distances and means are compared both in term of quality of results and of computational load.

From Euclidean to Riemannian Means: Information Geometry for SSVEP Classiﬁcation Emmanuel K. Kalunga, Sylvain Chevallier, Quentin Barthélemy et al. F’SATI  Tshawne University of Technology (South Africa) LISV  Université de Versailles SaintQuentin (France) Mensia Technologies (France) sylvain.chevallier@uvsq.fr 28 October 2015 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Cerebral interfaces Context Rehabilitation and disability compensation ) Outofthelab solutions ) Open to a wider population Problem Intrasubject variabilities ) Online methods, adaptative algorithms Intersubject variabilities ) Good generalization, fast convergence Opportunities New generation of BCI (Congedo & Barachant) • Growing interest in EEG community • Large community, available datasets • Challenging situations and problems S. Chevallier 28/10/2015 GSI 2 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Outline BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances S. Chevallier 28/10/2015 GSI 3 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Interaction based on brain activity BrainComputer Interface (BCI) for nonmuscular communication • Medical applications • Possible applications for wider population Recording at what scale ? • Neuron !LFP • Neuronal group !ECoG !SEEG • Brain !EEG !MEG !IRMf !TEP S. Chevallier 28/10/2015 GSI 4 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Interaction loop BCI loop 1 Acquisition 2 Preprocessing 3 Translation 4 User feedback S. Chevallier 28/10/2015 GSI 5 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Electroencephalography Most BCI rely on EEG ) Eﬃcient to capture brain waves • Lightweight system • Low cost • Mature technologies • High temporal resolution • No trepanation S. Chevallier 28/10/2015 GSI 6 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Origins of EEG • Local ﬁeld potentials • Electric potential diﬀerence between dendrite and soma • Maxwell’s equation • Quasistatic approximation • Volume conduction eﬀect • Sensitive to conductivity of brain skull • Sensitive to tissue anisotropies S. Chevallier 28/10/2015 GSI 7 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Experimental paradigms Diﬀerent brain signals for BCI : • Motor imagery : (de)synchronization in premotor cortex • Evoked responses : low amplitude potentials induced by stimulus SteadyState Visually Evoked Potentials 8 electrodes in occipital region SSVEP stimulation LEDs 13 Hz 17 Hz 21 Hz • Neural synchronization with visual stimulation • No learning required, based on visual attention • Strong induced activation S. Chevallier 28/10/2015 GSI 8 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances BCI Challenges Limitations • Data scarsity ) A few sources are nonlinearly mixed on all electrodes • Individual variabilities ) Eﬀect of mental fatigue • Intersession variabilities ) Electronic impedances, localizations of electrodes • Interindividual variabilities ) State of the art approaches fail with 20% of subjects Desired properties : • Online systems ) Continously adapt to the user’s variations • No calibration phase ) Non negligible cognitive load, raises fatigue • Generic model classiﬁers and transfert learning ) Use data from one subject to enhance the results for another S. Chevallier 28/10/2015 GSI 9 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Spatial covariance matrices Common approach : spatial ﬁltering • Eﬃcient on clean datasets • Speciﬁc to each user and session ) Require user calibration • Two step training with feature selection ) Overﬁtting risk, curse of dimensionality Working with covariance matrices • Good generalization across subjects • Fast convergence • Existing online algorithms • Eﬃcient implementations S. Chevallier 28/10/2015 GSI 10 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Covariance matrices for EEG • An EEG trial : X 2 RC⇥N , C electrodes, N time samples • Assuming that X ⇠ N(0, ⌃) • Covariance matrices ⌃ belong to MC = ⌃ 2 RC⇥C : ⌃ = ⌃ and x ⌃x > 0, 8x 2 RC \0 • Mean of the set {⌃i }i=1,...,I is ¯⌃ = argmin⌃2MC PI i=1 dm (⌃i , ⌃) • Each EEG class is represented by its mean • Classiﬁcation based on those means • How to obtain a robust and eﬃcient algorithm ? Congedo, 2013 S. Chevallier 28/10/2015 GSI 11 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Minimum distance to Riemannian mean Simple and robust classiﬁer • Compute the center ⌃ (k) E of each of the K classes • Assign a given unlabelled ˆ⌃ to the closest class k⇤ = argmin k (ˆ⌃, ⌃ (k) E ) Trajectories on tangent space at mean of all trials ¯⌃µ −4 −2 0 2 4 −4 −2 0 2 4 6 Resting class 13Hz class 21Hz class 17Hz class Delay S. Chevallier 28/10/2015 GSI 12 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Riemannian potato Removing outliers and artifacts Reject any ⌃i that lies too far from the mean of all trials ¯⌃µ z( i ) = i µ > zth , i is d(⌃i , ¯⌃), µ and are the mean and standard deviation of distances { i } I i=1 Raw matrices Riemannian potato ﬁltering S. Chevallier 28/10/2015 GSI 13 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Covariance matrices for EEGbased BCI Riemannian approaches in BCI : • Achieve state of the art results ! performing like spatial ﬁltering or sensorspace methods • Rely on simpler algorithms ! less errorprone, computationally eﬃcient What are the reason of this success ? • Invariances embedded with Riemannian distances ! invariance to rescaling, normalization, whitening ! invariance to electrode permutation or positionning • Equivalent to working in an optimal source space ! spatial ﬁltering are sensitive to outliers and userspeciﬁc ! no question on "sensors or sources" methods ) What are the most desirable invariances for EEG ? S. Chevallier 28/10/2015 GSI 14 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Considered distances and divergences Euclidean dE(⌃1, ⌃2) = k⌃1 ⌃2kF LogEuclidean dLE(⌃1, ⌃2) = klog(⌃1) log(⌃2)kF V. Arsigny et al., 2006, 2007 Aﬃneinvariant dAI(⌃1, ⌃2) = klog(⌃ 1 1 ⌃2)kF T. Fletcher & S. Joshi, 2004 , M. Moakher, 2005 ↵divergence d↵ D(⌃1, ⌃2) 1<↵<1 = 4 1 ↵2 log det( 1 ↵ 2 ⌃1+ 1+↵ 2 ⌃2) det(⌃1) 1 ↵ 2 det(⌃2) 1+↵ 2 Z. Chebbi & M. Moakher, 2012 Bhattacharyya dB(⌃1, ⌃2) = ⇣ log det 1 2 (⌃1+⌃2) (det(⌃1) det(⌃2))1/2 ⌘1/2 Z. Chebbi & M. Moakher, 2012 S. Chevallier 28/10/2015 GSI 15 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Experimental results • Euclidean distances yield the lowest results ! Usually attributed to the invariance under inversion that is not guaranteed ! Displays swelling eﬀect • Riemannian approaches outperform stateoftheart methods (CCA+SVM) • ↵divergence shows the best performances ! but requires a costly optimisation to ﬁnd the best ↵ value • Bhattacharyya has the lowest computational cost and a good accuracy −1 −0.5 0 0.5 1 20 30 40 50 60 70 80 90 Accuracy(%) Alpha values (α) −1 −0.5 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 CPUtime(s) S. Chevallier 28/10/2015 GSI 16 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Conclusion Working with covariance matrices in BCI • Achieves very good results • Simple algorithms work well : MDM, Riemannian potato • Need for robust and online methods Interesting applications for IG : • Many freely available datasets • Several competitions • Many open source toolboxes for manipulating EEG Several open questions : • Handling electrodes misplacements and others artifacts • Missing data and covariance matrices of lower rank • Inter and intraindividual variabilities S. Chevallier 28/10/2015 GSI 17 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Thank you ! S. Chevallier 28/10/2015 GSI 18 / 19 BrainComputer Interfaces Spatial covariance matrices for BCI Experimental assessment of distances Interaction loop BCI loop 1 Acquisition 2 Preprocessing 3 Translation 4 User feedback First systems in early ’70 S. Chevallier 28/10/2015 GSI 19 / 19
Group Theoretical Study on Geodesics for the Elliptical Models Hiroto Inoue Kyushu University, Japan October 28, 2015 GSI2015, ´Ecole Polytechnique, ParisSaclay, France Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 1 / 14 Overview 1 Eriksen’s construction of geodesics on normal model Problem 2 Reconsideration of Eriksen’s argument Embedding Nn → Sym+ n+1(R) 3 Geodesic equation on Elliptical model 4 Future work Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 2 / 14 Eriksen’s construction of geodesics on normal model Let Sym+ n (R) be the set of ndimensional positivedeﬁnite matrices. The normal model Nn = (M, ds2) is a Riemannian manifold deﬁned by M = (µ, Σ) ∈ Rn × Sym+ n (R) , ds2 = (t dµ)Σ−1 (dµ) + 1 2 tr((Σ−1 dΣ)2 ). The geodesic equation on Nn is ¨µ − ˙ΣΣ−1 ˙µ = 0, ¨Σ + ˙µt ˙µ − ˙ΣΣ−1 ˙Σ = 0. (1) The solution of this geodesic equation has been obtained by Eriksen. Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 3 / 14 Theorem ([Eriksen 1987]) For any x ∈ Rn, B ∈ Symn(R), deﬁne a matrix exponential Λ(t) by Λ(t) = ∆ δ Φ tδ tγ tΦ γ Γ := exp(−tA), A := B x 0 tx 0 −tx 0 −x −B ∈ Mat2n+1. (2) Then, the curve (µ(t), Σ(t)) := (−∆−1δ, ∆−1) is the geodesic on Nn satisﬁying the initial condition (µ(0), Σ(0)) = (0, In), ( ˙µ(0), ˙Σ(0)) = (x, B). (proof) We see that by the deﬁnition, (µ(t), Σ(t)) satisﬁes the geodesic equation. Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 4 / 14 Problem 1 Explain Eriksen’s theorem, to clarify the relation between the normal model and symmetric spaces. 2 Extend Eriksen’s theorem to the elliptical model. Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 5 / 14 Reconsideration of Eriksen’s argument Sym+ n+1(R) Notice that the positivedeﬁnite symmetric matrices Sym+ n+1(R) is a symmetric space by G/K Sym+ n+1(R) gK → g · tg, where G = GLn+1(R), K = O(n + 1). This space G/K has the Ginvariant Riemannian metric ds2 = 1 2 tr (S−1 dS)2 . Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 6 / 14 Embedding Nn → Sym+ n+1(R) Put an aﬃne subgroup GA := P µ 0 1 P ∈ GLn(R), µ ∈ Rn ⊂ GLn+1(R). Deﬁne a Riemannian submanifold as the orbit GA · In+1 = {g · t g g ∈ GA} ⊂ Sym+ n+1(R). Theorem (Ref. [Calvo, Oller 2001]) We have the following isometry Nn ∼ −→ GA · In+1 ⊂ Sym+ n+1(R), (Σ, µ) → Σ + µtµ µ tµ 1 . (3) Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 7 / 14 Embedding Nn → Sym+ n+1(R) By using the above embedding, we get a simpler expression of the metric and the geodesic equation. Nn ∼= GA · In+1 ⊂ Sym+ n+1(R) coordinate (Σ, µ) → S = Σ + µtµ µ tµ 1 metric ds2 = (tdµ)Σ−1(dµ) +1 2tr((Σ−1dΣ)2) ⇔ ds2 = 1 2 tr (S−1dS)2 geodesic eq. ¨µ − ˙ΣΣ−1 ˙µ = 0, ¨Σ + ˙µt ˙µ − ˙ΣΣ−1 ˙Σ = 0 ⇔ (In, 0)(S−1 ˙S) = (B, x) Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 8 / 14 Reconsideration of Eriksen’s argument We can interpret the Eriksen’s argument as follows. Diﬀerential equation Geodesic equation Λ−1 ˙Λ = −A −→ (In, 0)(S−1 ˙S) = (B, x) A = B x 0 t x 0 −t x 0 −x −B −→ e−tA = ∆ δ ∗ t δ ∗ ∗ ∗ ∗ −→ S := ∆ δ t δ −1 ∈ ∈ ∈ {A : JAJ = −A} −→ {Λ : JΛJ = Λ−1 } −→ Essential! Nn ∼= GA · In+1 ∩ ∩ ∩ sym2n+1(R) −→ exp Sym+ 2n+1(R) −→ projection Sym+ n+1(R) Here J = In 1 In . Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 9 / 14 Geodesic equation on Elliptical model Deﬁnition Let us deﬁne a Riemannian manifold En(α) = (M, ds2) by M = (µ, Σ) ∈ Rn × Sym+ n (R) , ds2 = (t dµ)Σ−1 (dµ) + 1 2 tr((Σ−1 dΣ)2 )+ 1 2 dα tr(Σ−1 dΣ) 2 . (4) where dα = (n + 1)α2 + 2α, α ∈ C. Then En(0) = Nn. The geodesic equation on En(α) is ¨µ − ˙ΣΣ−1 ˙µ = 0, ¨Σ + ˙µt ˙µ − ˙ΣΣ−1 ˙Σ− dα ndα + 1 t ˙µΣ−1 ˙µΣ = 0. (5) This is equivalent to the geodesic equation on the elliptical model. Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 10 / 14 Geodesic equation on Elliptical model The manifold En(α) is also embedded into positivedeﬁnite symmetric matrices Sym+ n+1(R), ref. [Calvo, Oller 2001], and we have simpler expression of the geodesic equation. En(α) ∼= ∃GA(α) · In+1 ⊂ Sym+ n+1(R) coordinate (Σ, µ) → S = Σα Σ + µtµ µ tµ 1 metric (4) ⇔ ds2 = 1 2 tr (S−1dS)2 geodesic eq. (5) ⇔ (In, 0)(S−1 ˙S) = (C, x) − α(log S) (In, 0) A = det A Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 11 / 14 Geodesic equation on Elliptical model But, in general, we do not ever construct any submanifold N ⊂ Sym+ 2n+1(R) such that its projection is En(α): Diﬀerential equation Geodesic equation Λ−1 ˙Λ = −A −→ (In, 0)(S−1 ˙S) = (C, x) − α(log S) (In, 0) Λ(t) −→ S(t) ∈ ∈ N −→ En(α) ∼= GA(α) · In+1 ∩ ∩ Sym+ 2n+1(R) −→ projection Sym+ n+1(R) The geodesic equation on elliptical model has not been solved. Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 12 / 14 Future work 1 Extend Eriksen’s theorem for elliptical models (ongoing) 2 Find Eriksen type theorem for general symmetric spaces G/K Sketch of the problem: For a projection p : G/K → G/K, ﬁnd a geodesic submanifold N ⊂ G/K, such that pN maps all the geodesics to the geodesics: ∀Λ(t): Geodesic −→ p(Λ(t)): Geodesic ∈ ∈ N −→ pN p(N) ∩ ∩ G/K −→ p:projection G/K Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 13 / 14 References Calvo, M., Oller, J.M. A distance between elliptical distributions based in an embedding into the Siegel group, J. Comput. Appl. Math. 145, 319–334 (2002). Eriksen, P.S. Geodesics connected with the Fisher metric on the multivariate normal manifold, pp. 225–229. Proceedings of the GST Workshop, Lancaster (1987). Hiroto Inoue (Kyushu Uni.) Group Theoretical Study on Geodesics October 28, 2015 14 / 14
We introduce a class of paths or oneparameter models connecting arbitrary two probability density functions (pdf’s). The class is derived by employing the KolmogorovNagumo average between the two pdf’s. There is a variety of such path connectedness on the space of pdf’s since the KolmogorovNagumo average is applicable for any convex and strictly increasing function. The information geometric insight is provided for understanding probabilistic properties for statistical methods associated with the path connectedness. The oneparameter model is extended to a multidimensional model, on which the statistical inference is characterized by sufficient statistics.

Path connectedness on a space of probability density functions Osamu Komori1 , Shinto Eguchi2 University of Fukui1 , Japan The Institute of Statistical Mathematics2 , Japan Ecole Polytechnique, ParisSaclay (France) October 28, 2015 Komori, O. (University of Fukui) GSI2015 October 28, 2015 1 / 18 Contents 1 KolmogorovNagumo (KN) average 2 parallel displacement A(ϕ) t characterizing ϕpath 3 Udivergence and its associated geodesic Komori, O. (University of Fukui) GSI2015 October 28, 2015 2 / 18 Setting Terminology . . X : data space P : probability measure on X FP: space of probability density functions associated with P We consider a path connecting f and g, where f, g ∈ FP, and investigate the property from a viewpoint of information geometry. Komori, O. (University of Fukui) GSI2015 October 28, 2015 3 / 18 KolmogorovNagumo (KN) average Let ϕ : (0, ∞) → R be an monotonic increasing and concave continuous function. Then for f and g in Fp The KolmogorovNagumo (KN) average . . ϕ−1 ( (1 − t)ϕ(f(x)) + tϕ(g(x)) ) for 0 ≤ t ≤ 1. Remark 1 . . ϕ−1 is monotone increasing, convex and continuous on (0, ∞) Komori, O. (University of Fukui) GSI2015 October 28, 2015 4 / 18 ϕpath Based on KN average, we consider ϕpath connecting f and g in FP: ϕpath . . ft(x, ϕ) = ϕ−1 ( (1 − t)ϕ(f(x)) + tϕ(g(x)) − κt ) , where κt ≤ 0 is a normalizing factor, where the equality holds if t = 0 or t = 1. Komori, O. (University of Fukui) GSI2015 October 28, 2015 5 / 18 Existence of κt Theorem 1 . . There uniquely exists κt such that ∫ X ϕ−1 ( (1 − t)ϕ(f(x)) + tϕ(g(x)) − κt ) dP(x) = 1 Proof From the convexity of ϕ−1 , we have 0 ≤ ∫ ϕ−1 ( (1 − t)ϕ(f(x)) + tϕ(g(x)) ) dP(x) ≤ ∫ {(1 − t)f(x) + tg(x)}dP(x) ≤ 1 And we observe that limc→∞ ϕ−1 (c) = +∞ since ϕ−1 is monotone increasing. Hence the continuity of ϕ−1 leads to the existence of κt satisfying the equation above. Komori, O. (University of Fukui) GSI2015 October 28, 2015 6 / 18 Illustration of ϕpath Komori, O. (University of Fukui) GSI2015 October 28, 2015 7 / 18 Examples of ϕpath Example 1 . 1 ϕ0(x) = log(x). The ϕ0path is given by ft(x, ϕ0) = exp((1 − t) log f(x) + t log g(x) − κt), where κt = log ∫ exp((1 − t) log f(x) + t log g(x))dP(x). 2 ϕη(x) = log(x + η) with η ≥ 0. The ϕηpath is given by ft(x, ϕη) = exp [ (1 − t) log{ f(x) + η} + t log{g(x) + η} − κt ] , where κt = log [ ∫ exp{(1 − t) log{f(x) + η} + t log{g(x) + η}}dP(x) − η ] . 3 ϕβ(x) = (xβ − 1)/β with β ≤ 1. The ϕβpath is given by ft(x, ϕβ) = {(1 − t)f(x)β + tg(x)β − κt} 1 β , where κt does not have an explicit form. Komori, O. (University of Fukui) GSI2015 October 28, 2015 8 / 18 Contents 1 KolmogorovNagumo (KN) average 2 parallel displacement A(ϕ) t characterizing ϕpath 3 Udivergence and its associated geodesic Komori, O. (University of Fukui) GSI2015 October 28, 2015 9 / 18 Extended expectation For a function a(x): X → R, we consider Extended expectation . . E(ϕ) f {a(X)} = ∫ X 1 ϕ′(f(x)) a(x)dP(x) ∫ X 1 ϕ′(f(x)) dP(x) , where ϕ: (0, ∞) → R is a generator function. Remark 2 If ϕ(t) = log t, then E(ϕ) reduces to the usual expectation. Komori, O. (University of Fukui) GSI2015 October 28, 2015 10 / 18 Properties of extended expectation We note that 1 E(ϕ) f (c) = c for any constant c. 2 E(ϕ) f {ca(X)} = cE(ϕ) f {a(X)} for any constant c. 3 E(ϕ) f {a(X) + b(X)} = E(ϕ) f {a(X)} + E(ϕ) f {b(X)}. 4 E(ϕ) f {a(X)2 } ≥ 0 with equality if and only if a(x) = 0 for Palmost everywhere x in X. Remark 3 If we deﬁne f(ϕ) (x) = 1/ϕ′ ( f(x))/ ∫ X 1/ϕ′ (f(x))dP(x), then E(ϕ) f {a(X)} = Ef(ϕ) {a(X)}. Komori, O. (University of Fukui) GSI2015 October 28, 2015 11 / 18 Tangent space of FP Let Hf be a Hilbert space with the inner product deﬁned by ⟨a, b⟩f = E(ϕ) f {a(X)b(X)}, and the tangent space Tangent space associated with extended expectation . . Tf = {a ∈ Hf : ⟨a, 1⟩f = 0}. For a statistical model M = { fθ(x)}θ∈Θ we have E(ϕ) fθ {∂iϕ(fθ(X))} = 0 for all θ of Θ, where ∂i = ∂/∂θi with θ = (θi)i=1,··· ,p. Further, E(ϕ) fθ {∂i∂jϕ(fθ(X))} = E(ϕ) fθ { ϕ′′ ( fθ(X)) ϕ′(fθ(X))2 ∂iϕ(fθ(X))∂iϕ(fθ(X)) } . Komori, O. (University of Fukui) GSI2015 October 28, 2015 12 / 18 Parallel displacement A(ϕ) t Deﬁne A(ϕ) t (x) in Tft by the solution for a differential equation ˙A(ϕ) t (x) − E(ϕ) ft { A(ϕ) t ˙ft ϕ′′ ( ft) ϕ′(ft) } = 0, where ft is a path connecting f and g such that f0 = f and f1 = g. ˙A(ϕ) t (x) is the derivative of A(ϕ) t (x) with respect to t. Theorem 2 The geodesic curve {ft}0≤t≤1 by the parallel displacement A(ϕ) t is the ϕpath. Komori, O. (University of Fukui) GSI2015 October 28, 2015 13 / 18 Contents 1 KolmogorovNagumo (KN) average 2 parallel displacement A(ϕ) t characterizing ϕpath 3 Udivergence and its associated geodesic Komori, O. (University of Fukui) GSI2015 October 28, 2015 14 / 18 Udivergence Assume that U(s) is a convex and increasing function of a scalar s and let ξ(t) = argmaxs{st − U(s)} . Then we have Udivergence . . DU(f, g) = ∫ {U(ξ(g)) − fξ(g)}dP − ∫ {U(ξ(f)) − fξ( f)}dP. In fact, Udivergence is the difference of the cross entropy CU( f, g) with the diagonal entropy CU( f, f), where CU(f, g) = ∫ {U(ξ(g)) − fξ(g)}dP. Komori, O. (University of Fukui) GSI2015 October 28, 2015 15 / 18 Connections based on Udivergence For a manifold of ﬁnite dimension M = { fθ(x) : θ ∈ Θ} and vector ﬁelds X and Y on M, the Riemannian metric is G(U) (X, Y)(f) = ∫ X f Yξ( f)dP for f ∈ M and linear connections ∇(U) and ∇∗(U) are G(U) (∇(U) X Y, Z)(f) = ∫ XY f Zξ(f)dP and G(U) (∇∗ X (U) Y, Z)(f) = ∫ Z f XYξ(f)dP. See Eguchi (1992) for details. Komori, O. (University of Fukui) GSI2015 October 28, 2015 16 / 18 Equivalence between ∇∗ geodesic and ξpath Let ∇(U) and ∇∗(U) be linear connections associated with Udivergence DU, and let C(ϕ) = {ft(x, ϕ) : 0 ≤ t ≤ 1} be the ϕ path connecting f and g of FP. Then, we have Theorem 3 A ∇(U) geodesic curve connecting f and g is equal to C(id) , where id denotes the identity function; while a ∇∗(U) geodesic curve connecting f and g is equal to C(ξ) , where ξ(t) = argmaxs{st − U(s)}. Komori, O. (University of Fukui) GSI2015 October 28, 2015 17 / 18 Summary 1 We consider ϕpath based on KolmogorovNagumo average. 2 The relation between Udivergence and ϕpath was investigated (ϕ corresponds to ξ). 3 The idea of ϕpath can be applied to probability density estimation as well as classiﬁcation problems. 4 Divergence associated with ϕpath can be considered, where a special case would be Bhattacharyya divergence. Komori, O. (University of Fukui) GSI2015 October 28, 2015 18 / 18
Computational Information Geometry... ...in mixture modelling Computational Information Geometry: mixture modelling Germain Van Bever1 , R. Sabolová1 , F. Critchley1 & P. Marriott2 . 1 The Open University (EPSRC grant EP/L010429/1), United Kingdom 2 University of Waterloo, USA GSI15, 2830 October 2015, Paris Germain Van Bever CIG for mixtures 1/19 Computational Information Geometry... ...in mixture modelling Outline 1 Computational Information Geometry... Information Geometry CIG 2 ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Germain Van Bever CIG for mixtures 2/19 Computational Information Geometry... ...in mixture modelling Information Geometry CIG Outline 1 Computational Information Geometry... Information Geometry CIG 2 ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Germain Van Bever CIG for mixtures 3/19 Computational Information Geometry... ...in mixture modelling Information Geometry CIG Generalities The use of geometry in statistics gave birth to many different approaches. Traditionally, Information geometry refers to the application of differential geometry to statistical theory and practice. The main ingredients of IG in exponential families (Amari, 1985) are 1 the manifold of parameters M, 2 the Riemannian (Fisher information) metric g, and 3 the set of afﬁne connections { −1 , +1 } (mixture and exponential connections). These allow to deﬁne notions of curvature, dimension reduction or information loss and invariant higher order expansions. Two afﬁne structures (maps on M) are used simultaneously: 1: Mixture afﬁne geometry on probability measures: λf(x) + (1 − λ)g(x). +1: Exponential afﬁne geometry on probability measures: C(λ)f(x)λ g(x)(1−λ) Germain Van Bever CIG for mixtures 4/19 Computational Information Geometry... ...in mixture modelling Information Geometry CIG Computational Information Geometry This talk is about Computational Information Geometry (CIG, Critchley and Marriott, 2014). 1 In CIG, the multinomial model provides, modulo, discretization, a universal model. It therefore moves from the manifoldbased systems to simplexbased geometries and allows for different supports in the extended simplex. 2 It provides a unifying framework for different geometries. 3 Tractability of the geometry allows for efﬁcient algorithms in a computational framework. It is inherently ﬁnite and discrete. The impact of discretization is studied. A working model will be a subset of the simplex. Germain Van Bever CIG for mixtures 5/19 Computational Information Geometry... ...in mixture modelling Information Geometry CIG Multinomial distributions X ∼ Mult(π0, . . . , πk), π = (π0, . . . , πk) ∈ int(∆k ), with ∆k := π : πi ≥ 0, k i=0 πi = 1 . In this case, π(0) = (π1 , . . . , πk ) is the mean parameter, while η = log(π(0) /π0) is the natural parameter. Studying limits gives extended exponential families on the closed simplex (Csiszár and Matúš, 2005). 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 mixed geodesics in 1space π1 π2 6 4 2 0 2 4 6 6420246 mixed geodesics in +1space η1 η2 Germain Van Bever CIG for mixtures 6/19 Computational Information Geometry... ...in mixture modelling Information Geometry CIG Restricting to the multinomials families Under regular exponential families with compact support, the cost of discretization on the components of Information Geometry is bounded! The same holds true for the MLE and the loglikelihood function. The loglikelihood (x, π) = k i=0 ni log(πi) is (i) strictly concave (in the −1representation) on the observed face (counts ni > 0), (ii) strictly decreasing in the normal direction towards the unobserved face (ni = 0), and, otherwise, (iii) constant. Considering an inﬁnitedimensional simplex allows to remove the compactness assumption (Critchley and Marriott, 2014). Germain Van Bever CIG for mixtures 7/19 Computational Information Geometry... ...in mixture modelling Information Geometry CIG Binomial subfamilies A (discrete) example: Binomial distributions as a subfamily of multinomial distributions. Let X ∼ Bin(k, p). Then, X can be seen as a subfamily of M = {XX ∼ Mult(π0, . . . , πk)} , with πi(p) = k i pi (1 − p)k−i . Figure: Left: Embedded binomial (k = 2) in the 2simplex. Right: Embedded binomial (k = 3) in the 3simplex. Germain Van Bever CIG for mixtures 8/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Outline 1 Computational Information Geometry... Information Geometry CIG 2 ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Germain Van Bever CIG for mixtures 9/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Mixture distributions The generic mixture distribution is f(x; Q) = f(x; θ)dQ(θ), that is, a mixture of (regular) parametric distributions. Regularity: same support S, abs. cont. with respect to measure ν. Mixture distributions arise naturally in many statistical problems, including Overdispersed models Random effects ANOVA Random coefﬁcient regression models and measurement error models Graphical models and many more Germain Van Bever CIG for mixtures 10/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Hard mixture problems Inference in the class of mixture distributions generates wellknown difﬁculties: Identiﬁability issues: Without imposing constraints on the mixing distribution Q, there may exist Q1 and Q2 such that f(x; Q1) = f(x; θ)dQ1(θ) = f(x; θ)dQ2(θ) = f(x; Q2). Byproduct: parametrisation issues. Byproduct: multimodal likelihood functions. Boundary problems. Byproduct: singularities in the likelihood function. Germain Van Bever CIG for mixtures 11/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions NPMLE Finite mixtures are essential to the geometry. Lindsay argues that nonparametric estimation of Q is necessary. Also, Theorem The loglikelihood (Q) = n s=1 log Ls(Q) = n s=1 log f(xs; θ)dQ(θ) , has a unique maximum over the space of all distribution functions Q. Furthermore, the maximiser ˆQ is a discrete distribution with no more than D distinct points of support, where D is the number of distinct points in (x1, . . . , xn). The likelihood on the space of mixtures is therefore deﬁned on the convex hull of the image of θ → (L1(θ), . . . , LD(θ)). Finding the NPMLE amounts to maximize a concave function over this convex set. Germain Van Bever CIG for mixtures 12/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Limits to convex geometry Knowing the shape of the likelihood on the whole simplex (and not only on the observed face) give extra insight. Convex geometry correctly captures the −1geometry of the simplex but NOT the 0 and +1 geometries (for example, Fisher information requires to know the full sample space). Understanding the (C)IG of mixtures in the simplex will therefore provide extra tools (and algorithms) in mixture modelling. In this talk, we mention results on 1 (−1)dimensionality of exponential families in the simplex. 2 convex polytopes approximation algorithms: Information geometry can give efﬁcient approximation of high dimensional convex hulls by polytopes Germain Van Bever CIG for mixtures 13/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Local mixture models (IG) Parametric vs nonparametric dilemma. Geometric analysis allows lowdimensional approximation in local setups. Theorem (Marriott, 2002) If f(x; θ) is a ndim exponential family with regularity conditions, Qλ(θ) is a local mixing around θ0, then f(x; Qλ) = f(x; θ)dQλ(θ) has the expansion f(x; Qλ) − f(x; θ0) − n i=1 λi ∂ ∂θi f(x; θ0) − n i,j=1 λij ∂2 ∂θi∂θj f(x; θ0) = O(λ−3 ). This is equivalent to f(x; Qλ) + O(λ−3 ) ∈ T2 Mθ0 . If the density f(x; θ) and all its derivatives are bounded, then the approximation will be uniform in x. Germain Van Bever CIG for mixtures 14/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Dimensionality in CIG It is therefore possible to approximate mixture distributions with lowdimensional families. In contrast, the (−1)−representation of any generic exponential family on the simplex will always have full dimension. The following result is even more general. Theorem (VB et al.) The −1convex hull of an open subset of a exponential subfamily of M with tangent dimension k − d has dimension at least k − d. Corollary (Critchley and Marriott, 2014) The −1convex hull of an open subset of a generic one dimensional subfamily of M is of full dimension. The tangent dimension is the maximal number of different components of any (+1) tangent vector to the exponential family. Generic ↔ tangent dimension= k, i.e. the tangent vector has distinct components. Germain Van Bever CIG for mixtures 15/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Example: Mixture of binomials As mentioned, IG gives efﬁcient approximation by polytopes. IG maximises concave function on (convex) polytopes. Example: toxicological data (Kupper and Haseman, 1978). ‘simple oneparameter binomial [...] models generally provides poor ﬁts to this type of binary data’. Germain Van Bever CIG for mixtures 16/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Approximation in CIG Deﬁne the norm ππ0 = k i=1 π2 i /πi,0 (preferred point metric, Critchley et al., 1993). Let π(θ) be an exponential family and ∪Si be a polytope surface. Deﬁne the distance function as d(π(θ), π0) := inf π∈∪Si π(θ) − ππ0 . Theorem (AnayaIzquierdo et al.) Let ∪Si be such that d(π(θ)) ≤ for all θ. Then (ˆπNP MLE ) − (ˆπ) ≤ N(ˆπG − ˆπNP MLE )ˆπ + o( ), where (ˆπG )i = ni/N and ˆπ is the NPMLE on ∪Si. Germain Van Bever CIG for mixtures 17/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions Summary Highdimensional (extended) multinomial space is used as a proxy for the ‘space of all models’. This computational approach encompasses Amari’s information geometry and Lindsay’s convex geometry... ...while having a tractable and mostly explicit geometry, which allows for a computational theory. Future work Converse of the dimensionality result (−1 to +1) Long term aim: implementing geometric theories within a R package/software. Germain Van Bever CIG for mixtures 18/19 Computational Information Geometry... ...in mixture modelling Introduction Lindsay’s convex geometry (C)IG for mixture distributions References: Amari, SI (1985), Differentialgeometrical methods in statistics, SpringerVerlag. AnayaIzquierdo, K., Critchley, F., Marriott, P. and Vos, P. (2012), Computational information geometry: theory and practice, Arxiv report, 1209.1988v1. Critchley, F., Marriott, P. and Salmon, M. (1993), Preferred point geometry and statistical manifolds, The Annals of Statistics, 21, 3, 11971224. Critchley, F. and Marriott, P. (2014), Computational Information Geometry in Statistics: Theory and Practice, Entropy, 16, 24542471. Csiszár, I. and Matúš, F. (2005), Closures of exponential families, The Annals of Probabilities, 33, 2, 582600. Kupper L.L., and Haseman J.K., (1978), The Use of a Correlated Binomial Model for the Analysis of Certain Toxicological Experiments, Biometrics, 34, 1, 6976. Marriott, P. (2002), On the local geometry of mixture models, Biometrika, 89, 1, 7793. Germain Van Bever CIG for mixtures 19/19
Bayesian and Information Geometry for Inverse Problems (chaired by Ali MohammadDjafari, Olivier Swander)
We review the manifold projection method for stochastic nonlinear filtering in a more general setting than in our previous paper in Geometric Science of Information 2013. We still use a Hilbert space structure on a space of probability densities to project the infinite dimensional stochastic partial differential equation for the optimal filter onto a finite dimensional exponential or mixture family, respectively, with two different metrics, the Hellinger distance and the L2 direct metric. This reduces the problem to finite dimensional stochastic differential equations. In this paper we summarize a previous equivalence result between Assumed Density Filters (ADF) and Hellinger/Exponential projection filters, and introduce a new equivalence between Galerkin method based filters and Direct metric/Mixture projection filters. This result allows us to give a rigorous geometric interpretation to ADF and Galerkin filters. We also discuss the different finitedimensional filters obtained when projecting the stochastic partial differential equation for either the normalized (KushnerStratonovich) or a specific unnormalized (Zakai) density of the optimal filter.

Stochastic PDE projection on manifolds: AssumedDensity and Galerkin Filters GSI 2015, Oct 28, 2015, Paris Damiano Brigo Dept. of Mathematics, Imperial College, London www.damianobrigo.it — Joint work with John Armstrong Dept. of Mathematics, King’s College, London — Full paper to appear in MCSS, see also arXiv.org D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 1 / 37 Inner Products, Metrics and Projections Spaces of densities Spaces of probability densities Consider a parametric family of probability densities S = {p(·, θ), θ ∈ Θ ⊂ Rm }, S1/2 = { p(·, θ), θ ∈ Θ ⊂ Rm }. If S (or S1/2) is a subset of a function space having an L2 structure (⇒ inner product, norm & metric), then we may ask whether p(·, θ) → θ Rm , ( p(·, θ) → θ respectively) is a Chart of a mdim manifold (?) S (S1/2). D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 2 / 37 Inner Products, Metrics and Projections Spaces of densities Spaces of probability densities Consider a parametric family of probability densities S = {p(·, θ), θ ∈ Θ ⊂ Rm }, S1/2 = { p(·, θ), θ ∈ Θ ⊂ Rm }. If S (or S1/2) is a subset of a function space having an L2 structure (⇒ inner product, norm & metric), then we may ask whether p(·, θ) → θ Rm , ( p(·, θ) → θ respectively) is a Chart of a mdim manifold (?) S (S1/2). The topology & differential structure in the chart is the L2 structure, but two possibilities: S : d2(p1, p2) = p1 − p2 (L2 direct distance), p1,2 ∈ L2 S1/2 : dH( √ p1, √ p2) = √ p1 − √ p2 (Hellinger distance), p1,2 ∈ L1 where · is the norm of Hilbert space L2. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 2 / 37 Inner Products, Metrics and Projections Manifolds, Charts and Tangent Vectors Tangent vectors, metrics and projection If ϕ : θ → p(·, θ) (θ → p(·, θ) resp.) is the inverse of a chart then { ∂ϕ(·, θ) ∂θ1 , · · · , ∂ϕ(·, θ) ∂θm } are linearly independent L2(λ) vector that span Tangent Space at θ. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 3 / 37 Inner Products, Metrics and Projections Manifolds, Charts and Tangent Vectors Tangent vectors, metrics and projection If ϕ : θ → p(·, θ) (θ → p(·, θ) resp.) is the inverse of a chart then { ∂ϕ(·, θ) ∂θ1 , · · · , ∂ϕ(·, θ) ∂θm } are linearly independent L2(λ) vector that span Tangent Space at θ. The inner product of 2 basis elements is deﬁned (L2 structure) ∂p(·, θ) ∂θi ∂p(·, θ) ∂θj = 1 4 ∂p(x, θ) ∂θi ∂p(x, θ) ∂θj dx = 1 4 γij(θ) . ∂ √ p ∂θi ∂ √ p ∂θj = 1 4 1 p(x, θ) ∂p(x, θ) ∂θi ∂p(x, θ) ∂θj dx = 1 4 gij(θ) . γ(θ): direct L2 matrix (d2); g(θ): famous FisherRao matrix (dH) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 3 / 37 Inner Products, Metrics and Projections Manifolds, Charts and Tangent Vectors Tangent vectors, metrics and projection If ϕ : θ → p(·, θ) (θ → p(·, θ) resp.) is the inverse of a chart then { ∂ϕ(·, θ) ∂θ1 , · · · , ∂ϕ(·, θ) ∂θm } are linearly independent L2(λ) vector that span Tangent Space at θ. The inner product of 2 basis elements is deﬁned (L2 structure) ∂p(·, θ) ∂θi ∂p(·, θ) ∂θj = 1 4 ∂p(x, θ) ∂θi ∂p(x, θ) ∂θj dx = 1 4 γij(θ) . ∂ √ p ∂θi ∂ √ p ∂θj = 1 4 1 p(x, θ) ∂p(x, θ) ∂θi ∂p(x, θ) ∂θj dx = 1 4 gij(θ) . γ(θ): direct L2 matrix (d2); g(θ): famous FisherRao matrix (dH) d2 ort. projection: Πγ θ [v] = m i=1 [ m j=1 γij (θ) v, ∂p(·, θ) ∂θj ] ∂p(·, θ) ∂θi (dH proj. analogous inserting √ · and replacing γ with g) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 3 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dXt = ft (Xt ) dt + σt (Xt ) dWt , X0, (signal) dYt = bt (Xt ) dt + dVt , Y0 = 0 (noisy observation) (1) These are Itˆo SDE’s. We use both Itˆo and Stratonovich (Str) SDE’s. Str SDE’s are necessary to deal with manifolds, since second order Itˆo terms not clear in terms of manifolds [16], although we are working on a direct projection of Ito equations with good optimality properties (John Armstrong) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 4 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dXt = ft (Xt ) dt + σt (Xt ) dWt , X0, (signal) dYt = bt (Xt ) dt + dVt , Y0 = 0 (noisy observation) (1) These are Itˆo SDE’s. We use both Itˆo and Stratonovich (Str) SDE’s. Str SDE’s are necessary to deal with manifolds, since second order Itˆo terms not clear in terms of manifolds [16], although we are working on a direct projection of Ito equations with good optimality properties (John Armstrong) The nonlinear ﬁltering problem consists in ﬁnding the conditional probability distribution πt of the state Xt given the observations up to time t, i.e. πt (dx) := P[Xt ∈ dx  Yt ], where Yt := σ(Ys , 0 ≤ s ≤ t). Assume πt has a density pt : then pt satisﬁes the Str SPDE: D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 4 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dpt = L∗ t pt dt − 1 2 pt [bt 2 − Ept {bt 2 }] dt + d k=1 pt [bk t − Ept {bk t }] ◦ dYk t . with the forward operator L∗ t φ = − n i=1 ∂ ∂xi [fi t φ] + 1 2 n i,j=1 ∂2 ∂xi ∂xj [aij t φ] D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 5 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dpt = L∗ t pt dt − 1 2 pt [bt 2 − Ept {bt 2 }] dt + d k=1 pt [bk t − Ept {bk t }] ◦ dYk t . with the forward operator L∗ t φ = − n i=1 ∂ ∂xi [fi t φ] + 1 2 n i,j=1 ∂2 ∂xi ∂xj [aij t φ] ∞dimensional SPDE. Solutions for even toy systems the like cubic sensor, f = 0, σ = 1, b = x3, do not belong in any ﬁnite dim p(·, θ) [19]. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 5 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dpt = L∗ t pt dt − 1 2 pt [bt 2 − Ept {bt 2 }] dt + d k=1 pt [bk t − Ept {bk t }] ◦ dYk t . with the forward operator L∗ t φ = − n i=1 ∂ ∂xi [fi t φ] + 1 2 n i,j=1 ∂2 ∂xi ∂xj [aij t φ] ∞dimensional SPDE. Solutions for even toy systems the like cubic sensor, f = 0, σ = 1, b = x3, do not belong in any ﬁnite dim p(·, θ) [19]. We need ﬁnite dimensional approximations. We can project SPDE according to either the L2 direct metric (γ(θ)) or, by deriving the analogous equation for √ pt , according to the Hellinger metric (g(θ)). D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 5 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dpt = L∗ t pt dt − 1 2 pt [bt 2 − Ept {bt 2 }] dt + d k=1 pt [bk t − Ept {bk t }] ◦ dYk t . with the forward operator L∗ t φ = − n i=1 ∂ ∂xi [fi t φ] + 1 2 n i,j=1 ∂2 ∂xi ∂xj [aij t φ] ∞dimensional SPDE. Solutions for even toy systems the like cubic sensor, f = 0, σ = 1, b = x3, do not belong in any ﬁnite dim p(·, θ) [19]. We need ﬁnite dimensional approximations. We can project SPDE according to either the L2 direct metric (γ(θ)) or, by deriving the analogous equation for √ pt , according to the Hellinger metric (g(θ)). Projection transforms the SPDE to a ﬁnite dimensional SDE for θ via the chain rule (hence Str calculus): dp(·, θt ) = m j=1 ∂p(·,θ) ∂θj ◦ dθj(t). D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 5 / 37 Nonlinear Projection Filtering Nonlinear ﬁltering problem The nonlinear ﬁltering problem for diffusion signals dpt = L∗ t pt dt − 1 2 pt [bt 2 − Ept {bt 2 }] dt + d k=1 pt [bk t − Ept {bk t }] ◦ dYk t . with the forward operator L∗ t φ = − n i=1 ∂ ∂xi [fi t φ] + 1 2 n i,j=1 ∂2 ∂xi ∂xj [aij t φ] ∞dimensional SPDE. Solutions for even toy systems the like cubic sensor, f = 0, σ = 1, b = x3, do not belong in any ﬁnite dim p(·, θ) [19]. We need ﬁnite dimensional approximations. We can project SPDE according to either the L2 direct metric (γ(θ)) or, by deriving the analogous equation for √ pt , according to the Hellinger metric (g(θ)). Projection transforms the SPDE to a ﬁnite dimensional SDE for θ via the chain rule (hence Str calculus): dp(·, θt ) = m j=1 ∂p(·,θ) ∂θj ◦ dθj(t). With Ito calculus we would have terms ∂2p(·,θ) ∂θi ∂θj d θi, θj (not tang vec) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 5 / 37 Nonlinear Projection Filtering Projection Filters Projection ﬁlter in the metrics h (L2) and g (Fisher) dθi t = m j=1 γij (θt ) L∗ t p(x, θt ) ∂p(x, θt ) ∂θj dx − m j=1 γij (θt ) 1 2 bt (x)2 ∂p ∂θj dx dt + d k=1 [ m j=1 γij (θt ) bk t (x) ∂p(x, θt ) ∂θj dx] ◦ dYk t , θi 0 . The above is the projected equation in d2 metric and Πγ . D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 6 / 37 Nonlinear Projection Filtering Projection Filters Projection ﬁlter in the metrics h (L2) and g (Fisher) dθi t = m j=1 γij (θt ) L∗ t p(x, θt ) ∂p(x, θt ) ∂θj dx − m j=1 γij (θt ) 1 2 bt (x)2 ∂p ∂θj dx dt + d k=1 [ m j=1 γij (θt ) bk t (x) ∂p(x, θt ) ∂θj dx] ◦ dYk t , θi 0 . The above is the projected equation in d2 metric and Πγ . Instead, using the Hellinger distance & the Fisher metric with projection Πg dθi t = m j=1 gij (θt ) L∗ t p(x, θt ) p(x, θt ) ∂p(x, θt ) ∂θj dx − m j=1 gij (θt ) 1 2 bt (x)2 ∂p ∂θj dx dt + d k=1 [ m j=1 gij (θt ) bk t (x) ∂p(x, θt ) ∂θj dx] ◦ dYk t , θi 0 . D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 6 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. The Fisher matrix has a simple structure: ∂2 θi ,θj ψ(θ) = gij(θ) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. The Fisher matrix has a simple structure: ∂2 θi ,θj ψ(θ) = gij(θ) The structure of the projection Πg is simple for exp families D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. The Fisher matrix has a simple structure: ∂2 θi ,θj ψ(θ) = gij(θ) The structure of the projection Πg is simple for exp families Special exp family with Yfunction b among c(x) exponents makes ﬁlter correction step (projection of dY term) exact D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. The Fisher matrix has a simple structure: ∂2 θi ,θj ψ(θ) = gij(θ) The structure of the projection Πg is simple for exp families Special exp family with Yfunction b among c(x) exponents makes ﬁlter correction step (projection of dY term) exact One can deﬁne both a local and global ﬁltering error through dH D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. The Fisher matrix has a simple structure: ∂2 θi ,θj ψ(θ) = gij(θ) The structure of the projection Πg is simple for exp families Special exp family with Yfunction b among c(x) exponents makes ﬁlter correction step (projection of dY term) exact One can deﬁne both a local and global ﬁltering error through dH Alternative coordinates, expectation param., η = Eθ[c] = ∂θψ(θ). D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Exponential Families Choosing the family/manifold: Exponential In past literature and in several papers in Bernoulli, IEEE Automatic Control etc, B. Hanzon and LeGland have developed a theory for the projection ﬁlter using the Fisher metric g and exponential families p(x, θ) := exp[θT c(x) − ψ(θ)]. Good combination: The tangent space has a simple structure: square roots do not complicate issues thanks to the exponential structure. The Fisher matrix has a simple structure: ∂2 θi ,θj ψ(θ) = gij(θ) The structure of the projection Πg is simple for exp families Special exp family with Yfunction b among c(x) exponents makes ﬁlter correction step (projection of dY term) exact One can deﬁne both a local and global ﬁltering error through dH Alternative coordinates, expectation param., η = Eθ[c] = ∂θψ(θ). Projection ﬁlter in η coincides with classical approx ﬁlter: assumed density ﬁlter (based on generalized “moment matching”) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 7 / 37 Choice of the family Mixture Families Mixture families However, exponential families do not couple as well with the metric γ(θ). Is there some important family for which the metric γ(θ) is preferable to the classical Fisher metric g(θ), in that the metric, the tangent space and the ﬁlter equations are simpler? D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 8 / 37 Choice of the family Mixture Families Mixture families However, exponential families do not couple as well with the metric γ(θ). Is there some important family for which the metric γ(θ) is preferable to the classical Fisher metric g(θ), in that the metric, the tangent space and the ﬁlter equations are simpler? The answer is afﬁrmative, and this is the mixture family. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 8 / 37 Choice of the family Mixture Families Mixture families However, exponential families do not couple as well with the metric γ(θ). Is there some important family for which the metric γ(θ) is preferable to the classical Fisher metric g(θ), in that the metric, the tangent space and the ﬁlter equations are simpler? The answer is afﬁrmative, and this is the mixture family. We deﬁne a simple mixture family as follows. Given m + 1 ﬁxed squared integrable probability densities q = [q1, q2, . . . , qm+1]T , deﬁne ˆθ(θ) := [θ1, θ2, . . . , θm, 1 − θ1 − θ2 − . . . − θm]T for all θ ∈ Rm. We write ˆθ instead of ˆθ(θ). Mixture family (simplex): SM (q) = {ˆθ(θ)T q, θi ≥ 0 for all i, θ1 + · · · + θm < 1} D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 8 / 37 Choice of the family Mixture Families Mixture families If we consider the L2 / γ(θ) distance, the metric γ(θ) itself and the related projection become very simple. Indeed, ∂p(·, θ) ∂θi = qi −qm+1 and γij(θ) = (qi(x)−qm(x))(qj(x)−qm(x))dx (NO inline numeric integr). D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 9 / 37 Choice of the family Mixture Families Mixture families If we consider the L2 / γ(θ) distance, the metric γ(θ) itself and the related projection become very simple. Indeed, ∂p(·, θ) ∂θi = qi −qm+1 and γij(θ) = (qi(x)−qm(x))(qj(x)−qm(x))dx (NO inline numeric integr). The L2 metric does not depend on the speciﬁc point θ of the manifold. The same holds for the tangent space at p(·, θ), which is given by span{q1 − qm+1, q2 − qm+1, · · · , qm − qm+1} Also the L2 projection becomes particularly simple. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 9 / 37 Mixture Projection Filter Mixture Projection Filter Armstrong and B. (MCSS 2016 [3]) show that the mixture family + metric γ(θ) lead to a Projection ﬁlter that is the same as approximate ﬁltering via Galerkin [5] methods. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 10 / 37 Mixture Projection Filter Mixture Projection Filter Armstrong and B. (MCSS 2016 [3]) show that the mixture family + metric γ(θ) lead to a Projection ﬁlter that is the same as approximate ﬁltering via Galerkin [5] methods. See the full paper for the details. Summing up: Family → Exponential Basic Mixture Metric ↓ Hellinger dH Good Nothing special Fisher g(θ) ∼ADF ≈ local moment matching Direct L2 d2 Nothing special Good matrix γ(θ) (∼Galerkin) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 10 / 37 Mixture Projection Filter Mixture Projection Filter However, despite the simplicity above, the mixture family has an important drawback: for all θ, ﬁlter mean is constrained min i mean of qi ≤ mean of p(·, θ) ≤ max i mean of qi D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 11 / 37 Mixture Projection Filter Mixture Projection Filter However, despite the simplicity above, the mixture family has an important drawback: for all θ, ﬁlter mean is constrained min i mean of qi ≤ mean of p(·, θ) ≤ max i mean of qi As a consequence, we are going to enrich our family to a mixture where some of the parameters are also in the core densities q. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 11 / 37 Mixture Projection Filter Mixture Projection Filter However, despite the simplicity above, the mixture family has an important drawback: for all θ, ﬁlter mean is constrained min i mean of qi ≤ mean of p(·, θ) ≤ max i mean of qi As a consequence, we are going to enrich our family to a mixture where some of the parameters are also in the core densities q. Speciﬁcally, we consider a mixture of GAUSSIAN DENSITIES with MEANS AND VARIANCES in each component not ﬁxed. For example for a mixture of two Gaussians we have 5 parameters. θpN(µ1,v1)(x) + (1 − θ)pN(µ2,v2)(x), param. θ, µ1, v1, µ2, v2 D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 11 / 37 Mixture Projection Filter Mixture Projection Filter However, despite the simplicity above, the mixture family has an important drawback: for all θ, ﬁlter mean is constrained min i mean of qi ≤ mean of p(·, θ) ≤ max i mean of qi As a consequence, we are going to enrich our family to a mixture where some of the parameters are also in the core densities q. Speciﬁcally, we consider a mixture of GAUSSIAN DENSITIES with MEANS AND VARIANCES in each component not ﬁxed. For example for a mixture of two Gaussians we have 5 parameters. θpN(µ1,v1)(x) + (1 − θ)pN(µ2,v2)(x), param. θ, µ1, v1, µ2, v2 We are now going to illustrate the Gaussian mixture projection ﬁlter (GMPF) in a fundamental example. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 11 / 37 Mixture Projection Filter The quadratic sensor The quadratic sensor Consider the quadratic sensor dXt = σdWt dYt = X2 dt + σdVt . D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 12 / 37 Mixture Projection Filter The quadratic sensor The quadratic sensor Consider the quadratic sensor dXt = σdWt dYt = X2 dt + σdVt . The measurements tell us nothing about the sign of X D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 12 / 37 Mixture Projection Filter The quadratic sensor The quadratic sensor Consider the quadratic sensor dXt = σdWt dYt = X2 dt + σdVt . The measurements tell us nothing about the sign of X Once it seems likely that the state has moved past the origin, the distribution will become nearly symmetrical D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 12 / 37 Mixture Projection Filter The quadratic sensor The quadratic sensor Consider the quadratic sensor dXt = σdWt dYt = X2 dt + σdVt . The measurements tell us nothing about the sign of X Once it seems likely that the state has moved past the origin, the distribution will become nearly symmetrical We expect a bimodal distribution D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 12 / 37 Mixture Projection Filter The quadratic sensor The quadratic sensor Consider the quadratic sensor dXt = σdWt dYt = X2 dt + σdVt . The measurements tell us nothing about the sign of X Once it seems likely that the state has moved past the origin, the distribution will become nearly symmetrical We expect a bimodal distribution θpN(µ1,v1)(x) + (1 − θ)pN(µ2,v2)(x) (red) vs eθ1x+θ2x2+θ3x3+θ4x4−ψ(θ) (pink) vs EKF (N) (blue) vs exact (green, ﬁnite diff. method, grid 1000 state & 5000 time) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 12 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 0 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 13 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 1 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 14 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 2 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 15 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 3 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 16 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 4 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 17 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 5 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 18 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 6 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 19 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 7 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 20 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 8 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 21 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 9 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 22 / 37 Mixture Projection Filter The quadratic sensor Simulation for the Quadratic Sensor 0 0.2 0.4 0.6 0.8 1 8 6 4 2 0 2 4 6 8 X Distribution at time 10 Projection Exact Extended Kalman Exponential D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 23 / 37 Mixture Projection Filter The quadratic sensor Comparing local approximation errors (L2 residuals) εt ε2 t = (pexact,t (x) − papprox,t (x))2 dx papprox,t (x): three possible choices. θpN(µ1,v1)(x) + (1 − θ)pN(µ2,v2)(x) (red) vs eθ1x+θ2x2+θ3x3+θ4x4−ψ(θ) (blue) vs EKF (N) (green) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 24 / 37 Mixture Projection Filter The quadratic sensor L2 residuals for the quadratic sensor 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 2 4 6 8 10 Time Residuals Projection Residual (L2 norm) Extended Kalman Residual (L2 norm) Hellinger Projection Residual (L2 norm) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 25 / 37 Mixture Projection Filter The quadratic sensor Comparing local approx errors (Prokhorov residuals) εt εt = inf{ : Fexact,t (x − ) − ≤ Fapprox,t (x) ≤ Fexact,t (x + ) + ∀x} with F the CDF of p’s. LevyProkhorov metric works well with singular densities like particles where L2 metric not ideal. θpN(µ1,v1)(x) + (1 − θ)pN(µ2,v2)(x) (red) vs eθ1x+θ2x2+θ3x3+θ4x4−ψ(θ) (green) vs best three particles (blue) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 26 / 37 Mixture Projection Filter The quadratic sensor L´evy residuals for the quadratic sensor 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0 1 2 3 4 5 6 7 8 9 10 Time ProkhorovResiduals Prokhorov Residual (L2NM) Prokhorov Residual (HE) Best possible residual (3Deltas) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 27 / 37 Mixture Projection Filter Cubic sensors Cubic sensors 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 2 4 6 8 10 Time Residuals Projection Residual (L2 norm) Extended Kalman Residual (L2 norm) Hellinger Projection Residual (L2 norm) Qualitatively similar results up to a stopping time D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 28 / 37 Mixture Projection Filter Cubic sensors Cubic sensors 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 2 4 6 8 10 Time Residuals Projection Residual (L2 norm) Extended Kalman Residual (L2 norm) Hellinger Projection Residual (L2 norm) Qualitatively similar results up to a stopping time As one approaches the boundary γij becomes singular D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 28 / 37 Mixture Projection Filter Cubic sensors Cubic sensors 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 2 4 6 8 10 Time Residuals Projection Residual (L2 norm) Extended Kalman Residual (L2 norm) Hellinger Projection Residual (L2 norm) Qualitatively similar results up to a stopping time As one approaches the boundary γij becomes singular The solution is to dynamically change the parameterization and even the dimension of the manifold. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 28 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities Projection uses overarching L2 structure D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities Projection uses overarching L2 structure Two different metrics: direct L2 and Hellinger/Fisher (L2 on √ .) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities Projection uses overarching L2 structure Two different metrics: direct L2 and Hellinger/Fisher (L2 on √ .) Fisher works well with exponential families: multimodality, correction step exact, simplicity of implementation equivalence with Assumed Density Filters “moment matching” D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities Projection uses overarching L2 structure Two different metrics: direct L2 and Hellinger/Fisher (L2 on √ .) Fisher works well with exponential families: multimodality, correction step exact, simplicity of implementation equivalence with Assumed Density Filters “moment matching” Direct L2 works well with mixture families even simpler ﬁlter equations, no inline numerical integration basic version equivalent to Galerkin methods suited also for multimodality (quadratic sensor tests, L2 global error) comparable with particle methods D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities Projection uses overarching L2 structure Two different metrics: direct L2 and Hellinger/Fisher (L2 on √ .) Fisher works well with exponential families: multimodality, correction step exact, simplicity of implementation equivalence with Assumed Density Filters “moment matching” Direct L2 works well with mixture families even simpler ﬁlter equations, no inline numerical integration basic version equivalent to Galerkin methods suited also for multimodality (quadratic sensor tests, L2 global error) comparable with particle methods Further investigation: convergence, more on optimality? D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Conclusions Approximate ﬁnitedimensional ﬁltering by rigorous projection on a chosen manifold of densities Projection uses overarching L2 structure Two different metrics: direct L2 and Hellinger/Fisher (L2 on √ .) Fisher works well with exponential families: multimodality, correction step exact, simplicity of implementation equivalence with Assumed Density Filters “moment matching” Direct L2 works well with mixture families even simpler ﬁlter equations, no inline numerical integration basic version equivalent to Galerkin methods suited also for multimodality (quadratic sensor tests, L2 global error) comparable with particle methods Further investigation: convergence, more on optimality? Optimality: introducing new projections (forthcoming J. Armstrong) D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 29 / 37 Conclusions and References Thanks With thanks to the organizing committee. Thank you for your attention. Questions and comments welcome D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 30 / 37 Conclusions and References References I [1] J. Aggrawal: Sur l’information de Fisher. In: Theories de l’Information (J. Kampe de Feriet, ed.), SpringerVerlag, Berlin–New York 1974, pp. 111117. [2] Amari, S. Differentialgeometrical methods in statistics, Lecture notes in statistics, SpringerVerlag, Berlin, 1985 [3] Armstrong, J., and Brigo, D. (2016). Nonlinear ﬁltering via stochastic PDE projection on mixture manifolds in L2 direct metric, Mathematics of Control, Signals and Systems, 2016, accepted. [4] Beard, R., Kenney, J., Gunther, J., Lawton, J., and Stirling, W. (1999). Nonlinear Projection Filter based on Galerkin approximation. AIAA Journal of Guidance Control and Dynamics, 22 (2): 258266. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 31 / 37 Conclusions and References References II [5] Beard, R. and Gunther, J. (1997). Galerkin Approximations of the Kushner Equation in Nonlinear Estimation. Working Paper, Brigham Young University. [6] BarndorffNielsen, O.E. (1978). Information and Exponential Families. John Wiley and Sons, New York. [7] Brigo, D. Diffusion Processes, Manifolds of Exponential Densities, and Nonlinear Filtering, In: Ole E. BarndorffNielsen and Eva B. Vedel Jensen, editor, Geometry in Present Day Science, World Scientiﬁc, 1999 [8] Brigo, D, On SDEs with marginal laws evolving in ﬁnitedimensional exponential families, STAT PROBABIL LETT, 2000, Vol: 49, Pages: 127 – 134 D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 32 / 37 Conclusions and References References III [9] Brigo, D. (2011). The direct L2 geometric structure on a manifold of probability densities with applications to Filtering. Available on arXiv.org and damianobrigo.it [10] Brigo, D, Hanzon, B, LeGland, F, A differential geometric approach to nonlinear ﬁltering: The projection ﬁlter, IEEE T AUTOMAT CONTR, 1998, Vol: 43, Pages: 247 – 252 [11] Brigo, D, Hanzon, B, Le Gland, F, Approximate nonlinear ﬁltering by projection on exponential manifolds of densities, BERNOULLI, 1999, Vol: 5, Pages: 495 – 534 [12] D. Brigo, Filtering by Projection on the Manifold of Exponential Densities, PhD Thesis, Free University of Amsterdam, 1996. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 33 / 37 Conclusions and References References IV [13] Brigo, D., and Pistone, G. (1996). Projecting the FokkerPlanck Equation onto a ﬁnite dimensional exponential family. Available at arXiv.org [14] Crisan, D., and Rozovskii, B. (Eds) (2011). The Oxford Handbook of Nonlinear Filtering, Oxford University Press. [15] M. H. A. Davis, S. I. Marcus, An introduction to nonlinear ﬁltering, in: M. Hazewinkel, J. C. Willems, Eds., Stochastic Systems: The Mathematics of Filtering and Identiﬁcation and Applications (Reidel, Dordrecht, 1981) 53–75. [16] Elworthy, D. (1982). Stochastic Differential Equations on Manifolds. LMS Lecture Notes. D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 34 / 37 Conclusions and References References V [17] Hanzon, B. A differentialgeometric approach to approximate nonlinear ﬁltering. In C.T.J. Dodson, Geometrization of Statistical Theory, pages 219 – 223,ULMD Publications, University of Lancaster, 1987. [18] B. Hanzon, Identiﬁability, recursive identiﬁcation and spaces of linear dynamical systems, CWI Tracts 63 and 64, CWI, Amsterdam, 1989 [19] M. Hazewinkel, S.I.Marcus, and H.J. Sussmann, Nonexistence of ﬁnite dimensional ﬁlters for conditional statistics of the cubic sensor problem, Systems and Control Letters 3 (1983) 331–340. [20] J. Jacod, A. N. Shiryaev, Limit theorems for stochastic processes. Grundlehren der Mathematischen Wissenschaften, vol. 288 (1987), SpringerVerlag, Berlin, D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 35 / 37 Conclusions and References References VI [21] A. H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, New York, 1970. [22] M. Fujisaki, G. Kallianpur, and H. Kunita (1972). Stochastic differential equations for the non linear ﬁltering problem. Osaka J. Math. Volume 9, Number 1 (1972), 1940. [23] Kenney, J., Stirling, W. Nonlinear Filtering of Convex Sets of Probability Distributions. Presented at the 1st International Symposium on Imprecise Probabilities and Their Applications, Ghent, Belgium, 29 June  2 July 1999 [24] R. Z. Khasminskii (1980). Stochastic Stability of Differential Equations. Alphen aan den Reijn [25] R.S. Liptser, A.N. Shiryayev, Statistics of Random Processes I, General Theory (Springer Verlag, Berlin, 1978). D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 36 / 37 Conclusions and References References VII [26] M. Murray and J. Rice  Differential geometry and statistics, Monographs on Statistics and Applied Probability 48, Chapman and Hall, 1993. [27] D. Ocone, E. Pardoux, A Lie algebraic criterion for nonexistence of ﬁnite dimensionally computable ﬁlters, Lecture notes in mathematics 1390, 197–204 (Springer Verlag, 1989) [28] Pistone, G., and Sempi, C. (1995). An Inﬁnite Dimensional Geometric Structure On the space of All the Probability Measures Equivalent to a Given one. The Annals of Statistics 23(5), 1995 D. Brigo and J. Armstrong (ICL and KCL) SPDE Projection Filters GSI 2015 37 / 37
Clustering, classification and Pattern Recognition in a set of data are between the most important tasks in statistical researches and in many applications. In this paper, we propose to use a mixture of Studentt distribution model for the data via a hierarchical graphical model and the Bayesian framework to do these tasks. The main advantages of this model is that the model accounts for the uncertainties of variances and covariances and we can use the Variational Bayesian Approximation (VBA) methods to obtain fast algorithms to be able to handle large data sets.

. Variational Bayesian Approximation method for Classiﬁcation and Clustering with a mixture of Studentt model Ali MohammadDjafari Laboratoire des Signaux et Syst`emes (L2S) UMR8506 CNRSCentraleSup´elecUNIV PARIS SUD SUPELEC, 91192 GifsurYvette, France http://lss.centralesupelec.fr Email: djafari@lss.supelec.fr http://djafari.free.fr http://publicationslist.org/djafari A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 1/20 Contents 1. Mixture models 2. Diﬀerent problems related to classiﬁcation and clustering Training Supervised classiﬁcation Semisupervised classiﬁcation Clustering or unsupervised classiﬁcation 3. Mixture of Studentt 4. Variational Bayesian Approximation 5. VBA for Mixture of Studentt 6. Conclusion A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 2/20 Mixture models General mixture model p(xa, Θ, K) = K k=1 ak pk(xkθk), 0 < ak < 1 Same family pk(xkθk) = p(xkθk), ∀k Gaussian p(xkθk) = N(xkµk, Σk) with θk = (µk, Σk) Data X = {xn, n = 1, · · · , N} where each element xn can be in one of these classes cn. ak = p(cn = k), a = {ak, k = 1, · · · , K}, Θ = {θk, k = 1, · · · , K} p(Xn, cn = ka, θ) = N n=1 p(xn, cn = ka, θ). A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 3/20 Diﬀerent problems Training: Given a set of (training) data X and classes c, estimate the parameters a and Θ. Supervised classiﬁcation: Given a sample xm and the parameters K, a and Θ determine its class k∗ = arg max k {p(cm = kxm, a, Θ, K)} . Semisupervised classiﬁcation (Proportions are not known): Given sample xm and the parameters K and Θ, determine its class k∗ = arg max k {p(cm = kxm, Θ, K)} . Clustering or unsupervised classiﬁcation (Number of classes K is not known): Given a set of data X, determine K and c. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 4/20 Training Given a set of (training) data X and classes c, estimate the parameters a and Θ. Maximum Likelihood (ML): (a, Θ) = arg max (a,Θ) {p(X, ca, Θ, K)} . Bayesian: Assign priors p(aK) and p(ΘK) = K k=1 p(θk) and write the expression of the joint posterior laws: p(a, ΘX, c, K) = p(X, ca, Θ, K) p(aK) p(ΘK) p(X, cK) where p(X, cK) = p(X, ca, ΘK)p(aK) p(ΘK) da dΘ Infer on a and Θ either as the Maximum A Posteriori (MAP) or Posterior Mean (PM). A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 5/20 Supervised classiﬁcation Given a sample xm and the parameters K, a and Θ determine p(cm = kxm, a, Θ, K) = p(xm, cm = ka, Θ, K) p(xma, Θ, K) where p(xm, cm = ka, Θ, K) = akp(xmθk) and p(xma, Θ, K) = K k=1 ak p(xmθk) Best class k∗: k∗ = arg max k {p(cm = kxm, a, Θ, K)} A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 6/20 Semisupervised classiﬁcation Given sample xm and the parameters K and Θ (not the proportions a), determine the probabilities p(cm = kxm, Θ, K) = p(xm, cm = kΘ, K) p(xmΘ, K) where p(xm, cm = kΘ, K) = p(xm, cm = ka, Θ, K)p(aK) da and p(xmΘ, K) = K k=1 p(xm, cm = kΘ, K) Best class k∗, for example the MAP solution: k∗ = arg max k {p(cm = kxm, Θ, K)} . A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 7/20 Clustering or nonsupervised classiﬁcation Given a set of data X, determine K and c. Determination of the number of classes: p(K = LX) = p(X, K = L) p(X) = p(XK = L) p(K = L) p(X) and p(X) = L0 L=1 p(K = L) p(XK = L), where L0 is the a priori maximum number of classes and p(XK = L) = n L k=1 akp(xn, cn = kθk)p(aK) p(ΘK) da dΘ When K and c are determined, we can also determine the characteristics of those classes a and Θ. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 8/20 Mixture of Studentt model Studentt and its Inﬁnite Gaussian Scaled Model (IGSM): T (xν, µ, Σ) = ∞ 0 N(xµ, z−1 Σ) G(z ν 2 , ν 2 ) dz where N(xµ, Σ)= 2πΣ−1 2 exp −1 2(x − µ) Σ−1 (x − µ) = 2πΣ−1 2 exp −1 2Tr (x − µ)Σ−1 (x − µ) and G(zα, β) = βα Γ(α) zα−1 exp [−βz] . Mixture of Studentt: p(x{νk, ak, µk, Σk, k = 1, · · · , K}, K) = K k=1 ak T (xnνk, µk, Σk). A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 9/20 Mixture of Studentt model Introducing znk, zk = {znk, n = 1, · · · , N}, Z = {znk}, c = {cn, n = 1, · · · , N}, θk = {νk, ak, µk, Σk}, Θ = {θk, k = 1, · · · , K} Assigning the priors p(Θ) = k p(θk), we can write: p(X, c, Z, ΘK) = n k akN(xnµk, z−1 n,k Σk) G(znkνk 2 , νk 2 ) p(θk) Joint posterior law: p(c, Z, ΘX, K) = p(X, c, Z, ΘK) p(XK) . The main task now is to propose some approximations to it in such a way that we can use it easily in all the above mentioned tasks of classiﬁcation or clustering. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 10/20 Variational Bayesian Approximation (VBA) Main idea: to propose easy computational approximation q(c, Z, Θ) for p(c, Z, ΘX, K). Criterion: KL(q : p) Interestingly, by noting that p(c, Z, ΘX, K) = p(X, c, Z, ΘK)/p(XK) we have: KL(q : p) = −F(q) + ln p(XK) where F(q) = − ln p(X, c, Z, ΘK) q is called free energy of q and we have the following properties: – Maximizing F(q) or minimizing KL(q : p) are equivalent and both give un upper bound to the evidence of the model ln p(XK). – When the optimum q∗ is obtained, F(q∗) can be used as a criterion for model selection. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 11/20 VBA: choosing the good families Using KL(q : p) has the very interesting property that using q to compute the means we obtain the same values if we have used p (Conservation of the means). Unfortunately, this is not the case for variances or other moments. If p is in the exponential family, then choosing appropriate conjugate priors, the structure of q will be the same and we can obtain appropriate fast optimization algorithms. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 12/20 Hierarchical graphical model ξ0 d d © αk βk znk E γ0, Σ0 c Σk µ0, η0 c µk k0 c a d d © d d © ¨ ¨¨¨ ¨¨%xn E Figure : Graphical representation of the model. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 13/20 VBA for mixture of Studentt In our case, noting that p(X, c, Z, ΘK) = n k p(xn, cn, znkak, µk, Σk, νk) k [p(αk) p(βk) p(µkΣk) p(Σk)] with p(xn, cn, znkak, µk, Σk, νk) = N(xnµk, z−1 n,k Σk) G(znkαk, βk) is separable, in one side for [c, Z] and in other size in components of Θ, we propose to use q(c, Z, Θ) = q(c, Z) q(Θ). A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 14/20 VBA for mixture of Studentt With this decomposition, the expression of the KullbackLeibler divergence becomes: KL(q1(c, Z)q2(Θ) : p(c, Z, ΘX, K) = c q1(c, Z)q2(Θ) ln q1(c, Z)q2(Θ) p(c, Z, ΘX, K) dΘ dZ The expression of the Free energy becomes: F(q1(c, Z)q2(Θ)) = c q1(c, Z)q2(Θ) ln p(X, c, ZΘ, K)p(ΘK) q1(c, Z)q2(Θ) dΘ dZ A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 15/20 Proposed VBA for Mixture of Studentt priors model Using a generalized Studentt obtained by replacing G(zn,kνk 2 , νk 2 ) by G(zn,kαk, βk) it will be easier to propose conjugate priors for αk, βk than for νk. p(xn, cn = k, znkak, µk, Σk, αk, βk, K) = ak N(xnµk, z−1 n,k Σk) G(zn,kαk, βk). In the following, noting by Θ = {(ak, µk, Σk, αk, βk), k = 1, · · · , K}, we propose to use the factorized prior laws: p(Θ) = p(a) k [p(αk) p(βk) p(µkΣk) p(Σk)] with the following components: p(a) = D(ak0), k0 = [k0, · · · , k0] = k01 p(αk) = E(αkζ0) = G(αk1, ζ0) p(βk) = E(βkζ0) = G(αk1, ζ0) p(µkΣk) = N(µkµ01, η−1 0 Σk) p(Σk) = IW(Σkγ0, γ0Σ0) A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 16/20 Proposed VBA for Mixture of Studentt priors model where D(ak) = Γ( l kk) l Γ(kl ) l akl −1 l is the Dirichlet pdf, E(tζ0) = ζ0 exp [−ζ0t] is the Exponential pdf, G(ta, b) = ba Γ(a) ta−1 exp [−bt] is the Gamma pdf and IW(Σγ, γ∆) = 1 2∆γ/2 exp −1 2Tr ∆Σ−1 ΓD(γ/2)Σ γ+D+1 2 . is the inverse Wishart pdf. With these prior laws and the likelihood: joint posterior law: pk(c, Z, ΘX) = p(X, c, Z, Θ) p(X) . A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 17/20 Expressions of q q(c, Z, Θ) = q(c, Z) q(Θ) = n k[q(cn = kznk) q(znk)] k[q(αk) q(βk) q(µkΣk) q(Σk)] q(a). with: q(a) = D(a˜k), ˜k = [˜k1, · · · , ˜kK ] q(αk) = G(αk˜ζk, ˜ηk) q(βk) = G(βk˜ζk, ˜ηk) q(µkΣk) = N(µkµ, ˜η−1Σk) q(Σk) = IW(Σk˜γ, ˜γ ˜Σ) With these choices, we have F(q(c, Z, Θ)) = ln p(X, c, Z, ΘK) q(c,Z,Θ) = k n F1kn + k F2k F1kn = ln p(xn, cn, znk, θk) q(cn=kznk )q(znk ) F2k = ln p(xn, cn, znk, θk) q(θk )A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 18/20 VBA Algorithm step Expressions of the updating expressions of the tilded parameters are obtained by following three steps: E step: Optimizing F with respect to q(c, Z) when keeping q(Θ) ﬁxed, we obtain the expression of q(cn = kznk) = ˜ak, q(znk) = G(znkαk, βk). M step: Optimizing F with respect to q(Θ) when keeping q(c, Z) ﬁxed, we obtain the expression of q(a) = D(a˜k), ˜k = [˜k1, · · · , ˜kK ], q(αk) = G(αk˜ζk, ˜ηk), q(βk) = G(βk˜ζk, ˜ηk), q(µkΣk) = N(µkµ, ˜η−1Σk), and q(Σk) = IW(Σk˜γ, ˜γ ˜Σ), which gives the updating algorithm for the corresponding tilded parameters. F evaluation: After each E step and M step, we can also evaluate the expression of F(q) which can be used for stopping rule of the iterative algorithm. Final value of F(q) for each value of K, noted Fk, can be used as a criterion for model selection, i.e.; the determination of the number of clusters. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 19/20 Conclusions Clustering and classiﬁcation of a set of data are between the most important tasks in statistical researches for many applications such as data mining in biology. Mixture models and in particular Mixture of Gaussians are classical models for these tasks. We proposed to use a mixture of generalised Studentt distribution model for the data via a hierarchical graphical model. To obtain fast algorithms and be able to handle large data sets, we used conjugate priors everywhere it was possible. The proposed algorithm has been used for clustering, classiﬁcation and discriminant analysis of some biological data (Cancer research related), but in this paper, we only presented the main algorithm. A. MohammadDjafari, VBA for Classiﬁcation and Clustering..., GSI2015, October 2830, 2015, Polytechnique, France 20/20
The textile plot proposed by Kumasaka and Shibata (2008) is a method for data visualization. The method transforms a data matrix in order to draw a parallel coordinate plot. In this paper, we investigate a set of matrices induced by the textile plot, which we call the textile set, from a geometrical viewpoint. It is shown that the textile set is written as the union of two differentiable manifolds if data matrices are restricted to be fullrank.

What is textile plot? Textile set Main result Other results Summary Geometric Properties of textile plot Tomonari SEI and Ushio TANAKA University of Tokyo and Osaka Prefecture University at ´Ecole Polytechnique, Oct 28, 2015 1 / 23 What is textile plot? Textile set Main result Other results Summary Introduction The textile plot proposed by Kumasaka and Shibata (2008) is a method for data visualization. The method transforms a data matrix into another matrix, Rn×p X → Y ∈ Rn×p , in order to draw a parallel coordinate plot. The parallel coordinate plot is a standard 2dimensional graphical tool for visualizing multivariate data at a glance. In this talk, we investigate a set of matrices induced by the textile plot, which we call the textile set, from a diﬀerential geometrical point of view. It is shown that the textile set is written as the union of two diﬀerentiable manifolds if data matrices are “generic”. 2 / 23 What is textile plot? Textile set Main result Other results Summary Introduction The textile plot proposed by Kumasaka and Shibata (2008) is a method for data visualization. The method transforms a data matrix into another matrix, Rn×p X → Y ∈ Rn×p , in order to draw a parallel coordinate plot. The parallel coordinate plot is a standard 2dimensional graphical tool for visualizing multivariate data at a glance. In this talk, we investigate a set of matrices induced by the textile plot, which we call the textile set, from a diﬀerential geometrical point of view. It is shown that the textile set is written as the union of two diﬀerentiable manifolds if data matrices are “generic”. 2 / 23 What is textile plot? Textile set Main result Other results Summary 1 What is textile plot? 2 Textile set 3 Main result 4 Other results 5 Summary 3 / 23 What is textile plot? Textile set Main result Other results Summary Textile plot Example (Kumasaka and Shibata, 2008) Textile plot for the iris data. (150 cases, 5 attributes) Each variate is transformed by a locationscale transformation. Categorical data is quantiﬁed. Missing data is admitted. Order of axes can be maintained. Specie s Sepal.Length Sepal.W id th Petal.Length Petal.W id th setosa versicolor virginica 4.3 7.9 2 4.4 1 6.9 0.1 2.5 4 / 23 What is textile plot? Textile set Main result Other results Summary Textile plot Example (Kumasaka and Shibata, 2008) Textile plot for the iris data. (150 cases, 5 attributes) Each variate is transformed by a locationscale transformation. Categorical data is quantiﬁed. Missing data is admitted. Order of axes can be maintained. Specie s Sepal.Length Sepal.W id th Petal.Length Petal.W id th setosa versicolor virginica 4.3 7.9 2 4.4 1 6.9 0.1 2.5 4 / 23 What is textile plot? Textile set Main result Other results Summary Textile plot Let us recall the method of the textile plot. For simplicity, we assume no categorical variate and no missing value. Let X = (x1, . . . , xp) ∈ Rn×p be the data matrix. Without loss of generality, assume the sample mean and sample variance of each xj are 0 and 1, respectively. The data is transformed into Y = (y1, . . . , yp), where yj = aj + bj xj , aj , bj ∈ R, j = 1, . . . , p. The coeﬃcients aj and bj are determined by the following procedure. 5 / 23 What is textile plot? Textile set Main result Other results Summary Textile plot Let us recall the method of the textile plot. For simplicity, we assume no categorical variate and no missing value. Let X = (x1, . . . , xp) ∈ Rn×p be the data matrix. Without loss of generality, assume the sample mean and sample variance of each xj are 0 and 1, respectively. The data is transformed into Y = (y1, . . . , yp), where yj = aj + bj xj , aj , bj ∈ R, j = 1, . . . , p. The coeﬃcients aj and bj are determined by the following procedure. 5 / 23 What is textile plot? Textile set Main result Other results Summary Textile plot Let us recall the method of the textile plot. For simplicity, we assume no categorical variate and no missing value. Let X = (x1, . . . , xp) ∈ Rn×p be the data matrix. Without loss of generality, assume the sample mean and sample variance of each xj are 0 and 1, respectively. The data is transformed into Y = (y1, . . . , yp), where yj = aj + bj xj , aj , bj ∈ R, j = 1, . . . , p. The coeﬃcients aj and bj are determined by the following procedure. 5 / 23 What is textile plot? Textile set Main result Other results Summary Textile plot Coeﬃcients a = (aj ) and b = (bj ) are the solution of the following minimization problem: Minimize a,b n∑ t=1 p∑ j=1 (ytj − ¯yt·)2 subject to yj = aj + bj xj , p∑ j=1 yj 2 = 1. Intuition: as horizontal as possible. Solution: a = 0 and b is the eigenvector corresponding to the maximum eigenvalue of the covariance matrix of X. yt1 yt2 yt3 yt4 yt5 yt. 6 / 23 What is textile plot? Textile set Main result Other results Summary Example (n = 100, p = 4) X ∈ R100×4. Each row ∼ N(0, Σ), Σ = 1 −0.6 0.5 0.1 −0.6 1 −0.6 −0.2 0.5 −0.6 1 0.0 0.1 −0.2 0.0 1 . −2.71 2.98 −3.93 3.27 −2.72 2.43 −2.58 2.23 −2.71 2.98 −3.93 3.27 −2.72 2.43 −2.58 2.23 (a) raw data X (b) textile plot Y 7 / 23 What is textile plot? Textile set Main result Other results Summary Our motivation The textile plot transforms the data matrix X into Y. Denote the map by Y = τ(X). What is the image τ(Rn×p)? We can show that Y ∈ τ(Rn×p) satisﬁes two conditions: ∃λ ≥ 0, ∀i = 1, . . . , p, p∑ j=1 yi yj = λ yi 2 and p∑ j=1 yj 2 = 1. This motivates the following deﬁnition of the textile set. 8 / 23 What is textile plot? Textile set Main result Other results Summary Our motivation The textile plot transforms the data matrix X into Y. Denote the map by Y = τ(X). What is the image τ(Rn×p)? We can show that Y ∈ τ(Rn×p) satisﬁes two conditions: ∃λ ≥ 0, ∀i = 1, . . . , p, p∑ j=1 yi yj = λ yi 2 and p∑ j=1 yj 2 = 1. This motivates the following deﬁnition of the textile set. 8 / 23 What is textile plot? Textile set Main result Other results Summary Our motivation The textile plot transforms the data matrix X into Y. Denote the map by Y = τ(X). What is the image τ(Rn×p)? We can show that Y ∈ τ(Rn×p) satisﬁes two conditions: ∃λ ≥ 0, ∀i = 1, . . . , p, p∑ j=1 yi yj = λ yi 2 and p∑ j=1 yj 2 = 1. This motivates the following deﬁnition of the textile set. 8 / 23 What is textile plot? Textile set Main result Other results Summary Textile set Deﬁnition The textile set is deﬁned by Tn,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 , ∑ j yj 2 = 1 }, The unnormalized textile set is deﬁned by Un,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 }. We are interested in mathematical properties of Tn,p and Un,p. Bad news: statistical implication such is a future work. Let us begin with small p case. 9 / 23 What is textile plot? Textile set Main result Other results Summary Textile set Deﬁnition The textile set is deﬁned by Tn,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 , ∑ j yj 2 = 1 }, The unnormalized textile set is deﬁned by Un,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 }. We are interested in mathematical properties of Tn,p and Un,p. Bad news: statistical implication such is a future work. Let us begin with small p case. 9 / 23 What is textile plot? Textile set Main result Other results Summary Textile set Deﬁnition The textile set is deﬁned by Tn,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 , ∑ j yj 2 = 1 }, The unnormalized textile set is deﬁned by Un,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 }. We are interested in mathematical properties of Tn,p and Un,p. Bad news: statistical implication such is a future work. Let us begin with small p case. 9 / 23 What is textile plot? Textile set Main result Other results Summary Textile set Deﬁnition The textile set is deﬁned by Tn,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 , ∑ j yj 2 = 1 }, The unnormalized textile set is deﬁned by Un,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 }. We are interested in mathematical properties of Tn,p and Un,p. Bad news: statistical implication such is a future work. Let us begin with small p case. 9 / 23 What is textile plot? Textile set Main result Other results Summary Textile set Deﬁnition The textile set is deﬁned by Tn,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 , ∑ j yj 2 = 1 }, The unnormalized textile set is deﬁned by Un,p = { Y ∈ Rn×p  ∃λ ≥ 0, ∀i, ∑ j yi yj = λ yi 2 }. We are interested in mathematical properties of Tn,p and Un,p. Bad news: statistical implication such is a future work. Let us begin with small p case. 9 / 23 What is textile plot? Textile set Main result Other results Summary Tn,p with small p Lemma (p = 1) Tn,1 = Sn−1, the unit sphere. Lemma (p = 2) Tn,2 = A ∪ B, where A = {(y1, y2)  y1 = y2 = 1/ √ 2}, B = {(y1, y2)  y1 − y2 = y1 + y2 = 1}, each of which is diﬀeomorphic to Sn−1 × Sn−1. Their intersection A ∩ B is diﬀeomorphic to the Stiefel manifold Vn,2. → See next slide for n = p = 2 case. 10 / 23 What is textile plot? Textile set Main result Other results Summary Tn,p with small p Lemma (p = 1) Tn,1 = Sn−1, the unit sphere. Lemma (p = 2) Tn,2 = A ∪ B, where A = {(y1, y2)  y1 = y2 = 1/ √ 2}, B = {(y1, y2)  y1 − y2 = y1 + y2 = 1}, each of which is diﬀeomorphic to Sn−1 × Sn−1. Their intersection A ∩ B is diﬀeomorphic to the Stiefel manifold Vn,2. → See next slide for n = p = 2 case. 10 / 23 What is textile plot? Textile set Main result Other results Summary Example (n = p = 2) T2,2 ⊂ R4 is the union of two tori, glued along O(2). θ φ ξ η T2,2 = { 1 √ 2 ( cos θ cos φ sin θ sin φ )} ∪ { 1 2 ( cos ξ + cos η cos ξ − cos η sin ξ + sin η sin ξ − sin η )} 11 / 23 What is textile plot? Textile set Main result Other results Summary For general dimension p To state our main result, we deﬁne two concepts: noncompact Stiefel manifold and canonical form. Deﬁnition (e.g. Absil et al. (2008)) Let n ≥ p. Denote by V ∗ the set of all column fullrank matrices: V ∗ := { Y ∈ Rn×p  rank(Y) = p }. V ∗ is called the noncompact Stiefel manifold. Note that dim(V ∗) = np and V ∗ = Rn×p. The orthogonal group O(n) acts on V ∗. By the GramSchmidt orthonormalization, the quotient space V ∗/O(n) is identiﬁed with uppertriangular matrices with positive diagonals. → see next slide. 12 / 23 What is textile plot? Textile set Main result Other results Summary For general dimension p To state our main result, we deﬁne two concepts: noncompact Stiefel manifold and canonical form. Deﬁnition (e.g. Absil et al. (2008)) Let n ≥ p. Denote by V ∗ the set of all column fullrank matrices: V ∗ := { Y ∈ Rn×p  rank(Y) = p }. V ∗ is called the noncompact Stiefel manifold. Note that dim(V ∗) = np and V ∗ = Rn×p. The orthogonal group O(n) acts on V ∗. By the GramSchmidt orthonormalization, the quotient space V ∗/O(n) is identiﬁed with uppertriangular matrices with positive diagonals. → see next slide. 12 / 23 What is textile plot? Textile set Main result Other results Summary For general dimension p To state our main result, we deﬁne two concepts: noncompact Stiefel manifold and canonical form. Deﬁnition (e.g. Absil et al. (2008)) Let n ≥ p. Denote by V ∗ the set of all column fullrank matrices: V ∗ := { Y ∈ Rn×p  rank(Y) = p }. V ∗ is called the noncompact Stiefel manifold. Note that dim(V ∗) = np and V ∗ = Rn×p. The orthogonal group O(n) acts on V ∗. By the GramSchmidt orthonormalization, the quotient space V ∗/O(n) is identiﬁed with uppertriangular matrices with positive diagonals. → see next slide. 12 / 23 What is textile plot? Textile set Main result Other results Summary Noncompact Stiefel manifold and canonical form Deﬁnition (Canonical form) Let us denote by V ∗∗ the set of all matrices written as y11 · · · y1p 0 ... ... ... ... ypp 0 · · · 0 ... ... 0 · · · 0 , yii > 0, 1 ≤ i ≤ p. We call it a canonical form. Note that V ∗∗ ⊂ V ∗ and V ∗/O(n) V ∗∗. 13 / 23 What is textile plot? Textile set Main result Other results Summary Noncompact Stiefel manifold and canonical form Deﬁnition (Canonical form) Let us denote by V ∗∗ the set of all matrices written as y11 · · · y1p 0 ... ... ... ... ypp 0 · · · 0 ... ... 0 · · · 0 , yii > 0, 1 ≤ i ≤ p. We call it a canonical form. Note that V ∗∗ ⊂ V ∗ and V ∗/O(n) V ∗∗. 13 / 23 What is textile plot? Textile set Main result Other results Summary Restriction of unnormalized textile set V ∗: noncompact Stiefel manifold, V ∗∗: set of canonical forms. Deﬁnition Denote the restriction of Un,p to V ∗ and V ∗∗ by U∗ n,p = Un,p ∩ V ∗ , U∗∗ n,p = Un,p ∩ V ∗∗ , respectively. The group O(n) acts on U∗ n,p. The quotient space U∗ n,p/O(n) is identiﬁed with U∗∗ n,p. So it is essential to study U∗∗ n,p. 14 / 23 What is textile plot? Textile set Main result Other results Summary Restriction of unnormalized textile set V ∗: noncompact Stiefel manifold, V ∗∗: set of canonical forms. Deﬁnition Denote the restriction of Un,p to V ∗ and V ∗∗ by U∗ n,p = Un,p ∩ V ∗ , U∗∗ n,p = Un,p ∩ V ∗∗ , respectively. The group O(n) acts on U∗ n,p. The quotient space U∗ n,p/O(n) is identiﬁed with U∗∗ n,p. So it is essential to study U∗∗ n,p. 14 / 23 What is textile plot? Textile set Main result Other results Summary Restriction of unnormalized textile set V ∗: noncompact Stiefel manifold, V ∗∗: set of canonical forms. Deﬁnition Denote the restriction of Un,p to V ∗ and V ∗∗ by U∗ n,p = Un,p ∩ V ∗ , U∗∗ n,p = Un,p ∩ V ∗∗ , respectively. The group O(n) acts on U∗ n,p. The quotient space U∗ n,p/O(n) is identiﬁed with U∗∗ n,p. So it is essential to study U∗∗ n,p. 14 / 23 What is textile plot? Textile set Main result Other results Summary U∗∗ n,p for small p Let us check examples. Example (n = p = 1) U∗∗ 1,1 = {(1)}. Example (n = p = 2) Let Y = ( y11 y12 0 y22 ) with y11, y22 > 0. Then U∗∗ 2,2 = {y12 = 0} ∪ {y2 11 = y2 12 + y2 22}, union of a plane and a cone. 15 / 23 What is textile plot? Textile set Main result Other results Summary U∗∗ n,p for small p Let us check examples. Example (n = p = 1) U∗∗ 1,1 = {(1)}. Example (n = p = 2) Let Y = ( y11 y12 0 y22 ) with y11, y22 > 0. Then U∗∗ 2,2 = {y12 = 0} ∪ {y2 11 = y2 12 + y2 22}, union of a plane and a cone. 15 / 23 What is textile plot? Textile set Main result Other results Summary Main theorem The diﬀerential geometrical property of U∗∗ n,p is given as follows: Theorem Let n ≥ p ≥ 3. Then we have the following decomposition U∗∗ n,p = M1 ∪ M2, where each Mi is a diﬀerentiable manifold, the dimensions of which are given by dim M1 = p(p + 1) 2 − (p − 1), dim M2 = p(p + 1) 2 − p, respectively. M2 is connected while M1 may not. 16 / 23 What is textile plot? Textile set Main result Other results Summary Example U∗∗ 3,3 is the union of 4dim and 3dim manifolds. We look at a cross section with y11 = y22 = 1: y12 y13 y33 Union of a surface and a vertical line. 17 / 23 What is textile plot? Textile set Main result Other results Summary Corollary Let n ≥ p ≥ 3. Then we have U∗ n,p = π−1 (M1) ∪ π−1 (M2), where π denotes the map of GramSchmidt orthonormalization. The dimensions are dim π−1 (M1) = np − (p − 1), dim π−1 (M2) = np − p. 18 / 23 What is textile plot? Textile set Main result Other results Summary Other results We state other results. First we have n = 1 case. Lemma If n = 1, then the textile set T1,p is the union of a (p − 2)dimensional manifold and 2(2p − 1) isolated points. Example U∗∗ 1,3 consists of a circle and 14 points: U∗∗ 1,3 = (S2 ∩ {y1 + y2 + y3 = 1}) ∪ {±( 1√ 3 , 1√ 3 , 1√ 3 ), ±( 1√ 2 , 1√ 2 , 0), ±( 1√ 2 , 0, 1√ 2 ), ±(0, 1√ 2 , 1√ 2 ), ± (1, 0, 0), ±(0, 1, 0), ±(0, 0, 1)} . 19 / 23 What is textile plot? Textile set Main result Other results Summary Other results We state other results. First we have n = 1 case. Lemma If n = 1, then the textile set T1,p is the union of a (p − 2)dimensional manifold and 2(2p − 1) isolated points. Example U∗∗ 1,3 consists of a circle and 14 points: U∗∗ 1,3 = (S2 ∩ {y1 + y2 + y3 = 1}) ∪ {±( 1√ 3 , 1√ 3 , 1√ 3 ), ±( 1√ 2 , 1√ 2 , 0), ±( 1√ 2 , 0, 1√ 2 ), ±(0, 1√ 2 , 1√ 2 ), ± (1, 0, 0), ±(0, 1, 0), ±(0, 0, 1)} . 19 / 23 What is textile plot? Textile set Main result Other results Summary Diﬀerential geometrical characterization of fλ −1 (O) Fix λ ≥ 0 arbitrarily. We deﬁne the map fλ : Rn×p → Rp+1 by fλ(y1, . . . , yp) := ∑ j y1 yj − λ y1 2 ... ∑ j yp yj − λ yp 2 ∑ j yj 2 − 1 . Lemma We have a classiﬁcation of Tn,p, namely Tn,p = λ≥0 fλ −1 (O) = 0≤λ≤n fλ −1 (O). 20 / 23 What is textile plot? Textile set Main result Other results Summary Diﬀerential geometrical characterization of fλ −1 (O) Fix λ ≥ 0 arbitrarily. We deﬁne the map fλ : Rn×p → Rp+1 by fλ(y1, . . . , yp) := ∑ j y1 yj − λ y1 2 ... ∑ j yp yj − λ yp 2 ∑ j yj 2 − 1 . Lemma We have a classiﬁcation of Tn,p, namely Tn,p = λ≥0 fλ −1 (O) = 0≤λ≤n fλ −1 (O). 20 / 23 What is textile plot? Textile set Main result Other results Summary Diﬀerential geometrical characterization of fλ −1 (O) Lastly, we state a characterization of fλ −1 (O) from the viewpoint of diﬀerential geometry. Theorem Let λ ≥ 0. fλ −1 (O) is a regular submanifold of Rn×p with codimension p + 1 whenever λ > 0, y11yjj − y1j yj1 = 0, j = 2, . . . , p, ∃ ∈ { 2, . . . , p }; p∑ j=2 yij + yi (1 − 2λ) = 0, i = 1, . . . , n. 21 / 23 What is textile plot? Textile set Main result Other results Summary Present and future study Summary: We deﬁned the textile set Tn,p and ﬁnd its geometric properties. Present and future study: . 1 Characterize the classiﬁcation fλ −1 (O) with induced Riemannian metric from Rnp by (global) Riemannian geometry: geodesic, curvature etc. . 2 Investigate diﬀerential geometrical and topological properties of Tn,p and fλ −1 (O), including its group action. 3 Can one ﬁnd statistical implication such as sample distribution theory? Merci beaucoup! 22 / 23 What is textile plot? Textile set Main result Other results Summary Present and future study Summary: We deﬁned the textile set Tn,p and ﬁnd its geometric properties. Present and future study: . 1 Characterize the classiﬁcation fλ −1 (O) with induced Riemannian metric from Rnp by (global) Riemannian geometry: geodesic, curvature etc. . 2 Investigate diﬀerential geometrical and topological properties of Tn,p and fλ −1 (O), including its group action. 3 Can one ﬁnd statistical implication such as sample distribution theory? Merci beaucoup! 22 / 23 What is textile plot? Textile set Main result Other results Summary Present and future study Summary: We deﬁned the textile set Tn,p and ﬁnd its geometric properties. Present and future study: . 1 Characterize the classiﬁcation fλ −1 (O) with induced Riemannian metric from Rnp by (global) Riemannian geometry: geodesic, curvature etc. . 2 Investigate diﬀerential geometrical and topological properties of Tn,p and fλ −1 (O), including its group action. 3 Can one ﬁnd statistical implication such as sample distribution theory? Merci beaucoup! 22 / 23 What is textile plot? Textile set Main result Other results Summary References . 1 Absil, P.A., Mahony, R., and Sepulchre, R. (2008), Optimization Algorithms on Matrix Manifolds, Princeton University Press. . 2 Honda, K. and Nakano, J. (2007), 3 dimensional parallel coordinate plot, Proceedings of the Institute of Statistical Mathematics, 55, 69–83. . 3 Inselberg, A. (2009), Parallel Coordinates: VISUAL Multidimensional Geometry and its Applications, Springer. 4 Kumasaka, N. and Shibata, R. (2008), Highdimensional data visualisation: The textile plot, Computational Statistics and Data Analysis, 52, 3616–3644. 23 / 23
In anomalous statistical physics, deformed algebraic structures are important objects. Heavily tailed probability distributions, such as Student’s tdistributions, are characterized by deformed algebras. In addition, deformed algebras cause deformations of expectations and independences of random variables. Hence, a generalization of independence for multivariate Student’s tdistribution is studied in this paper. Even if two random variables which follow to univariate Student’s tdistributions are independent, the joint probability distribution of these two distributions is not a bivariate Student’s tdistribution. It is shown that a bivariate Student’s tdistribution is obtained from two univariate Student’s tdistributions under qdeformed independence.

A generalization of independence and multivariate Student’s tdistributions MATSUZOE Hiroshi Nagoya Institute of Technology joint works with SAKAMOTO Monta (Efrei, Paris) 1 Deformed exponential family 2 Nonadditive diﬀerentials and expectation functionals 3 Geometry of deformed exponential families 4 Generalization of independence 5 qindependence and Student’s tdistributions 6 Appendix Notions of expectations, independence are determined from the choice of statistical models. Introduction: Geometry and statistics • Geometry for the sample space • Geometry for the parameter space • Wasserstein geometry • Optimal transport theory • A pdf is regarded as a distribution of mass • Information geometry • Convexity of entropy and free energy • Duality of estimating function
Hessian Information Geometry (chaired by ShunIchi Amari, Michel Boyom)
We define a metric and a family of αconnections in statistical manifolds, based on ϕdivergence, which emerges in the framework of ϕfamilies of probability distributions. This metric and αconnections generalize the Fisher information metric and Amari’s αconnections. We also investigate the parallel transport associated with the αconnection for α = 1.

Curvature properties for statistical structures are studied. The study deals with the curvature tensor of statistical connections and their duals as well as the Ricci tensor of the connections, Laplacians and the curvature operator. Two concepts of sectional curvature are introduced. The meaning of the notions is illustrated by presenting few exemplary theorems.

We show that Hessian manifolds of dimensions 4 and above must have vanishing Pontryagin forms. This gives a topological obstruction to the existence of Hessian metrics. We find an additional explicit curvature identity for Hessian 4manifolds. By contrast, we show that all analytic Riemannian 2manifolds are Hessian.

Based on the theory of compact normal leftsymmetric algebra (clan), we realize every homogeneous cone as a set of positive definite real symmetric matrices, where homogeneous Hessian metrics as well as a transitive group action on the cone are described efficiently.

In this article, we derive an inequality satisfied by the squared norm of the imbedding curvature tensor of Multiply CRwarped product statistical submanifolds N of holomorphic statistical space forms M. Furthermore, we prove that under certain geometric conditions, N and M become Einstein.

Topological forms and Information (chaired by Daniel Bennequin, Pierre Baudot)
In this lecture we will present joint work with Ryan Thorngren on thermodynamic semirings and entropy operads, with Nicolas Tedeschi on Birkhoff factorization in thermodynamic semirings, ongoing work with Marcus Bintz on tropicalization of Feynman graph hypersurfaces and Potts model hypersurfaces, and their thermodynamic deformations, and ongoing work by the author on applications of thermodynamic semirings to models of morphology and syntax in Computational Linguistics.

Information Algebras and their Applications Matilde Marcolli Geometric Science of Information, Paris, October 2015 Matilde Marcolli Information Algebras Based on: M. Marcolli, R. Thorngren, Thermodynamic semirings, J. Noncommut. Geom. 8 (2014), no. 2, 337–392 M. Marcolli, N. Tedeschi, Entropy algebras and Birkhoﬀ factorization, J. Geom. Phys. 97 (2015) 243–265 Matilde Marcolli Information Algebras MinPlus Algebra (Tropical Semiring) minplus (or tropical) semiring T = R ∪ {∞} • operations ⊕ and x ⊕ y = min{x, y} with identity ∞ x y = x + y with identity 0 • operations ⊕ and satisfy: associativity commutativity left/right identity distributivity of product over sum ⊕ Matilde Marcolli Information Algebras Thermodynamic semirings Tβ,S = (R ∪ {∞}, ⊕β,S , ) • deformation of the tropical addition ⊕β,S x ⊕β,S y = min p {px + (1 − p)y − 1 β S(p)} β thermodynamic inverse temperature parameter S(p) = S(p, 1 − p) binary information measure, p ∈ [0, 1] • for β → ∞ (zero temperature) recovers unperturbed idempotent addition ⊕ • multiplication = + is undeformed • for S = Shannon entropy considered ﬁrst in relation to F1geometry in A. Connes, C. Consani, From monoids to hyperstructures: in search of an absolute arithmetic, arXiv:1006.4810 Matilde Marcolli Information Algebras Khinchin axioms Sh(p) = −C(p log p + (1 − p) log(1 − p)) • Axiomatic characterization of Shannon entropy S(p) = Sh(p) 1 symmetry S(p) = S(1 − p) 2 minima S(0) = S(1) = 0 3 extensivity S(pq) + (1 − pq)S(p(1 − q)/(1 − pq)) = S(p) + pS(q) • correspond to algebraic properties of semiring Tβ,S 1 commutativity of ⊕β,S 2 left and right identity for ⊕β,S 3 associativity of ⊕β,S ⇒ Tβ,S commutative, unital, associative iﬀ S(p) = Sh(p) Matilde Marcolli Information Algebras Khinchin axioms nary form Given S as above, deﬁne Sn : ∆n−1 → R 0 by Sn(p1, . . . , pn) = 1 j n−1 (1 − 1 i
We show that the entropy function–and hence the finite 1logarithm–behaves a lot like certain derivations. We recall its cohomological interpretation as a 2cocycle and also deduce 2ncocycles for any n. Finally, we give some identities for finite multiple polylogarithms together with number theoretic applications.

Finite polylogarithms, their multiple analogues and the Shannon entropy Geometric Sciences of Information 2015 Session “Topological Forms and Information” École Polytechnique (France), 28 October 2015 Philippe ElbazVincent(Université Grenoble Alpes) & Herbert Gangl (Durham University) Content of this talk Information theory, Entropy and Polylogarithms (review of past works), Algebraic interpretation of the entropy function, Cohomological interpretation of formal entropy functions, Finite multiple polylogarithms, applications and open problems. 2 / 13 Information theory, Entropy and Polylogarithms (1/4) The Shannon entropy can be characterised in the framework of information theory, assuming that the propagation of information follows a Markovian model (Shannon, 1948). If H is the Shannon entropy, it fulﬁlls the equation, often called the Fundamental Equation of Information Theory (FEITH) H(x) + (1 − x)H y 1 − x − H(y) − (1 − y)H x 1 − y = 0 . (FEITH) It is known (Aczel and Dhombres, 1989), that if g is a real function locally integrable on ]0, 1[ and if, moreover, g fulﬁlls FEITH, then there exists c ∈ R such that g = cH (we can also restrict the hypothesis to Lebesgue measurable). 3 / 13 Information theory, Entropy and Polylogarithms (2/4) It turns out that FEITH can be derived, in a precise formal sense (ElbazVincent and Gangl, 2002), from the 5term equation of the classical (or padic) dilogarithm. Cathelineau (1996) found that an appropriate derivative of the Bloch–Wigner dilogarithm coincides with the classical entropy function, and that the ﬁve term relation satisﬁed by the former implies the four term relation of the latter. More precisely, we deﬁne Lim(z) = ∞ n=1 zn nm , z < 1, the mlogarithm. We set D2(z) = i Im Li2(z) + log(1 − z) log z , Then D2 satisﬁes the following 5term equation D2 (a) − D2 (b) + D2 b a − D2 1 − b 1 − a + D2 1 − b−1 1 − a−1 = 0, whenever such an expression makes sense. The relation is the famous ﬁve term equation for the dilogarithm (ﬁrst stated by Abel). 4 / 13 Information theory, Entropy and Polylogarithms (3/4) It can be shown formarly (see Cathelineau, ElbazVincent and Gangl) that FEITH is an inﬁnitesimal version of this 5term equation. Kontsevich (1995) discovered that the truncated ﬁnite logarithm over a ﬁnite ﬁeld Fp, with p prime, deﬁned by £1(x) = p−1 k=1 xk k , satisﬁes FEITH. In our previous work, we showed how one can expand this relationship for “higher analogues" in order to produce and prove similar functional identities for ﬁnite polylogarithms from those for classical polylogarithms (using mod p reduction of padic polylogarithms and their inﬁnitesimal version). It was also shown that functional equations for ﬁnite polylogarithms often hold even as polynomial identities over ﬁnite ﬁelds. 5 / 13 Information theory, Entropy and Polylogarithms (4/4) Entropy and FEITH arise from the inﬁnitesimal picture (for both archimedean and nonarchimedean structure) and their ﬁnite analogs associated to the dilogarithm. Does their exist higher analogue of the Shannon entropy associated to mlogarithms ? It could be connected to the higher degrees of the information cohomology space of Baudot and Bennequin (Entropy 2015). 6 / 13 Algebraic interpretation of the entropy function (1/2) Let R be a (commutative) ring and let D be a map from R to R. We will say that D is a unitary derivation over R if the following axioms hold : 1 “Leibniz’s rule” : for all x, y ∈ R, we have D(xy) = xD(y) + yD(x). 2 “Additivity on partitions of unity” : for all x ∈ R, we have D(x) + D(1 − x) = 0. We will denote by Deru(R) the set of unitary derivations over R. We will say that a map f : R → R is an abstract symmetric information function of degree 1 if the two following conditions hold : for all x, y ∈ R such that x, y, 1 − x, 1 − y ∈ R×, the functional equation FEITH holds and for all x ∈ R, we have f (x) = f (1 − x). Denote by IF1(R) the set of abstract symmetric information functions of degree 1 over R. Then IF1(R) is an Rmodule. Let Leib(R) be the set of Leibniz functions over R (i.e. which fulﬁll the “Leibniz rule”). 7 / 13 Algebraic interpretation of the entropy function (2/2) Proposition : We have a morphism of Rmodules h : Leib(R) → IF1(R), deﬁned by h(ϕ) = ϕ + ϕ ◦ τ, with τ(x) = 1 − x. Furthermore, Ker(h) = Deru(R). Hence, if h is onto, abstract information function are naturally associated to formal derivations. Nevertheless, h can be also 0. Indeed, if R = Fq, is a ﬁnite ﬁeld, then Leib(Fq) = 0, but IF1(Fq) = 0 (it is generated by £1). 8 / 13 Cohomological interpretation of formal entropy functions The following results are classical in origin (Cathelineau, 1988 and Kontsevich, 1995) Proposition : Let F be a ﬁnite prime ﬁeld and H : F → F a function which fulﬁlls the following conditions : H(x) = H(1 − x), the functional equation (FEITH) holds for H and H(0) = 0. Then the function ϕ : F × F → F deﬁned by ϕ(x, y) = (x + y)H( x x+y ) if x + y = 0 and 0 otherwise, is a nontrivial 2cocycle. sketch of proof : Suppose that ϕ is a 2coboundary. Then, there exists a map Q : F → F, such that ϕ(x, y) = Q(x + y) − Q(x) − Q(y). The function ψλ(x) = Q(λx) − λQ(x) is an additive morphism F → F, hence entirely determined by ψλ(1). The map ψλ(1) fulﬁlls the Leibniz chain rule on F× . We deduce from it that ϕ = 0 (which is not possible, so it is not a coboundary !) We deduce that £1 is unique (up to a constant). In the real or complex we use other type of cohomological arguments (see also the relationship with Baudot and Bennequin, 2015). 9 / 13 Finite multiple polylogarithms (1/3) While classical polylogarithms play an important role in the theory of mixed Tate motives over a ﬁeld, it turns out that it is often preferable to also consider the larger class of multiple polylogarithms (cf. Goncharov’s work). In a similar way it is useful to investigate their ﬁnite analogues. We are mainly concerned with ﬁnite double polylogarithms which are given as functions Z/p × Z/p → Z/p by £a,b(x, y) = 0
0 be divisible by 3, and put ω = n/3 − 1. Then ω j=0 ω j 2ω j £n−(j+1),j+1 [a, b]−[ 1 a , a b]−ap bp [b, 1 a b ]+bp [a b, 1 b ] = 0. Questions : what is the interpretation in term of information theory for the multiple polylogs ? 12 / 13 Finite polylogarithms and Fermat’s last theorem Several classical criteria used by Kummer, Mirimanoﬀ and Wieferich to prove certain cases of Fermat’s Last Theorem can be rephrased in terms of functional equations and evaluations of ﬁnite (multiple) polylogarithms. For example, Mirimanoﬀ was led to the study of (nowadays called) Mirimanoﬀ polynomials (cf. Ribenboim book on FLT) ϕj (T) = p−1 j=1 kj−1Tk, which are nothing else but ﬁnite polylogarithms... The Mirimanoﬀ congruences (op.cit) can be reformulated as follows : for any solution (x, y, z) of xp + yp + zp = 0 in pairwise prime integers not divisible by p (i.e. a Fermat triple) and for t = −x y we have £1(t) = 0 , £j (t)£p−j (t) = 0 (j = 2, . . . , p − 1 2 ) . One can prove these congruences using an identity expressing £p−j−1,j+1(1, T) in terms of £n(T). 13 / 13
We present a dictionary between arithmetic geometry of toric varieties and convex analysis. This correspondence allows for effective computations of arithmetic invariants of these varieties. In particular, combined with a closed formula for the integration of a class of functions over polytopes, it gives a number of new values for the height (arithmetic analog of the degree) of toric varieties, with respect to interesting metrics arising from polytopes. In some cases these heights are interpreted as the average entropy of a family of random processes.

”GSI’15” ´Ecole Polytechnique, October 28, 2015 Heights of toric varieties, entropy and integration over polytopes Jos´e Ignacio Burgos Gil, Patrice Philippon & Mart´ın Sombra Patrice Philippon, IMJPRG UMR 7586  CNRS 1 Toric varieties Toric varieties form a remarkable class of algebraic varieties, endowed with an action of a torus having one Zariski dense open orbit. Toric divisors are those invariant by the action of the torus. Together with their toric divisors, they can be described in terms of combinatorial objects such as lattice fans, support functions or lattice polytopes (u1,u2)→0 (u1,u2)→−u1 (u1,u2)→−u2 2 Each cone corresponds to an aﬃne toric variety and the fan encodes how they glue together. If the fan is complete then the toric variety is proper. The support function determines a toric divisor D on each aﬃne toric chart. By duality, the stability set of the support function is a polytope ∆, which may be empty but which is of dimension n as soon as D is nef, which is equivalent to the support function being concave. One fundamental result is: if D is a toric nef divisor then degD(X) = n!voln(∆). 3 Heights A height measures the complexity of objects over the ﬁeld of rational numbers, say. For a/b ∈ Q× and d = gcd(a, b): h(a/b) = log max(a/d, b/d) = v log max(av, bv), thanks to the product formula: v dv = 1 for any d ∈ Q× and where v runs over all the (normalised) absolute values on Q (usual and padic). 4 Heights A height measures the complexity of objects over the ﬁeld of rational numbers, say. For a/b ∈ Q× and d = gcd(a, b): h(a/b) = log max(a/d, b/d) = v log max(av, bv), thanks to the product formula: v dv = 1, d ∈ Q× . For points of a projective space x = (x0 : . . . : xN) ∈ PN(Q): h(x) = v log x v = − v log (x) v, where · v is a norm on QN+1 compatible with the absolute value ·v on Q (usual or padic). Metrics on OPN (1): (x) v =  (x)v x v . 5 On an abstract variety equipped with a divisor (X, D), deﬁned over Q, the suitable arithmetic setting amounts to a collection of metrics on the space of rational sections of the divisor, compatible with the absolute values on Q (the collection is in bijection with the set of absolute values on Q). We denote D the resulting metrised divisor. Arithmetic intersection theory allows to deﬁne the height of X relative to D analogously to the degree degD(X): hD(X) = v hv(X) where the local heights hv are deﬁned through an arithmetic analogue of B´ezout formula. Local heights depend on the choice of auxiliary sections but the global height does not. 6 Metrics on toric varieties On toric divisors, a metric is said toric if it is invariant by the action of the compact subtorus of the principal orbit. There exists a bijection between toric metrics and continuous functions on the fan, whose diﬀerence with the support function is bounded. The metric is semipositive iﬀ the corresponding function is concave. By Legendre duality, the semipositive toric metrics are also in bijection with the continuous, concave functions on the polytope associated to the toric divisor, dubbed roof function. 7 The roof function is the concave enveloppe of the graph of the function s → − log s v,sup, for s running over the toric sections of the divisor and its multiples. Roof function of the pullback of the canonical metric of P2 on P1 by t→(1 t :1 2:t) v=2 v=∞ v=other The support function itself corresponds to the socalled canonical metric. Its roof function is the zero function on the polytope. 8 Heights on toric varieties Let (X, D) be a toric varieties with a toric divisor (over Q), equipped with a collection of toric metrics (a toric metrised divisor). The (local) roof functions attached to the toric metrised divisor sum up in the socalled global roof function: ϑ := v ϑv. We have the analogue of the formula seen for the degree: hD(X) = (n + 1)! ∆ ϑ. 9 Metrics from polytopes Let F (x) = x, uF + F (0) be the linear forms deﬁning a polytope Γ ⊂ Rn , with F running over its facets and uF = voln−1(F ) nvoln(Γ) . Let ∆ ⊂ Γ be another polytope, the restriction of ϑ := − 1 c F F log( F ) to ∆, is the roof function of some (archimedean) metric on the toric variety X and divisor D deﬁned by ∆, hence D. Example: the roof function of the FubiniStudy metric on Pn is −(1/2)(x0 log(x0) + . . . + xn log(xn)) where x0 = 1 − x1 − . . . − xn (dual to −1 2 log 1 + n i=1 e−2ui ). 10 Height as average entropy Let x ∈ Γ and βx be the (discrete) random variable that maps y ∈ Γ to the face F of Γ such that y ∈ Cone(x, F): P(βx = F) = dist(x, F) voln−1(F) nvoln(Γ) . F Γ ∆ • x 11 Height as average entropy Let x ∈ Γ and βx be the (discrete) random variable that maps y ∈ Γ to the face F of Γ such that y ∈ Cone(x, F): P(βx = F) = dist(x, F) voln−1(F) nvoln(Γ) . The entropy E(βx) = − F P(βx = F) log(P(βx = F)) satisﬁes 1 voln(∆) · ∆ E(βx)dvoln(x) = c n + 1 · hD(X) degD(X) . 12 Integration over polytopes An aggregate of ∆ in a direction u ∈ Rn is the union of all the faces of ∆ contained in {x ∈ Rn  x, u = λ} for some λ ∈ R. Deﬁnition – Let V be an aggregate in the direction of u ∈ Rn , we set recursively: If u = 0, then Cn(∆, 0, V ) = voln(V ) and Ck(∆, 0, V ) = 0 for k = n. If u = 0, then Ck(∆, u, V ) = − F uF , u u 2 Ck(F, πF (u), V ∩ F), where the sum is over the facets F of ∆. This recursive formula implies that Ck(∆, u, V ) = 0 for all k > dim(V ). 13 Proposition [2, Prop.6.1.4] – Let ∆ ⊂ Rn be a polytope of dimension n and u ∈ Rn . Then, for any f ∈ Cn (R), ∆ f(n) ( x, u )dvoln(x) = V ∈∆(u) dim(V ) k=0 Ck(∆, u, V )f(k) ( V, u ). The coeﬃcients Ck(∆, u, V ) are determined by this identity. Example: If ∆ = Conv(ν0, . . . , νn) = n i=0{x; x, ui ≥ λi} is a simplex and u ∈ Rn \ {0}, then C0(∆, u, ν0) equals n!voln(∆) n i=1 ν0 − νi, u = ε det(u1, . . . , un)n−1 n i=1 det(u1, . . . , ui−1, u, ui+1, . . . , un) , with ε the sign of (−1)n det(u1, . . . , un). 14 References [1] G.Everest & T.Ward, Heights of Polynomials and entropy in Algebraic Dynamics, Universitext, Springer Verlag (1999). [2] J.I.Burgos Gil, P.Philippon & M.Sombra, Arithmetic geometry of toric varieties. Metrics, measures and heights, Ast´erisque 360, Soc. Math. France, 2014. 15 [3] J.I.Burgos Gil, A.Moriwaki, P.Philippon & M.Sombra, Arithmetic positivity on toric varieties, J. Algebraic Geom., 2016, to appear, eprint arXiv:1210.7692v3. [4] J.I.Burgos Gil, P.Philippon & M.Sombra, Successive minima of toric height functions, Ann. Inst. Fourier, Grenoble, 2015, to appear, eprint arXiv:1403.4048v2. Ouf! 16
In this paper we propose a method to characterize and estimate the variations of a random convex set Ξ0 in terms of shape, size and direction. The mean nvariogram γ(n)Ξ0:(u1⋯un)↦E[νd(Ξ0∩(Ξ0−u1)⋯∩(Ξ0−un))] of a random convex set Ξ0 on ℝ d reveals information on the n th order structure of Ξ0. Especially we will show that considering the mean nvariograms of the dilated random sets Ξ0 ⊕ rK by an homothetic convex family rKr > 0, it’s possible to estimate some characteristic of the n th order structure of Ξ0. If we make a judicious choice of K, it provides relevant measures of Ξ0. Fortunately the germgrain model is stable by convex dilatations, furthermore the mean nvariogram of the primary grain is estimable in several type of stationary germgrain models by the so called npoints probability function. Here we will only focus on the Boolean model, in the planar case we will show how to estimate the n th order structure of the random vector composed by the mixed volumes t (A(Ξ0),W(Ξ0,K)) of the primary grain, and we will describe a procedure to do it from a realization of the Boolean model in a bounded window. We will prove that this knowledge for all convex body K is sufficient to fully characterize the so called difference body of the grain Ξ0⊕˘Ξ0. we will be discussing the choice of the element K, by choosing a ball, the mixed volumes coincide with the Minkowski’s functional of Ξ0 therefore we obtain the moments of the random vector composed of the area and perimeter t (A(Ξ0),U(Ξ)). By choosing a segment oriented by θ we obtain estimates for the moments of the random vector composed by the area and the Ferret’s diameter in the direction θ, t((A(Ξ0),HΞ0(θ)). Finally, we will evaluate the performance of the method on a Boolean model with rectangular grain for the estimation of the second order moments of the random vectors t (A(Ξ0),U(Ξ0)) and t((A(Ξ0),HΞ0(θ)).

Characterization and Estimation of the Variations of a Random Convex Set by its Mean nVariogram : Application to the Boolean Model S.Rahmani, JC.Pinoli & J.Debayle Ecole Nationale Sup´erieure des Mines de SaintEtienne,FRANCE SPIN, PROPICE / LGF, UMR CNRS 5307 28/10/2015 SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 1 / 22 Geometric Stochastic Modeling and objectives Section 1 Geometric Stochastic Modeling and objectives SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 2 / 22 Geometric Stochastic Modeling and objectives Stochastic materials Material modelling Material characterization SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 3 / 22 Geometric Stochastic Modeling and objectives GermGrain model [Matheron 1967] Deﬁnition Ξ = xi ∈Φ xi + Ξi (1) The Ξi are i.i.d. Φ a point process Law of Φ ⇔ Spatial distribution Law of Ξ0 ⇔ granulometry Boolean model ⇒ Φ Poisson point process of intensity λ SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 4 / 22 Geometric Stochastic Modeling and objectives Objectives and state of the art Geometrical characterization of Ξ0 from measurements in a bounded window Ξ ∩ M No assumption on Ξ0’s shape. Describing Ξ0. State of the art Miles formulae [Miles 1967] Tangent points method [Molchanov 1995] Minimum contrast method[ Dupac & Digle 1980] ⇒ Mean geometric parameter λ, E[A(Ξ0)], E[U(Ξ0)] Formula for distribution for model of disk [Emery 2012] SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 5 / 22 Geometric Stochastic Modeling and objectives Characterization and description of the grain For homothetic grains: Disk of radius r: E[r] = E[U(Ξ0)] 2π & E[r2] = E[A(Ξ0)] π Square of side x :E[x] = E[U(Ξ0)] 4 & E[r2] = E[A(Ξ0)] ⇒ Parametric distribution of homothetic factor! For non homothetic grains: rectangle, ellipse... Same mean for area and perimeter (Minkowski densities) ⇒ insuﬃcient to fully characterize Ξ0! What about the variations of these geometrical characteristics? SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 6 / 22 Theoretical aspects Section 2 Theoretical aspects SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 7 / 22 Theoretical aspects From covariance of Ξ to variation of Ξ0 Covariance: CΞ(u) = P(x ∈ (Ξ ∩ Ξ + u)) Mean covariogram: ¯γΞ0 (u) = E[A(Ξ0 ∩ Ξ0 + u)] Relationship: ¯γΞ0 (u) = 1 γ log 1 + CΞ(u) − p2 Ξ (1 − pΞ)2 (2) In addition: R2 ¯γΞ0 (u)du = E[A(Ξ0)2] SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 8 / 22 Theoretical aspects Stability by convex dilations Ξ (a) grain Ξ0, intensity λ Ξ ⊕ K (b) grain Ξ0 ⊕ K, intensity λ Where X ⊕ Y = {x + yx ∈ X, y ∈ Y } ⇒ The Boolean model is stable under convex dilations SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 9 / 22 Theoretical aspects The proposed method Consequently, for all r ≥ 0 we can estimate: ζ0,K (r) = E[A(Ξ0 ⊕ rK)2 ] = R2 E[γΞ0⊕rK (u)]du SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 10 / 22 Theoretical aspects The proposed method Consequently, for all r ≥ 0 we can estimate: ζ0,K (r) = E[A(Ξ0 ⊕ rK)2 ] = R2 E[γΞ0⊕rK (u)]du Steiner’s formula (mixed volumes) A(Ξ0 ⊕ rK) = A(Ξ0) + 2rW (Ξ0, K) + r2A(K) The polynomial ζ0,K ζ0,K (r) = E[A2 0] + 4rE[A0W (Ξ0, K)] + r2 (4E[W (Ξ0, K)2 ] + + 2A(K)E[A0]) + 4r3 A(K)E[W (Ξ0, K)] + r4 A(K)2 SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 10 / 22 Theoretical aspects The proposed method Consequently, for all r ≥ 0 we can estimate: ζ0,K (r) = E[A(Ξ0 ⊕ rK)2 ] = R2 E[γΞ0⊕rK (u)]du Steiner’s formula (mixed volumes) A(Ξ0 ⊕ rK) = A(Ξ0) + 2rW (Ξ0, K) + r2A(K) The polynomial ζ0,K ζ0,K (r) = E[A2 0] + 4rE[A0W (Ξ0, K)] + r2 (4E[W (Ξ0, K)2 ] + + 2A(K)E[A0]) + 4r3 A(K)E[W (Ξ0, K)] + r4 A(K)2 ⇒ Estimation of E[A2 0], E[A0W (Ξ0, K)] and E[W (Ξ0, K)2] SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 10 / 22 Theoretical aspects Generalization to nth order moments The mean nvariogram For n ≤ 2, γ (n) Ξ0 (u1, · · · un−1) = E[A( n−1 i=1 (Ξ0 − ui ) ∩ Ξ0)] Relation nvariogram → n point probability function (see proceding) Of course R2 · · · R2 γ (n) Ξ0 (u1, · · · un−1)du1 · · · dun−1 = E[A(Ξ0)n] Then the development of E[A(Ξ0 ⊕ K)n] by Steiner’s formula gives: ∀K convex, nth order moments of (A0, W (Ξ0, K)) SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 11 / 22 Theoretical aspects The interpretation of the mixed area Deﬁnition For Ξ0 and K convex, W (Ξ0, K) = 1 2(A(Ξ0 ⊕ K) − A(K)) For unit ball :W (Ξ0, B) = U(Ξ) the perimeter For a segment: W (Ξ0, Sθ) = HΞ0 (θ) the F´eret’s diameter Ξ0 HΞ0 (θ) Oxθ For a polygon W (Ξ, N i=1 αi Sθi ) = N i=1 αi HΞ0 (θi ) ⇒ ∀N, ∀(θ1, · · · θN) all moments of (HΞ0 (θ1), · · · HΞ0 (θN)) ⇒ Characterization of the random process HΞ0 SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 12 / 22 Theoretical aspects The F´eret’s diameter random process HΞ0 trajectory of HΞ0 is the support function of the realization Ξ0 ⊕ ˘Ξ0 The process HΞ0 describes and characterizes Ξ0 ⊕ ˘Ξ0 NB: Ξ0 isotropic ⇔ HΞ0 strong stationary SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 13 / 22 Ξ0 HΞ0 (θ) θ 0 π/2 π 3π/2 2π 4 5 6 7 8 orientation θ FeretdiameterHΞ0 (θ) Practical aspects Section 3 Practical aspects SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 14 / 22 Practical aspects The simplest cases Estimation of 1st and 2ndorder moments E[A(Ξ0)] and E[W (Ξ0, K)] E[A(Ξ0)2], E[A(Ξ0)W (Ξ0, K)] and E[W (Ξ0, K)2] Disk E[A(Ξ0)2] E[A(Ξ0)U(Ξ0)] E[U(Ξ0)2] Segment E[A(Ξ0)2] E[HΞ0 (θ)2] E[A(Ξ0)HΞ0 (θ)] Parallelogram E[A(Ξ0)2] E[HΞ0 (θ)2] E[A(Ξ0)HΞ0 (θ)] E[HΞ0 (θ1)HΞ0 (θ2)] additional quantity of interest SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 15 / 22 Practical aspects The simplest cases Estimation of 1st and 2ndorder moments E[A(Ξ0)] and E[W (Ξ0, K)] E[A(Ξ0)2], E[A(Ξ0)W (Ξ0, K)] and E[W (Ξ0, K)2] Disk E[A(Ξ0)2] E[A(Ξ0)U(Ξ0)] E[U(Ξ0)2] Segment E[A(Ξ0)2] E[HΞ0 (θ)2] E[A(Ξ0)HΞ0 (θ)] Parallelogram E[A(Ξ0)2] E[HΞ0 (θ)2] E[A(Ξ0)HΞ0 (θ)] E[HΞ0 (θ1)HΞ0 (θ2)] additional quantity of interest SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 15 / 22 Practical aspects Procedure Ξ(ω) ∩ M r1, · · · rnKRealizations (Ξ(ω) ⊕ ri K) ∩ (M ri K)Dilations Covariances CΞ⊕ri K Mean Covariograms ¯γΞ0⊕ri K Integration E[A(Ξ0 ⊕ ri K)2 ] = ¯γΞ0⊕ri K (u)du Polynomial ﬁtting E[A(Ξ0)2], E[A(Ξ0)W (Ξ0, K)], E[W (Ξ0, K)2] SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 16 / 22 Practical aspects statistical aspects The following estimator for n point probability function is unbiased and strong consistent as M → ∞. ˆC (n) Ξ,M(x1, · · · xn) = A((Ξ ∩ M) {0, x1 − xn, · · · xn−1 − xn}) A(M {0, x1 − xn, · · · xn−1 − xn}) Then it follows a consistent estimator for nvariogram and thus the moments of (A(Ξ0), W (Ξ0, K)), but not necessarily unbiased. ⇒ Small bias for M bigger than Ξ0.(check by simulation) SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 17 / 22 Test by simulation Section 4 Test by simulation SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 18 / 22 Test by simulation Experiments Several realizations of the following Boolean model a ∼ N(40, 10), b ∼ N(30, 10) M : 500 × 500 λ = 100 500 × 500 r = 0, 1, · · · 10 SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 19 / 22 Test by simulation Results 0 500 1,000 1,500 2,000 2 5 10 20 30 number of realizations relativeerror(%) Dilation with a segment E[Hθ(Ξ0)2] E[A(Ξ0)Hθ(Ξ0)] E[A(Ξ0)2] 0 500 1,000 2 5 10 20 30 number of realizations relativeerror(%) Dilation with a disk E[U(Ξ0)2] E[A(Ξ0)U(Ξ0)] E[A(Ξ0)2] SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 20 / 22 Test by simulation Conclusions and prospects Conclusions Theoretical estimator for nth order moments of the process HΞ0 Practical estimation of 1st and 2ndorder moments of: t(A(Ξ0), U(Ξ0)) and t(A(Ξ0), HΞ0 (θ)) ⇒ Characterization of a random particle depending on 2 parameters: rectangle, ellipse... Prospects Describing complex random convex by ﬁrst and second order characteristics of the process HΞ0 (Ex:Gaussian process). quantifying the anisotropy of the grain. Bias corrector. SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 21 / 22 Test by simulation Thanks for listening! SR (ENSMSE / LGFPMDM) GSI 2015 28/10/2015 22 / 22
Short course (chaired by Roger Balian)
Keynote speach Marc Arnaudon (chaired by Frank Nielsen)
We will prove a EulerPoincaré reduction theorem for stochastic processes taking values in a Lie group, which is a generalization of the Lagrangian version of reduction and its associated variational principles. We will also show examples of its application to the rigid body and to the group of diffeomorphisms, which includes the NavierStokes equation on a bounded domain and the CamassaHolm equation.

Deterministic framework Stochastic framework Stochastic EulerPoincaré reduction. Marc Arnaudon Université de Bordeaux, France GSI, École Polytechnique, 29 October 2015 Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework References Arnaudon, Marc; Chen, Xin; Cruzeiro, Ana Bela; Stochastic EulerPoincaré reduction. J. Math. Phys. 55 (2014), no. 8, 17pp Chen, Xin; Cruzeiro, Ana Bela; Ratiu, Tudor S.; Constrained and stochastic variational principles for dissipative equations with advected quantities. arXiv:1506.05024 Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework 1 Deterministic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V 2 Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a Riemannian manifold and L : TM × [0, T] → R a Lagrangian on M. Let q ∈ C1 a,b([0, T]; M) := {q ∈ C1([0, T], M), q(0) = a, q(T) = b}. The action functional C : C1 a,b([0, T]; M) → R is deﬁned by C (q(·)) := T 0 L (q(t), ˙q(t), t) dt. The critical points for C satisfy the EulerLagrange equation d dt ∂L ∂ ˙q − ∂L ∂q = 0. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a Riemannian manifold and L : TM × [0, T] → R a Lagrangian on M. Let q ∈ C1 a,b([0, T]; M) := {q ∈ C1([0, T], M), q(0) = a, q(T) = b}. The action functional C : C1 a,b([0, T]; M) → R is deﬁned by C (q(·)) := T 0 L (q(t), ˙q(t), t) dt. The critical points for C satisfy the EulerLagrange equation d dt ∂L ∂ ˙q − ∂L ∂q = 0. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a Riemannian manifold and L : TM × [0, T] → R a Lagrangian on M. Let q ∈ C1 a,b([0, T]; M) := {q ∈ C1([0, T], M), q(0) = a, q(T) = b}. The action functional C : C1 a,b([0, T]; M) → R is deﬁned by C (q(·)) := T 0 L (q(t), ˙q(t), t) dt. The critical points for C satisfy the EulerLagrange equation d dt ∂L ∂ ˙q − ∂L ∂q = 0. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a Riemannian manifold and L : TM × [0, T] → R a Lagrangian on M. Let q ∈ C1 a,b([0, T]; M) := {q ∈ C1([0, T], M), q(0) = a, q(T) = b}. The action functional C : C1 a,b([0, T]; M) → R is deﬁned by C (q(·)) := T 0 L (q(t), ˙q(t), t) dt. The critical points for C satisfy the EulerLagrange equation d dt ∂L ∂ ˙q − ∂L ∂q = 0. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Suppose that the conﬁguration space M = G is a Lie group and L : TG → R is a left invariant Lagrangian: (ξ) := L(e, ξ) = L(g, g · ξ), ∀ξ ∈ TeG, g ∈ G. (here and in the sequel, g · ξ = TeLgξ) The action functional C : C1 a,b([0, T]; G) → R is deﬁned by C (g(·)) := T 0 L (g(t), ˙g(t)) dt = T 0 (ξ(t)) dt, where ξ(t) := g(t)−1 · ˙g(t). [J.E. Marsden, T. Ratiu 1994] [J.E. Marsden, J. Scheurle 1993]: g(·) is a critical point for C if and only if it satisﬁes the EulerPoincaré equation on T∗ e G d dt d dξ − ad∗ ξ(t) d dξ = 0, where ad∗ ξ : T∗ e G → T∗ e G is the dual action of adξ : TeG → TeG: ad∗ ξ η, θ = η, adξ θ , η ∈ T∗ e G, θ ∈ TeG. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Suppose that the conﬁguration space M = G is a Lie group and L : TG → R is a left invariant Lagrangian: (ξ) := L(e, ξ) = L(g, g · ξ), ∀ξ ∈ TeG, g ∈ G. (here and in the sequel, g · ξ = TeLgξ) The action functional C : C1 a,b([0, T]; G) → R is deﬁned by C (g(·)) := T 0 L (g(t), ˙g(t)) dt = T 0 (ξ(t)) dt, where ξ(t) := g(t)−1 · ˙g(t). [J.E. Marsden, T. Ratiu 1994] [J.E. Marsden, J. Scheurle 1993]: g(·) is a critical point for C if and only if it satisﬁes the EulerPoincaré equation on T∗ e G d dt d dξ − ad∗ ξ(t) d dξ = 0, where ad∗ ξ : T∗ e G → T∗ e G is the dual action of adξ : TeG → TeG: ad∗ ξ η, θ = η, adξ θ , η ∈ T∗ e G, θ ∈ TeG. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Suppose that the conﬁguration space M = G is a Lie group and L : TG → R is a left invariant Lagrangian: (ξ) := L(e, ξ) = L(g, g · ξ), ∀ξ ∈ TeG, g ∈ G. (here and in the sequel, g · ξ = TeLgξ) The action functional C : C1 a,b([0, T]; G) → R is deﬁned by C (g(·)) := T 0 L (g(t), ˙g(t)) dt = T 0 (ξ(t)) dt, where ξ(t) := g(t)−1 · ˙g(t). [J.E. Marsden, T. Ratiu 1994] [J.E. Marsden, J. Scheurle 1993]: g(·) is a critical point for C if and only if it satisﬁes the EulerPoincaré equation on T∗ e G d dt d dξ − ad∗ ξ(t) d dξ = 0, where ad∗ ξ : T∗ e G → T∗ e G is the dual action of adξ : TeG → TeG: ad∗ ξ η, θ = η, adξ θ , η ∈ T∗ e G, θ ∈ TeG. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V We will be interested in variations ξ(·) satisfying ˙ξ(t) = ˙ν(t) + adξ(t) ν(t) for some ν ∈ C1 ([0, T], TeG), which is equivalent to the variation of g(·) with the perturbation gε(t) = g(t)eε,ν (t), where eε,ν (t) is the unique solution to the following ODE on G: d dt eε,ν (t) = εeε,ν (t) · ˙ν(t), eε,ν (0) = e. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V We will be interested in variations ξ(·) satisfying ˙ξ(t) = ˙ν(t) + adξ(t) ν(t) for some ν ∈ C1 ([0, T], TeG), which is equivalent to the variation of g(·) with the perturbation gε(t) = g(t)eε,ν (t), where eε,ν (t) is the unique solution to the following ODE on G: d dt eε,ν (t) = εeε,ν (t) · ˙ν(t), eε,ν (0) = e. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a ndimensional compact Riemannian manifold. We deﬁne Gs := g : M → M a bijection , g, g−1 ∈ Hs (M, M) , where Hs(M, M) denotes the manifold of Sobolev maps of class s > 1 + n 2 from M to itself. If s > 1 + n 2 then Gs is a C∞ Hilbert manifold. Gs is a group under composition between maps, right translation is smooth, left translation and inversion are only continuous. Gs is also a topological group (but not an inﬁnite dimensional Lie group). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a ndimensional compact Riemannian manifold. We deﬁne Gs := g : M → M a bijection , g, g−1 ∈ Hs (M, M) , where Hs(M, M) denotes the manifold of Sobolev maps of class s > 1 + n 2 from M to itself. If s > 1 + n 2 then Gs is a C∞ Hilbert manifold. Gs is a group under composition between maps, right translation is smooth, left translation and inversion are only continuous. Gs is also a topological group (but not an inﬁnite dimensional Lie group). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let M be a ndimensional compact Riemannian manifold. We deﬁne Gs := g : M → M a bijection , g, g−1 ∈ Hs (M, M) , where Hs(M, M) denotes the manifold of Sobolev maps of class s > 1 + n 2 from M to itself. If s > 1 + n 2 then Gs is a C∞ Hilbert manifold. Gs is a group under composition between maps, right translation is smooth, left translation and inversion are only continuous. Gs is also a topological group (but not an inﬁnite dimensional Lie group). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V The tangent space TηGs at arbitrary η ∈ Gs is TηGs = U : M → TM of class Hs , U(m) ∈ Tη(m)M . The Riemannian structure on M induces the weak L2, or hydrodynamic, metric ·, · 0 on Gs given by U, V 0 η := M Uη(m), Vη(m) m dµg(m), for any η ∈ Gs, U, V ∈ TηGs. Here Uη := U ◦ η−1 ∈ TeGs and µg denotes the Riemannian volume asociated with (M, g). Obviously, ·, · 0 is a right invariant metric on Gs. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V The tangent space TηGs at arbitrary η ∈ Gs is TηGs = U : M → TM of class Hs , U(m) ∈ Tη(m)M . The Riemannian structure on M induces the weak L2, or hydrodynamic, metric ·, · 0 on Gs given by U, V 0 η := M Uη(m), Vη(m) m dµg(m), for any η ∈ Gs, U, V ∈ TηGs. Here Uη := U ◦ η−1 ∈ TeGs and µg denotes the Riemannian volume asociated with (M, g). Obviously, ·, · 0 is a right invariant metric on Gs. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V The tangent space TηGs at arbitrary η ∈ Gs is TηGs = U : M → TM of class Hs , U(m) ∈ Tη(m)M . The Riemannian structure on M induces the weak L2, or hydrodynamic, metric ·, · 0 on Gs given by U, V 0 η := M Uη(m), Vη(m) m dµg(m), for any η ∈ Gs, U, V ∈ TηGs. Here Uη := U ◦ η−1 ∈ TeGs and µg denotes the Riemannian volume asociated with (M, g). Obviously, ·, · 0 is a right invariant metric on Gs. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let be the LeviCivita connection associated with the Riemannian manifold (M, g). We deﬁne a right invariant connection 0 on Gs by 0 ˜X ˜Y (η) := ∂ ∂t t=0 ˜Y(ηt ) ◦ η−1 t ◦ η + Xη Yη ◦ η, where ˜X, ˜Y ∈ L (Gs), Xη := ˜X ◦ η−1, Yη := ˜Y ◦ η−1 ∈ L s(M), and η is a C1 curve in Gs such that η0 = η and d dt t=0 ηt = ˜X(η). Here L (Gs) denotes the set of smooth vector ﬁelds on Gs. 0 is the LeviCivita connection associated to Gs, ·, · 0 . Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Let be the LeviCivita connection associated with the Riemannian manifold (M, g). We deﬁne a right invariant connection 0 on Gs by 0 ˜X ˜Y (η) := ∂ ∂t t=0 ˜Y(ηt ) ◦ η−1 t ◦ η + Xη Yη ◦ η, where ˜X, ˜Y ∈ L (Gs), Xη := ˜X ◦ η−1, Yη := ˜Y ◦ η−1 ∈ L s(M), and η is a C1 curve in Gs such that η0 = η and d dt t=0 ηt = ˜X(η). Here L (Gs) denotes the set of smooth vector ﬁelds on Gs. 0 is the LeviCivita connection associated to Gs, ·, · 0 . Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V For s > 1 + n 2 , let Gs V := g, g ∈ Gs , g is volume preserving . Gs V is still a topological group. The tangent space TeGs V is G s V = TeGs V = U, U ∈ TeGs , div(U) = 0 . The L2metric ·, · 0 and its LeviCivita connection 0,V are deﬁned on Gs V by orthogonal projection. More precisely the Levi Civita connection on Gs V is given by 0,V X Y = Pe( 0 X Y) with Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V For s > 1 + n 2 , let Gs V := g, g ∈ Gs , g is volume preserving . Gs V is still a topological group. The tangent space TeGs V is G s V = TeGs V = U, U ∈ TeGs , div(U) = 0 . The L2metric ·, · 0 and its LeviCivita connection 0,V are deﬁned on Gs V by orthogonal projection. More precisely the Levi Civita connection on Gs V is given by 0,V X Y = Pe( 0 X Y) with Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V For s > 1 + n 2 , let Gs V := g, g ∈ Gs , g is volume preserving . Gs V is still a topological group. The tangent space TeGs V is G s V = TeGs V = U, U ∈ TeGs , div(U) = 0 . The L2metric ·, · 0 and its LeviCivita connection 0,V are deﬁned on Gs V by orthogonal projection. More precisely the Levi Civita connection on Gs V is given by 0,V X Y = Pe( 0 X Y) with Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V For s > 1 + n 2 , let Gs V := g, g ∈ Gs , g is volume preserving . Gs V is still a topological group. The tangent space TeGs V is G s V = TeGs V = U, U ∈ TeGs , div(U) = 0 . The L2metric ·, · 0 and its LeviCivita connection 0,V are deﬁned on Gs V by orthogonal projection. More precisely the Levi Civita connection on Gs V is given by 0,V X Y = Pe( 0 X Y) with Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Consider the ODE on M d dt (gt (x)) = u (t, gt (x)) g0(x) = x. Here u(t, ·) ∈ TeGs for every t > 0. For every ﬁxed t > 0, gt (·) ∈ Gs(M). So g ∈ C1([0, T], Gs). If div(u(t)) = 0 for every t then g ∈ C1([0, T], Gs V ) Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Consider the ODE on M d dt (gt (x)) = u (t, gt (x)) g0(x) = x. Here u(t, ·) ∈ TeGs for every t > 0. For every ﬁxed t > 0, gt (·) ∈ Gs(M). So g ∈ C1([0, T], Gs). If div(u(t)) = 0 for every t then g ∈ C1([0, T], Gs V ) Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V Consider the ODE on M d dt (gt (x)) = u (t, gt (x)) g0(x) = x. Here u(t, ·) ∈ TeGs for every t > 0. For every ﬁxed t > 0, gt (·) ∈ Gs(M). So g ∈ C1([0, T], Gs). If div(u(t)) = 0 for every t then g ∈ C1([0, T], Gs V ) Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V [V.I. Arnold 1966] [D.G. Ebin, J.E. Marsden 1970] A Lagrangian path g ∈ C2([0, T], Gs V ) satisfying the equation above is a geodesic on Gs V , ·, · 0,V (i.e. 0,V ˙g(t) ˙g(t)) if and only of the velocity ﬁeld u satisﬁes the Euler equation for incompressible inviscid ﬂuids (E) ∂u ∂t = − uu − p divu = 0 Notice that the term p corresponds to the use of 0 instead of 0,V : the ﬁrst system rewrites as ∂u ∂t = − 0,V u u divu = 0 Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V [V.I. Arnold 1966] [D.G. Ebin, J.E. Marsden 1970] A Lagrangian path g ∈ C2([0, T], Gs V ) satisfying the equation above is a geodesic on Gs V , ·, · 0,V (i.e. 0,V ˙g(t) ˙g(t)) if and only of the velocity ﬁeld u satisﬁes the Euler equation for incompressible inviscid ﬂuids (E) ∂u ∂t = − uu − p divu = 0 Notice that the term p corresponds to the use of 0 instead of 0,V : the ﬁrst system rewrites as ∂u ∂t = − 0,V u u divu = 0 Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V If we take : TeGs V → R as (X) := X, X , X ∈ TeGs V , and deﬁne the action functional C : C1 e,e([0, T], Gs V ) → R by C (g(·)) := T 0 ˙g(t) · g(t)−1 dt, then a Lagrangian path g ∈ C2([0, T], Gs V ) integral path of u is a critical point of C if and only if u satisﬁes the Euler equation (E). [J.E. Marsden, T. Ratiu 1994] [J.E. Marsden, J. Scheurle 1993] Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework EulerPoincaré equations Diffeomorphism group on a compact Riemannian manifold Volume preserving diffeomorphism group Lagrangian paths Characterization of the geodesics on Gs V , ·, · 0 EulerPoincaré equation on Gs V [S. Shkoller 1998] If we take : TeGs V → R as the H1 metric (X) := M X, X m dµg(m) + α2 M X, X m dµg(m), X ∈ TeGs V , and deﬁne the action functional C : C1 e,e([0, T], Gs V ) → R in the same way as before, then a Lagrangian path g ∈ C2([0, T], Gs V ) integral path of u is a critical point of C if and only if u satisﬁes the CamassaHolm equation ∂ν ∂t + u · ν + α2 ( u)∗ · ∆ν = p, ν = (1 + α2∆)u, div(u) = 0. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Aim: to establish a stochastic EulerPoincaré reduction theorem in a general Lie group. To apply it to volume preserving diffeomorphisms of a compact symmetric space. Stochastic term will correspond for Euler equation to introducing viscosity. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Aim: to establish a stochastic EulerPoincaré reduction theorem in a general Lie group. To apply it to volume preserving diffeomorphisms of a compact symmetric space. Stochastic term will correspond for Euler equation to introducing viscosity. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Aim: to establish a stochastic EulerPoincaré reduction theorem in a general Lie group. To apply it to volume preserving diffeomorphisms of a compact symmetric space. Stochastic term will correspond for Euler equation to introducing viscosity. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations An Rnvalued semimartingale ξt has a decomposition ξt (ω) = Nt (ω) + At (ω) where (Nt ) is a local martingale and (At ) has ﬁnite variation. If (Nt ) is a martingale, then E[Nt Fs] = Ns, t ≥ s. We are interested in semimartingales which furthermore satisfy At (ω) = t 0 as(ω) ds. Deﬁning Dξt dt := lim ε→0 E ξt+ε − ξt ε Ft , we have Dξt dt = at Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations An Rnvalued semimartingale ξt has a decomposition ξt (ω) = Nt (ω) + At (ω) where (Nt ) is a local martingale and (At ) has ﬁnite variation. If (Nt ) is a martingale, then E[Nt Fs] = Ns, t ≥ s. We are interested in semimartingales which furthermore satisfy At (ω) = t 0 as(ω) ds. Deﬁning Dξt dt := lim ε→0 E ξt+ε − ξt ε Ft , we have Dξt dt = at Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations An Rnvalued semimartingale ξt has a decomposition ξt (ω) = Nt (ω) + At (ω) where (Nt ) is a local martingale and (At ) has ﬁnite variation. If (Nt ) is a martingale, then E[Nt Fs] = Ns, t ≥ s. We are interested in semimartingales which furthermore satisfy At (ω) = t 0 as(ω) ds. Deﬁning Dξt dt := lim ε→0 E ξt+ε − ξt ε Ft , we have Dξt dt = at Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Itô formula : f(ξt ) = f(ξ0) + t 0 df(ξs), dNs + t 0 df(ξs), dAs + 1 2 t 0 Hessf(dξs ⊗ dξs). From this we see that ξt is a local martingale if and only if for all f ∈ C2(Rn), f(ξt ) − f(ξ0) − 1 2 t 0 Hessf(dξs ⊗ dξs) is a real valued local martingale. This property becomes a deﬁnition for manifoldvalued martingales. Deﬁnition Let at ∈ Tξt M an adapted process. If for all f ∈ C2(M) f(ξt )−f(ξ0)− t 0 df(ξs), as ds− 1 2 t 0 Hessf(dξs⊗dξs) is a real valued local martingale then Dξt dt = at . Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Itô formula : f(ξt ) = f(ξ0) + t 0 df(ξs), dNs + t 0 df(ξs), dAs + 1 2 t 0 Hessf(dξs ⊗ dξs). From this we see that ξt is a local martingale if and only if for all f ∈ C2(Rn), f(ξt ) − f(ξ0) − 1 2 t 0 Hessf(dξs ⊗ dξs) is a real valued local martingale. This property becomes a deﬁnition for manifoldvalued martingales. Deﬁnition Let at ∈ Tξt M an adapted process. If for all f ∈ C2(M) f(ξt )−f(ξ0)− t 0 df(ξs), as ds− 1 2 t 0 Hessf(dξs⊗dξs) is a real valued local martingale then Dξt dt = at . Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Itô formula : f(ξt ) = f(ξ0) + t 0 df(ξs), dNs + t 0 df(ξs), dAs + 1 2 t 0 Hessf(dξs ⊗ dξs). From this we see that ξt is a local martingale if and only if for all f ∈ C2(Rn), f(ξt ) − f(ξ0) − 1 2 t 0 Hessf(dξs ⊗ dξs) is a real valued local martingale. This property becomes a deﬁnition for manifoldvalued martingales. Deﬁnition Let at ∈ Tξt M an adapted process. If for all f ∈ C2(M) f(ξt )−f(ξ0)− t 0 df(ξs), as ds− 1 2 t 0 Hessf(dξs⊗dξs) is a real valued local martingale then Dξt dt = at . Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Itô formula : f(ξt ) = f(ξ0) + t 0 df(ξs), dNs + t 0 df(ξs), dAs + 1 2 t 0 Hessf(dξs ⊗ dξs). From this we see that ξt is a local martingale if and only if for all f ∈ C2(Rn), f(ξt ) − f(ξ0) − 1 2 t 0 Hessf(dξs ⊗ dξs) is a real valued local martingale. This property becomes a deﬁnition for manifoldvalued martingales. Deﬁnition Let at ∈ Tξt M an adapted process. If for all f ∈ C2(M) f(ξt )−f(ξ0)− t 0 df(ξs), as ds− 1 2 t 0 Hessf(dξs⊗dξs) is a real valued local martingale then Dξt dt = at . Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Let G be a Lie group with right invariant metric ·, · and right invariant connection . Let G := TeG be the Lie algebra of G. Consider a countable family Hi , i ≥ 1, of elements of G , and u ∈ C1([0, T], G ). Consider the Stratonovich equation dgt = i≥1 Hi ◦ dWi t − 1 2 Hi Hi dt + u(t) dt · gt g0 = e where the (Wi t ) are independent real valued Brownian motions. Itô formula writes f(gt ) =f(g0) + i≥1 t 0 df(gs), Hi dWi s + t 0 df(gs), u(s)gs ds + 1 2 i≥1 t 0 Hessf(Hi (gs), Hi (gs)) ds. This implies that Dgt dt = u(t)gt . Particular case If (Hi ) is an orthonormal basis, Hi Hi = 0, is the Levi Civita connection associated to the metric and u ≡ 0, then gt is a Brownian motion in G. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Let G be a Lie group with right invariant metric ·, · and right invariant connection . Let G := TeG be the Lie algebra of G. Consider a countable family Hi , i ≥ 1, of elements of G , and u ∈ C1([0, T], G ). Consider the Stratonovich equation dgt = i≥1 Hi ◦ dWi t − 1 2 Hi Hi dt + u(t) dt · gt g0 = e where the (Wi t ) are independent real valued Brownian motions. Itô formula writes f(gt ) =f(g0) + i≥1 t 0 df(gs), Hi dWi s + t 0 df(gs), u(s)gs ds + 1 2 i≥1 t 0 Hessf(Hi (gs), Hi (gs)) ds. This implies that Dgt dt = u(t)gt . Particular case If (Hi ) is an orthonormal basis, Hi Hi = 0, is the Levi Civita connection associated to the metric and u ≡ 0, then gt is a Brownian motion in G. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Let G be a Lie group with right invariant metric ·, · and right invariant connection . Let G := TeG be the Lie algebra of G. Consider a countable family Hi , i ≥ 1, of elements of G , and u ∈ C1([0, T], G ). Consider the Stratonovich equation dgt = i≥1 Hi ◦ dWi t − 1 2 Hi Hi dt + u(t) dt · gt g0 = e where the (Wi t ) are independent real valued Brownian motions. Itô formula writes f(gt ) =f(g0) + i≥1 t 0 df(gs), Hi dWi s + t 0 df(gs), u(s)gs ds + 1 2 i≥1 t 0 Hessf(Hi (gs), Hi (gs)) ds. This implies that Dgt dt = u(t)gt . Particular case If (Hi ) is an orthonormal basis, Hi Hi = 0, is the Levi Civita connection associated to the metric and u ≡ 0, then gt is a Brownian motion in G. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations On the space S (G) of Gvalued semimartingales deﬁne J(ξ) = 1 2 E T 0 Dξ dt 2 dt . Perturbation: for v ∈ C1([0, T], G ) satisfying v(0) = v(T) = 0 and ε > 0, let eε,v (·) ∈ C1([0, T], G) the ﬂow generated by εv: d dt eε,v (t) = ε ˙v(t) · eε,v (t) eε,v (0) = e Deﬁnition We say that g ∈ S (G) is a critical point of J if for all v ∈ C1([0, T], G ) satisfying v(0) = v(T) = 0, dJ dε ε=0 gε,v = 0 where gε,v (t) = eε,v (t)g(t). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations On the space S (G) of Gvalued semimartingales deﬁne J(ξ) = 1 2 E T 0 Dξ dt 2 dt . Perturbation: for v ∈ C1([0, T], G ) satisfying v(0) = v(T) = 0 and ε > 0, let eε,v (·) ∈ C1([0, T], G) the ﬂow generated by εv: d dt eε,v (t) = ε ˙v(t) · eε,v (t) eε,v (0) = e Deﬁnition We say that g ∈ S (G) is a critical point of J if for all v ∈ C1([0, T], G ) satisfying v(0) = v(T) = 0, dJ dε ε=0 gε,v = 0 where gε,v (t) = eε,v (t)g(t). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations On the space S (G) of Gvalued semimartingales deﬁne J(ξ) = 1 2 E T 0 Dξ dt 2 dt . Perturbation: for v ∈ C1([0, T], G ) satisfying v(0) = v(T) = 0 and ε > 0, let eε,v (·) ∈ C1([0, T], G) the ﬂow generated by εv: d dt eε,v (t) = ε ˙v(t) · eε,v (t) eε,v (0) = e Deﬁnition We say that g ∈ S (G) is a critical point of J if for all v ∈ C1([0, T], G ) satisfying v(0) = v(T) = 0, dJ dε ε=0 gε,v = 0 where gε,v (t) = eε,v (t)g(t). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Theorem g is a critical point of J if and only if du(t) dt = −ad∗ ˜u(t)u(t) − K(u(t)) with ˜u(t) = u(t) − 1 2 i≥1 Hi Hi , ad∗ u v, w = v, aduv and K : G → G satisﬁes K(u), v = − u, 1 2 i≥1 adv Hi Hi + Hi (adv (Hi )) Remark 1 If for all i ≥ 1, Hi = 0, or uv = 0 for all u, v ∈ G , then K(u) = 0 and we get the standard EulerPoincaré equation. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Theorem g is a critical point of J if and only if du(t) dt = −ad∗ ˜u(t)u(t) − K(u(t)) with ˜u(t) = u(t) − 1 2 i≥1 Hi Hi , ad∗ u v, w = v, aduv and K : G → G satisﬁes K(u), v = − u, 1 2 i≥1 adv Hi Hi + Hi (adv (Hi )) Remark 1 If for all i ≥ 1, Hi = 0, or uv = 0 for all u, v ∈ G , then K(u) = 0 and we get the standard EulerPoincaré equation. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Proposition If for all i ≥ 1, Hi Hi = 0 then K(u) = − 1 2 i≥1 Hi · Hi u + R(u, Hi )Hi . In particular if (Hi ) is an o.n.b. of G then K(u) = − 1 2 u = − 1 2 ∆u + 1 2 Ric u the Hodge Laplacian. Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Let Gs v = {g : M → M volume preserving bijection, such that g, g−1 ∈ Hs }. Assume s > 1 + dimM 2 . Then Gs V is a C∞ smooth manifold. Lie algebra G s V = TeGs V = {X : Hs (M, TM), π(X) = e, div(X) = 0}. Notice that π(X) = e means that X is a vector ﬁeld on M: X(x) ∈ Tx M. On G s V consider the two scalar products X, Y 0 = M X(x), Y(x) dx and X, Y 1 = M X(x), Y(x) dx + M X(x), Y(x) dx. The Levi Civita connection on Gs V is given by 0V X Y = Pe( 0 X Y) with 0 the Levi Civita connection of ·, · 0 on Gs and Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). One can ﬁnd (Hi )i≥1 such that for all i ≥ 1, Hi Hi = 0, div(Hi ) = 0, and i≥1 H2 i f = ν∆f, f ∈ C2 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Let Gs v = {g : M → M volume preserving bijection, such that g, g−1 ∈ Hs }. Assume s > 1 + dimM 2 . Then Gs V is a C∞ smooth manifold. Lie algebra G s V = TeGs V = {X : Hs (M, TM), π(X) = e, div(X) = 0}. Notice that π(X) = e means that X is a vector ﬁeld on M: X(x) ∈ Tx M. On G s V consider the two scalar products X, Y 0 = M X(x), Y(x) dx and X, Y 1 = M X(x), Y(x) dx + M X(x), Y(x) dx. The Levi Civita connection on Gs V is given by 0V X Y = Pe( 0 X Y) with 0 the Levi Civita connection of ·, · 0 on Gs and Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). One can ﬁnd (Hi )i≥1 such that for all i ≥ 1, Hi Hi = 0, div(Hi ) = 0, and i≥1 H2 i f = ν∆f, f ∈ C2 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Let Gs v = {g : M → M volume preserving bijection, such that g, g−1 ∈ Hs }. Assume s > 1 + dimM 2 . Then Gs V is a C∞ smooth manifold. Lie algebra G s V = TeGs V = {X : Hs (M, TM), π(X) = e, div(X) = 0}. Notice that π(X) = e means that X is a vector ﬁeld on M: X(x) ∈ Tx M. On G s V consider the two scalar products X, Y 0 = M X(x), Y(x) dx and X, Y 1 = M X(x), Y(x) dx + M X(x), Y(x) dx. The Levi Civita connection on Gs V is given by 0V X Y = Pe( 0 X Y) with 0 the Levi Civita connection of ·, · 0 on Gs and Pe the orthogonal projection on G s V : Hs (TM) = G s V ⊕ dHs+1 (M). One can ﬁnd (Hi )i≥1 such that for all i ≥ 1, Hi Hi = 0, div(Hi ) = 0, and i≥1 H2 i f = ν∆f, f ∈ C2 (M). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Corollary (1) g is a critical point of J ·,· 0 if and only if u solves NavierStokes equation ∂u ∂t = − uu + ν 2 ∆u − p divu = 0 (2) Assume M = T2 the 2dimensional torus. Then g is a critical point of J ·,· 1 if and only if u solves CamassaHolm equation ∂u ∂t = − uv − 2 j=1 vj uj + ν 2 ∆v − p v = u − ∆u divu = 0 For the proof, use Itô formula and compute in different situations ad∗ v (u) and K(u). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Corollary (1) g is a critical point of J ·,· 0 if and only if u solves NavierStokes equation ∂u ∂t = − uu + ν 2 ∆u − p divu = 0 (2) Assume M = T2 the 2dimensional torus. Then g is a critical point of J ·,· 1 if and only if u solves CamassaHolm equation ∂u ∂t = − uv − 2 j=1 vj uj + ν 2 ∆v − p v = u − ∆u divu = 0 For the proof, use Itô formula and compute in different situations ad∗ v (u) and K(u). Marc Arnaudon Stochastic EulerPoincaré reduction. Deterministic framework Stochastic framework Semimartingales in a Lie group Stochastic EulerPoincaré reduction Group of volume preserving diffeomorphisms NavierStokes and CamassaHolm equations Corollary (1) g is a critical point of J ·,· 0 if and only if u solves NavierStokes equation ∂u ∂t = − uu + ν 2 ∆u − p divu = 0 (2) Assume M = T2 the 2dimensional torus. Then g is a critical point of J ·,· 1 if and only if u solves CamassaHolm equation ∂u ∂t = − uv − 2 j=1 vj uj + ν 2 ∆v − p v = u − ∆u divu = 0 For the proof, use Itô formula and compute in different situations ad∗ v (u) and K(u). Marc Arnaudon Stochastic EulerPoincaré reduction.
Information Geometry Optimization (chaired by Giovanni Pistone, Yann Ollivier)
When observing data x1, . . . , x t modelled by a probabilistic distribution pθ(x), the maximum likelihood (ML) estimator θML = arg max θ Σti=1 ln pθ(x i ) cannot, in general, safely be used to predict xt + 1. For instance, for a Bernoulli process, if only “tails” have been observed so far, the probability of “heads” is estimated to 0. (Thus for the standard logloss scoring rule, this results in infinite loss the first time “heads” appears.)

Laplace’s Rule of Succession in Information Geometry Yann Ollivier CNRS & ParisSaclay University, France Sequential prediction Sequential prediction problem: given observations x1, . . . , xt, build a probabilistic model pt+1 for xt+1, iteratively. Sequential prediction Sequential prediction problem: given observations x1, . . . , xt, build a probabilistic model pt+1 for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Sequential prediction Sequential prediction problem: given observations x1, . . . , xt, build a probabilistic model pt+1 for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated logloss LT := − T−1∑︁ t=0 log pt+1 (xt+1  x1...t) to be minimized. Sequential prediction Sequential prediction problem: given observations x1, . . . , xt, build a probabilistic model pt+1 for xt+1, iteratively. Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? Common performance criterion for prediction: cumulated logloss LT := − T−1∑︁ t=0 log pt+1 (xt+1  x1...t) to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. Maximum likelihood estimator Maximum likelihood strategy: Fix a parametric model p
A divergence function defines a Riemannian metric G and dually coupled affine connections (∇, ∇ ∗ ) with respect to it in a manifold M. When M is dually flat, a canonical divergence is known, which is uniquely determined from {G, ∇, ∇ ∗ }. We search for a standard divergence for a general nonflat M. It is introduced by the magnitude of the inverse exponential map, where α = (1/3) connection plays a fundamental role. The standard divergence is different from the canonical divergence.

GSI – 2015  Paris Standard Divergence in Manifold of Dual Affine Connections Shunichi Amari (RIKEN Brain Science Institute) Nihat Ay (MaxPlanck Inst. Mathematics in Science) Divergence and metric 3 : 0 1 : 2 i j ij D p q D d g d d O d : Riemannian metric, positivedefiniteG Divergence and dual affine connections : : ; ijk ijk ijk i j k ijk i j k i ji j D D Dual geometry , , , , , , , , , 2 X X ijk ijk ijk o ijk ijk ijk M g X Y Z Y Z Y Z M g T T T : LeviCivita connectiono Dual geometry canonical divergence : dually flat : , : M D Bregman divergence Exponential map : geodesict 0 0 1 0 logp p q X q X p q p q Exponential map divergence 2 : :D p q X p q divergence 2 : :D p q X p q geodesic Theorem 1. Exponential map divergence induces geometry3 Standard divergence: 2 stan 1/3: ,D p q X p q Theorem 2. exponential map divergence recovers the original geometry 1 3 2 2 stan 1/3 1/3 1 : , , 2 D p q X p q X q p Remark: dually flat case stan canD D stan stan[ : ] * [ : ]D p q D q p 21 0 [ : ] ( )D p q t t dt Divergence and projection projection theorem: ˆ argmin : q S p D p q grad :qX c D p q p ˆp S
The statistical structure on a manifold M is predicated upon a special kind of coupling between the Riemannian metric g and a torsionfree affine connection ∇ on the TM, such that ∇ g is totally symmetric, forming, by definition, a “Codazzi pair” { ∇ , g}. In this paper, we first investigate various transformations of affine connections, including additive translation (by an arbitrary (1,2)tensor K), multiplicative perturbation (through an arbitrary invertible operator L on TM), and conjugation (through a nondegenerate twoform h). We then study the Codazzi coupling of ∇ with h and its coupling with L, and the link between these two couplings. We introduce, as special cases of Ktranslations, various transformations that generalize traditional projective and dualprojective transformations, and study their commutativity with Lperturbation and hconjugation transformations. Our derivations allow affine connections to carry torsion, and we investigate conditions under which torsions are preserved by the various transformations mentioned above. Our systematic approach establishes a general setting for the study of Information Geometry based on transformations and coupling relations of affine connections – in particular, we provide a generalization of conformalprojective transformation.

Transformations and Coupling Relations for Aﬃne Connections James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI) Oct 29, 2015 James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Outline 1 Transformation of aﬃne connections (with torsion) hconjugation: by a twoform h gauge transform: by an operator L additive translation: by a (1,2)tensor K 2 Commutative relations and “commutativity prisms” keeping track of “torsion” as going through the transformations 3 Transformations that preserve Codazzi coupling pg, q more general than “conformalprojective transformation”? James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Outline 1 Transformation of aﬃne connections (with torsion) hconjugation: by a twoform h gauge transform: by an operator L additive translation: by a (1,2)tensor K 2 Commutative relations and “commutativity prisms” keeping track of “torsion” as going through the transformations 3 Transformations that preserve Codazzi coupling pg, q more general than “conformalprojective transformation”? James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Outline 1 Transformation of aﬃne connections (with torsion) hconjugation: by a twoform h gauge transform: by an operator L additive translation: by a (1,2)tensor K 2 Commutative relations and “commutativity prisms” keeping track of “torsion” as going through the transformations 3 Transformations that preserve Codazzi coupling pg, q more general than “conformalprojective transformation”? James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Statistical manifold and Codazzi coupling On a diﬀerentiable manifold M, one independently prescribes: 1 a pseudoRiemannian metric g; 2 an aﬃne connection . Codazzi coupling of g and The pair pg, q is said to be Codazzicoupled if p Z gqpX, Y q “ p X gqpZ, Y q. This notion is a generalization of LeviCivita coupling (i.e., parallelism of g with respect to ). It can be shown that p , gq is Codazzicoupled ÐÑ and ˚ have same torsion. Statistical manifold: deﬁnition A manifold pM, g, q where (i) is torsionfree and (ii) pg, q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Statistical manifold and Codazzi coupling On a diﬀerentiable manifold M, one independently prescribes: 1 a pseudoRiemannian metric g; 2 an aﬃne connection . Codazzi coupling of g and The pair pg, q is said to be Codazzicoupled if p Z gqpX, Y q “ p X gqpZ, Y q. This notion is a generalization of LeviCivita coupling (i.e., parallelism of g with respect to ). It can be shown that p , gq is Codazzicoupled ÐÑ and ˚ have same torsion. Statistical manifold: deﬁnition A manifold pM, g, q where (i) is torsionfree and (ii) pg, q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Statistical manifold and Codazzi coupling On a diﬀerentiable manifold M, one independently prescribes: 1 a pseudoRiemannian metric g; 2 an aﬃne connection . Codazzi coupling of g and The pair pg, q is said to be Codazzicoupled if p Z gqpX, Y q “ p X gqpZ, Y q. This notion is a generalization of LeviCivita coupling (i.e., parallelism of g with respect to ). It can be shown that p , gq is Codazzicoupled ÐÑ and ˚ have same torsion. Statistical manifold: deﬁnition A manifold pM, g, q where (i) is torsionfree and (ii) pg, q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Statistical manifold and Codazzi coupling On a diﬀerentiable manifold M, one independently prescribes: 1 a pseudoRiemannian metric g; 2 an aﬃne connection . Codazzi coupling of g and The pair pg, q is said to be Codazzicoupled if p Z gqpX, Y q “ p X gqpZ, Y q. This notion is a generalization of LeviCivita coupling (i.e., parallelism of g with respect to ). It can be shown that p , gq is Codazzicoupled ÐÑ and ˚ have same torsion. Statistical manifold: deﬁnition A manifold pM, g, q where (i) is torsionfree and (ii) pg, q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Conjugate connection gconjugation of a connection Given any pg, q, conjugate connection ˚ can be deﬁned: ZgpX, Y q “ gp Z X, Y q ` gpX, ˚ Z Y q. It can be veriﬁed that (i) ˚ is indeed a connection and (ii) the ˚ action on is involutive: p ˚ q˚ “ . Deﬁning a connection by conjugacy with a nondegenerate twoform h: can be done unambiguously only when h is symmetric or skewsymmetric; otherwise “left conjugate” and “right conjugate”, in reference to the slot hp¨, ¨q, will not be the same. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Conjugate connection gconjugation of a connection Given any pg, q, conjugate connection ˚ can be deﬁned: ZgpX, Y q “ gp Z X, Y q ` gpX, ˚ Z Y q. It can be veriﬁed that (i) ˚ is indeed a connection and (ii) the ˚ action on is involutive: p ˚ q˚ “ . Deﬁning a connection by conjugacy with a nondegenerate twoform h: can be done unambiguously only when h is symmetric or skewsymmetric; otherwise “left conjugate” and “right conjugate”, in reference to the slot hp¨, ¨q, will not be the same. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Gauge transformation of connection Let L denote TM isomorphism. The gauge transformation of by L, denoted Lp q, is deﬁned as (for vector ﬁelds X, Y ): pLp qqX Y “ L´1 p X pLY qq. pL, q is said to be Codazzicoupled if p X LqY “ p Y LqX, where p X LqY ” X pLY q ´ Lp X Y q. Proposition (SchwenkSchellschmidt and Simon, 2009) Let be an aﬃne connection, and L be a tangent bundle isomorphism. Then the following are equivalent: 1 p , Lq is Codazzicoupled. 2 and Lp q have equal torsions. 3 pLp q, L´1 q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Gauge transformation of connection Let L denote TM isomorphism. The gauge transformation of by L, denoted Lp q, is deﬁned as (for vector ﬁelds X, Y ): pLp qqX Y “ L´1 p X pLY qq. pL, q is said to be Codazzicoupled if p X LqY “ p Y LqX, where p X LqY ” X pLY q ´ Lp X Y q. Proposition (SchwenkSchellschmidt and Simon, 2009) Let be an aﬃne connection, and L be a tangent bundle isomorphism. Then the following are equivalent: 1 p , Lq is Codazzicoupled. 2 and Lp q have equal torsions. 3 pLp q, L´1 q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Gauge transformation of connection Let L denote TM isomorphism. The gauge transformation of by L, denoted Lp q, is deﬁned as (for vector ﬁelds X, Y ): pLp qqX Y “ L´1 p X pLY qq. pL, q is said to be Codazzicoupled if p X LqY “ p Y LqX, where p X LqY ” X pLY q ´ Lp X Y q. Proposition (SchwenkSchellschmidt and Simon, 2009) Let be an aﬃne connection, and L be a tangent bundle isomorphism. Then the following are equivalent: 1 p , Lq is Codazzicoupled. 2 and Lp q have equal torsions. 3 pLp q, L´1 q is Codazzicoupled. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Linking gconjugation with Lgauge transform We proved the following characterization theorem for gconjugation of a connection in terms of any L: Characterization Theorem Let be a connection and ˚ its conjugate connection w.r.t. a metric g. Denote ωpX, Y q “ gpLX, Y q for arbitrary TM isomorphism L. Then ω “ 0 if and only if Lp ˚ q “ . Explicitly written: ˚ Z X “ Z X ` Lp Z L´1 qX. Proof used the identify (for any invertible operator L): ChpX, Y , Zq “ Cg pLpXq, Y , Zq ` gpp Z LqX, Y q, where CpX, Y , Zq ” p Z gqpX, Y q, hpX, Y q ” gpLpXq, Y q. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Translation of a connection by Ktensor Translation by a (1,2)tensor: X Y Ñ X Y ` KpX, Y q. It is torsionpreserving iﬀ K is symmetric: KpX, Y q “ KpY , Xq. Examples of Ktranslations (i) P_ pτq : X Y ÞÑ X Y ` τpXqY , P_ transformation; (ii) Ppτq : X Y ÞÑ X Y ` τpY qX, Ptransformation; (iii) Projpτq : X Y ÞÑ X Y ` τpY qX ` τpXqY , called projective transformation, always torsionpreserving; (iv) Dph, V q : X Y ÞÑ X Y ´ hpY , XqV , called “dualprojective transformation”, torsionpreserving when h symmetric. Here τ is an arbitrary oneform, h is a nondegenerate twoform, X, Y , V are all vector ﬁelds. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Translation of a connection by Ktensor Translation by a (1,2)tensor: X Y Ñ X Y ` KpX, Y q. It is torsionpreserving iﬀ K is symmetric: KpX, Y q “ KpY , Xq. Examples of Ktranslations (i) P_ pτq : X Y ÞÑ X Y ` τpXqY , P_ transformation; (ii) Ppτq : X Y ÞÑ X Y ` τpY qX, Ptransformation; (iii) Projpτq : X Y ÞÑ X Y ` τpY qX ` τpXqY , called projective transformation, always torsionpreserving; (iv) Dph, V q : X Y ÞÑ X Y ´ hpY , XqV , called “dualprojective transformation”, torsionpreserving when h symmetric. Here τ is an arbitrary oneform, h is a nondegenerate twoform, X, Y , V are all vector ﬁelds. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Interactions of hconjugation, Lgauge, Ktranslation Let g, L, τ be as above. Let gL denote gpL¨, ¨q, ΓL denote Lgauge transformation, Cpgq denote conjugation w.r.t. g, and ¯τ be the vector ﬁeld such that gpX, ¯τq “ τpXq. ‚ ‚ ‚ ‚ ‚ ‚ Ppτq Cpgq ΓL CpgLq Dpg,¯τq ΓL Cpgq DpgL,L´1p¯τqq CpgLq James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Interactions of hconjugation, Lgauge, Ktranslation ‚ ‚ ‚ ‚ ‚ ‚ P_pτq Cpgq ΓL CpgLq P_p´τq ΓL Cpgq P_p´τq CpgLq James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Conformalprojective transformation (CPT) Conformalprojective transformation (CPT) is deﬁned (Kurose, 2002) as, for any smooth functions ψ and φ, gpX, Y q ÞÑ eψ`φ gpX, Y q X Y ÞÑ X Y ´ gpX, Y q gradg ψ ` XpφqY ` Y pφqX CPT include, as special cases, projective transformation of conformal transformation of g and LeviCivita connection dualprojective transformation of , given pg, q Codazzi transform of g and αconformal transformation of g and It is known that CPT preserves Codazzi coupling of pg, q. We wonder whether it can be further generalized while preserving Codazzi structure. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Conformalprojective transformation (CPT) Conformalprojective transformation (CPT) is deﬁned (Kurose, 2002) as, for any smooth functions ψ and φ, gpX, Y q ÞÑ eψ`φ gpX, Y q X Y ÞÑ X Y ´ gpX, Y q gradg ψ ` XpφqY ` Y pφqX CPT include, as special cases, projective transformation of conformal transformation of g and LeviCivita connection dualprojective transformation of , given pg, q Codazzi transform of g and αconformal transformation of g and It is known that CPT preserves Codazzi coupling of pg, q. We wonder whether it can be further generalized while preserving Codazzi structure. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Conformalprojective transformation (CPT) Conformalprojective transformation (CPT) is deﬁned (Kurose, 2002) as, for any smooth functions ψ and φ, gpX, Y q ÞÑ eψ`φ gpX, Y q X Y ÞÑ X Y ´ gpX, Y q gradg ψ ` XpφqY ` Y pφqX CPT include, as special cases, projective transformation of conformal transformation of g and LeviCivita connection dualprojective transformation of , given pg, q Codazzi transform of g and αconformal transformation of g and It is known that CPT preserves Codazzi coupling of pg, q. We wonder whether it can be further generalized while preserving Codazzi structure. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Conformalprojective transformation (CPT) Conformalprojective transformation (CPT) is deﬁned (Kurose, 2002) as, for any smooth functions ψ and φ, gpX, Y q ÞÑ eψ`φ gpX, Y q X Y ÞÑ X Y ´ gpX, Y q gradg ψ ` XpφqY ` Y pφqX CPT include, as special cases, projective transformation of conformal transformation of g and LeviCivita connection dualprojective transformation of , given pg, q Codazzi transform of g and αconformal transformation of g and It is known that CPT preserves Codazzi coupling of pg, q. We wonder whether it can be further generalized while preserving Codazzi structure. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections CPpV , W , Lq preserving Codazzi Structure Generalized conformalprojective transformation CPpV , W , Lq Let V and W be vector ﬁelds, and L an invertible operator. CPpV , W , Lq consists of an Lperturbation of the metric g along with a torsionpreserving transformation Dpg, W qProjp ˜V q of the connection, where ˜V is the oneform given by ˜V pXq :“ gpV , Xq for any vector ﬁeld X. Proposition. (Assuming dim M ě 4) CPpV , W , Lq preserves Codazzi pairs t , gu if and only if L “ ef for some smooth function f , and V ` W “ gradg f . Take ˜V to be an arbitrary oneform, not necessarily closed, and ˜W :“ df ´ ˜V for some ﬁxed smooth function f . CPT results when f “ φ ` ψ, in which case df “ dφ ` dψ is a natural decomposition. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections CPpV , W , Lq preserving Codazzi Structure Generalized conformalprojective transformation CPpV , W , Lq Let V and W be vector ﬁelds, and L an invertible operator. CPpV , W , Lq consists of an Lperturbation of the metric g along with a torsionpreserving transformation Dpg, W qProjp ˜V q of the connection, where ˜V is the oneform given by ˜V pXq :“ gpV , Xq for any vector ﬁeld X. Proposition. (Assuming dim M ě 4) CPpV , W , Lq preserves Codazzi pairs t , gu if and only if L “ ef for some smooth function f , and V ` W “ gradg f . Take ˜V to be an arbitrary oneform, not necessarily closed, and ˜W :“ df ´ ˜V for some ﬁxed smooth function f . CPT results when f “ φ ` ψ, in which case df “ dφ ` dψ is a natural decomposition. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Recent development (Teng Fei and Jun Zhang) Let L be J (almost compatible structure) or K (almost paracomplex structure): J2 “ ´id; K2 “ id. A compatible triple pg, ω, Lq satisﬁes: 1 gpLX, Y q ` gpX, LY q “ 0; 2 ωpLX, Y q “ ωpX, LY q; 3 ωpX, Y q “ gpLX, Y q; A manifold M is called: 1 symplectic if there exists a symplectic (skewsymmetric + nondegenerate) form ω that is closed: dω “ 0; 2 Fedosov if (i) M is symplectic and (ii) there exists a torsionfree connection parallel to ω : ω “ 0; 3 (para)K¨ahler if (i) M is symplectic and (ii) there exists an integrable L compatible with ω : ωpX, LY q “ ωpLX, Y q. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Recent development (Teng Fei and Jun Zhang) Let L be J (almost compatible structure) or K (almost paracomplex structure): J2 “ ´id; K2 “ id. A compatible triple pg, ω, Lq satisﬁes: 1 gpLX, Y q ` gpX, LY q “ 0; 2 ωpLX, Y q “ ωpX, LY q; 3 ωpX, Y q “ gpLX, Y q; A manifold M is called: 1 symplectic if there exists a symplectic (skewsymmetric + nondegenerate) form ω that is closed: dω “ 0; 2 Fedosov if (i) M is symplectic and (ii) there exists a torsionfree connection parallel to ω : ω “ 0; 3 (para)K¨ahler if (i) M is symplectic and (ii) there exists an integrable L compatible with ω : ωpX, LY q “ ωpLX, Y q. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections Codazzi Structure and (Para)K¨ahler Structure Main Theorem Let be a torsionfree connection on M, and L denote either J (almost complex) or K (almost paracomplex) operator on TM. Then, for the following three statements, any two imply the third: 1 is Codazzicoupled with g; 2 is Codazzicoupled with L; 3 ω “ 0. As a result, M becomes a K¨ahler or paraK¨ahler manifold. In other words, Codazzi coupling of p , Lq turns a statistical manifold or Fedosov manifold into a (para)K¨ahler manifold, which is then both statistical and symplectic. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections THANK YOU FOR ATTENTION!! Tao, J. and Zhang, J. (2015). Transformation and coupling relations for aﬃne connections. Proceedings of GSI 2015. Springer. Fei, T. and Zhang, J, (in preparation). Interaction of Codazzi structur and (para)Kahler structure. James Tao (Harvard University, Cambridge MA) Jun Zhang (University of Michigan, Ann Arbor MI)Transformations and Coupling Relations for Aﬃne Connections
This paper address the problem of online learning finite statistical mixtures of exponential families. A short review of the ExpectationMaximization (EM) algorithm and its online extensions is done. From these extensions and the description of the kMaximum Likelihood Estimator (kMLE), three online extensions are proposed for this latter. To illustrate them, we consider the case of mixtures of Wishart distributions by giving details and providing some experiments.

Online kMLE for mixture modelling with exponential families Christophe SaintJean Frank Nielsen Geometry Science Information 2015 Oct 2830, 2015  Ecole Polytechnique, ParisSaclay Application Context 2/27 We are interested in building a system (a model) which evolves when new data is available: x1, x2, . . . , xN, . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X Outline of this talk 3/27 1 Online learning exponential families 2 Online learning of mixture of exponential families Introduction, EM, kMLE Recursive EM, Online EM Stochastic approximations of kMLE Experiments 3 Conclusions Reminder : (Regular) Exponential Family 4/27 Firstly, π will be approximated by a member of a (regular) exponential family (EF): EF = {f (x; θ) = exp { s(x), θ + k(x) − F(θ)θ ∈ Θ} Terminology: λ source parameters. θ natural parameters. η expectation parameters. s(x) suﬃcient statistic. k(x) auxiliary carrier measure. F(θ) the lognormalizer: diﬀerentiable, strictly convex Θ = {θ ∈ RDF(θ) < ∞} is an open convex set Almost all common distributions are EF members but uniform, Cauchy distributions. Reminder : Maximum Likehood Estimate (MLE) 5/27 Maximum Likehood Estimate for general p.d.f: ˆθ(N) = argmax θ N i=1 f (xi ; θ) = argmin θ − 1 N N i=1 log f (xi ; θ) assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations. Maximum Likehood Estimate for an EF: ˆθ(N) = argmin θ − 1 N i s(xi ), θ − cst(χ) + F(θ) which is exactly solved in H, the space of expectation parameters: ˆη(N) = F(ˆθ(N) ) = 1 N i s(xi ) ≡ ˆθ(N) = ( F)−1 1 N i s(xi ) Exact Online MLE for exponential family 6/27 A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input: a sequence S of observations Input: Functions s and ( F)−1 for some EF Output: a sequence of MLE for all observations seen before ˆη(0) = 0; N = 1; for xN ∈ S do ˆη(N) = ˆη(N−1) + N−1(s(xN) − ˆη(N−1)); yield ˆη(N) or yield ( F)−1(ˆη(N)); N = N + 1; Analytical expressions of ( F)−1 exist for most EF (but not all) Case of Multivariate normal distribution (MVN) 7/27 Probability density function of MVN: N(x; µ, Σ) = (2π)−d 2 Σ−1 2 exp−1 2 (x−µ)T Σ−1(x−µ) One possible decomposition: N(x; θ1, θ2) = exp{ θ1, x + θ2, −xxT F − 1 4 t θ1θ−1 2 θ1 − d 2 log(π) + 1 2 log θ2} =⇒ s(x) = (x, −xxT ) ( F)−1(η1, η2) = (−η1ηT 1 − η2)−1η1, 1 2(−η1ηT 1 − η2)−1 Case of the Wishart distribution 8/27 See details in the paper. Finite (parametric) mixture models 9/27 Now, π will be approximated by a ﬁnite (parametric) mixture f (·; θ) indexed by θ: π(x) ≈ f (x; θ) = K j=1 wj fj (x; θj ), 0 ≤ wj ≤ 1, K j=1 wj = 1 where wj are the mixing proportions, fj are the component distributions. When all fj ’s are EFs, it is called a Mixture of EFs (MEF). −5 0 5 10 0.000.050.100.150.200.25 x 0.1*dnorm(x)+0.6*dnorm(x,4,2)+0.3*dnorm(x,−2,0.5) Unknown true distribution f* Mixture distribution f Components density functions f_j Incompleteness in mixture models 10/27 incomplete observable χ = {x1, . . . , xN} deterministic ← complete unobservable χc = {y1 = (x1, z1), . . . , yN} Zi ∼ catK (w) Xi Zi = j ∼ fj (·; θj ) For a MEF, the joint density p(x, z; θ) is an EF: log p(x, z; θ) = K j=1 [z = j]{log(wj ) + θj , sj (x) + kj (x) − Fj (θj )} = K j=1 [z = j] [z = j] sj (x) , log wj − Fj (θj ) θj + k(x, z) ExpectationMaximization (EM) [1] 11/27 The EM algorithm maximizes iteratively Q(θ; ˆθ(t), χ). Algorithm 2: EM algorithm Input: ˆθ(0) initial parameters of the model Input: χ(N) = {x1, . . . , xN} Output: A (local) maximizer ˆθ(t∗) of log f (χ; θ) t ← 0; repeat Compute Q(θ; ˆθ(t), χ) := Eˆθ(t) [log p(χc; θ)χ] ; // EStep Choose ˆθ(t+1) = argmaxθ Q(θ; ˆθ(t), χ) ; // MStep t ← t +1; until Convergence of the complete loglikehood; EM for MEF 12/27 For a mixture, the EStep is always explicit: ˆz (t) i,j = ˆw (t) j f (xi ; ˆθ (t) j )/ j ˆw (t) j f (xi ; ˆθ (t) j ) For a MEF, the MStep then reduces to: ˆθ(t+1) = argmax {wj ,θj } K j=1 i ˆz (t) i,j i ˆz (t) i,j sj (xi ) , log wj − Fj (θj ) θj ˆw (t+1) j = N i=1 ˆz (t) i,j /N ˆη (t+1) j = F(ˆθ (t+1) j ) = i ˆz (t) i,j sj (xi ) i ˆz (t) i,j (weighted average of SS) kMaximum Likelihood Estimator (kMLE) [2] 13/27 The kMLE introduces a geometric split χ = K j=1 ˆχ (t) j to accelerate EM : ˜z (t) i,j = [argmax j wj f (xi ; ˆθ (t) j ) = j] Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the MStep of the kMLE then reduces to: ˆθ(t+1) = argmax {wj ,θj } K j=1 ˆχ (t) j  xi ∈ˆχ (t) j sj (xi ) , log wj − Fj (θj ) θj ˆw (t+1) j = ˆχ (t) j /N ˆη (t+1) j = F(ˆθ (t+1) j ) = xi ∈ˆχ (t) j sj (xi ) ˆχ (t) j  (clusterwise unweighted average of SS) Online learning of mixtures 14/27 Consider now the online setting x1, x2, . . . , xN, . . . Denote ˆθ(N) or ˆη(N) the parameter estimate after dealing N observations Denote ˆθ(0) or ˆη(0) their initial values Remark: For a ﬁxedsize dataset χ, one may apply multiple passes (with shuﬄe) on χ. The increase in the likelihood function is no more guaranteed after an iteration. Stochastic approximations of EM(1) 15/27 Two main approaches to online EMlike estimation: Stochastic MStep : Recursive EM (1984) [5] ˆθ(N) = ˆθ(N−1) + {NIc(ˆθ(N−1) }−1 θ log f (xN; ˆθ(N−1) ) where Ic is the Fisher Information matrix for the complete data: Ic(ˆθ(N−1) ) = −Eˆθ (N−1) j log p(x, z; θ) ∂θ∂θT A justiﬁcation for this formula comes from the Fisher’s Identity: log f (x; θ) = Eθ[log p(x, z; θ)x] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert Ic after each iteration. Stochastic approximations of EM(2) 16/27 Stochastic EStep : Online EM (2009) [7] ˆQ(N) (θ) = ˆQ(N−1) (θ)+α(N) Eˆθ(N−1) [log p(xN, zN; θ)xN] − ˆQ(N−1) (θ) In case of a MEF, the algorithm works only with the cond. expectation of the suﬃcient statistics for complete data. ˆzN,j = Eθ(N−1) [zN,j xN] ˆS (N) wj ˆS (N) θj = ˆS (N−1) wj ˆS (N−1) θj + α(N) ˆzN,j ˆzN,j sj (xN) − ˆS (N−1) wj ˆS (N−1) θj The MStep is unchanged: ˆw (N) j = ˆη (N) wj = ˆS (N) wj ˆθ (N) j = ( Fj )−1 (ˆη (N) θj = ˆS (N) θj / ˆS (N) wj ) Stochastic approximations of EM(3) 17/27 Some properties: Initial values ˆS(0) may be used for introducing a ”prior”: ˆS (0) wj = wj , ˆS (0) θj = wj η (0) j Parameters constraints are automatically respected No matrix to invert ! Policy for α(N) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !! Stochastic approximations of kMLE(1) 18/27 In order to keep previous advantages of online EM for an online kMLE, our only choice concerns the way to aﬀect xN to a cluster. Strategy 1 Maximize the likelihood of the complete data (xN, zN) ˜zN,j = [argmax j ˆw (N−1) j f (xN; ˆθ (N−1) j ) = j] Equivalent to Online CEM and similar to MacQueen iterative kMeans. Stochastic approximations of kMLE(2) 19/27 Strategy 2 Maximize the likelihood of the complete data (xN, zN) after the MStep: ˜zN,j = [argmax j ˆw (N) j f (xN; ˆθ (N) j ) = j] Similar to Hartigan’s method for kmeans. Additional cost: precompute all possible MSteps for the Stochastic EStep. Stochastic approximations of kMLE(3) 20/27 Strategy 3 Draw ˜zN,j from the categorical distribution ˜zN sampled from CatK ({pj = log( ˆw (N−1) j fj (xN; ˆθ (N−1) j ))}j ) Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of kMLE. For strategies 1 and 3, the MStep reduces the update of the parameters for a single component. Experiments 21/27 True distribution π = 0.5N(0, 1) + 0.5N(µ2, σ2 2) Diﬀerent values for µ2, σ2 for more or less overlap between components. A small subset of observations has be taken for initialization (kMLE++ / kMLE). Video illustrating the inconsistency of online kMLE. Experiments on Wishart 22/27 Conclusions  Future works 23/27 On consistency: EM, Online EM are consistent kMLE, online kMLE (Strategies 1,2) are inconsistent (due to the Bayes error in maximizing the classiﬁcation likelihood) Online stochastic kMLE (Strategy 3) : consistency ? So, when components overlap, online EM > kMLE > online kMLE for parameter learning. Need to study how the dimension inﬂuences the inconstancy/convergence rate for online kMLE. Convergence rate is lower for online methods (sublinear convergence of the SGD) Time for an update vs sample size: online kMLE (1,3) < online EM < online kMLE (2) << kMLE 24/27 online EM appears to be the best compromise !! References I 25/27 Dempster, A.P., Laird, N.M. and Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977. Nielsen, F.: On learning statistical mixtures maximizing the complete likelihood Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), AIP Conference Proceedings Publishing, 1641, pp. 238245, 214. Celeux, G. and Govaert, G.: A classiﬁcation EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14(3), pp. 315332, 1992. References II 26/27 Sam´e, A., Ambroise, C., Govaert, G.: An online classiﬁcation EM algorithm based on the mixture model Statistics and Computing, 17(3), pp. 209–218, 2007. Titterington, D. M. : Recursive Parameter Estimation Using Incomplete Data. Journal of the Royal Statistical Society. Series B (Methodological), Volume 46, Number 2, pp. 257–267, 1984. Amari, S. I. : Natural gradient works eﬃciently in learning. Neural Computation, Volume 10, Number 2, pp. 251?276, 1998. Capp´e, O., Moulines, E.: Online expectationmaximization algorithm for latent data models. Journal of the Royal Statistical Society. Series B (Methodological), 71(3):593613, 2009. References III 27/27 Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justiﬁes incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in graphical models, pages 355368. MIT Press, Cambridge, 1999. Bottou, L´eon : Online Algorithms and Stochastic Approximations. Online Learning and Neural Networks, Saad, David Eds.,Cambridge University Press, 1998.
We discuss the optimization of the stochastic relaxation of a realvalued function, i.e., we introduce a new search space given by a statistical model and we optimize the expected value of the original function with respect to a distribution in the model. From the point of view of Information Geometry, statistical models are Riemannian manifolds of distributions endowed with the Fisher information metric, thus the stochastic relaxation can be seen as a continuous optimization problem defined over a differentiable manifold. In this paper we explore the secondorder geometry of the exponential family, with applications to the multivariate Gaussian distributions, to generalize secondorder optimization methods. Besides the Riemannian Hessian, we introduce the exponential and the mixture Hessians, which come from the dually flat structure of an exponential family. This allows us to obtain different Taylor formulæ according to the choice of the Hessian and of the geodesic used, and thus different approaches to the design of secondorder methods, such as the Newton method.

GSI2015 2nd conference on Geometric Science of Information 2830 Oct 2015 Ecole Polytechnique ParisSaclay Secondorder Optimization over the Multivariate Gaussian Distribution Luigi Malag`o 1 2 Giovanni Pistone 1Shinshu University JP & INRIA Saclay FR 2de Castro Statistics, Collegio Carlo Alberto, Moncalieri IT Introduction • This is is the presentation by Giovanni of the paper with the same title in the Proceedings. • Unfortunately, Giovanni is the least qualiﬁed of the two authors to present this speciﬁc application of Information Geometry, his speciﬁc ﬁeld of expertise being nonparametric Information Geometry and its applications in Probability and Statistical Physics. Luigi is currently working in Japan and could not make it. • Among the two of us, Luigi is the responsible for the idea of using gradient methods and later, Newton methods, in black box optimization. Our collaboration started with the preparation of the FOGA 2011 paper • L. Malag`o, M. Matteucci, and G. Pistone. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Proceedings of the 11th workshop on Foundations of genetic algorithms, FOGA ’11, pages 230–242, New York, NY, USA, 2011. ACM Summary 1. Geometry of the Exponential Family 2. SecondOrder Optimization: The Newton Method 3. Applications to the Gaussian Distribution 4. Discussion and Future Work • A short introduction for Taylor formulæ on Gaussian exponential families is provided. The binary case has been previously discussed in • L. Malag`o and G. Pistone. Combinatorial optimization with information geometry: Newton method. Entropy, 16:4260–4289, 2014. • Riemannian Newton methods are discussed in a Session of this Conference cf, • P.A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. With a foreword by Paul Van Dooren • The focus of this short presenation is on a speciﬁc framework for Information Geometry we call statistical bundle. Hilbert vs Tangent vs Statistical Bundle • S. Amari. Dual connections on the Hilbert bundles of statistical models. In Geometrization of statistical theory (Lancaster, 1987), pages 123–151, Lancaster, 1987. ULDM Publ • R. E. Kass and P. W. Vos. Geometrical foundations of asymptotic inference. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons, Inc., New York, 1997. Statistical Bundle: Gaussian case • Hα(x), x ∈ Rm , are Hermite polynomials of order 1 and 2. • E.g, m = 3, H010(x) = x2, H011(x) = x2x3, H020(x) = x2 2 − 1. • The Gaussian model with suﬃcient statistics B = {X1, . . . , Xn} ⊂ {Hαα = 1, 2}, is N = p(x; θ) = exp n j=1 θj Xj − ψ(θ) • The ﬁbers are Vp = Span (Xj − Ep [Xj ]j = 1, . . . , n) • The statistical bundle is SN = {(p, U)p ∈ N, U ∈ Vp} • Each U ∈ Vp, p ∈ N, is a polynomial of degree up to 2 and t → Eq etU is ﬁnite around 0, q ∈ N • Every polynomial X belongs to ∩q∈N L2 (q) Parallel transports Deﬁnition • etransport: e Uq p : Vp U → U − Eq [U] ∈ Vq . • mtransport: for each U ∈ Vp and V ∈ Vq U, m Up qV p = e Uq pU, V q Properties • e Ur q e Uq p = e Ur p • m Ur q m Uq p = m Ur p • e Uq pU, m Uq pV q = U, V p • If q p V ∈ L2 (p), then m Up qV is its orthogonal projection onto Vp. Parallel transports in coordinates I We deﬁne on the statistical bundle SN a system of moving frames. 1. The exponential frame of the ﬁber SpN = Vp is the vector basis Bp = {Xj − Ep [Xj ]j = 1, . . . , n} 2. Each element U ∈ Vp is uniquely written as U = n j=1 αj (U)(Xj − Ep [Xj ]) = α(U)T (X − Ep [X]) 3. The expression in the exponential frame of the scalar product is the Fisher information matrix: e Iij (p) = Xi − Ep [Xi ] , Xj − Ep [Xj ] p = Covp (Xi , Xj ) = ∂2 ∂θi 2 θj ψ(θ) 4. U → α(U) = e I (p)−1 Covp (X, U) Parallel transports in coordinates II 5. The mixture frame of the ﬁber SpN = Vp is e I (p) −1 Bp = n i=1 e Iij (p)(Xi − Ep [Xi ]) j = 1, . . . , n 6. Each element V ∈ Vp is uniquely written as V = n j=1 βj (V ) n i=1 e Iij (p)(Xi −Ep [Xi ]) = β(V )T e I (p)−1 (X−Ep [X]) 7. The coordinates in the mixture basis are given in matrix form by V → β(V ) = Covp (X, V ) . 8. The matrix m I (p) = e I (p)−1 is the matrix expression of the metric in the mixture frame. α(U) = m I (p)β(U), β(U) = e I (p)α(U) . Parallel transports in the moving frames • The etransport acts on the exponential coordinates as the identity, α e Uq pU = α(U) • Equivalently, = e I (q)−1 Covq (X, U) = e I (p)−1 Covp (X, U) • The mtransport acts on the mixture coordinates as the identity, β m Uq pV = β(V ) REMARK A section or vector ﬁeld of the statistical bundle is a mapping F : N p → F(p) ∈ Vp. As there are two distingushed charts on the model (exponential p → θ(p) and mixture p → η(p) = ψ(θ(p))) and two distinguished frames on each ﬁber, there are in general four distinguished expression of each section. Score and statistical gradient Deﬁnition t → p(t) is a curve in the model N and f : N → R. • The score of the curve t → p(t) is a curve in the statistical bundle t → (p(t), Dp(t)) ∈ SN such that for all X ∈ Span (1, X1, . . . , Xn) it holds d dt Ep(t) [X] = X − Ep(t) [X] , Dp(t) p(t) • Usually, Dp(t) = ˙p(t) p(t) = d dt log p(t) • The statistical gradient of f : N → R is a section of the statistical bundle, p → (p, grad f (p)) ∈ SN such that for each regular curve t → p(t), it holds d dt f (p(t)) = grad f (p(t)), Dp(t) p(t) Score and statistical gradient in coordinates • Let the regular curve t → p(t) be expressed in the exponential coordinates by t → θ(t). The score t → Dp(t) is expressed in the exponential frame by t → ˙θ(t) that is, Dp(t) = n j=1 ˙θj (t)(Xj − ∂ ∂θj ψ(θ(t))) • Let the regular curve t → p(t) be expressed in the mixture coordinates by t → η(t) = ψ(θ(t)). The score is expressed in the mixture frame as t → ˙η(t). • Let X be a random variable which belongs to all L2 (p), p ∈ N and f (p) = Ep [f ]. Then p → grad f (p) exists and equals the orthogonal projection of X onto Vp, namely grad(p → Ep [X]) = e I (p)−1 Covp (X, X) (X − Ep [X]), X = (X1, . . . , Xn) . • The expressions of grad f are of interest in optimization. Taylor formula in the Statistical Bundle • For a curve t → p(t) ∈ N connecting p = p(0) to q = p(1) and a function f : N → R the Taylor formula is f (q) = f (p) + d dt f (p(t)) t=o + 1 2 d2 dt2 f (p(t)) t=o + R2(f , p, q) • The ﬁrst derivative is computed with the statistical gradient and the score f (q) = f (p) + grad f (p(0)), Dp(0) p + 1 2 d dt grad f (p(t)), Dp(t) p(t) t=o + R2(f , p, q) Accelleration and Hessian d dt grad f (p(t)), Dp(t) p(t) t=o = d dt e U p(0) p(t) grad f (p(t)), m U p(0) p(t)Dp(t) p(0) t=o d dt grad f (p(t)), Dp(t) p(t) t=o = d dt m U p(0) p(t) grad f (p(t)), m U p(0) p(t)Dp(t) p(0) t=o d dt grad f (p(t)), Dp(t) p(t) t=o = d dt Ep(0) p(t) p(0) grad f (p(t))Dp(t) t=o Accellerations • Let us deﬁne the acceleration at t of a curve t → p(t) ∈ N. The velocity is deﬁned to be t → (p(t), Dp(t)) = p(t), d dt log (p(t)) ∈ SN • The exponential acceleration is e D2 p(t) = d ds e U p(t) p(s)Dp(s) s=t • The mixture acceleration is m D2 p(t) = d ds m U p(t) p(s)Dp(s) s=t • The Riemannian acceleration is 0 D2 p(t) = 1 2 e D2 p(t) + m D2 p(t) Covariant derivatives I • p → (p, F(p)), p → (p, G(p)), are sections of SN, with expressions in the moving frames F(p) = n j=1 αj (p)(Xj − Ep [Xj ]) , F(p) = n j=1 βj (p) n i=1 e Iij (p)(Xi − Ep [Xi ]) , G(p) = n j=1 γj (p)(Xj − Ep [Xj ]) , G(p) = n j=1 δj (p) n i=1 e Iij (p)(Xi − Ep [Xi ]) . Covariant derivatives II • The exponential covariant derivative is the vector ﬁeld p → (p, e DG F(p)), where e DG F(p) = n j=1 grad αj (p), G(p) p (Xj − Ep [Xj ]) = n j=1 n i=1 γi (p)(∂i grad αj (p))(Xj − Ep [Xj ]) • The mixture covariant derivative is the vector ﬁeld p → (p, m DG F(p)), where m DG F(p) = n j=1 grad βj (p), G(p) p n i=1 e Iij (p)(Xi − Ep [Xi ]) = n j=1 n k=1 γk (p) grad βj (p), Xk − Ep [Xk ] p n i=1 e Iij (p)(Xi −Ep [Xi ]) Covariant derivatives III • The Riemannian covariant derivative is the vector ﬁeld p → (p, 0 DG F(p)) with 0 DG F = 1 2 (e DG F + m DG F) . Hessians • Let f : N → R be a mapping with gradient p → (p, grad f (p)). Let p → (p, G(p)) be a vector ﬁeld (section) of SN. • The exponential Hessian of f is the vector ﬁeld p → (p, e HessG f (p)), with e HessG f (p) = e DG grad f (p) . • The mixture Hessian of f is the vector ﬁeld p → (p, m HessG f (p)), with m HessG f (p) = m DG grad f (p) . • The Riemannian Hessian of F is the vector ﬁeld p → (p, 0 HessG F(p)), with 0 HessG f (p) = 0 DG grad f (p) . Taylor’s formulæ I 1. t → p(t) is the mixture geodesic connecting p = p(0) to q = p(1). f (q) = f (p) + grad f (p), Dp(0) p + 1 2 e HessDp(0)f (p), Dp(0) p + R+ 2 (p, q) R+ 2 (p, q) = 1 0 dt (1 − t) e HessDp(t)f (p(t)), Dp(t) p(t) − 1 2 e HessDp(0)f (p), Dp(0) p Taylor’s formulæ II 2. t → p(t) is the exponential geodesic connecting p = p(0) to q = p(1). f (q) = f (p) + grad f (p), Dp(0) p + 1 2 m HessDp(0)f (p), Dp(0) p + R− 2 (p, q) R− 2 (p, q) = 1 0 dt (1 − t) m HessDp(t)f (p(t)), Dp(t) p(t) − 1 2 m HessDp(0)f (p), Dp(0) p Taylor’s formulæ III 3. t → p(t) is the Riemannian geodesic connecting p = p(0) to q = p(1). f (q) = f (p) + grad f (p), Dp(0) p + 1 2 0 HessDp(0)f (p), Dp(0) p + R0 2 (p, q) where R0 2 (p, q) = 1 0 dt(1 − t) 0 HessDp(t)f (p(t)), Dp(t) p(t) − 1 2 0 HessDp(0)f (p), Dp(0) p Newton step • Let t → p(t) be the exponential geodesic starting at p = p(0) with Dp(0) = U. • Assume U is a critical point of Vp(0) U → f (p) + grad f (p(0)), U p(0) + 1 2 m HessU f (p), U p(0) that is grad f (p(0)) + m HessU f (p) = 0 • If q = p(1), then f (q) = f (p) − 1 2 0 HessU f (p), U p + R0 2 (p, q) Conclusion and work in progress • Comparisons between the Riemannian Newton method e.g., Absil et al., and the statistical bundle setup are being performed. • In particular, the use of alternative Hessians is of special interest.
We prove the equivalence of two online learning algorithms, mirror descent and natural gradient descent. Both mirror descent and natural gradient descent are generalizations of online gradient descent when the parameter of interest lies on a nonEuclidean manifold. Natural gradient descent selects the steepest descent direction along a Riemannian manifold by multiplying the standard gradient by the inverse of the metric tensor. Mirror descent induces nonEuclidean structure by solving iterative optimization problems using different proximity functions. In this paper, we prove that mirror descent induced by a Bregman divergence proximity functions is equivalent to the natural gradient descent algorithm on the Riemannian manifold in the dual coordinate system.We use techniques from convex analysis and connections between Riemannian manifolds, Bregman divergences and convexity to prove this result. This equivalence between natural gradient descent and mirror descent, implies that (1) mirror descent is the steepest descent direction along the Riemannian manifold corresponding to the choice of Bregman divergence and (2) mirror descent with loglikelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical CramérRao lower bound.

Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee (Duke) 29 Oct 2015 Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 1 / 18 Optimization of largescale problems Optimization of a function f (θ) where θ ∈ Rp. O( √ p)  convergence rate of standard subgradient descent. A problem in modern optimization, e.g. machine learning. Mirror descent [A Nemirovski, 1979. A Beck & M Teboulle, 2003]: O(log p)  convergence rate of mirror descent. Widely used tool in optimization and machine learning. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 2 / 18 Diﬀerential geometry in statistics (1) Cram´erRao lower bound (Rao 1945)  Lower bound on the variance of an estimator is a function of curvature. Sometimes called Cram´erRaoFr´echetDarmois lower bound. (2) Invariant (noninformative) priors (Jeﬀreys 1946)  An uniformative prior distribution for a parameter space is based on a diﬀerential form. (3) Information geometry (Amari 1985)  Diﬀerential geometry of probability distributions. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 3 / 18 Stochastic gradient descent Given a convex diﬀerentiable cost function, f : Θ → R. Generate a sequence of parameters {θt}∞ t=1 which incur a loss f (θt) that minimize regret at a time T, T t=1 f (θt). One solution θt+1 = θt − αt f (θt), where (αt)∞ t=0 denotes a sequence of stepsizes. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 4 / 18 Natural gradient For certain cost functions (loglikelihoods of exponential family models) the set of parameters Θ are supported on a pdimensional Riemannian manifold, (M, H). Typically the metric tensor H = (hjk) is determined by the Fisher information matrix (I(θ))ij = EData ∂ ∂θi f (x; θ) ∂ ∂θj f (x; θ) θ , i, j = 1, . . . , p. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 5 / 18 Natural gradient Given a cost function f on the Riemannian manifold f : M → R, the natural gradient descent step is: θt+1 = θt − αtH−1 (θt) f (θt), where H−1 is the inverse of the Riemannian metric. The natural gradient algorithm steps in the direction of steepest descent along the Riemannian manifold (M, H). It requires a matrix inversion. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 6 / 18 Mirror descent Gradient descent can be written θt+1 = arg min θ∈Θ θ, f (θt) + 1 2αt θ − θt 2 2 . For a (strictly) convex proximity function Ψ : Rp × Rp → R+ mirror descent is θt+1 = arg min θ∈Θ θ, f (θt) + 1 αt Ψ(θ, θt) . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 7 / 18 Bregman divergence Let G : Θ → R be a strictly convex twicediﬀerentiable function the Bregman divergence is BG (θ, θ ) = G(θ) − G(θ ) − G(θ ), θ − θ . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 8 / 18 Bregman divergences for exponential family Family G(θ) BG (θ, θ ) N(θ, Ip×p) 1 2 θ 2 2 1 2 θ − θ 2 2 Poi(eθ) exp(θ) exp (θ/θ ) − exp(θ ), θ − θ Be 1 1+e−θ log(1 + exp(θ)) log 1+eθ 1+eθ − eθ 1+eθ , θ − θ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 9 / 18 Mirror descent Mirror descent using the Bregman divergence as the proximity function θt+1 = arg min θ θ, f (θt) + 1 αt BG (θ, θt) . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 10 / 18 Convex duals The convex conjugate function for a function G is deﬁned to be: H(µ) := sup θ∈Θ { θ, µ − G(θ)} . Let µ = g(θ) ∈ Φ be the extremal point of the dual. The dual Bregnman divergence BH : Φ × Φ → R+ is BH(µ, µ ) = H(µ) − H(µ ) − H(µ ), µ − µ . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 11 / 18 Dual Bregman divergences for exponential family G(θ) H(µ) BH(µ, µ ) 1 2 θ 2 2 1 2 µ 2 2 1 2 µ − µ 2 2 exp(θ) µ, log µ − µ µ log µ µ log(1 + exp(θ)) η log µ (1 − µ) log 1−µ 1−µ +(1 − µ) log(1 − µ) +µ log µ µ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 12 / 18 Manifolds in primal and dual coordinates BG (·, ·) induces a Riemannian manifold (Θ, 2G) in the primal coordinates. Φ be the image of Θ under the continuous map g = G. BH : Φ × Φ → R+ induces the same Riemannian manifold (Φ, 2H) under dual coordinates Φ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 13 / 18 Equivalence Theorem (Raskutti, Mukherjee) The mirror descent step with Bregman divergence deﬁned by G applied to function f in the space Θ is equivalent to the natural gradient step along Riemannian manifold (Φ, 2H) in dual coordinates. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 14 / 18 Consequences Exponential family with density: p(y  θ) = h(y) exp( θ, y − G(θ)). Consider the following mirror descent step given yt θt+1 = arg min θ θ, θBG (θ, h(yt))θ=θt + 1 αt BG (θ, θt) . In dual coordinates one would minimize ft(µ; yt) = − log p(yt  µ) = BH(yt, µ). The natural gradient step is µt+1 = µt − αt[ 2 H(µt)]−1 BH(yt, µt), = µt+1 = µt − αt(µt − yt), the curvature of the loss BH(yt, µt) matches the metric tensor 2H(µ). Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 15 / 18 Statistical eﬃciency Given independent samples YT = (y1, ..., yT ) and a sequence of unbiased estimators µT is Fisher eﬃcient if lim T→∞ EYT [(µT − µ)(µT − µ)T ] → 1 T 2 H, where 2H is the inverse of the Fisher information matrix. Theorem (Raskutti, Mukherjee) The mirror descent step applied to the log loss (??) with stepsizes αt = 1 t asymptotically achieves the Cram´erRao lower bound. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 16 / 18 Challenges (1) Information geometry on mixture of manifolds. (2) Proximity functions for functions over the Grassmannian. (3) EM algorithms for mixtures. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 17 / 18 Acknowledgements Funding: Center for Systems Biology at Duke NSF DMS and CCF DARPA AFOSR NIH Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 18 / 18
Geometry of Time Series and Linear Dynamical systems (chaired by Bijan Afsari, Arshia Cont)
We present in this paper a novel nonparametric approach useful for clustering independent identically distributed stochastic processes. We introduce a preprocessing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic nonparametric representation which factorizes dependency and marginal distribution apart without losing any information. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal http://www.datagrapple.com .

Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Clustering Random Walk Time Series GSI 2015  Geometric Science of Information Gautier Marti, Frank Nielsen, Philippe Very, Philippe Donnat 29 October 2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Context (data from www.datagrapple.com) Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What is a clustering program? Deﬁnition Clustering is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than those in diﬀerent groups. Example of a clustering program We aim at ﬁnding k groups by positioning k group centers {c1, . . . , ck} such that data points {x1, . . . , xn} minimize minc1,...,ck n i=1 mink j=1 d(xi , cj )2 But, what is the distance d between two random walk time series? Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What are clusters of Random Walk Time Series? French banks and building materials CDS over 20062015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What are clusters of Random Walk Time Series? French banks and building materials CDS over 20062015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Geometry of RW TS ≡ Geometry of Random Variables i.i.d. observations: X1 : X1 1 , X2 1 , . . . , XT 1 X2 : X1 2 , X2 2 , . . . , XT 2 . . . , . . . , . . . , . . . , . . . XN : X1 N, X2 N, . . . , XT N Which distances d(Xi , Xj ) between dependent random variables? Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Pitfalls of a basic distance Let (X, Y ) be a bivariate Gaussian vector, with X ∼ N(µX , σ2 X ), Y ∼ N(µY , σ2 Y ) and whose correlation is ρ(X, Y ) ∈ [−1, 1]. E[(X − Y )2 ] = (µX − µY )2 + (σX − σY )2 + 2σX σY (1 − ρ(X, Y )) Now, consider the following values for correlation: ρ(X, Y ) = 0, so E[(X − Y )2] = (µX − µY )2 + σ2 X + σ2 Y . Assume µX = µY and σX = σY . For σX = σY 1, we obtain E[(X − Y )2] 1 instead of the distance 0, expected from comparing two equal Gaussians. ρ(X, Y ) = 1, so E[(X − Y )2] = (µX − µY )2 + (σX − σY )2. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Pitfalls of a basic distance Let (X, Y ) be a bivariate Gaussian vector, with X ∼ N (µX , σ2 X ), Y ∼ N (µY , σ2 Y ) and whose correlation is ρ(X, Y ) ∈ [−1, 1]. E[(X − Y ) 2 ] = (µX − µY ) 2 + (σX − σY ) 2 + 2σX σY (1 − ρ(X, Y )) Now, consider the following values for correlation: ρ(X, Y ) = 0, so E[(X − Y )2 ] = (µX − µY )2 + σ2 X + σ2 Y . Assume µX = µY and σX = σY . For σX = σY 1, we obtain E[(X − Y )2 ] 1 instead of the distance 0, expected from comparing two equal Gaussians. ρ(X, Y ) = 1, so E[(X − Y )2 ] = (µX − µY )2 + (σX − σY )2 . 30 20 10 0 10 20 30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Probability density functions of Gaus sians N(−5, 1) and N(5, 1), Gaus sians N(−5, 3) and N(5, 3), and Gaussians N(−5, 10) and N(5, 10). Green, red and blue Gaussians are equidistant using L2 geometry on the parameter space (µ, σ). Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Sklar’s Theorem Theorem (Sklar’s Theorem (1959)) For any random vector X = (X1, . . . , XN) having continuous marginal cdfs Pi , 1 ≤ i ≤ N, its joint cumulative distribution P is uniquely expressed as P(X1, . . . , XN) = C(P1(X1), . . . , PN(XN)), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Sklar’s Theorem Theorem (Sklar’s Theorem (1959)) For any random vector X = (X1, . . . , XN ) having continuous marginal cdfs Pi , 1 ≤ i ≤ N, its joint cumulative distribution P is uniquely expressed as P(X1, . . . , XN ) = C(P1(X1), . . . , PN (XN )), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Copula Transform Deﬁnition (The Copula Transform) Let X = (X1, . . . , XN) be a random vector with continuous marginal cumulative distribution functions (cdfs) Pi , 1 ≤ i ≤ N. The random vector U = (U1, . . . , UN) := P(X) = (P1(X1), . . . , PN(XN)) is known as the copula transform. Ui , 1 ≤ i ≤ N, are uniformly distributed on [0, 1] (the probability integral transform): for Pi the cdf of Xi , we have x = Pi (Pi −1 (x)) = Pr(Xi ≤ Pi −1 (x)) = Pr(Pi (Xi ) ≤ x), thus Pi (Xi ) ∼ U[0, 1]. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Copula Transform Deﬁnition (The Copula Transform) Let X = (X1, . . . , XN ) be a random vector with continuous marginal cumulative distribution functions (cdfs) Pi , 1 ≤ i ≤ N. The random vector U = (U1, . . . , UN ) := P(X) = (P1(X1), . . . , PN (XN )) is known as the copula transform. 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 X∼U[0,1] 10 8 6 4 2 0 2 Y∼ln(X) ρ≈0.84 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 PX (X) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 PY(Y) ρ=1 The Copula Transform invariance to strictly increasing transformation Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Deheuvels’ Empirical Copula Transform Let (Xt 1 , . . . , Xt N ), 1 ≤ t ≤ T, be T observations from a random vector (X1, . . . , XN ) with continuous margins. Since one cannot directly obtain the corresponding copula observations (Ut 1, . . . , Ut N ) = (P1(Xt 1 ), . . . , PN (Xt N )), where t = 1, . . . , T, without knowing a priori (P1, . . . , PN ), one can instead Deﬁnition (The Empirical Copula Transform) estimate the N empirical margins PT i (x) = 1 T T t=1 1(Xt i ≤ x), 1 ≤ i ≤ N, to obtain the T empirical observations ( ˜Ut 1, . . . , ˜Ut N ) = (PT 1 (Xt 1 ), . . . , PT N (Xt N )). Equivalently, since ˜Ut i = Rt i /T, Rt i being the rank of observation Xt i , the empirical copula transform can be considered as the normalized rank transform. In practice x_transform = rankdata(x)/len(x) Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Generic NonParametric Distance d2 θ (Xi , Xj ) = θ3E Pi (Xi ) − Pj (Xj )2 + (1 − θ) 1 2 R dPi dλ − dPj dλ 2 dλ (i) 0 ≤ dθ ≤ 1, (ii) 0 < θ < 1, dθ metric, (iii) dθ is invariant under diﬀeomorphism Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Generic NonParametric Distance d2 0 : 1 2 R dPi dλ − dPj dλ 2 dλ = Hellinger2 d2 1 : 3E Pi (Xi ) − Pj (Xj )2 = 1 − ρS 2 = 2−6 1 0 1 0 C(u, v)dudv Remark: If f (x, θ) = cΦ(u1, . . . , uN; Σ) N i=1 fi (xi ; νi ) then ds2 = ds2 GaussCopula + N i=1 ds2 margins Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Hierarchical Block Model A model of nested partitions The nested partitions deﬁned by the model can be seen on the distance matrix for a proper distance and the right permutation of the data points In practice, one observe and work with the above distance matrix which is identitical to the left one up to a permutation of the data Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Results: Data from Hierarchical Block Model Adjusted Rand Index Algo. Distance Distrib Correl Correl+Distrib HCAL (1 − ρ)/2 0.00 ±0.01 0.99 ±0.01 0.56 ±0.01 E[(X − Y )2 ] 0.00 ±0.00 0.09 ±0.12 0.55 ±0.05 GPR θ = 0 0.34 ±0.01 0.01 ±0.01 0.06 ±0.02 GPR θ = 1 0.00 ±0.01 0.99 ±0.01 0.56 ±0.01 GPR θ = .5 0.34 ±0.01 0.59 ±0.12 0.57 ±0.01 GNPR θ = 0 1 0.00 ±0.00 0.17 ±0.00 GNPR θ = 1 0.00 ±0.00 1 0.57 ±0.00 GNPR θ = .5 0.99 ±0.01 0.25 ±0.20 0.95 ±0.08 AP (1 − ρ)/2 0.00 ±0.00 0.99 ±0.07 0.48 ±0.02 E[(X − Y )2 ] 0.14 ±0.03 0.94 ±0.02 0.59 ±0.00 GPR θ = 0 0.25 ±0.08 0.01 ±0.01 0.05 ±0.02 GPR θ = 1 0.00 ±0.01 0.99 ±0.01 0.48 ±0.02 GPR θ = .5 0.06 ±0.00 0.80 ±0.10 0.52 ±0.02 GNPR θ = 0 1 0.00 ±0.00 0.18 ±0.01 GNPR θ = 1 0.00 ±0.01 1 0.59 ±0.00 GNPR θ = .5 0.39 ±0.02 0.39 ±0.11 1 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Results: Application to Credit Default Swap Time Series Distance matrices computed on CDS time series exhibit a hierarchical block structure Marti, Very, Donnat, Nielsen IEEE ICMLA 2015 (un)Stability of clusters with L2 distance Stability of clusters with the proposed distance Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Consistency Deﬁnition (Consistency of a clustering algorithm) A clustering algorithm A is consistent with respect to the Hierarchical Block Model deﬁning a set of nested partitions P if the probability that the algorithm A recovers all the partitions in P converges to 1 when T → ∞. Deﬁnition (Spaceconserving algorithm) A spaceconserving algorithm does not distort the space, i.e. the distance Dij between two clusters Ci and Cj is such that Dij ∈ min x∈Ci ,y∈Cj d(x, y), max x∈Ci ,y∈Cj d(x, y) . Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Consistency Theorem (Consistency of spaceconserving algorithms (Andler, Marti, Nielsen, Donnat, 2015)) Spaceconserving algorithms (e.g., Single, Average, Complete Linkage) are consistent with respect to the Hierarchical Block Model. T = 100 T = 1000 T = 10000 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Discussion and questions? Avenue for research: distances on (copula,margins) clustering using multivariate dependence information clustering using multiwise dependence information Optimal Copula Transport for Clustering Multivariate Time Series, Marti, Nielsen, Donnat, 2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series
Operational viewpoint on consensus inspired by quantum consensus objective covers some more linear algorithms Limit on accelerating consensus algorithms with informationtheoretic links Alain Sarlette, INRIA/QUANTIC & Ghent University/SYSTeMS Operational viewpoint on consensus inspired by quantum consensus objective covers some more linear algorithms Limit on accelerating consensus algorithms with informationtheoretic links the announced talk Alain Sarlette, INRIA/QUANTIC & Ghent University/SYSTeMS seems cool … in press at IEEE Trans. Automatic Control Operational viewpoint on consensus inspired by quantum consensus objective covers some more linear algorithms Limit on accelerating consensus algorithms with informationtheoretic links Alain Sarlette, INRIA/QUANTIC & Ghent University/SYSTeMS Operational viewpoint on consensus inspired by quantum consensus objective covers some more linear algorithms “ Symmetrization “ L.Mazzarella, F.Ticozzi, A.S. arXiv:1311.3364 and arXiv:1303.4077 ! Consensus: reaching agreement x1 = x2 = ... = xN is the basis for many distributed computing tasks Very flexible and robust convergence: as long as the network integrated over some finite T forms a connected graph and α(t) ∈ [αm, αM] ⊂ (0,1) Classical consensus algorithm x1 x2 x3 x4 xN x... x... ! ! ! highest value can only decrease, lowest can only increase Convergence proof idea: shrinking convex hull xk(t+1) ( xj(t) – xk(t) ) + xk(t)α(t) Defining consensus in tensor product space? How define consensus w.r.t. correlations, entanglement,... ! How to write a consensus algorithm? Standard consensus: system states xk are directly accessible for computation, can be linearly combined, copied, communicated... Quantum consensus: the whole quantum state / proba distribution cannot be measured ➯ We must physically exchange “things” Our initial goal: Bringing consensus into quantum regime Consensus viewed as partial swapping Pairwise consensus interaction between agents ( j, k ): Consensus viewed as partial swapping Pairwise consensus interaction between agents ( j, k ): swap j with k stay in place Such mixture of two unitary operations: stay in place and swap can be easily implemented physically in quantum systems or, for that matter, in other information structures Linear action a(g,x) of G on X ! ! Target : symmetrization reach a state x ∈ X where a(g,x) = x for all g ∈ G Consensus operation as discrete group action (finite) group vector space with objects “of interest” − − − Property: the projection on the symmetrization set can be written Linear action a(g,x) of G on X ! ! Dynamics : with the defining a convex combination over G at each t Usually sg(t) ≠0 only for g belonging to a very restricted subset of G (finite) group vector space with objects “of interest” − − Consensus operation as discrete group action * The state x(t) at any time can be written as a convex combination ! ! * The dynamics can then be lifted to the vector p(t) and written as ! ! ! Lift from actions to group … with p independent of x(0) ! ! starting point pg(0) = δ(g,e) target pg = 1/G for all g … yields consensus on group weights − Possibly large number of nodes, e.g. G = N! for permutation group The exact values of sh(t), and even the selected interactions at each time step, need not be exactly controlled ➯ strong robustness ! ! ! ! ! ! Proof: * possible by analogy with classical consensus * alternative: use entropy of p(t) as strict Lyapunov function Convergence to p holds if:– G = Permutations leads to random consensus by acting on classical state values (standard consensus) classical or quantum probability distributions G = cyclic group leads to random Fourier transform (use?) G = decoupling group links to quantum Dynamical Decoupling G = operational gates gives uniform random gate generation Consensus with antagonistic interactions Consensus towards leader value Gradient descent and coordinate descent Various applications the announced talk Nontrivial weight assignment & convergence result Solves previously not covered cases to distinguish {xk}=0 or {xk} = {xj} Consensus with antagonistic interactions G = permutation matrices with arbitrary sign ±1 on each entry Weights sg: Birkhoff decomposition on ajk as for standard consensus Then swap weights to nonpositive permutation if ajk <0 Nontrivial weight assignment (iterative procedure, see paper) Operator conclusions about which components of x converge to zero (slightly more general than standard convergence to x=0) Consensus towards leader value G = permutation matrices with arbitrary sign ±1 on each entry also other algorithms with (ajk) substochastic Gradient & Coordinate descent Search for min of f(x) by computing Assume (sorry) f(x)= xT A x In the eigenbasis of A this becomes a (if stable) substochastic iteration. Not a big insight… extension: k cycle through coordinates k Weights * follow from reflection matrices around nonorthogonal directions * sum to 1 but may be negative ➱ Study coordinate descent convergence via symmetric but possibly negative transition matrix: ∃ clear tools e.g. in consensus Gradient & Coordinate descent G = permutation matrices with arbitrary sign ±1 on each entry Search for min of f(x) by computing Assume (sorry) f(x)= xT A x In the eigenbasis of A this becomes a (if stable) substochastic iteration. Not a big insight… extension: k cycle through coordinates k Operational viewpoint on consensus inspired by quantum consensus objective covers some more linear algorithms Limit on accelerating consensus algorithms with informationtheoretic links Alain Sarlette, INRIA/QUANTIC & Ghent University/SYSTeMS arXiv:1412.0402 Add one memory, no more [Muthukrishnan et al, 98] + k k memory Properly using one memory x(t1)x(t) allows to converge quadratically faster What about more memories? Add one memory, no more [Muthukrishnan et al, 98] + k k memory Properly using one memory x(t1)x(t) allows to converge exponentially quadratically faster What about more memories? Our result: if graph eigenvalues can be any in [a,b] with a,b known then more memories do not improve worst consensus eigenvalue proof: not very informationtheoretic, see arXiv:1412.0402 Optimization: ! Nesterov method not further improvable by m(t2),… ? Robust control: design plant to be stable under feedback u = k y , k in interval ! Communication theory: Interesting links = network Optimization: ! Nesterov method not further improvable by m(t2),… ? Robust control: design plant to be stable under feedback u = k y , k in interval ! Communication theory: Interesting links = improves by taking direct feedback to itself into account network Optimization: ! Nesterov method not further improvable by m(t2),… ? Robust control: design plant to be stable under feedback u = k y , k in interval ! Communication theory: Interesting links = network if network poorly known, no benefit to account for longer loops
Scaled Bregman distances SBD have turned out to be useful tools for simultaneous estimation and goodnessoffittesting in parametric models of random data (streams, clouds). We show how SBD can additionally be used for model preselection (structure detection), i.e. for finding appropriate candidates of model (sub)classes in order to support a desired decision under uncertainty. For this, we exemplarily concentrate on the context of nonlinear recursive models with additional exogenous inputs; as special cases we include nonlinear regressions, linear autoregressive models (e.g. AR, ARIMA, SARIMA time series), and nonlinear autoregressive models with exogenous inputs (NARX). In particular, we outline a corresponding informationgeometric 3D computergraphical selection procedure. Some samplesize asymptotics is given as well.

New model search for nonlinear recursive models, regressions and autoregressions Wolfgang Stummer and AnnaLena Kißlinger FAU University of ErlangenNürnberg Talk at GSI 2015, Palaiseau, 29/10/2015 Outline Outline • introduce a new method for model search (model preselection, structure detection) in data streams/clouds: key technical tool: densitybased probability distances/divergences with “scaling” • gives much ﬂexibility for interdisciplinary situationbased applications (also with cost functions, utility, etc.) • goalspeciﬁc handling of outliers and inliers (dampening, ampliﬁcation) not directly covered today • give new general parameterfree asymptotic distributions for involved dataderived distances/divergences • outline a corresponding informationgeometric 3D computergraphical selection procedure 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  3 WHY distances between (non)probability measures (1) • “distances” D(P, Q) between two (non)probability measures P, Q play a prominent role in modern statistical inferences: • parameter estimation, • testing for goodnessofﬁt resp. homogenity resp. independence, • clustering, • changepoint detection, • Bayesian decision procedures as well as for other research ﬁelds such as • information theory, • signal processing including image and speech processing, • pattern recognition, • feature extraction, • machine learning, • econometrics, and • statistical physics. 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  4 WHY distances between (non)probability measures (2) • suppose we want to describe the proximity/distance/closeness/similarity D(P, Q) of two (non)probability distributions P and Q • either two “theoretical” distributions e.g. P = N(µ1, σ2 1), Q = N(µ2, σ2 2) • or two (empirical) distributions representing data (e.g. derived from frequencies, histograms, . . . ) • or one of each −→ today • P, Q may live on Rd , or on “spaces of functions with appropriate properties”: e.g. potential future scenarios of a time series, or a cont.time stochastic process e.g. functional data • exemplary statistical uses of distances D(P, Q) −→ 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  5 WHY distances between probability measures (3) Applic. 1: plane = all probability distributions (on R, Rd , a path space, . . . ) we have a “distance” on this, say D(P, Q) e.g. P := P orig N := Pemp N := 1 N · N i=1 δXi [·] . . . empirical distribution of an iid sample X1, . . . , XN of size N from Qθtrue ; puts equal “weight” 1 N on each data point. θ = minimum distance estimator (e.g. θ = MLE for D(Pemp N , Qθ) = KullbackLeib.) however, D(Pemp N , Qθ) may still be large −→ “bad goodness of ﬁt” −→ test 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  6 Time Series and Nonlinear Regressions (1) in time series, the data (describing random var.) . . . , X1, X2, . . . are noniid: e.g. autoregressive model AR(2) of order 2: Xm+1 − ψ1 · Xm − ψ2 · Xm−1 = εm+1, m ≥ k, where (εm+1)m≥k is a family of independent and identically distributed (i.i.d.) random variables on some space Y having parametric distribution Qθ (θ ∈ Θ). compact notation: take the parameter vector £ := (2, ψ1, ψ2), the backshift operator B deﬁned by B Xm := Xm−1, the 2−polynomial ψ1 · B + ψ2 · B2 , the identity operator 1 given by 1Xm := Xm −→ lefthand side becomes F£ Xm+1, Xm, Xm−1, . . . , Xk = 1 − 2 j=1 ψjBj Xm+1 −→ as dataderived distribution we take the empirical distribution of lefthand side P orig N,£ [ · ] := P[ · ; Xk−1, . . . , Xk+N; £] := 1 N · N i=1 δ F£ Xk+i,Xk+i−1,...,Xk [·] with histogramaccording probability mass function (relative frequencies) p £ N(y) = # i ∈ {1, . . . , N} : F£ Xk+i, . . . , Xk = y N = # i : Xk+i − γ1 · Xk+i−1 − γ2 · Xk+i−2 = y N 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  7 Time Series and Nonlinear Regressions (2) −→ 2 issues: which time series models Xi and which distances D(·, ·) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  8 Time Series and Nonlinear Regressions (3) more general: nonlinear autorecursions in the sense of F£m+1 m+1, Xm+1, Xm, Xm−1, . . . , Xk, Zk−, am+1, am, am−1, . . . , ak = εm+1, m ≥ k, • where (F£m+1 )m≥k is a sequence of nonlinear functions parametrized by £m+1 ∈ Γ, • (εm+1)m≥k are iid with parametric distribution Qθ (θ ∈ Θ), • (ak)m≥k are independent variables which are nonstochastic (deterministic) today, • the “backloginput” Zk− denotes the additional input on X and a before k to get the recursion started. today, we assume k = −∞, and EQθ [εm+1] = 0, and that the initial data Xk as well as the backloginput Zk− are deterministic. Special case: Xm+1 = g f£m+1 (m+1, Xm, Xm−1, . . . , Xk, Zk−, am+1, am, am−1, . . . , ak), εm+1 for some appropriate functions f£m+1 and g, e.g. g(u, v) := u + v, g(u, v) := u · v −→ (εm+1)m≥k can be interpreted as “randomnessdriving innovations (noise)” 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  9 Time Series and Nonlinear Regressions (4) our general context covers in particular • NARX models = nonlinear autoregressive models with exogenous input: is the above special case with constant parameter vector £m+1 ≡ £ and additive g. Especially: • nonlinear regressions with deterministic independent variables: the only involved X is Xm+1 • AR(r) = linear autoregressive models (time series) of order r ∈ N (recall the above example with r = 2) • ARIMA(r,d,0) = linear autoregressive integrated models (time series) of order r ∈ N0 and d ∈ N0 • SARIMA(r,d,0)(R,D,0)s = linear seasonal autoregressive integrated models (time series) of order d ∈ N0 of nonseasonal differencing, order r ∈ N0 of the nonseasonal ARpart, length s ∈ N0 of a season, order D ∈ N0 of seasonal differencing and order R ∈ N0 of the seasonal ARpart. 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  10 Divergences / similarity measures (1) • so far: motiviations for “WHY to measure the proximity/distance/closeness/similarity D(P, Q)” here: P = P orig N,£ [ · ] (= empirical distribution of iid noises) Q = Qθ ( = candidate for true distribution of iid noises) • now: “HOW to measure”, which “distance” D(P, Q) to use ? • prominent examples for D(P, Q): relative entropy (KullbackLeibler information discrimination) –> MDE = MLE !!, Hellinger distance, Pearson’s ChiSquare divergence, Csiszar’s f−divergences ... −→ all will be covered by our much more general context • DESIRE: to have a toolbox {Dφ,M(P, Q) : φ ∈ Φ, M ∈ M} which is farreaching and ﬂexible (reﬂected by different choices of the “generator” φ and the scaling measure M) should also cover robustness issues !! 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  11 Divergences / similarity measures (2) • from now on: probability distributions P, Q on (X, A) nonprobability distribution/(σ−)ﬁnite measure M on (X, A) we assume that all three of them have densities w.r.t. a σ−ﬁnite measure λ p(x) = dP dλ (x), q(x) = dQ dλ (x) and m(x) = dM dλ (x) for a. all x ∈ X (for today: mostly X ⊂ R) • furthermore we take a “divergence (distance) generating function” φ : (0, ∞) → R which (for today) is twice differentiable, strictly convex without loss of generality we also assume φ(1) = 0 the limit φ(0) := limt↓0 φ(t) always exists (but may be ∞) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  12 Scaled Bregman Divergences (1) Deﬁnition (Stu. 07, extended in Stu. & Vajda 2012 IEEE Trans. Inf. Th.) The Bregman divergence (distance) of probability distributions P, Q scaled by the (σ−)ﬁnite measure M on (X, A) is deﬁned by Bφ (P, Q  M) := X m(x) φ p(x) m(x) − φ q(x) m(x) − φ q(x) m(x) · p(x) m(x) − q(x) m(x) dλ(x) • if X = {x1, x2, . . . xs} where s may be inﬁnite, and “λ is a counting measure” −→ p(·), q(·), m(·) are classical probability mass functions (“counting densities”): Bφ (P, Q  M) = s i=1 m(xi) φ p(xi) m(xi) − φ q(xi) m(xi) − φ q(xi) m(xi) · p(xi) m(xi) − q(xi) m(xi) e.g. φ(t) = (t − 1)2 −→ Bφ (P, Q  M) = s i=1 (p(xi)−q(xi))2 m(xi) weighted Pearson χ2 Ex.: P := Pemp N := 1 N · N i=1 δεi [·] . . . empirical distribution of an iid sample of size N from Qθtrue ; corresponding pmf = relative frequency p(x) := pemp N (x) := 1 N · #{j ∈ {1, . . . , N} : εj = x}; Q := Qθ where the “hypothetical candidate distribution” Qθ has pmf q(x) := qθ(x) M := W(Pemp N , Qθ) with pmf m(x) = w(pemp N (x), qθ(x)) > 0 for some funct. w(·, ·) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  13 discrete case with φ(t) = φα(t) and m(x) = wβ(p(x), q(x)) 3D presentation; exemplary goal: ≈ 0 for all α, β 10 3D presentation; exemplary goal: ≈ 0 for all α, β 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  14 Bφ (P, Q  M) with composite scalings M = W(P, Q) (1) • from now on: M = W(P, Q), i.e. m(x) = w(p(x), q(x)) for some function w(·, ·) • w(u, v) = 1 −→ unscaled/classical Bregman distance (discr.: Pardo/Vajda 97,03) e.g. for generator φ1(t) = t log t + 1 − t −→ KullbackLeibler divergence (MLE) e.g. for the power functions φα(t) := tα−1+α−α·t α(α−1) , α = 0, 1, −→ density power divergences of Basu et al. 98, Basu et al. 2013/14/15 • new example (Kißlinger/Stu. (2015c): scaling by weighted rthpower means: wβ,r (u, v) := (β · ur + (1 − β) · vr )1/r , β ∈ [0, 1], r ∈ R\{0} • e.g. r = 1: arithmeticmeanscaling (mixture scaling) subcase β = 0: w0,1(u, v) = v −→ all Csiszar φ−divergences/disparities for φ2(t) one gets Pearson’s chisquare divergence subcase β = 1 and φ2(t) −→ Neyman’s chisquare divergence subcase β ∈ [0, 1] and φ2(t) −→ blended weight chisquare divergence, Lindsay 94 subcase β ∈ [0, 1] and φα(t) −→ Stu./Vajda (2012), Kißlinger/Stu. (2013, 2015a) • e.g. r = 1/2: wβ,1/2(u, v) = (β · √ u + (1 − β) · √ v)2 subcase β ∈ [0, 1] and φ2(t) −→ blended weight Hellinger distance: Lindsay (1994), Basu/Lindsay (1994) • e.g. r → 0: geometricmean scaling wβ,0(u, v) = uβ · v1−β Kißlinger/Stu. (2015b) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  15 Some scale connectors w(u, v) (for any generator φ) (1) (a) w0,1(u, v) = v Csiszar diverg. (b) w0.45,1(u, v) = 0.45 · u + 0.55 · v . (c) w0.45,0.5(u, v) = (0.45 √ u + 0.55 √ v)2 (d) w0.45,0(u, v) = u0.45 · v0.55 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  16 Scale connectors w(u, v), NOT r−th power means (e) WEXPM: w0.45,˜f6 (u, v) = 1 6 log 0.45e6u + 0.55e6v (g) wmed 0.45 (u, v) = med{min{u, v}, 0.45, max{u, v}} . (j) wsmooth adj (u, v) with hin = −0.5 , hout = 0.3, δ = 10−7 , etc. (k) Parameter description for wadj (u, v) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  17 Robustness to obtain the robustness against outliers and inliers (i.e. high unusualnesses in data, surprising observations), as well as the (asymptotic) efﬁciency of our procedure is a question of a good choice of the scale connector w(·, ·) −→ another long paper Kiss. and Stu. 2015b −→ another talk we end up with a new transparent, farreaching 3D computergraphical “geometric” method called densitypair adjustment function this is vaguely a similar task to choosing a good copula in (inter)dependencemodelling frameworks 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  18 Universal model search UMSPD (1) recall: which time series model Xi and which distance D(·, ·) now: model search in detail; basic idea (for ﬁnite discrete distributions): under the correct (“true”) model F£0 m+1 , Qθ0 we get that the sequence Fγ0 k+i (k + i, Xk+i, Xk+i−1, ..., Xk, Zk−, ak+i, ..., ak) i=1...N behaves like a sizeNsample from an iid sequence under the distribution Qθ0 , i.e. P£0 N [·] := 1 N N i=1 δF£0 k+i (k+i,Xk+i,Xk+i−1,...,Xk ,Zk−,ak+i,...,ak )[·] N→∞ −−−→ Qθ0 [·] and thus Dα,β P£0 N , Qθ0 N→∞ −−−→ 0 for a very broad family D := Dα,β(·, ·) : α ∈ [α, α] , β ∈ β, β of distances, where we use the SBDs Dα,β(P£0 N , Qθ) := Bφα P£0 N , Qθ0  Wβ(Pemp N , Qθ0 ) for a α−family of generators φα(·) (today: the above power functions) and a β−family of scale connectors Wβ(·, ·) (today: geometricmean scaling wβ,0(u, v) = uβ · v1−β ) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  19 Universal model search UMSPD (2) We introduce the universal modelsearch by probability distance (UMSPD): 1. choose F£m+1 m≥k from a principal parametricfunctionfamily class 2. choose some preﬁxed class of parametric candidate distributions {Qθ : θ ∈ Θ} 3. ﬁnd a parameter sequence £ := (£m+1)m≥k (often constant) and a θ ∈ Θ such that Dα,β P£ N, Qθ ≈ 0 for large enough sample size N and all (α, β) ∈ [α, α] × β, β 4. preselect the model F£m+1 , Qθ if the “3D score surface” (the “mountains”) S := {(α, β, Dα,β(P£ N, Qθ)) : α ∈ [α, α] , β ∈ β, β } is smaller than some appropriatly chosen threshold T (namely, a chisquarequantile, see below) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  20 Universal model search UMSPD (3) Graphical implementation by plotting the 3D preselectionscore surface S 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  21 Universal model search UMSPD (4) ADVANTAGE OF UMSPD: after the preselection process one can continue to work with the same Dα,β(·, ·) in order to perform amongst all preselected candidate models a statistically sound inference in terms of simultaneous exact parameterestimation and goodnessofﬁt. one issue remains to be discussed for UMSPD: the choice of the threshold T 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  22 Universal model search UMSPD (5) exemplarily show how to quantify the abovementioned preselection criterion “the 3D surface S should be smaller than a threshold T” by some sound asymptotic analysis for the above special choices φα(·) and wβ(·, ·) the cornerstone is the following limit theorem Theorem Let Qθ0 be a ﬁnite discrete distribution with c := Y ≥ 2 possible outcomes and strictly positive densities qθ0 (y) > 0 for all y ∈ Y. Then for each α > 0, α = 1 and each β ∈ [0, 1[ the random scaled Bregman power distance 2N · Bφα P£0 N , Qθ0  (P£0 N )β · Q1−β θ0 =: 2N · B(α, β; £0, θ0; N) is asymptotically chisquared distributed in the sense that 2N · B(α, β; £0, θ0; N) L −−−→ N→∞ χ2 c−1 . in terms of the corresponding χ2 c−1−quantiles, one can derive the threshold T which the 3D preselectionscore surface S has to (partially) exceed in order to believe with appropriate level of conﬁdence that the investigated model ((F£m+1 )m≥k, Qθ) is not good enough to be preselected. 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  23 Further Topics • can use scaled Bregman divergences for robust statistical inferences with “completely general asymptotic results” for other choices of φ(·) and w(·, ·) −→ Kißlinger & Stu. (2015b) • can use scaled Bregman divergences for change detection in data streams −→ Kißlinger & Stu. (2015c) • explicit formulae for Bφα(Pθ1 , Pθ2 Pθ0 ) where Pθ1 , Pθ2 , Pθ0 stem from the same arbitrary exponential family, cf. Stu. & Vajda (2012), Kißlinger & Stu. (2013); including stochastic processes (Levy processes) • we can do Bayesian decision making with important processes • nonstationary stochastic differential equations • e.g. nonstationary branching processes −→ Kammerer & Stu. (2010) • e.g. inhomogeneous binomial diffusion approximations −→ Stu. & Lao (2012) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  24 Summary • introduced a new method for model search (model preselection, structure detection) in data streams/clouds: key technical tool: densitybased probability distances/divergences with “scaling” • gives much ﬂexibility for interdisciplinary situationbased applications (also with cost functions, utility, etc.) • gave a new parameterfree asymptotic distribution result for involved dataderived distances/divergences • outlined a corresponding informationgeometric 3D computergraphical selection procedure 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  25 Ali, M.S., Silvey, D.: A general class of coefﬁcients of divergence of one distribution from another. J. Roy. Statist. Soc. B28,131140 (1966) Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efﬁcient estimation by minimising a density power divergence. Biometrika 85, 549–559 (1998) Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. CRC Press, Boca Raton (2011) Billings, S.A.: Nonlinear System Identiﬁcation. Wiley, Chichester (2013) Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A8, 85–108 (1963) Kißlinger, A.L., Stummer, W.: Some Decision Procedures Based on Scaled Bregman Distance Surfaces. In: F. Nielsen and F. Barbaresco (Eds.): GSI 2013, LNCS 8085, pp. 479–486. Springer, Berlin (2013) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  26 Kißlinger, A.L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: F. Nielsen and F. Barbaresco (Eds.): GSI 2015, LNCS 9389, Springer, Berlin (2015a) Kißlinger, A.L., Stummer, W.: Robust statistical engineering by means of scaled Bregman divergences. Preprint (2015b). Kißlinger, A.L., Stummer, W.: A New InformationGeometric Method of Change Detection. Preprint (2015c). Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Leipzig (1987) Nock, R., Piro, P., Nielsen, F., Ali, W.B.H., Barlaud, M.: Boosting k−NN for categorization of natural sciences. Int J. Comput. Vis. 100, 294 – 314 (2012) Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall, Boca Raton (2006) Pardo, M.C., Vajda, I.: On asymptotic properties of informationtheoretic divergences. IEEE Transaction on Information Theory 49(7), 1860 – 1868 (2003) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  27 Read, T.R.C., Cressie, N.A.C.: GoodnessofFit Statistics for Discrete Multivariate Data. Springer, New York (1988) Stummer, W.: Some Bregman distances between ﬁnancial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503 – 1050504 (2007) Stummer, W., Vajda, I.: On Bregman Distances and Divergences of Probability Measures. IEEE Transaction on Information Theory 58 (3), 1277–1288 (2012) 29/10/2015  Wolfgang Stummer and AnnaLena Kißlinger  GSI 2015  28
In the context of sensor networks, gossip algorithms are a popular, well established technique, for achieving consensus when sensor data are encoded in linear spaces. Gossip algorithms also have several extensions to non linear data spaces. Most of these extensions deal with Riemannian manifolds and use Riemannian gradient descent. This paper, instead, studies gossip in a broader CAT(k) metric setting, encompassing, but not restricted to, several interesting cases of Riemannian manifolds. As it turns out, convergence can be guaranteed as soon as the data lie in a small enough ball of a mere CAT(k) metric space. We also study convergence speed in this setting and establish linear rates of convergence.

Gossip in CAT(κ) metric spaces Anass Bellachehab J´er´emie Jakubowicz T´el´ecom SudParis, Institut MinesT´el´ecom & CNRS UMR 5157 GSI 2015 Palaiseau October 28 1 / 21 Problem We consider a network of N agents such that: The network is represented by a connected, undirected graph G = (V , E), where V = {1, . . . , N} stands for the set of agents and E denotes the set of available communication links between agents. At any given time t an agent v stores stores data represented as an element xv (t) of a data space M. Xt = (x1(t), . . . , xN(t)) is the tuple of data values of the whole network at instant t. 2 / 21 Problem (cont’d) Each agent has its own Poisson clock that ticks with a common intensity λ (the clocks are identically made) independently of other clocks. When an agent clock ticks, the agent is able to perform some computations and wake up some neighboring agents. The goal is to take the system from an initial state X(0) to a consensus state; meaning a state of the form X∞ = (x∞, . . . , x∞) with: x∞ ∈ M. 3 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x0 = −1 −1 1 −1 −2 1 1 2 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x0 = −1 −1 1 −1 −2 1 1 2 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x1 = 0 −1 0 −1 −2 1 1 2 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x1 = 0 −1 0 −1 −2 1 1 2 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x1 = 0.5 0.5 0 −1 −2 1 0.5 0.5 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x1 = 0.5 0.5 0 −1 −2 1 0.5 0.5 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x2 = 0.5 0.5 −1 0 −1 0 0.5 0.5 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x∞ = −0.25 0.25 −0.25 0.25 −0.25 0.25 −0.25 0.25 4 / 21 Random Pairwise Gossip (Xiao & Boyd’04) x∞ = −0.25 0.25 −0.25 0.25 −0.25 0.25 −0.25 0.25 xn = I − 1 2 (δin − δjn )(δin − δjn )T xn−1 4 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 A natural extension in a metric setting 5 / 21 Outline 1. Motivation 2. State of the art 3. CAT(κ) spaces 4. Previous result for κ = 0 5. Why the κ > 0 case is more complex 6. Our result 6 / 21 Motivation In its Euclidean setting, Random Pairwise Midpoint cannot address several useful type of data: Sphere positions (Sphere) Line orientations (Projective space) Solid orientations (Rotations) Subspaces (Grassmanians) Phylogenetic Trees (Metric space) Cayley graphs (Metric space) Reconﬁgurable systems (Metric space) 7 / 21 State of the art Consensus optimization on manifolds : [SarletteSepulchre’08],[Tron et al.’12],[Bonnabel’13] Synchronization on the circle : [Sarlette et al.’08] Synchronization on SO(3) : [Tron et al.’12] Our previous work: Distibuted pairwise gossip on CAT(0) spaces Caveat: In this work, we deal the problem of synchonization, i.e. attaining a consensus, whatever its value; contrarily to the Euclidean case where it is known that random pairwise midpoints converges to ¯x0. 8 / 21 CAT(κ) spaces Model spaces Consider a model surface Mκ with constant sectional curvature κ: κ < 0 corresponds to a hyperbolic space κ = 0 corresponds to a Euclidean space κ > 0 corresponds to a sphere Geodesics Assume M is a metric space equipped with metric d. A map γ : [0, l] → M such that: ∀0 ≤ t, t ≤ l, d γ(t), γ(t ) = t − t  is called a geodesic in M; a = γ(0) and b = γ(l) are its endpoints. If there exists one and only one geodesic linking a to b, it is denoted [a, b]. 9 / 21 CAT(κ) spaces (cont’d) Triangles A triple of geodesics γ, γ and γ with respective endpoints a, b and c is called a triangle and is denoted (γ, γ , γ ) or (a, b, c) when there is no ambiguity. Comparison triangles When κ ≤ 0, given a triangle (γ, γ , γ ), there always exist a triangle (aκ, bκ, cκ) in Mκ such that d(a, b) = d(aκ, bκ), d(b, c) = d(bκ, cκ) and d(c, a) = d(cκ, aκ) with a = γ(0), b = γ (0) and c = γ (0). b a c l l l bκ aκ cκ l l l 10 / 21 CAT(κ) spaces (cont’d) CAT(κ) inequality A triangle (γ, γ , γ ) in a metric space M satisﬁes the CAT(κ) inequality if for any x ∈ [a, b] and y ∈ [a, c] one has: d(x, y) ≤ d(xκ, yκ) where xκ ∈ [aκ, bκ] is such that d(aκ, xκ) = d(a, x) and yκ ∈ [aκ, cκ] is such that d(aκ, yκ) = d(a, y). b a c x y d d ≤ dκ bκ aκ cκ xκ yκdκ A metric space is said CAT(κ) if every pair of points can be joined by a geodesic and every triangle with perimeter less than 2Dκ = 2π√ κ satisfy the CAT(κ) inequality. 11 / 21 Formal setting Assumptions 1. Time is discrete t = 0, 1, . . . 2. At each time each agent holds a “value” xt,v in a CAT(κ) metric space M 3. At each time t, an agent Vt randomly wakes up and wakes up a neighbor Wt, according to the probability distribution: P[{Vk, Wk} = {v, w}] = Pv,w > 0 if v ∼ w 0 otherwise Algorithm description xt,v = Midpoint(xt−1,Vt , xt−1,Wt ) if v ∈ {Vt, Wt} xt−1,v otherwise 12 / 21 Previous result The algorithm is sound Because geodesics exist and are unique in CAT(0) spaces. Convergence The algorithm converges to a consensus with probability 1, whatever the initial state x0. Rate of convergence Convergence occur at a linear rate: deﬁne σ2 (x) = v∼w d2 (xv , xw ) ; then, there exists a constant L < 0 such that Eσ2 (Xk) ≤ C0 exp(Lk) 13 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 What changes for the κ > 0 (the case of the sphere) 14 / 21 Our result Provided the diameter of the initial set of values is less than Dκ/2, The algorithm is sound Because geodesics exist and are unique using this restriction. Convergence The algorithm converges to a consensus with probability 1. Rate of convergence Convergence occur at a linear rate: deﬁne σ2 (x) = v∼w χκ (d(xv , xw )) ; with: χκ(x) = 1 − cos( √ κx) then, there exists a constant L ∈ (−1, 0) such that: Eσ2 (Xk) ≤ C0 exp(Lk) 15 / 21 Before iteration xt−1,u • xt−1,Vt • xt−1,Wt• 16 / 21 After iteration xt,u • xt,Vt xt,Wt • • • 16 / 21 Net balance xt−1,u • xt−1,Vt • xt−1,Wt•xt,Vt xt,Wt • 16 / 21 Sketch of proof (Net balance) Let us look at the increments: N(σ2 κ(Xt) − σ2 κ(Xt−1)) = −χκ(d(XVt (t − 1), XWt (t − 1))) + u∈V u=Vt ,u=Wt Tκ(Vt, Wt, u) with: Tκ(Vt, Wt, u) = 2χκ(d(Xu(t), Mt)) − χκ(d(Xu(t), XVt (t − 1))) −χκ(d(Xu(t), XWt (t − 1))) Using the inequality: χκ d p + q 2 , r ≤ χκ(d(p, r)) + χκ(d(q, r)) 2 17 / 21 Sketch of proof (Two propositions) We can prove the a ﬁrst propostion: E[σ2 κ(Xk+1) − σ2 κ(Xk)] ≤ − 1 N E∆κ(Xk) with: ∆κ(x) = 1 2N v∼w {v,w}∈E Pv,w χκ(d(xv , xw )) Using graph connectedeness we prove a second proposition: Assume G = (V , E) is an undirected connected graph, there exists a constant CG ≥ 1 depending on the graph only such that: ∀x ∈ MN , 1 2 ∆κ(x) ≤ σ2 κ(x) ≤ CG ∆κ(x) 18 / 21 Sketch of proof (cont’d) The following lemma Assume an is a sequence of nonnegative numbers such that an+1 − an ≤ −βan with β ∈ (0, 1). Then, ∀n ≥ 0, an ≤ a0 exp(−βn) Combined with the two propositions, gives the desired result. Eσ2 (Xk) ≤ exp(Lk) 19 / 21 Simulation results Sphere 20 / 21 Simulation results Rotations 20 / 21 Summary We have proved that, when the data belong to complete CAT(κ) metric space, provided the initial values are close enough, the same algorithm makes sense and also converge linearly. We have checked that our results are consistent with simulations. 21 / 21
Optimal Transport (chaired by JeanFrançois Marcotorchino, Alfred Galichon)
In this paper we relate the Equilibrium Assignment Problem (EAP), which is underlying in several economics models, to a system of nonlinear equations that we call the “nonlinear BernsteinSchrödinger system”, which is wellknown in the linear case, but whose nonlinear extension does not seem to have been studied. We apply this connection to derive an existence result for the EAP, and an efficient computational method.

TOPICS IN EQUILIBRIUM TRANSPORTATION Alfred Galichon (NYU and Sciences Po) GSI, Ecole polytechnique, October 29, 2015 GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 1/ 22 THIS TALK This talk is based on the following two papers: AG, Scott Kominers and Simon Weber (2015a). Costly Concessions: An Empirical Framework for Matching with Imperfectly Transferable Utility. AG, Scott Kominers and Simon Weber (2015b). The Nonlinear BernsteinSchr¨odinger Equation in Economics, GSI proceedings. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 2/ 22 THIS TALK Agenda: 1. Economic motivation 2. The mathematical problem 3. Computation 4. Estimation GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 3/ 22 THIS TALK Agenda: 1. Economic motivation 2. The mathematical problem 3. Computation 4. Estimation GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 3/ 22 THIS TALK Agenda: 1. Economic motivation 2. The mathematical problem 3. Computation 4. Estimation GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 3/ 22 THIS TALK Agenda: 1. Economic motivation 2. The mathematical problem 3. Computation 4. Estimation GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 3/ 22 Section 1 ECONOMIC MOTIVATION GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 4/ 22 MOTIVATION: A MODEL OF LABOUR MARKET Consider a very simple model of labour market. Assume that a population of workers is characterized by their type x ∈ X , where X = Rd for simplicity. There is a distribution P over the workers, which is assumed to sum to one. A population of ﬁrms is characterized by their types y ∈ Y (say Y = Rd ), and their distribution Q. It is assumed that there is the same total mass of workers and ﬁrms, so Q sums to one. Each worker must work for one ﬁrm; each ﬁrm must hire one worker. Let π (x, y) be the probability of observing a matched (x, y) pair. π should have marginal P and Q, which is denoted π ∈ M (P, Q) . GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 5/ 22 OPTIMALITY In the simplest case, the utility of a worker x working for a ﬁrm y at wage w (x, y) will be α (x, y) + w (x, y) while the corresponding proﬁt of ﬁrm y is γ (x, y) − w (x, y) . In this case, the total surplus generated by a pair (x, y) is α (x, y) + w + γ (x, y) − w = α (x, y) + γ (x, y) =: Φ (x, y) which does not depend on w (no transfer frictions). A central planner may thus like to choose assignment π ∈ M (P, Q) so to max π∈M(P,Q) Φ (x, y) dπ (x, y) . But why would this be the equilibrium solution? GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 6/ 22 EQUILIBRIUM The equilibrium assignment is determined by an important quantity: the wages. Let w (x, y) be the wage of employee x working for ﬁrm of type y. Let the indirect surpluses of worker x and ﬁrm y be respectively u (x) = max y {α (x, y) + w (x, y)} v (y) = max x {γ (x, y) − w (x, y)} so that (π, w) is an equilibrium when u (x) ≥ α (x, y) + w (x, y) with equality if (x, y) ∈ Supp (π) v (y) ≥ γ (x, y) − w (x, y) with equality if (x, y) ∈ Supp (π) By summation, u (x) + v (y) ≥ Φ (x, y) with equality if (x, y) ∈ Supp (π) . GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 7/ 22 THE MONGEKANTOROVICH THEOREM OF OPTIMAL TRANSPORTATION One can show that the equilibrium outcome (π, u, v) is such that π is solution to the primal MongeKantorovich Optimal Transportation problem max π∈M(P,Q) Φ (x, y) dπ (x, y) and (u, v) is solution to the dual OT problem min u,v u (x) dP (x) + v (y) dQ (y) s.t. u (x) + v (y) ≥ Φ (x, y) Feasibility+Complementary slackness yield the desired equilibrium conditions π ∈ M (P, Q) u (x) + v (y) ≥ Φ (x, y) (x, y) ∈ Supp (π) =⇒ u (x) + v (y) = Φ (x, y) “Second welfare theorem”, “invisible hand”, etc. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 8/ 22 EQUILIBRIUM VS. OPTIMALITY Is equilibrium always the solution to an optimization problem? It is not. This is why this talk is about “Equilibrium Transportation,” which contains, but is strictly more general than “Optimal Transportation”. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 9/ 22 EQUILIBRIUM VS. OPTIMALITY Is equilibrium always the solution to an optimization problem? It is not. This is why this talk is about “Equilibrium Transportation,” which contains, but is strictly more general than “Optimal Transportation”. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 9/ 22 IMPERFECTLY TRANSFERABLE UTILITY Consider the same setting as above, but instead of assuming that workers’ and ﬁrm’s payoﬀs are linear in surplus, assume u (x) = max y {Uxy (w (x, y))} v (y) = max x {Vxy (w (x, y))} where Uxy (w) is nondecreasing and continuous, and Vxy (w) is nonincreasing and continuous. Motivation: taxes, decreasing marginal returns, risk aversion, etc. Of course, Optimal Transportation case is recovered when Uxy (w) = αxy + w Vxy (w) = γxy − w. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 10/ 22 IMPERFECTLY TRANSFERABLE UTILITY For (u, v) ∈ R2, let Ψxy (u, v) = min {t ∈ R : ∃w, u − t ≤ Uxy (w) and v − t ≤ Vxy (w)} so that Ψ is nondecreasing in both variables and (u, v) = (Uxy (w) , Vxy (w)) for some w if and only if Ψxy (u, v) = 0. Optimal Transportation case is recovered when Ψxy (u, v) = (u + v − Φxy ) /2. As before, (π, w) is an equilibrium when u (x) ≥ Uxy (w (x, y)) with equality if (x, y) ∈ Supp (π) v (y) ≥ Vxy (w (x, y)) with equality if (x, y) ∈ Supp (π) We have therefore that (π, u, v) is an equilibrium when Ψxy (u (x) , v (y)) ≥ 0 with equality if (x, y) ∈ Supp (π) . GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 11/ 22 Section 2 THE MATHEMATICAL PROBLEM GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 12/ 22 EQUILIBRIUM TRANSPORTATION: DEFINITION We have therefore that (π, u, v) is an equilibrium outcome when π ∈ M (P, Q) Ψxy (u (x) , v (y)) ≥ 0 (x, y) ∈ Supp (π) =⇒ Ψxy (u (x) , v (y)) = 0 . Problem: existence of an equilibrium outcome? This paper: yes in the discrete case (X and Y ﬁnite), via entropic regularization. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 13/ 22 REMARK 1: LINK WITH GALOIS CONNECTIONS As soon as Ψxy is strictly increasing in both variables, Ψxy (u, v) = 0 expresses as u = Gxy (v) and v = G−1 xy (u) where the generating functions Gxy and G−1 xy are decreasing and continuous functions. In this case, relations u (x) = max y∈Y Gxy (v (y)) and v (y) = max x∈X G−1 xy (u (x)) generalize the LegendreFenchel conjugacy. This pair of relations form a Galois connection; see Singer (1997) and Noeldeke and Samuelson (2015). GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 14/ 22 REMARK 2: TRUDINGER’S LOCAL THEORY OF PRESCRIBED JACOBIANS Assuming everything is smooth, and letting fP and fQ be the densities of P and Q we have under some conditions that the equilibrium transportation plan is given by y = T (x), where mass balance yields det DT (x) = f (x) g (T (x)) and optimality yieds ∂x G−1 xT(x) (u (x)) + ∂uG−1 xT(x) (u (x)) u (x) = 0 which thus inverts into T (x) = e (x, u (x) , u (x)) . Trudinger (2014) studies MongeAmpere equations of the form det De (., u, u) = f g (e (., u, u)) . (more general than Optimal Transport where no dependence on u). GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 15/ 22 DISCRETE CASE Our work (GKW 2015a and b) focuses on the discrete case, when P and Q have ﬁnite support. Call px and qy the mass of x ∈ X and y ∈ Y respectively. In the discrete case, problem boils down to looking for (π, u, v) such that πxy ≥ 0, ∑y πxy = px , ∑x πxy = qy Ψxy (ux , vy ) ≥ 0 πxy > 0 =⇒ Ψxy (ux , vy ) = 0 . GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 16/ 22 Section 3 COMPUTATION GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 17/ 22 ENTROPIC REGULARIZATION Take temperature parameter T > 0 and look for π under the form πxy = exp − Ψxy (ux , vy ) T Note that when T → 0, the limit of Ψxy (ux , vy ) is nonnegative, and the limit of πxy Ψxy (ux , vy ) is zero. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 18/ 22 THE NONLINEAR BERNSTEINSCHR ¨ODINGER EQUATION If πxy = exp (−Ψxy (ux , vy ) /T) , condition π ∈ M (P, Q) boils down to set of nonlinear equations in (u, v) ∑y∈Y exp − Ψxy (ux ,vy ) T = px ∑x∈X exp − Ψxy (ux ,vy ) T = qy which we call the nonlinear BernsteinSchr¨odinger equation. In the optimal transportation case, this becomes the classical BS equation ∑y∈Y exp Φxy −ux −vy 2T = px ∑x∈X exp Φxy −ux −vy 2T = qy GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 19/ 22 ALGORITHM Note that Fx : ux → ∑y∈Y exp − Ψxy (ux ,vy ) T is a decreasing and continuous function. Mild conditions on Ψ therefore ensure the existence of ux so that Fx (ux ) = px . Our algorithm is thus a nonlinear Jacobi algorithm:  Make an initial guess of v0 y  Determine the uk+1 x to ﬁt the px margins, based on the vk y  Update the vk+1 y to ﬁt the qy margins, based on the uk+1 x .  Repeat until vk+1 is close enough to vk. One can proof that if v0 y is high enough, then the vk y decrease to ﬁxed point. Convergence is very fast in practice. GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 20/ 22 Section 4 STATISTICAL ESTIMATION GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 21/ 22 MAXIMUM LIKELIHOOD ESTIMATION In practice, one observes ˆπxy and would like to estimate Ψ. Assume that Ψ belongs to a parametric family Ψθ, so that πθ xy = exp −Ψθ xy uθ x , vθ y ∈ M (P, Q). The loglikelihood l (θ) associated to observation ˆπxy is l (θ) = ∑ xy ˆπxy log πθ xy = − ∑ xy ˆπxy Ψθ xy uθ x , vθ y and thus the maximum likelihood procedure consists in min θ ∑ xy ˆπxy Ψθ xy uθ x , vθ y . GALICHON EQUILIBRIUM TRANSPORTATION SLIDE 22/ 22
This note presents a short review of the Schrödinger problem and of the first steps that might lead to interesting consequences in terms of geometry. We stress the analogies between this entropy minimization problem and the renowned optimal transport problem, in search for a theory of lower bounded curvature for metric spaces, including discrete graphs.

. . . .. . . Some geometric aspects of the Schr¨odinger problem Christian L´eonard Universit´e Paris Ouest GSI’15 ´Ecole Polytechnique. October 2830, 2015 Interpolations in P(X) X : Riemannian manifold (state space) P(X) : set of all probability measures on X µ0, µ1 ∈ P(X) interpolate between µ0 and µ1 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 0 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 1 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 0 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 0.25 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 0.5 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 0.75 Interpolations in P(X) Standard aﬃne interpolation between µ0 and µ1 µaﬀ t := (1 − t)µ0 + tµ1 ∈ P(X), 0 ≤ t ≤ 1 t = 1 Interpolations in P(X) . . . .. . . Aﬃne interpolations require mass transference with inﬁnite speed Interpolations in P(X) . . . .. . . Aﬃne interpolations require mass transference with inﬁnite speed Denial of the geometry of X We need interpolations built upon trans portation, not tele portation Interpolations in P(X) We seek interpolations of this type Interpolations in P(X) We seek interpolations of this type t = 0 Interpolations in P(X) We seek interpolations of this type t = 0.25 Interpolations in P(X) We seek interpolations of this type t = 0.5 Interpolations in P(X) We seek interpolations of this type t = 0.75 Interpolations in P(X) We seek interpolations of this type t = 1 Displacement interpolation µ0 µ1 Displacement interpolation x y µ0 µ1 y = T(x) Displacement interpolation µ0 µ1 geodesics Displacement interpolation µ0 µ1 geodesics Displacement interpolation Displacement interpolation x y γxy t Displacement interpolation Displacement interpolation Curvature geodesics and curvature are intimately linked several geodesics give information on the curvature Curvature geodesics and curvature are intimately linked several geodesics give information on the curvature δ(t) θ p . . . .. . . δ(t) = √ 2(1 − cos θ) t ( 1 − σp(S) cos2(θ/2) 6 t2 + O(t4 ) ) Displacement interpolation x y µ0 µ1 y = T(x) Displacement interpolation . Respect geometry .. . . .. . . we have already used geodesics how to choose y = T(x) such that interpolations encrypt curvature as best as possible? no shock Displacement interpolation . Respect geometry .. . . .. . . we have already used geodesics how to choose y = T(x) such that interpolations encrypt curvature as best as possible? no shock perform optimal transport . Monge’s problem .. . . .. . . ∫ X d2(x, T(x)) µ0(dx) → min; T : T#µ0 = µ1 d : Riemannian distance Lazy gas experiment t = 0 0 < t < 1 t = 1 Positive curvature Lazy gas experiment t = 0 0 < t < 1 t = 1 Negative curvature Curvature and displacement interpolations . Relative entropy .. . . .. . . H(pr) := ∫ log(dp/dr) dp, p, r : probability measures . Convexity of the entropy along displacement interpolations .. . . .. . . The following assertions are equivalent Ric ≥ K along any [µ0, µ1]disp = (µt)0≤t≤1, d2 dt2 H(µtvol) ≥ KW 2 2 (µ0, µ1) von RenesseSturm (04) W2 is the Wasserstein distance Curvature and displacement interpolations . Relative entropy .. . . .. . . H(pr) := ∫ log(dp/dr) dp, p, r : probability measures . Convexity of the entropy along displacement interpolations .. . . .. . . The following assertions are equivalent Ric ≥ K along any [µ0, µ1]disp = (µt)0≤t≤1, d2 dt2 H(µtvol) ≥ KW 2 2 (µ0, µ1) von RenesseSturm (04) W2 is the Wasserstein distance starting point of the LottSturmVillani theory Schr¨odinger’s thought experiment Consider a huge collection of noninteracting identical Brownian particles. Schr¨odinger’s thought experiment Consider a huge collection of noninteracting identical Brownian particles. If the density proﬁle of the system at time t = 0 is approximately µ0 ∈ P(R3), you expect it to evolve along the heat ﬂow: { νt = ν0et∆/2, 0 ≤ t ≤ 1 ν0 = µ0 where ∆ is the Laplace operator. Schr¨odinger’s thought experiment Consider a huge collection of noninteracting identical Brownian particles. If the density proﬁle of the system at time t = 0 is approximately µ0 ∈ P(R3), you expect it to evolve along the heat ﬂow: { νt = ν0et∆/2, 0 ≤ t ≤ 1 ν0 = µ0 where ∆ is the Laplace operator. Suppose that you observe the density proﬁle of the system at time t = 1 to be approximately µ1 ∈ P(R3) with µ1 diﬀerent from the expected ν1. Probability of this rare event ≃ exp(−CNAvogadro). Schr¨odinger’s thought experiment Consider a huge collection of noninteracting identical Brownian particles. If the density proﬁle of the system at time t = 0 is approximately µ0 ∈ P(R3), you expect it to evolve along the heat ﬂow: { νt = ν0et∆/2, 0 ≤ t ≤ 1 ν0 = µ0 where ∆ is the Laplace operator. Suppose that you observe the density proﬁle of the system at time t = 1 to be approximately µ1 ∈ P(R3) with µ1 diﬀerent from the expected ν1. Probability of this rare event ≃ exp(−CNAvogadro). . Schr¨odinger’s question (1931) .. . . .. . . Conditionally on this very rare event, what is the most likely path (µt)0≤t≤1 ∈ P(R3)[0,1] of the evolving proﬁle of the particle system? Schr¨odinger problem X : compact Riemannian manifold Ω := {paths} ⊂ X[0,1] P ∈ P(Ω) and (Pt)0≤t≤1 ∈ P(X)[0,1] R ∈ P(Ω) : Wiener measure (Brownian motion) Schr¨odinger problem X : compact Riemannian manifold Ω := {paths} ⊂ X[0,1] P ∈ P(Ω) and (Pt)0≤t≤1 ∈ P(X)[0,1] R ∈ P(Ω) : Wiener measure (Brownian motion) . Schr¨odinger problem .. . . .. . . H(PR) → min; P ∈ P(Ω) : P0 = µ0, P1 = µ1 (S) µ0, µ1 ∈ P(X) are the initial and ﬁnal prescribed proﬁles Schr¨odinger problem X : compact Riemannian manifold Ω := {paths} ⊂ X[0,1] P ∈ P(Ω) and (Pt)0≤t≤1 ∈ P(X)[0,1] R ∈ P(Ω) : Wiener measure (Brownian motion) . Schr¨odinger problem .. . . .. . . H(PR) → min; P ∈ P(Ω) : P0 = µ0, P1 = µ1 (S) µ0, µ1 ∈ P(X) are the initial and ﬁnal prescribed proﬁles . Deﬁnition. Rentropic interpolation .. . . .. . . [µ0, µ1]R := (Pt)0≤t≤1 with P the unique solution of (S). It is the answer to Schr¨odinger’s question Lazy gas experiments Lazy gas experiment at zero temperature (Monge) Zero temperature Displacement interpolations Optimal transport Lazy gas experiment at positive temperature (Schr¨odinger) Positive temperature Entropic interpolations Minimal entropy Lazy gas experiments t = 0 0 < t < 1 t = 1 Negative curvature Zero temperature Lazy gas experiments t = 0 t = 1 Negative curvature Positive temperature Slowing down . . . .. . . To decrease temperature, slow down the particles of the heat bath Slowing down . . . .. . . To decrease temperature, slow down the particles of the heat bath . Slowed down reference measures .. . . .. . . (Wt)t≥0 : Brownian motion on the Riemannian manifold X R : law of (Wt)0≤t≤1 Rk : law of (Wt/k)0≤t≤1 k → ∞ Slowing down k = 1 : x y γxy Rxy t = 0 t = 1 Slowing down k = 1 : x y γxy Rxy t = 0 t = 1 k = 10 : x y Rk,xy Slowing down k = 1 : x y γxy Rxy t = 0 t = 1 k = 10 : x y Rk,xy k = ∞ : x y γxy Slowing down N → ∞, k = 1 : the whole particle system performs a rare event to travel from µ0 to µ1 cooperative behavior Gibbs conditioning principle (thermodynamical limit: N → ∞) Slowing down N → ∞, k = 1 : the whole particle system performs a rare event to travel from µ0 to µ1 cooperative behavior Gibbs conditioning principle (thermodynamical limit: N → ∞) N = 1, k → ∞ : each individual particle faces a harder task and must travel along an approximate geodesic individual behavior large deviation principle (slowing down limit: k → ∞) Slowing down N → ∞, k = 1 : the whole particle system performs a rare event to travel from µ0 to µ1 cooperative behavior Gibbs conditioning principle (thermodynamical limit: N → ∞) N = 1, k → ∞ : each individual particle faces a harder task and must travel along an approximate geodesic individual behavior large deviation principle (slowing down limit: k → ∞) . Slowing down principle .. . . .. . . The slowed down sequence (Rk)k≥1 encodes some geometry Slowing down N → ∞, k = 1 : the whole particle system performs a rare event to travel from µ0 to µ1 cooperative behavior Gibbs conditioning principle (thermodynamical limit: N → ∞) N = 1, k → ∞ : each individual particle faces a harder task and must travel along an approximate geodesic individual behavior large deviation principle (slowing down limit: k → ∞) . Slowing down principle .. . . .. . . The slowed down sequence (Rk)k≥1 encodes some geometry N → ∞, k → ∞ : these two behaviors superpose Results . Results 1 .. . . .. . . displacement interpolations feel curvature entropic interpolations also feel curvature Results . Results 1 .. . . .. . . displacement interpolations feel curvature entropic interpolations also feel curvature . Results 2 .. . . .. . . entropic interpolations converge to displacement interpolations entropic interpolations regularize displacement interpolations Results . Results 1 .. . . .. . . displacement interpolations feel curvature entropic interpolations also feel curvature . Results 2 .. . . .. . . entropic interpolations converge to displacement interpolations entropic interpolations regularize displacement interpolations Γconvergence Results . Results 3 .. . . .. . . The same kind of results hold in other settings ...1 discrete graphs ...2 Finsler manifolds ...3 interpolations with varying mass Results . Results 3 .. . . .. . . The same kind of results hold in other settings ...1 discrete graphs ...2 Finsler manifolds ...3 interpolations with varying mass ...1 graphs: random walk Results . Results 3 .. . . .. . . The same kind of results hold in other settings ...1 discrete graphs ...2 Finsler manifolds ...3 interpolations with varying mass ...1 graphs: random walk ...2 Finsler: jump process in a manifold, (work in progress) Results . Results 3 .. . . .. . . The same kind of results hold in other settings ...1 discrete graphs ...2 Finsler manifolds ...3 interpolations with varying mass ...1 graphs: random walk ...2 Finsler: jump process in a manifold, (work in progress) ...3 varying mass: branching process, (work in progress) Results . Results 4 .. . . .. . . Schr¨odinger’s problem is the analogue of Hamilton’s least action principle. It allows for dynamical theories of diﬀusion processes random walks on graphs Results . Results 4 .. . . .. . . Schr¨odinger’s problem is the analogue of Hamilton’s least action principle. It allows for dynamical theories of diﬀusion processes random walks on graphs stochastic Newton equation acceleration is related to curvature References Schr¨odinger (1931) Villani (big yellow book on optimal transport) Zambrini (stochastic deformation of classical mechanics in the diﬀusion setting) References Schr¨odinger (1931) Villani (big yellow book on optimal transport) Zambrini (stochastic deformation of classical mechanics in the diﬀusion setting) Conforti + L. (preprint) Mikami (PTRF ’04) L. (JFA ’12, AoP ’15) . . . .. . . Thank you for your attention
This article leans on some previous results already presented in [10], based on the Fréchet’s works,Wilson’s entropy and Minimal Trade models in connectionwith theMKPtransportation problem (MKP, stands for MongeKantorovich Problem). Using the duality between “independance” and “indetermination” structures, shown in this former paper, we are in a position to derive a novel approach to design a copula, suitable and efficient for anomaly detection in IT systems analysis.

Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Optimal Transport, Independance versus Indetermination duality, impact on a new Copula Design Benoit Huyot, Yves Mabiala Thales Communications and Security 29 October 2015 Benoit Huyot, Yves Mabiala 1 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba 1 Cybersecurity problem overview Current Intrusion Detection Systems Anomaly based IDS IDS as a classiﬁcation problem 2 Properties of Copula Function Copula theory historic Sklar’s Theorem and Frechet’s Bounds Regularity properties on copula function 3 Copula theory used in anomalies detection applications Classiﬁcation AUC with copula paradigm Experimental results Benoit Huyot, Yves Mabiala 2 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Current Intrusion Detection Systems Rule based approaches Suitable to detect previously known patterns Rules are easily understandable Easy addition of new rules But Unable to detect unknown patterns Benoit Huyot, Yves Mabiala 3 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Anomaly based IDS Anomaly based approaches Suitable to detect unknown patterns Time consuming to update model Alerts are diﬃcult to understand through existing tools Too many false alerts But Our approach is an attempt to overcome these problems Benoit Huyot, Yves Mabiala 4 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Anomaly based IDS Anomaly detection as a classiﬁcation problem Y is a binary random variable where Y = 0 if the event is abnormal Y = 1 else. p0 is the a priori attack probability deﬁne by p0 = P(Y ≤ 0) X represents the diﬀerence characteristics of the network event If X is a pdimensional random vector, the cumulative distribution function will be denoted F(x) = P(X1 ≤ x1, ..., Xp ≤ xp) Benoit Huyot, Yves Mabiala 5 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba IDS as a classiﬁcation problem Scoring function Scoring function is deﬁned as P(Y = 0X = x) By deﬁnition we have P(Y = 0X = x) = P(Y = 0, X = x) P(X = x) Anomalies are identiﬁed thanks to the classical Bayes’s rule model Empirical estimation is diﬃcult due to the ”Curse of Dimensionnality” Joint probabilities will be computed using copula theory to ease computations Benoit Huyot, Yves Mabiala 6 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Copula theory historic Introduction to Copula theory Originated by M.Fr´echet in 1951 Fr´echet, M. (1951): ”Sur les tableaux de corr´elations dont les marges sont donn´ees”, Annales de l’Universit´e de Lyon, Section A no 14, 5377 A.Sklar gave a breakthrough in 1959 Sklar, A. (1959), ”Fonctions de r´epartition `a n dimensions et leurs marges”, Publ. Inst. Statist. Univ. Paris 8: 229231 Benoit Huyot, Yves Mabiala 7 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Sklar’s Theorem and Frechet’s Bounds Main results on copula function Theorem (Sklar’s theorem) Given two continuous random variables X and Y in L1, with cumulative distribution functions written F and G. It exists an unique function C, called, copula such as: P(X ≤ x, Y ≤ y) = C(F(x), G(y)) Theorem (Fr´echetHoeﬀding’s Bounds) Given a copula function C, ∀(u, v) ∈ [0, 1]2 we have the following Fr´echet’s bounds: Max(u + v − 1, 0) ≤ C(u, v) ≤ Min(u, v) Benoit Huyot, Yves Mabiala 8 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Regularity properties on copula function 2increasing property or Monge’s conditions B + D = C(u1, v2) D + C = C(v1, u2) A + B + C + D = C(v1, v2) D = C(u1, u2) A = (A + B + C + D) − (B + D) − (D + C) + D and A ≥ 0 ∀(u1, v1) as 0 ≤ u1 ≤ v1 ≤ 1 ∀(u2, v2) as 0 ≤ u2 ≤ v2 ≤ 1 C(v1, v2) − C(u1, v2) − C(v1, u2) + C(u1, u2) ≥ 0 Benoit Huyot, Yves Mabiala 9 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Regularity properties on copula function Copula is an Holderian function B + C + E = C(u2, v2) − C(u1, v1) A + C + E = C(u2, 1) − C(u1, 1) B + C + D = C(v2, 1) − C(v1, 1) B + C + E ≤ (B + C + D) + (A + C + E) We obtain a 1Holderian condition for the Copula C: ∀(u1, v1, u2, v2) ∈ [0, 1]4 C(u2, v2)−C(u1, v1) ≤ u2−u1+v2−v1 Benoit Huyot, Yves Mabiala 10 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Copula theory used in anomalies detection applications Only unfrequent events could have a score greater than 1 2 Looking for attack remains to looking for rare events Fr´echet’s Bounds gives us P(Y = 0X) ≤ min(P(X), P(Y = 0)) P(X) and we get: P(Y = 0X) ≥ 1 2 ⇒ P(X) ≤ 2.P(Y = 0) Benoit Huyot, Yves Mabiala 11 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Lower bound for anomalies detection It’s possible to show limit The ”lower tail dependance” is deﬁned as: λL = Lim v→0 C(v, v) v λL ≤ Lim v→0 C(u, v) v Benoit Huyot, Yves Mabiala 12 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Variation of the score function We want to study to variation of v → C(u, v) v in [0, 2p0] 1 v2 v ∂C ∂v (u, v) − C(u, v) ≤ 0 ⇔ ∂C ∂v (u, v) ≤ C(u, v) v link to convexity ⇔ v ∂ ∂v logC(u, v) ≤ 1 link to Fisher’s information Benoit Huyot, Yves Mabiala 13 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Classiﬁcation AUC with copula paradigm ROC curve and AUC Sensitivity: True Positive Rate, C(p0, s) p0 1Speciﬁcity (antiSpeciﬁcity): False Positive Rate, s 1 − p0 (1 − C(p0, s)) AUC = 1 2p0(1 − p0) 1 − p2 0 − 1 0 (C(p0, s) − 1)2 ds In case of a bivariate random vector X we get: AUC = K1(p0)−K2(p0) 1 0 1 0 (C2(s1, s2) − 1)2 ∂2 ∂s1∂s2 C2(s1, s2)ds1ds2 Benoit Huyot, Yves Mabiala 14 Optimal transport problem In the MongeKantorovich problem we want to minimize following quantity: minh A 0 B 0 h(x, y) − 1 AB 2 Under constraints: 1 A 0 B 0 h(x, y) = 1 2 A 0 h(x, y) = g(y) 3 B 0 h(x, y) = f (x) The solution is given by: h∗ (x, y) = f (x) B + g(y) A − 1 AB The cumulative distribution function associated to the solution is: H∗ (x, y) = y F(x) B + x G(y) A − xy AB Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Classiﬁcation AUC with copula paradigm Algorithm principle Benoit Huyot, Yves Mabiala 16 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Experimental results Experimental results Quantile level used for copula benchmark Quantile level 10−4 5.10−4 10−3 5.10−3 10−2 Optimal Transport Copula Detection rate 18.64% 73.86% 74.32% 74.82% 75.09% False alarms rate 23.15% 2.32% 4.38% 3.72% 4.71% Clayton Copula Detection rate 0.0% 0.0% 19.28% 71.73% 79.86% False alarms rate 0.0% 0.0% 0.63% 36.76% 34.20% Frechet’s upper bound Copula Detection rate 30.35% 31.39% 32.73% 36.93% 79.11% False alarms rate 41.26% 38.68% 31.89% 27.48% 27.95% Benoit Huyot, Yves Mabiala 17 Outline Cybersecurity problem overview Properties of Copula Function Copula theory used in anomalies detection applications Ba Experimental results Thanks for your attention! Benoit Huyot, Yves Mabiala 18 Link to Fisher’s Information We will use the following equation: v C(u, v) ∂ ∂v C(u, v) = ∂ ∂v logC(u, v).v This condition is the statistical score The variance of this quantity gives the Fisher’s Information Sensitivity Sensitivity represents how many events are well assigned to anomalies Sensitivity : P( ˆY = 0Y = 0) ˆY = 0 when F(X) ≤ s for a given threshold s ˆY = 0 when X ∈ F−1([0; s]) Sensitivity: P(X ∈ F−1([0; s])p0) Sensitivity Sensitivity appears so as : P( ˆY = 0Y = 0) = P(Y = 0, ˆY = 0) P(Y = 0) = P(Y = 0, X ≤ F−1 X (s)) P(Y = 0) = C(p0, s) p0 Speciﬁcity/Antispeciﬁcity Antispeciﬁcity represents how many misclassiﬁcations are given by the algorithm Speciﬁcity : P( ˆY = 1Y = 1) ˆY = 1 when F(X) ≥ s for a given threshold s ˆY = 1 when X ∈ F−1([s; 1]) Speciﬁcity: P(X ∈ F−1([s; 1])p0) Antispeciﬁcity Antispeciﬁcity appears using survival copula function as: 1 − P( ˆY = 1Y = 1) = P( ˆY = 0Y > 0) = P( ˆY = 0) P(Y > 0) P(Y > 0 ˆY = 0) = s 1 − p0 (1 − C(p0, s)) Area under ROC Curve (AUC) AUC = 1 0 PD(PF )dPF Using an integration by substitution we obtain: AUC = 1 0 PD(s). ∂PF (s) ∂s ds Sensitivity: PD(s) = C(p0, s) p0 Antispeciﬁcity PF (s) = s 1 − p0 (1 − C(p0, s)) AUC = 1 p0(1 − p0) 1 0 C(p0, s) − C(p0, s)2 − sC(p0, s)C (p0, s) ds AUC simpliﬁcation AUC = 1 p0(1 − p0) 1 0 C(p0, s) − C(p0, s)2 − sC(p0, s)C (p0, s) ds An integration by parts give us: A3 = − sC2(p0, s) 2 1 0 + 1 2 1 0 C(p0, s)2 ds = − p2 0 2 + 1 2 1 0 C(p0, s)2 ds AUC = 1 p0(1 − p0) 1 0 C(p0, s) − 1 2 C(p0, s)2 ds − p0 2(1 − p0) Using this simpliﬁcation X − 1 2 X2 = − 1 2 X2 − 2X + 1 + 1 2 it comes: AUC = 1 2p0(1 − p0) 1 − p2 0 − 1 0 (C(p0, s) − 1)2 ds AUC in a bivariate case Using the FrechetHoeﬀding’s upper bounds and the lower tail dependence we get: 1 0 (λLs − 1)2 ds ≤ 1 0 (C(p0, s) − 1)2 ds ≤ 1 0 (min(p0, s) − 1)2 ds It comes : K + λ2 L 1 0 (s − 1)2 ds ≤ 1 0 (C(p0, s) − 1)2 ds ≤ 1 0 (s − 1)2 ds If X is a bivariate random vector: 1 0 (s − 1)2 ds = 1 0 1 0 (C2(s1, s2) − 1)2 ∂2 ∂s1∂s2 C2(s1, s2)ds1ds2
Optimal Mass Transport over Bridges Michele Pavon Department of Mathematics University of Padova, Italy GSI’15, Paris, October 29, 2015 Joint work with Yongxin Chen, Tryphon Georgiou, Department of Electrical and Computer Engineering, University of Minnesota A Venetian Schr¨odinger bridge Dynamic version of OMT “Fluiddynamic” version of OMT (Benamou and Brenier (2000)): inf (ρ,v) Rn 1 0 1 2 v(x, t) 2ρ(x, t)dtdx, (1a) ∂ρ ∂t + · (vρ) = 0, (1b) ρ(x, 0) = ρ0(x), ρ(y, 1) = ρ1(y). (1c) Proposition 1 Let ρ∗(x, t) with t ∈ [0, 1] and x ∈ Rn, satisfy ∂ρ∗ ∂t + · ( ψρ∗) = 0, ρ∗(x, 0) = ρ0(x), where ψ is the (viscosity) solution of the HamiltonJacobi equation ∂ψ ∂t + 1 2 ψ 2 = 0 for some boundary condition ψ(x, 1) = ψ1(x). If ρ∗(x, 1) = ρ1(x), then the pair (ρ∗, v∗) with v∗(x, t) = ψ(x, t) is optimal for (1). Schr¨odinger’s Bridges • Cloud of N independent Brownian particles; • empirical distr. ρ0(x)dx and ρ1(y)dy at t = 0 and t = 1, resp. • ρ0 and ρ1 not compatible with transition mechanism ρ1(y) = 1 0 p(t0, x, t1, y)ρ0(x)dx, where p(s, y, t, x) = [2π(t − s)]−n 2 exp − x − y2 2(t − s) , s < t Particles have been transported in an unlikely way (N large). Schr¨odinger(1931): Of the many unlikely ways in which this could have happened, which one is the most likely? Schr¨odinger’s Bridges (cont’d) Schr¨odinger: solution (bridge from ρ0 to ρ1 over Brownian motion), has at each time a density ρ that factors as ρ(x, t) = ϕ(x, t) ˆϕ(x, t), where ϕ and ˆϕ solve Schr¨odinger’s system ϕ(x, t) = p(t, x, 1, y)ϕ(y, 1)dy, ϕ(x, 0) ˆϕ(x, 0) = ρ0(x) ˆϕ(x, t) = p(0, y, t, x) ˆϕ(y, 0)dy, ϕ(x, 1) ˆϕ(x, 1) = ρ1(x). F¨ollmer 1988: This is a problem of large deviations of the empirical distribution on path space connected through Sanov’s theorem to a maximum entropy problem. Existence and uniqueness for Schr¨odinger’s system: Fortet 1940, Beurling 1960, Jamison 1974/75, F¨ollmer 1988. Schr¨odinger’s Bridges as a control problem The maximum entropy formulation of the Schr¨odinger bridge prob lem (SBP) with “prior” P is Minimize H(Q, P ) = EQ log dQ dP over D(ρ0, ρ1), where D be the family of distributions on Ω := C([0, 1], Rn) that are equivalent to stationary Wiener measure W = Wx dx. It can be turned, thanks to Girsanov’s theorem, into a stochastic control problem (Blaqui`ere, Dai Pra, M.P.Wakolbinger, Filliger HonglerStreit,...) with ﬂuid dynamic counterpart (P = W ) inf (ρ,v) Rn 1 0 1 2 v(x, t) 2ρ(x, t)dtdx, ∂ρ ∂t + · (vρ) − 2 ∆ρ = 0, ρ(x, 0) = ρ0(x), ρ(y, 1) = ρ1(y). Alternative timesymmetric ﬂuiddynamic for mulation of SBP When prior is W stationary Wiener measure with variance inf (ρ,v) Rn 1 0 1 2 v(x, t) 2 + 8 log ρ(x, t) 2 ρ(x, t)dtdx, ∂ρ ∂t + · (vρ) = 0, ρ(0, x) = ρ0(x), ρ(1, y) = ρ1(y). With respect to Benamou and Brenier problem, extra term given by a Fisher information functional integrated over time. Answers at once question posed by Eric Carlen in 2006 investigating connections between OMT and Nelson’s stochastic mechanics. Schr¨odinger’s Bridges and OMT  Can we use SBP as a regular approximation of OMT? Yes, Mikami 2004, MikamiThieullen 2006,2008, L´eonard 2012.  Is this useful to compute solution to OMT? Problems: 1. Solution to control formulation of SBP not given in implementable form! 2 Control formulation of SBP only for non degenerate diﬀusions with control and noise entering through same channel! (excludes most engineering applications); 3 No steadystate theory; 4 No OMT problem with nontrivial prior! GaussMarkov processes Problem 1. Find a control u minimizing J(u) := E 1 0 u(t) · u(t) dt , among those which achieve the transfer dXt = A(t)Xtdt + B(t)u(t)dt + B1(t)dWt, X0 ∼ N (0, Σ0), X1 ∼ N (0, Σ1). Engineering applications: Swarms of robots, shape bulk magne tization distribution in NMR spectroscopy and imaging, industrial process control, ... If pair (A, B) is controllable (for constant A and B amounts to B, AB, ..., An−1B having full row rank), prob lem always feasible (highly nontrivial, control may be “handicapped” with respect to the eﬀect of noise). GaussMarkov processes (cont’d) Problem 2. Find u = −Kx which minimizes Jpower(u) := E{u · u} and such that dx(t) = (A − BK)x(t)dt + B1dw(t) has ρ(x) = (2π)−n/2 det(Σ)−1/2 exp − 1 2 x Σ−1x as invariant probability density. Problem may not have a solution (not all values for Σ can be maintained by state feedback). Previous contributions: Beghi (1996,1997), Grigoriadis Skelton (1997), Brockett (2007, 2012), VladimirovPetersen (2010, 2015) GaussMarkov processes (cont’d) Suﬃcient conditions for optimality in terms of:  a system of two matrix Riccati equations (Lyapunov equations if B = B1) in the ﬁnite horizon case ˙Π = −A Π − ΠA + ΠBB Π ˙H = −A H − HA − HBB H + (Π + H) BB − B1B1 (Π + H) . Σ−1 0 = Π(0) + H(0) Σ−1 T = Π(T ) + H(T ). GaussMarkov processes (cont’d)  in terms of algebraic conditions for the stationary case. rank AΣ + ΣA + B1B1 B B 0 = rank 0 B B 0 . Optimal controls may be computed via semideﬁnite programming in both cases.  Y. Chen, T.T. Georgiou and M. Pavon, Optimal steering of a linear stochastic system to a ﬁnal probability distribution, Part I, Aug. 2014, arXiv:1408.2222v1, IEEE Trans. Aut. Control, to appear.  Y. Chen, T.T. Georgiou and M. Pavon, Optimal steering of a linear stochastic system to a ﬁnal probability distribution, Part II, Oct. 2014, arXiv:1410.3447v1, IEEE Trans. Aut. Control, to appear. Cooling Two problems: • Eﬃcient asymptotic steering of a system of stochastic oscillators to desired steady state ¯ρ; • Eﬃcient steering of the system from initial condition ρ0 to ¯ρ at ﬁnite time t = 1. In both cases get solution for general system of nonlinear stochastic oscillators by extending Schr¨odinger bridges theory.  Y. Chen, T.T. Georgiou and M. Pavon, Fast cooling for a system of stochastic oscillators, arXiv:1411.1323v2, J. Math. Phys. Nov. 2015. OMT with “prior” inf (ρ,v) Rn 1 0 1 2 v(x, t) − vp(x, t) 2ρ(x, t)dtdx, (2a) ∂ρ ∂t + · (vρ) = 0, (2b) ρ(x, 0) = ρ0(x), ρ(y, 1) = ρ1(y). (2c) Proposition 2 Let ρ∗(x, t) with t ∈ [0, 1] and x ∈ Rn, satisfy ∂ρ∗ ∂t + · [(vp + ψ)ρ∗] = 0, ρ∗(x, 0) = ρ0(x), where ψ is the (viscosity) solution of the HamiltonJacobi equation ∂ψ ∂t + vp · ψ + 1 2 ψ 2 = 0 for boundary cond. ψ(x, 1) = ψ1(x). If ρ∗(x, 1) = ρ1(x), then the pair (ρ∗, v∗) with v∗(x, t) = vp(x, t) + ψ(x, t) is optimal for (2). OMT with “prior” (cont’d) Problem still in classical OMT framework inf π∈Π(µ,ν) Rn×Rn c(x, y)dπ(x, y), with c(x, y) = inf x∈Xxy 1 0 L(t, x(t), ˙x(t))dt, L(t, x, ˙x) = ˙x − vp(x, t) 2. Many results in OMT only for c(x, y) = c(x−y) strictly convex orig inating from a Lagrangian L(t, x, ˙x) = c( ˙x). We are also interested in inf (ρ,u) Rn 1 0 1 2 u(x, t) 2ρ(x, t)dtdx, (3a) ∂ρ ∂t + · ((vp(x, t) + B(t)u(x, t))ρ) = 0, (3b) ρ(x, 0) = ρ0(x), ρ(y, 1) = ρ1(y). (3c) OMT and SBP Mikami (2004), L´eonard (2012) show, when prior is W , that as the diﬀusion coeﬃcient tends to zero, OMT is Γlimit of SBP. Hence inﬁma converge and minimizers converge. In Gaussian case, we show directly convergence of solution to HamiltonJacobiBellman equation to solution of HamiltonJacobi equation also in case with prior.  Y. Chen, T.T. Georgiou and M. Pavon, On the relation between optimal transport and Schr¨odinger bridges: A stochastic control viewpoint, Dec. 2014, arXiv:1412.4430v1, J. Opt. Th. Appl., DOI: 10.1007/s109570150803z.  Y. Chen, T.T. Georgiou and M. Pavon, Optimal transport over a linear dynamical system, Feb. 2015, arXiv:1502.01265v1. OMT and SBP: Example Smoluchowski model for highly overdamped planar Brownian mo tion in a force ﬁeld is, in a strong sense, highfriction limit of full OrnsteinUhlenbeck model in phase space dXt = − V (Xt)dt + √ dWt, − V (x) = Ax, A = −3 0 0 −3 , m0 = 5 5 , and Σ0 = 1 0 0 1 m1 = −5 −5 , and Σ1 = 1 0 0 1 Transparent tube represent the “3σ region” (x − mt)Σ−1 t (x − mt) ≤ 9. OMT and SBP: Example (cont’d) Interpolation based on Schr¨odinger bridge with = 9 Interpolation based on Schr¨odinger bridge with = 4 Interpolation based on Schr¨odinger bridge with = 0.01 Interpolation based on optimal transport with prior OMT and SBP in general How can we eﬀectively compute the solution of the SBP in the general non Gaussian case? In T. T. Georgiou and M. Pavon, Positive contraction mappings for classical and quantum Schr¨odinger systems, May 2014, arXiv:1405.6650v2, J. Math. Phys., 56, 033301, March 2015. eﬃcient iterative techniques to solve the Schr¨odinger system for Markov chains and Kraus maps of statistical quantum mechanics based on the Garrett Birkhoﬀ (1957)Bushell(1973) theorem. Application to general OMT  Y. Chen, T. Georgiou, and M. Pavon, Entropic and displacement interpolation: a computational approach using the Hilbert metric, June 2015, arXiv:1506.04255v1, submitted for publication. Applications to interpolation of 2D images to get 3D model. [t = 0] [t = 1] MRI slices at two diﬀerent points [t = 0.2] [t = 0.4] [t = 0.6] [t = 0.8] Interpolation with = 0.01 THANK YOU FOR YOUR ATTENTION! References • T. T. Georgiou and M. Pavon, Positive contraction mappings for classical and quantum Schr¨odinger systems, arXiv:1405.6650v2, J. Math. Phys., 56, 033301, March 2015. • Y. Chen and T.T. Georgiou, Stochastic bridges of linear systems, preprint, arxiv: 1407.3421, IEEE Trans. Aut. Control, to appear. • Y. Chen, T.T. Georgiou and M. Pavon, Optimal steering of a linear stochastic system to a ﬁnal probability distribution, Aug. 2014, arXiv:1408.2222v1, IEEE Trans. Aut. Control, to appear. • Y. Chen, T. Georgiou and M. Pavon, Optimal steering of inertial particles diﬀusing anisotropically with losses, arXiv 1410.1605v1, Oct. 7, 2014, Amer ican Control Conf. 2015. • Y. Chen, T.T. Georgiou and M. Pavon, Optimal steering of a linear stochastic system to a ﬁnal probability distribution, part II, Oct. 2014, arXiv:1410.3447v1, IEEE Trans. Aut. Control, to appear. • Y. Chen, T.T. Georgiou and M. Pavon, Fast cooling for a system of stochas tic oscillators, Nov. 2014, arXiv:1411.1323v2, J. Math. Phys., Nov. 2015. • Y. Chen, T.T. Georgiou and M. Pavon, On the relation between optimal transport and Schr¨odinger bridges: A stochastic control viewpoint, Dec. 2014, arXiv:1412.4430v1, JOTA, to appear. • Y. Chen, T.T. Georgiou and M. Pavon, Optimal transport over a linear dynamical system, Feb. 2015, arXiv:1502.01265v1. References (cont’d) • Y. Chen, T.T. Georgiou and M. Pavon, Optimal mass transport over bridges, arXiv 1503.00215v1, Feb. 28, 2015, GSI’15 Conf.. • Y. Chen, T. Georgiou and M. Pavon, Steering state statistics with output feedback, arXiv 1504.00874v1, April 3, 2015, Proc. CDC 2015 (to appear). • Y. Chen, T.T. Georgiou and M. Pavon, Optimal control of the state statistics for a linear stochastic system, arXiv 1503.04885v1, March 17, 2015, Proc. CDC 2015 (to appear). • Y. Chen, T.T. Georgiou and M. Pavon, Entropic and displacement inter polation: a computational approach using the Hilbert metric, June 2015, arXiv:1506.04255v1, submitted for publication. Markovian prior P with Nelson’s current velocity ﬁeld vP (x, t) Minimize(ρ,v) t1 t0 RN 1 2σ2 v(x, t) − vP (x, t) 2 + 2 8 log ρ ρP (x, t) 2 ρ(x, t)dxdt, ∂ρ ∂t + · (vρ) = 0, ρt0 = ρ0, ρt1 = ρ1. Comparison with OMT with prior: There is here an extra term in the action functional which has the form of a relative Fisher information of ρt with respect to the prior onetime density ρP t (Dirichlet form) integrated over time. Decomposition of relative entropy H(Q, P ) = EQ log dQ dP = EQ log dµQ dµP (x(0), x(1)) + EQ log dQ x(1) x(0) dP x(1) x(0) (x) = log dµQ dµP dµQ + log dQ y x dP y x dQy xµQ(dx, dy). Thus, problem reduces to minimizing log dµQ dµP dµQ subject to the (linear) constraints µQ(dx × Rn) = ρ0(x)dx, µQ(Rn × dy) = ρ1(y)dy.
Information Geometry in Image Analysis (chaired by Yannick Berthoumieu, Geert Verdoolaege)
The current paper introduces new prior distributions on the zeromean multivariate Gaussian model, with the aim of applying them to the classification of covariance matrices populations. These new prior distributions are entirely based on the Riemannian geometry of the multivariate Gaussian model. More precisely, the proposed Riemannian Gaussian distribution has two parameters, the centre of mass ˉY and the dispersion parameter σ. Its density with respect to Riemannian volume is proportional to exp(−d2(Y;ˉY)), where d2(Y;ˉY) is the square of Rao’s Riemannian distance. We derive its maximum likelihood estimators and propose an experiment on the VisTex database for the classification of texture images.

Geometric Science of Information 2015 Non supervised classification in the space of SPD matrices Salem Said – Lionel Bombrun – Yannick Berthoumieu Laboratoire IMS CNRS UMR 5218 – Universit´e de Bordeaux 29 October 2015 Said et al. (IMS Bordeaux – CNRS UMR 5218) Geometric Science of Information 2015 29 October 2015 0 / 11 Context of our work Our project : Statistical learning in the space of SPD matrices Our team : 3 members of IMS laboratory + 2 post docs (Hatem Hajri, Paolo Zanini) Target applications : remote sensing , radar signal processing , Neuroscience (BCI) Our partners : IMB (Marc Arnaudon + PhD student), Gipsalab, Ecole des Mines Our recent work http ://arxiv.org/abs/1507.01760 Riemannian Gaussian distributions on the space of SPD matrices (in review, IEEE IT) Some of our problems : Given a population of SPD matrices (any size or structure) − Nonsupervised learning of its class structure − Semiparametric learning of its density Please look up our paper on Arxiv :) Said et al. (IMS Bordeaux – CNRS UMR 5218) Geometric Science of Information 2015 29 October 2015 1 / 11 Geometric tools Statistical manifold : Θ = SPD, Toeplitz, BlockToeplitz, etc, matrices Hessian or Fisher metric : ds2 (θ ) = HessΦ (dθ,dθ ) Φ model entropy — Θ becomes a Riemannian homogeneous space of negative curvature ! ! Example : 2 × 2 correlation (baby Toeplitz) Θ = 1 θ θ∗ 1 θ  < 1 Φ(θ ) = − log[1 − θ 2 ] ⇒ ds2 (θ ) = dθ 2 [1 − θ 2 ]2 Poincar´e disc model Why do we use this ? – Suitable mathematical properties – Relation to entropy or “information” – Often leads to excellent performance Said et al. (IMS Bordeaux – CNRS UMR 5218) Geometric Science of Information 2015 29 October 2015 2 / 11 First place in IEEE BCI challenge Contribution IIntroduction of Riemannian Gaussian distributions A statistical model of a class/cluster : [Pennec 2006] p(θ  ¯θ, σ ) = Z−1 (σ ) Expression unknown in the literature × exp − d 2 (θ, ¯θ ) 2σ2 d(θ, ¯θ ) Riema nnian distance Computing Z (σ ) Z (σ ) = Θ exp − d 2 (θ, ¯θ ) 2σ2 dv(θ ) d 2 (θ, ¯θ ) = tr log θ−1 ¯θ 2 dv(θ ) = det(θ )− m+1 2 i
We present a new texture discrimination method for textured color images in the wavelet domain. In each wavelet subband, the correlation between the color bands is modeled by a multivariate generalized Gaussian distribution with fixed shape parameter (Gaussian, Laplacian). On the corresponding Riemannian manifold, the shape of texture clusters is characterized by means of principal geodesic analysis, specifically by the principal geodesic along which the cluster exhibits its largest variance. Then, the similarity of a texture to a class is defined in terms of the Rao geodesic distance on the manifold from the texture’s distribution to its projection on the principal geodesic of that class. This similarity measure is used in a classification scheme, referred to as principal geodesic classification (PGC). It is shown to perform significantly better than several other classifiers.

FACULTY OF ENGINEERING AND ARCHITECTURE Color Texture Discrimination using the Principal Geodesic Distance on a Multivariate Generalized Gaussian Manifold Geert Verdoolaege1,2 and Aqsa Shabbir1,3 1Department of Applied Physics, Ghent University, Ghent, Belgium 2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium 3MaxPlanckInstitut für Plasmaphysik, D85748 Garching, Germany Geometric Science of Information Paris, October 28–30, 2015 Overview 1 Color texture 2 Geometry of wavelet distributions 3 Principal geodesic classiﬁcation 4 Classiﬁcation experiments 5 Conclusions 2 Overview 1 Color texture 2 Geometry of wavelet distributions 3 Principal geodesic classiﬁcation 4 Classiﬁcation experiments 5 Conclusions 3 VisTex database 128 × 128 subimages extracted from RGB images from 40 classes (textures) 4 CUReT database 200 × 200 RGB images from 61 classes with varying illumination and viewpoint 5 Texture modeling Structure at various scales Stochasticity Correlations between colors, neighboring pixels, etc. ⇒ Multivariate wavelet distributions 6 Overview 1 Color texture 2 Geometry of wavelet distributions 3 Principal geodesic classiﬁcation 4 Classiﬁcation experiments 5 Conclusions 7 Generalized Gaussian distributions Univariate: generalized Gaussian distribution (zero mean): p(xα, β) = β 2αΓ(1/β) exp − x α β mvariate multivariate generalized Gaussian (MGGD, zeromean): p(xΣ, β) = Γ m 2 π m 2 Γ m 2β 2 m 2β β Σ 1 2 exp − 1 2 x Σ−1 x β Shape parameter β = 1: Gaussian; β = 1/2: Laplace (heavy tails) 8 MGGD geometry: coordinate system (Σ1, β1) → (Σ2, β2): ﬁnd K such that K Σ1K = Im, K Σ2K ≡ Φ2 ≡ diag(λ1 2, . . . , λp 2), λi 2 eigenvalues of Σ−1 1 Σ2 In fact, ∀ Σ(t), t ∈ [0, 1]: K Σ(t)K ≡ Φ(t) ≡ diag(λ1 2, . . . , λp 2), λi 2 eigenvalues of Σ−1 1 Σ(t) ri(t) ≡ ln[λi(t)] M. Berkane et al., J. Multivar. Anal., 63, 35–46, 1997 G. Verdoolaege and P. Scheunders, J. Math. Imaging Vis., 43, 180–193, 2012 9 MGGD geometry: Fisher information metric gββ(β) = 1 β2 1 + m 2β 2 Ψ1 m 2β + m β ln(2) + Ψ m 2β + m 2β [ln(2)]2 + Ψ 1 + m 2β ln(4) + Ψ 1 + m 2β + Ψ1 1 + m 2β gβi(β) = − 1 2β 1 + ln(2) + Ψ 1 + m 2β gii(β) = 3bh − 1 4 bh ≡ 1 4 m + 2β m + 2 gij(β) = bh − 1 4 , i = j 10 MGGD geometry: geodesics and exponential map Geodesic equations for ﬁxed β: ri (t) ≡ ln(λi 2) t Geodesic distance: GD(Σ1, Σ2) = 3bh − 1 4 i (ri 2)2 + 2 bh − 1 4 i
Practical estimation of mixture models may be problematic when a large number of observations are involved: for such cases, online versions of ExpectationMaximization may be preferred, avoiding the need to store all the observations before running the algorithms. We introduce a new online method wellsuited when both the number of observations is large and lots of mixture models need to be learned from different sets of points. Inspired by dictionary methods, our algorithm begins with a training step which is used to build a dictionary of components. The next step, which can be done online, amounts to populating the weights of the components given each arriving observation. The usage of the dictionary of components shows all its interest when lots of mixtures need to be learned using the same dictionary in order to maximize the return on investment of the training step. We evaluate the proposed method on an artificial dataset built from random Gaussian mixture models.

Information Geometry for mixtures CoMixture Models Bag of components Bagofcomponents: an online algorithm for batch learning of mixture models Olivier Schwander Frank Nielsen Université Pierre et Marie Curie, Paris, France École polytechnique, Palaiseau, France October 29, 2015 1 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Exponential families Deﬁnition p(x; λ) = pF (x; θ) = exp ( t(x)θ − F(θ) + k(x)) λ source parameter t(x) suﬃcient statistic θ natural parameter F(θ) lognormalizer k(x) carrier measure F is a stricly convex and diﬀerentiable function ·· is a scalar product 2 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Multiple parameterizations: dual parameter spaces Legendre Transform (F, Θ) ↔ (F , H) θ ∈ Θ Natural Parameters η ∈ H Expectation Parameters θ = F (η) η = F(θ) Source Parameters (not unique) λ1 ∈ Λ1, λ2 ∈ Λ2, . . . , λn ∈ Λn Multiple source parameterizations Two canonical parameterizations 3 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Bregman divergences Deﬁnition and properties BF (x y) = F(x) − F(y) − x − y, F(y) F is a stricly convex and diﬀerentiable function No symmetry! Contains a lot of common divergences Squared Euclidean, Mahalanobis, KullbackLeibler, ItakuraSaito. . . 4 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Bregman centroids Leftsided centroid min c i ωi BF (c xi ) Rightsided centroid min c i ωi BF (xi c) Closedform cL = F∗ i ωi F(xi ) cR = i ωi xi 5 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Link with exponential families [Banerjee 2005] Bijection with exponential families log pF (xθ) = −BF∗ (t(x) η) + F∗ (t(x)) + k(x) KullbackLeibler between exponential families between members of the same exponential family KL(pF (x, θ1), pF (x, θ2)) = BF (θ2 θ1) = BF (η1 η2) KullbackLeibler centroids In closedform through the Bregman divergence 6 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Maximum likelihood estimator A Bregman centroid ˆη = arg max η i log pF (xi , η) = arg min η i BF∗ (t(xi ) η) −F∗ (t(xi )) − k(xi ) does not depend on η = arg min η i BF∗ (t(xi ) η) = i t(xi ) And ˆθ = F (ˆη) 7 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Mixtures of exponential families m(x; ω, θ) = 1≤i≤k ωi pF (x; θi ) Fixed Family of the components PF Number of components k (model selection techniques to choose) Parameters Weights i ωi = 1 Component parameters θi Learning a mixture Input: observations x1, . . . , xN Output: ωi and θi 8 / 20 Information Geometry for mixtures CoMixture Models Bag of components Exponential families Bregman divergences Mixture models Bregman Soft Clustering: EM for exponential families [Banerjee 2005] Estep p(i, j) = ωjpF (xi , θj) m(xi ) Mstep ηj = arg max η i p(i, j) log pF (xi , θj) = arg min η i p(i, j) BF∗ (t(xi ) η) −F∗ (t(xi )) − k(xi ) does not depend on η = i p(i, j) u p(u, j) t(xu) 9 / 20 Information Geometry for mixtures CoMixture Models Bag of components Motivation Algorithms Applications Joint estimation of mixture models Exploit shared information between multiple pointsets to improve quality to improve speed Inspiration Dictionary methods Transfer learning Eﬃcient algorithms Building Comparing 10 / 20 Information Geometry for mixtures CoMixture Models Bag of components Motivation Algorithms Applications CoMixtures Sharing components of all the mixtures m1(xω(1) , η) = k i=1 ω (1) i pF (x ηj) . . . mS(xω(S) , η) = k i=1 ω (S) i pF (x ηj) Same η1 . . . ηk everywhere Diﬀerent weights ω(l) 11 / 20 Information Geometry for mixtures CoMixture Models Bag of components Motivation Algorithms Applications coExpectationMaximization Maximize the mean of the likelihoods on each mixtures Estep A posterior matrix for each dataset p(l) (i, j) = ω (l) j pF (xi , θj) m(x (l) i ω(l), η) Mstep Maximization on each dataset η (l) j = i p(i, j) u p(l)(u, j) t(x(l) u ) Aggregation ηj = 1 S S l=1 η (l) j 12 / 20 Information Geometry for mixtures CoMixture Models Bag of components Motivation Algorithms Applications Variational approximation of KullbackLeibler [Hershey Olsen 2007] KLVariationnal(m1, m2) = K i=1 ω (1) i log j ω (1) j e−KL(pF (·; θi ) pF (·; θj )) j ω (2) j e−KL(pF (·; θi ) pF (·; θj )) With shared parameters Precompute Dij = e−KL(pF (· ηi ),pF (· ηj )) Fast version KLvar(m1 m2) = i ω (1) i log j ω (1) j e−Dij j ω (2) j e−Dij 13 / 20 Information Geometry for mixtures CoMixture Models Bag of components Motivation Algorithms Applications coSegmentation Segmentation from 5D RGBxy mixtures Original EM CoEM 14 / 20 Information Geometry for mixtures CoMixture Models Bag of components Motivation Algorithms Applications Transfer learning Increase the quality of one particular mixture of interest First image: only 1% of the points Two other images: full set of points Not enough points for EM 15 / 20 Information Geometry for mixtures CoMixture Models Bag of components Algorithm Experiments Bag of Components Training step Comix on some training set Keep the parameters Costly but oﬄine D = {θ1, . . . , θK } Online learning of mixtures For a new pointset For each observation arriving: arg max θ∈D pF (xj, θ) or arg min θ∈D BF (t(xj), θ) 16 / 20 Information Geometry for mixtures CoMixture Models Bag of components Algorithm Experiments Nearest neighbor search Naive version Linear search O(number of samples × number of components) Same order of magnitude as one step of EM Improvement Computational Bregman Geometry to speedup the search Bregman Ball Trees Hierarchical clustering Approximate nearest neighbor 17 / 20 Information Geometry for mixtures CoMixture Models Bag of components Algorithm Experiments Image segmentation Segmentation on a random subset of the pixels 100% 10% 1% EM BoC 18 / 20 Information Geometry for mixtures CoMixture Models Bag of components Algorithm Experiments Computation times Training 100% 10% 1% 0 20 40 60 80 100 120 Training EM BoC 19 / 20 Information Geometry for mixtures CoMixture Models Bag of components Algorithm Experiments Summary Comix Mixtures with shared components Compact description of a lot of mixtures Fast KL approximations Dictionarylike methods Bag of Components Online method Predictable time (no iteration) Works with only a few points Fast 20 / 20
Stochastic watershed is an image segmentation technique based on mathematical morphology which produces a probability density function of image contours. Estimated probabilities depend mainly on local distances between pixels. This paper introduces a variant of stochastic watershed where the probabilities of contours are computed from a gaussian model of image regions. In this framework, the basic ingredient is the distance between pairs of regions, hence a distance between normal distributions. Hence several alternatives of statistical distances for normal distributions are compared, namely Bhattacharyya distance, Hellinger metric distance and Wasserstein metric distance.

A technique of spatialspectral quantization of hyperspectral images is introduced. Thus a quantized hyperspectral image is just summarized by K spectra which represent the spatial and spectral structures of the image. The proposed technique is based on αconnected components on a region adjacency graph. The main ingredient is a dissimilarity metric. In order to choose the metric that best fit the hyperspectral data manifold, a comparison of different probabilistic dissimilarity measures is achieved.

Optimal Transport and applications in Imagery/Statistics (chaired by Bertrand Maury, Jérémie Bigot)
Optimal transport (OT) is a major statistical tool to measure similarity between features or to match and average features. However, OT requires some relaxation and regularization to be robust to outliers. With relaxed methods, as one feature can be matched to several ones, important interpolations between different features arise. This is not an issue for comparison purposes, but it involves strong and unwanted smoothing for transfer applications. We thus introduce a new regularized method based on a nonconvex formulation that minimizes transport dispersion by enforcing the onetoone matching of features. The interest of the approach is demonstrated for color transfer purposes.

Introduction 1 / 30 Adaptive color transfer with relaxed optimal transport Julien Rabin1 , Sira Ferradans2 and Nicolas Papadakis3 1 GREYC, University of Caen, 2 Data group, ENS, 3 CNRS, Institut de Mathématiques de Bordeaux Conference on Geometric Science of Information J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Introduction 2 / 30 Optimal transport on histograms MongeKantorovitch (MK) discrete mass transportation problem: Map µ0 onto µ1 while minimizing the total transport cost ������������� The two histograms must have the same mass. Optimal transport cost is called the Wasserstein distance (Earth Mover’s Distance) Optimal transport map is the application mapping µ0 onto µ1 J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Introduction 3 / 30 Applications in Image Processing and Computer Vision Optimal transport as a framework to deﬁne statisticalbased tools Applications to many imaging and computer vision problems: • Robust dissimilarity measure (Optimal transport cost): Image retrieval [Rubner et al., 2000] [Pele and Werman, 2009] SIFT matching [Pele and Werman, 2008] [Rabin et al., 2009] 3D shape recognition, Feature detection [Tomasi] Object segmentation [Ni et al., 2009] [Swoboda and Schnorr, 2013] • Tool for matching/interpolation (Optimal transport map): Nonrigid shape matching, image registration [Angenent et al., 2004] Texture synthesis and mixing [Ferradans et al., 2013] Histogram speciﬁcation and averaging [Delon, 2004] Color transfer [Pitié et al., 2007], [Rabin et al., 2011b] Not to mention other applications (physics, economy, etc). J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Introduction 4 / 30 Color transfer Target image (µ) Source image (ν) Optimal transport of µ onto ν Target image after color transfer Limitations: • Mass conservation artifacts • Irregularity of optimal transport map J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Introduction 5 / 30 Outline Outline: Part I. Computation of optimal transport between histograms Part II. Optimal transport relaxation and regularization Application to color transfer J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 6 / 30 Part I Wasserstein distance between histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 7 / 30 Formulation for clouds of points Deﬁnition: L2 Wasserstein Distance Given two clouds of points X, Y ⊂ Rd×N of N elements in Rd with equal masses 1 N , the quadratic Wasserstein distance is deﬁned as W2(X, Y)2 = min σ∈ΣN 1 N N i=1 Xi − Yσ(i) 2 (1) where ΣN is the set of all permutations of N elements. J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 7 / 30 Formulation for clouds of points Deﬁnition: L2 Wasserstein Distance Given two clouds of points X, Y ⊂ Rd×N of N elements in Rd with equal masses 1 N , the quadratic Wasserstein distance is deﬁned as W2(X, Y)2 = min σ∈ΣN 1 N N i=1 Xi − Yσ(i) 2 (1) where ΣN is the set of all permutations of N elements. ⇔ Optimal Assignment problem, can be computed using standard sorting algorithms when d = 1 J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 8 / 30 Exact solution in unidimensional case (d = 1) for histograms Histograms may be seen as clouds of points with nonuniform masses, so that µ(x) = M i=1 mi δXi (x), s.t. i mi = 1, mi ≥ 0 ∀i Computing the Lp Wasserstein distance for onedimensional histograms is still simple for p ≥ 1. Optimal transport cost writes [Villani, 2003] Wp(µ, ν) = H−1 µ − H−1 ν p = 1 0 H−1 µ (t) − H−1 ν (t) p dt 1 p where Hµ(t) = t −∞ dµ = Xi t mi is the cumulative distribution function of µ and H−1 µ (t) = inf {s \ Rµ(s) t} its pseudoinverse. Time complexity: O(N) operations if bins are already sorted. J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 9 / 30 Exact solution in unidimensional case (d = 1) for histograms Can not be extended to higher dimensions as the cumulative function Hµ : x ∈ Rd → Hµ(x) ∈ R is not invertible for d > 1 J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 10 / 30 Exact solution in general case (d>1) Transport cost between normalized histograms µ and ν, where µ = M i=1 mi δXi , ν = N j=1 nj δYj , mi , nj ≥ 0 and i mi = j nj = 1. • mi , nj are the masses at locations Xi , Yj It can be recasted as a linear programming problem: linear cost + linear constraints W2(µ, ν)2 = min P∈Pµ,ν P , C = i,j Pi,j Xi − Yj 2 = min A·p=b pT c • C is the ﬁxed cost assignment matrix between histograms bins: Ci,j = d k=1 Xk i − Yk j 2 • Pµ,ν is the set of non negative matrices P with marginals µ and ν, ie P(µ, ν) = P ∈ RM×N , Pi,j 0, i,j Pi,j = 1, j Pi,j = mi , i Pi,j = nj J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 11 / 30 Illustration in unidimensional case (d = 1) for histograms Two histograms µ = {1 3 , 2 3 } and ν = {1 3 , 1 6 , 1 2 } Example: µi is the production at plant i and νj is the storage capacity of storehouse j Matrix C deﬁnes the transport cost from i to j: C11 = 22 C21 = 62 C12 = 12 C22 = 52 C13 = 52 C23 = 12 The set of admissible matrices P is µ1 = 1/3 µ2 = 2/3 P11 P21 ν1 = 1/3 P12 P22 ν2 = 1/6 P13 P23 ν3 = 1/2 Pij is the mass that is transported from µi to νj . J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 11 / 30 Illustration in unidimensional case (d = 1) for histograms Two histograms µ = {1 3 , 2 3 } and ν = {1 3 , 1 6 , 1 2 } Example: µi is the production at plant i and νj is the storage capacity of storehouse j Matrix C deﬁnes the transport cost from i to j: C11 = 22 C21 = 62 C12 = 12 C22 = 52 C13 = 52 C23 = 12 The set of admissible matrices P is µ1 = 1/3 µ2 = 2/3 1/9 2/9 ν1 = 1/3 1/18 1/9 ν2 = 1/6 1/6 1/3 ν3 = 1/2 Pij is the mass that is transported from µi to νj . The transport cost is W(µ, ν) = ij Pij Cij = 15 J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 11 / 30 Illustration in unidimensional case (d = 1) for histograms Two histograms µ = {1 3 , 2 3 } and ν = {1 3 , 1 6 , 1 2 } Example: µi is the production at plant i and νj is the storage capacity of storehouse j Matrix C deﬁnes the transport cost from i to j: C11 = 22 C21 = 62 C12 = 12 C22 = 52 C13 = 52 C23 = 12 The set of admissible matrices P is µ1 = 1/3 µ2 = 2/3 1/3 0 ν1 = 1/3 0 1/6 ν2 = 1/6 0 1/2 ν3 = 1/2 Pij is the mass that is transported from µi to νj . The transport cost is W(µ, ν) = ij Pij Cij = 6 J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 12 / 30 Optimal transport solution illustration in 1D Histograms µ and ν (on uniform grid Ω) Optimal ﬂow P J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 12 / 30 Optimal transport solution illustration in 1D Histograms µ and ν (on uniform grid Ω) Optimal ﬂow P Remark: Masses can be splitted by transport J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 12 / 30 Optimal transport solution illustration in 1D Histograms µ and ν (on uniform grid Ω) Optimal ﬂow P Remark: Masses can be splitted by transport J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Optimal transport framework 13 / 30 Optimal transport solution with linear programming method Discrete mass transportation problem for histograms can be solved with standard linear programming algorithms (simplex, interior point methods). Dedicated algorithms are more efﬁcient for optimal assignment problem (e.g Hungarian and Auction algorithms in O(N3 )) Computation can be (slightly) accelerated when using other costs than L2 (e.g. L1 [Ling and Okada, 2007], Truncated L1 [Pele and Werman, 2008]) Advantages Complexity does not depend on feature dimension d Limitation Intractable for signal processing applications where N 103 (considering time complexity & memory limitation) J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 14 / 30 Part II Relaxation and regularization J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 15 / 30 Problem Statement Histogram speciﬁcation exhibits strong limitations of optimal transport when dealing with image processing: • Color artifacts due to the exact speciﬁcation (histograms can have very different shapes) • Irregularities: Transport map is not consistent in the color domain It does not take into account spatial information Histogram equalization + Filtering Proposed solution • Relax mass conservation constraint • Promote regular transport ﬂows (color consistency) • Include spatial information (spatial consistency) J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 16 / 30 Constraint Relaxation Idea 1: Relaxation of mass conservation constraints [Ferradans et al., 2013] We consider the transport cost between normalized histograms µ and ν, µ(x) = M i=1 mi δXi (x), s.t. i mi = 1, mi ≥ 0 ∀i Relaxed Formulation : P ∈ arg min P∈Pκ(µ,ν) P, C = 1 i N,1 j M Pi,j Ci,j • with Ci,j = Xi − Yj 2 , where Xi ∈ Ω ⊂ Rd is bin centroid of µ for index i; • with new (linear) constraints: P(µ, ν) = Pi,j 0, i,j Pi,j = 1, j Pi,j = mi , i Pi,j = nj J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 16 / 30 Constraint Relaxation Idea 1: Relaxation of mass conservation constraints [Ferradans et al., 2013] We consider the transport cost between normalized histograms µ and ν, µ(x) = M i=1 mi δXi (x), s.t. i mi = 1, mi ≥ 0 ∀i Relaxed Formulation : P ∈ arg min P∈Pκ(µ,ν) P, C = 1 i N,1 j M Pi,j Ci,j • with Ci,j = Xi − Yj 2 , where Xi ∈ Ω ⊂ Rd is bin centroid of µ for index i; • with new (linear) constraints: Pκ(µ, ν) = Pi,j 0, i,j Pi,j = 1, j Pi,j = mi , κnj ≤ i Pi,j ≤ Knj where capacity parameters are such that κ ≤ 1 ≤ K: hard to tune J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 17 / 30 Proposed relaxed histogram matching Idea 2: Use capacity variables as unknowns {P , κ } ∈ arg min P∈Pκ(µ,ν) κ∈RN ,κ≥0, κ, n =1 P, C + ρκ − 11 where Pκ(µ, ν) = Pi,j 0, i,j Pi,j = 1, j Pi,j = mi , i Pi,j = κj nj ⇒ Still a linear program J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 18 / 30 Illustration of relaxed transport Optimal transport J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 18 / 30 Illustration of relaxed transport Relaxed optimal transport J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 19 / 30 Relaxed color transfer: comparison with raw OT Target Raw OT Relaxed OT Source No color or spatial regularization J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 20 / 30 Proposed relaxed and regularized histogram matching Idea 3: Add regularization prior {P , κ } ∈ arg min P∈Pκ(µ,ν) κ∈RN ,κ≥0, κ, n =1 P, C + ρκ − 11 + λR(P). where Pκ(µ, ν) = Pi,j 0, i,j Pi,j = 1, j Pi,j = mi , i Pi,j = κj nj and R(P) models some regularity priors ⇒ Still a linear program J. Rabin, S. Ferradans, N. Papadakis Adaptive color transfer with relaxed optimal transport Relaxation Regularization Conclusion 21 / 30 Regularity of transport map • Global regularization: Deﬁning the regularity of the ﬂow matrix is a NPhard problem • Average transport map Instead, we use the Posterior mean to estimate a onetoone transfer function T between µ and ν T(Xi )
We introduce the generalized Pareto distributions as a statistical model to describe thresholded edgemagnitude image filter results. Compared to the more commonWeibull or generalized extreme value distributions these distributions have at least two important advantages, the usage of the high threshold value assures that only the most important edge points enter the statistical analysis and the estimation is computationally more efficient since a much smaller number of data points have to be processed. The generalized Pareto distributions with a common threshold zero form a twodimensional Riemann manifold with the metric given by the Fisher information matrix. We compute the Fisher matrix for shape parameters greater than 0.5 and show that the determinant of its inverse is a product of a polynomial in the shape parameter and the squared scale parameter. We apply this result by using the determinant as a sharpness function in an autofocus algorithm. We test the method on a large database of microscopy images with given ground truth focus results. We found that for a vast majority of the focus sequences the results are in the correct focal range. Cases where the algorithm fails are specimen with too few objects and sequences where contributions from different layers result in a multimodal sharpness curve. Using the geometry of the manifold of generalized Pareto distributions more efficient autofocus algorithms can be constructed but these optimizations are not included here.

Generalized Pareto Distributions, Image Statistics and Autofocusing in Automated Microscopy Reiner Lenz Microscopy 34 slices changing focus along the optical axis Focal Sequence – First 4x16 images Focal Sequence – Next 4x16 images 4 5 Focal Sequence – Final 4x16 images Total Focus 6 Observations 7 • Autofocus is easy • It is independent on image content (what is in the image) • It is independent of imaging method (how image is produced) • It is fast (‘realtime’) • It is local (which part of the image is in focus) • It is obviously useful in applications (microscopy, camera, …) • It is useful in understanding lowlevel vision processes • It is illustrates relation between scenestatistics and vision Processing Pipeline / Techniques 8 Filtering Thresholding Critical Points Group Representations Extreme Value Statistics Information Geometry Filtering Representations of dihedral Groups 9 Most images are defined on square grids The symmetry group of square grids is the dihedral group D(4) Consists of 8 elements: 4 rotations and 4 (rotation+reflection) For a 5x5 array choose six filter pairs resulting in a 6x2 vector at each pixel Fx = −1 1 −1 1 , Fy = −1 −1 1 1
We study barycenters in the Wasserstein space Pp(E) of a locally compact geodesic space (E, d). In this framework, we define the barycenter of a measure ℙ on Pp(E) as its Fréchet mean. The paper establishes its existence and states consistency with respect to ℙ. We thus extends previous results on ℝ d , with conditions on ℙ or on the sequence converging to ℙ for consistency.

Barycenter in Wasserstein spaces: existence and consistency Thibaut Le Gouic and JeanMichel Loubes* Institut de Math´ematiques de Marseille ´Ecole Centrale Marseille Institut Math´ematique de Toulouse* October 29th 2015 1 / 23 Barycenter in Wasserstein spaces Barycenter The barycenter of a set {xi }1≤i≤J of Rd for J points endowed with weights (λi )1≤i≤J is deﬁned as 1≤i≤J λi xi . It is characterized by being the minimizer of x → 1≤i≤J λi x − xi 2 . 2 / 23 Barycenter in Wasserstein spaces Barycenter The barycenter of a set {xi }1≤i≤J of Rd for J points endowed with weights (λi )1≤i≤J is deﬁned as 1≤i≤J λi xi . It is characterized by being the minimizer of x → 1≤i≤J λi x − xi 2 . Replace (Rd , . ) by a metric space (E, d), and minimize x → 1≤i≤J λi d(x, xi )2 . 2 / 23 Barycenter in Wasserstein spaces Barycenter Likewise, given a random variable/vector of law µ on Rd , its expectation EX is characterized by being the minimizer of x → E X − x 2 . 3 / 23 Barycenter in Wasserstein spaces Barycenter Likewise, given a random variable/vector of law µ on Rd , its expectation EX is characterized by being the minimizer of x → E X − x 2 . → extension to a metric space (it summarizes the information staying in a geodesic space) 3 / 23 Barycenter in Wasserstein spaces Barycenter Deﬁnition (pbarycenter) Given a probability measure µ on a geodesic space (E, d), the set arg min x ∈ E; d(x, y)p dµ(y) , is called the set of pbarycenters of µ. 4 / 23 Barycenter in Wasserstein spaces Barycenter Deﬁnition (pbarycenter) Given a probability measure µ on a geodesic space (E, d), the set arg min x ∈ E; d(x, y)p dµ(y) , is called the set of pbarycenters of µ. Existence ? 4 / 23 1 Geodesic space 2 Wasserstein space 3 Applications 5 / 23 Barycenter in Wasserstein spaces Geodesic space Deﬁnition (Geodesic space) A complete metric space (E, d) is said to be geodesic if for all x, y ∈ E, there exists z ∈ E such that 1 2 d(x, y) = d(x, z) = d(z, y). 6 / 23 Barycenter in Wasserstein spaces Geodesic space Deﬁnition (Geodesic space) A complete metric space (E, d) is said to be geodesic if for all x, y ∈ E, there exists z ∈ E such that 1 2 d(x, y) = d(x, z) = d(z, y). Include many spaces (vectorial normed spaces, compact manifolds, ...), 6 / 23 Barycenter in Wasserstein spaces Geodesic space Proposition (Existence) The pbarycenter of any probability measure on a locally compact geodesic space, with ﬁnite moments of order p, exists. 7 / 23 Barycenter in Wasserstein spaces Geodesic space Proposition (Existence) The pbarycenter of any probability measure on a locally compact geodesic space, with ﬁnite moments of order p, exists. Not unique e.g. the sphere Non positively curved space → unique barycenter, 1Lipschitz on 2Wasserstein space. 7 / 23 1 Geodesic space 2 Wasserstein space 3 Applications 8 / 23 Barycenter in Wasserstein spaces Wasserstein metric Deﬁnition (Wasserstein metric) Let µ and ν be two probability measures on a metric space (E, d) and p ≥ 1. The pWasserstein distance between µ and ν is deﬁned as W p p (µ, ν) = inf π∈Γ(µ,ν) dE (x, y)p dπ(x, y), where Γ(µ, ν) is the set of all probability measures on E × E with marginals µ and ν. 9 / 23 Barycenter in Wasserstein spaces Wasserstein metric Deﬁnition (Wasserstein metric) Let µ and ν be two probability measures on a metric space (E, d) and p ≥ 1. The pWasserstein distance between µ and ν is deﬁned as W p p (µ, ν) = inf π∈Γ(µ,ν) dE (x, y)p dπ(x, y), where Γ(µ, ν) is the set of all probability measures on E × E with marginals µ and ν. Deﬁned for any measure for which moments of order p are ﬁnite : Ed(X, x0)p < ∞ (denote this set Pp(E)), It is a metric on Pp(E) ; (Pp(E), Wp) is called the Wasserstein space, The topology of this metric is the weak convergence topology and convergence of moments of order p. 9 / 23 Barycenter in Wasserstein spaces Wasserstein metric The Wasserstein space of a complete geodesic space is a complete geodesic space. (Pp(E), Wp) is locally compact ⇔ (E, d) is compact. (E, d) ⊂ (Pp(E), Wp) isometrically. Existence of the barycenter on (Pp(E), Wp) ? 10 / 23 Barycenter in Wasserstein spaces Measurable barycenter application Deﬁnition (Measurable barycenter application) Let (E, d) be a geodesic space. (E, d) is said to admit measurable barycenter applications if for any J ≥ 1 and any weights (λj )1≤j≤J, there exists a measurable application T from EJ to E such that for all (x1, ..., xJ) ∈ EJ, min x∈E J j=1 λj d(x, xj )p = J j=1 λj d(T(x1, ..., xJ), xj )p . 11 / 23 Barycenter in Wasserstein spaces Measurable barycenter application Deﬁnition (Measurable barycenter application) Let (E, d) be a geodesic space. (E, d) is said to admit measurable barycenter applications if for any J ≥ 1 and any weights (λj )1≤j≤J, there exists a measurable application T from EJ to E such that for all (x1, ..., xJ) ∈ EJ, min x∈E J j=1 λj d(x, xj )p = J j=1 λj d(T(x1, ..., xJ), xj )p . Locally compact geodesic spaces admit measurable barycenter applications. 11 / 23 Barycenter in Wasserstein spaces Existence of barycenter Theorem (Existence of barycenter) Let (E, d) be a geodesic space that admits measurable barycenter applications. Then any probability measure P on (Pp(E), Wp) has a barycenter. 12 / 23 Barycenter in Wasserstein spaces Existence of barycenter Theorem (Existence of barycenter) Let (E, d) be a geodesic space that admits measurable barycenter applications. Then any probability measure P on (Pp(E), Wp) has a barycenter. Barycenter is not unique e.g. : E = Rd with P = 1 2δµ1 + 1 2δµ2 , µ1 = 1 2δ(−1,−1) + 1 2δ(1,1) and µ2 = 1 2δ(1,−1) + δ(−1,1) 12 / 23 Barycenter in Wasserstein spaces Existence of barycenter Theorem (Existence of barycenter) Let (E, d) be a geodesic space that admits measurable barycenter applications. Then any probability measure P on (Pp(E), Wp) has a barycenter. Barycenter is not unique e.g. : E = Rd with P = 1 2δµ1 + 1 2δµ2 , µ1 = 1 2δ(−1,−1) + 1 2δ(1,1) and µ2 = 1 2δ(1,−1) + δ(−1,1) Consistency of the barycenter ? 12 / 23 Barycenter in Wasserstein spaces 3 steps for existence 1 Multimarginal problem 2 Weak consistency 3 Approximation by ﬁnitely supported measures 13 / 23 Barycenter in Wasserstein spaces Push forward Deﬁnition (Push forward) Given a measure ν on E and an measurable application T : E → (F, F), the push forward of ν by T is given by T#ν(A) = ν T−1 (A) , ∀A ∈ F. Probabilist version : X is a r.v. on (Ω, A, P), then PX = X#P. 14 / 23 Barycenter in Wasserstein spaces Multimarginal problem Theorem (Barycenter and multimarginal problem [Agueh and Carlier, 2011]) Let (E, d) be a complete separable geodesic space, p ≥ 1 and J ∈ N∗. Given (µi )1≤i≤J ∈ Pp(E)J and weights (λi )1≤i≤J, there exists a measure γ ∈ Γ(µ1, ..., µJ) minimizing ˆγ → inf x∈E 1≤i≤J λi d(xi , x)p dˆγ(x1, ..., xJ). If (E, d) admits a measurable barycenter application T : EJ → E then the measure ν = T#γ is a barycenter of (µi )1≤i≤J If T is unique, ν is of the form ν = T#γ. 15 / 23 Barycenter in Wasserstein spaces Weak consistency Theorem (Weak consistency of the barycenter) Let (E, d) be a geodesic space that admits measurable barycenter. Take (Pj )j≥1 ⊂ Pp(E) converging to P ∈ Pp(E). Take any barycenter µj of Pj . Then the sequence (µj )j≥1 is (weakly) tight and any limit point is a barycenter of P. 16 / 23 Barycenter in Wasserstein spaces Approximation by ﬁnitely supported measure Proposition (Approximation by ﬁnitely supported measure) For any measure P on Pp(E) there exists a sequence of ﬁnitely supported measures (Pj )j≥1 ⊂ Pp(E) such that Wp(Pj , P) → 0 as j → ∞. 17 / 23 Barycenter in Wasserstein spaces 3 steps for existence 1 Multimarginal problem 2 Weak consistency 3 Approximation by ﬁnitely supported measures 18 / 23 Barycenter in Wasserstein spaces 3 steps for existence 1 Multimarginal problem → existence of barycenter for P ﬁnitely supported. 2 Weak consistency 3 Approximation by ﬁnitely supported measures 18 / 23 Barycenter in Wasserstein spaces 3 steps for existence 1 Multimarginal problem → existence of barycenter for P ﬁnitely supported. 2 Weak consistency → existence of barycenter for probabilities that can be approximated by measures with barycenters. 3 Approximation by ﬁnitely supported measures 18 / 23 Barycenter in Wasserstein spaces 3 steps for existence 1 Multimarginal problem → existence of barycenter for P ﬁnitely supported. 2 Weak consistency → existence of barycenter for probabilities that can be approximated by measures with barycenters. 3 Approximation by ﬁnitely supported measures → any probability can be approximated by a ﬁnitely supported probability measure. 18 / 23 Barycenter in Wasserstein spaces Consistency of the barycenter Theorem (Consistency of the barycenter) Let (E, d) be a geodesic space that admits measurable barycenter. Take (Pj )j≥1 ⊂ Pp(E) and P ∈ Pp(E). Take any barycenter µj of Pj . Then the sequence (µj )j≥1 is totally bounded in (Pp(E), Wp) and any limit point is a barycenter of P. 19 / 23 Barycenter in Wasserstein spaces Consistency of the barycenter Theorem (Consistency of the barycenter) Let (E, d) be a geodesic space that admits measurable barycenter. Take (Pj )j≥1 ⊂ Pp(E) and P ∈ Pp(E). Take any barycenter µj of Pj . Then the sequence (µj )j≥1 is totally bounded in (Pp(E), Wp) and any limit point is a barycenter of P. Imply continuity of barycenter when barycenter are unique. No rate of convergence (barycenter Lipschitz on (E, d) Lipschitz on Pp(E)). Imply compactness of the set of barycenters. 19 / 23 1 Geodesic space 2 Wasserstein space 3 Applications 20 / 23 Barycenter in Wasserstein spaces Statistical application : improvement of measures accuracy Take (µn i )1≤j≤J → µj when n → ∞ and weights (λj )1≤j≤J. Set µn B the barycenter of (µn i )1≤j≤J. Then, as n → ∞, µn B → µB. 21 / 23 Barycenter in Wasserstein spaces Statistical application : improvement of measures accuracy Take (µn i )1≤j≤J → µj when n → ∞ and weights (λj )1≤j≤J. Set µn B the barycenter of (µn i )1≤j≤J. Then, as n → ∞, µn B → µB. Texture mixing [Rabin et al., 2011] 21 / 23 Barycenter in Wasserstein spaces Statistical application : growing number of measures Take (µn)n≥1 such that 1 n n i=1 µi → P. Set µn B the barycenter of 1 n n i=1 δµi . Then, as n → ∞, µn B → µB 22 / 23 Barycenter in Wasserstein spaces Statistical application : growing number of measures Take (µn)n≥1 such that 1 n n i=1 µi → P. Set µn B the barycenter of 1 n n i=1 δµi . Then, as n → ∞, µn B → µB Average of template deformation [Bigot and Klein, 2012],[Agull´oAntol´ın et al., 2015] 22 / 23 Agueh, M. and Carlier, G. (2011). Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2) :904–924. Agull´oAntol´ın, M., CuestaAlbertos, J. A., Lescornel, H., and Loubes, J.M. (2015). A parametric registration model for warped distributions with Wasserstein’s distance. J. Multivariate Anal., 135 :117–130. Bigot, J. and Klein, T. (2012). Consistent estimation of a population barycenter in the Wasserstein space. ArXiv eprints. Rabin, J., Peyr´e, G., Delon, J., and Bernot, M. (2011). Wasserstein Barycenter and its Application to Texture Mixing. SSVM’11, pages 435–446. 23 / 23
Univariate Lmoments are expressed as projections of the quantile function onto an orthogonal basis of univariate polynomials. We present multivariate versions of Lmoments expressed as collections of orthogonal projections of a multivariate quantile function on a basis of multivariate polynomials. We propose to consider quantile functions defined as transports from the uniform distribution on [0; 1] d onto the distribution of interest and present some properties of the subsequent Lmoments. The properties of estimated Lmoments are illustrated for heavytailed distributions.

Multivariate LMoments Based on Transports Alexis Decurninge Huawei Technologies Geometric Science of Information October 29th, 2015 Outline 1 Lmoments Deﬁnition of Lmoments 2 Quantiles and multivariate Lmoments Deﬁnitions and properties Rosenblatt quantiles and Lmoments Monotone quantiles and Lmoments Estimation of Lmoments Numerical applications Deﬁnition of Lmoments Lmoments of a distribution : if X1,...,Xr are real random variables with common cumulative distribution function F λr = 1 r r−1 k=0 (−1)k r − 1 k E[Xr−k:r ] with X1:r ≤ X2:r ≤ ... ≤ Xr:r : order statistics λ1 = E[X] : localization λ2 = E[X2:2 − X1:2] : dispersion τ3 = λ3 λ2 = E[X3:3−2X2:3+X1:3] E[X2:2−X1:2] : asymmetry τ4 = λ4 λ2 = E[X4:4−3X3:4+3X2:4−X1:4] E[X2:2−X1:2] : kurtosis Existence if xdF(x) < ∞ Characterization of Lmoments Lmoments are projections of the quantile function on an orthogonal basis λr = 1 0 F−1 (t)Lr (t)dt F−1 generalized inverse of F F−1 (t) = inf {x ∈ R such that F(x) ≥ t} Lr Legendre polynomial (orthogonal basis in L2([0, 1])) Lr (t) = r k=0 (−1)k r k 2 tr−k (1 − t)k Lmoments completely characterize a distribution F−1 (t) = ∞ r=1 (2r + 1)λr Lr (t) Deﬁnition of Lmoments (discrete distributions) Lmoments for a multinomial distribution of support x1 ≤ x2 ≤ ... ≤ xn and weights π1, ..., πn ( n i=1 πi = 1) λr = n i=1 w (r) i xi = n i=1 Kr i a=1 πa − Kr i−1 a=1 πa xi with Kr the respective primitive of Lr : Kr = Lr Empirical Lmoments Ustatistics : mean of all subsequences of size r without replacement ˆλr = 1 n r 1≤i1<···
Probability Density Estimation (chaired by Jesús Angulo, S. Said)
The two main techniques of probability density estimation on symmetric spaces are reviewed in the hyperbolic case. For computational reasons we chose to focus on the kernel density estimation and we provide the expression of Pelletier estimator on hyperbolic space. The method is applied to density estimation of reflection coefficients derived from radar observations.

Probability density estimation on the hyperbolic space applied to radar processing October 28, 2015 Emmanuel Chevalliera, Frédéric Barbarescob, Jesús Anguloa a CMMCentre de Morphologie Mathématique, MINES ParisTech; France b Thales Air Systems, Surface Radar Domain, Technical Directorate, Advanced Developments Department, 91470 Limours, France emmanuel.chevallier@minesparistech.fr 1/20 Probability density estimation on the hyperbolic space Three techniques of nonparametric probability density estimation: histograms kernels orthogonal series The Hyperbolic space of dimension 2 Histograms, kernels and orthogonal series in the hyperbolic space Density estimation of radar data in the Poincaré disk 2/20 Probability density estimation on the hyperbolic space Three techniques of nonparametric probability density estimation Histograms: partition of the space into a set of bins counting the number of samples per bins 3/20 Probability density estimation on the hyperbolic space Kernels: a kernel is placed over each sample the density is evaluated by summing the kernels 4/20 Probability density estimation on the hyperbolic space Orthogonal series: the true density f is studied through the estimation of the scalar products between f and an orthonormal basis of real functions. Let f be the true density f , g = f gdµ let {ei } is a orthogonal Hilbert basis of real functions f = ∞ i=−∞ f , ei ei , since fI , ei = fI ei dµ = E (ei (I)) ≈ 1 n n j=1 ei (I(pj )) we can estimate f by: f ≈ N i=−N 1 n n j=1 ei (I(pj )) ei = ˆf . 5/20 Probability density estimation on the hyperbolic space Homogeneity and isotropy consideration non homogeneous bins non istropic bins Absence of prior on f : the estimation should be as homogeneous and isotropic as possible. → choice of bins, kernels or orthogonal basis 6/20 Probability density estimation on the hyperbolic space Remark on homogeneity and isotropy Figure: Random variable X ∈ Circle. The underlying space is not homogeneous and not isotropic, the density estimation can not consider every points and directions in an equivalent way. 7/20 Probability density estimation on the hyperbolic space The 2 dimensional hyperbolic space and the Poincaré disk The only space of constant negative sectional curvature The Poincaré disk is a model of hyperbolic geometry ds2 D = 4 dx2 + dy2 (1 − x2 − y2)2 Homogeneous and isotropic 8/20 Probability density estimation on the hyperbolic space Density estimation in the hyperbolic space: histograms A good tilling: homogeneous and isotropic There are many polygonal tilings: There is no homotetic transformations for all λ ∈ R Problem: not always possible to scale the tiling to the studied density 9/20 Probability density estimation on the hyperbolic space Density estimation in the hyperbolic space: orthogonal series Standard choice of basis: eigenfunctions of the Laplacian operator ∆ In Rn: (ei ) = Fourier basis → characteristic function density estimator. f , [a, b] → R, f = ∞ i=−∞ f , ei ei , f , R → R, f = ∞ ω=−∞ f , eω eωdω, Compact case: estimation of a sum Non compact case: estimation of an integral 10/20 Probability density estimation on the hyperbolic space Density estimation in the hyperbolic space: orthogonal series On the Poincaré disk D, solutions of ∆f = λf are known for f , D → R but not for f , D ⊂ D → R with D compact Computational problem: the estimation involves an integral, even for bounded support functions 11/20 Probability density estimation on the hyperbolic space Kernel density estimation on Riemannian manifolds K : R+ → R+ such that: i) Rd K(x)dx = 1, ii) Rd xK(x)dx = 0, iii) K(x > 1) = 0, sup(K(x)) = K(0). Euclidean kernel estimator: ˆfk = 1 k i 1 rd K x, xi  r Riemannian case: K x − xi  r → K d(x − xi ) r 12/20 Probability density estimation on the hyperbolic space Figure: Volume change θxi induced by the exponential map θx : volume change (TM, Lebesgue) expx −→ (M, vol) Kernel density estimator proposed by Pelletier: ˆfk = 1 k i 1 rd 1 θxi (x) K d(x, xi ) r 13/20 Probability density estimation on the hyperbolic space θx in the hyperbolic space θx can easily be computed in hyperbolic geometry. Polar coordinates at p ∈ D: at p ∈ D, if the geodesic of angle α of length r leads to q, (r, α) ↔ q In polar coordinates: ds2 = dr2 + sinh(r)2 dα2 thus dvolpolar = sinh(r)drdα and θp((r, θ)) = sinh(r) r 14/20 Probability density estimation on the hyperbolic space Density estimation in the hyperbolic space: kernels Kernel density estimator: ˆfk = 1 k i 1 rd d(x, xi ) sinh(d(x, xi )) K d(x, xi ) r Formulation as a convolution Fourier−Helgason ←→ 0rthogonal series Reasonable computational cost 15/20 Probability density estimation on the hyperbolic space Radar data Succession of input vector z = (z0, .., zn−1) ∈ Cn z: background or target? Assumptions: z = (z0, .., zn−1) is a centered Gaussian process. Centered → dened by its covariance Rn = E[ZZ∗ ] = r0 r1 . rn−1 r1 r0 r1 . rn−2 . . . r1 rn−1 . r1 r0 Rn ∈ T n: Toeplitz (additional stationary assumption) and SPD matrix 16/20 Probability density estimation on the hyperbolic space Auto regressive model Auto regressive model of order k: ˆzl = − k j=1 ak j zl−j kth reection coecient : µk = ak k Dieomorphism ϕ: ϕ : T n → R∗ + × Dn−1 , Rn → (P0, µ1, · · · , µn−1) (z0, ..., zn−1) ↔ (P0, µ1, · · · , µn−1) 17/20 Probability density estimation on the hyperbolic space Geometry on T n ϕ : T n → R∗ + × Dn−1 , Rn → (P0, µ1, · · · , µn−1) metric on T n: product metric on R∗ + × Dn−1 Multiple acquisitions of an identical background: distribution of the µk? Potential use: identication of a nonbackground objects 18/20 Probability density estimation on the hyperbolic space Application of density estimation to radar data µ1, N = 0.007 µ2, N = 1.61 µ3, N = 14.86 µ1, N = 0.18 µ2, N = 2.13 µ3, N = 4.81 Figure: First row: ground, second row: Rain 19/20 Probability density estimation on the hyperbolic space Conclusion The density estimation on the hyperbolic space is not a fundamentally dicult problem Easiest solution: kernels Future works: computation of the volume change in kernels for Riemannian manifolds deepen the application for radar signals Thank you for your attention 20/20 Probability density estimation on the hyperbolic space
We address here the problem of perceptual colour histograms. The Riemannian structure of perceptual distances is measured through standards sets of ellipses, such as Macadam ellipses. We propose an approach based on local Euclidean approximations that enables to take into account the Riemannian structure of perceptual distances, without introducing computational complexity during the construction of the histogram.

Color Histograms using the perceptual metric October 28, 2015 Emmanuel Chevalliera, Ivar Farupb, Jesús Anguloa a CMMCentre de Morphologie Mathématique, MINES ParisTech; France b Gjovik University College; France emmanuel.chevallier@minesparistech.fr 1/16 Color Histograms using the perceptual metric Plan of the presentation Formalization of the notion of image histogram Perceptual metric and Macadam ellipses Density estimation in the space of colors 2/16 Color Histograms using the perceptual metric Image histogram : formalization I : Ω → V p → I(p) Ω: support space of pixels: rectangle/parallelepiped. V: the value space (Ω, µΩ), (V , µV ), µΩ and µV are induced by the choosen geometries on Ω and V . Transport of µΩ on V : I∗(µΩ) Image histogram: estimation of f = dI∗(µΩ) dµV 3/16 Color Histograms using the perceptual metric pixels: p ∈ Ω, uniformly distributed with respect to µΩ {I(p), p a pixel }: set of independent draws of the "random variable" I Estimation of f = dI∗(µΩ) dµV from {I(p), p a pixel }: → standard problem of probability density estimation 4/16 Color Histograms using the perceptual metric Perceptual color histograms I : Ω → (M = colors, gperceptual ) p → I(p) Assumption: the perceptual distances between colors is induced by a Riemannian metric The manifold of colors was one of the rst example of Riemannian manifold, suggested by Riemann 5/16 Color Histograms using the perceptual metric Macadam ellipses: just noticeable dierences Chromaticity diagram (constant luminance): Ellipses: elementary unit balls → local L2 metric 6/16 Color Histograms using the perceptual metric Lab space The Euclidean metric of the Lab parametrization is supposed to be more perceptual than other parametrizations Figure: Macadam ellipses in the ab plan However, the ellipses are clearly not balls 7/16 Color Histograms using the perceptual metric Modiction of the density estimator Density → local notion. No need of knowing long geodesics Small distances → local approximation by an Euclidean metric Notations: dR: Perceptual metric .Lab: Canonical Euclidean metric of Lab .c: Euclidean metric on Lab induced by the ellipse at c Small distances around c: .c is "better" than .Lab 8/16 Color Histograms using the perceptual metric Modiction of the density estimator Standard kernel estimator: ˆf (x) = 1 k pi ∈{pixels} 1 r2 K x − I(pi )Lab r Possible modication K x − I(pi )Lab r → K x − I(pi )I(pi ) r where .I(pi ) is an Euclidean distance dened by the interpolated ellipse at I(pi ). 9/16 Color Histograms using the perceptual metric Generally, at c a color: limx→c x − cc dR(x, c) = 1 = limx→c x − cLab dR(x, c) Thus, ∃A > 0 such that, ∀R > 0, ∃x ∈ BLab(c, R), A < x − c dR(x, c) − 1 . while ∃Rc = Rc,A such that, ∀x ∈ BLab(c, Rc), x − cc dR(x, c) − 1 < A. hence supBLab(c,Rc ) x − cc dR(x, c) − 1 < A < supBLab(c,Rc ) x − c dR(x, c) − 1 . 10/16 Color Histograms using the perceptual metric When the scaling factor r is small enough: r ≤ Rc and Bc(c, r) ⊂ BLab(c, Rc) x ∈ B(c, Rc), K x−cc r better than K x−cLab r . x /∈ B(c, Rc), K x−cc r = K x−cLab r = 0 11/16 Color Histograms using the perceptual metric Interpolation of a set of local metric: a deep question... What is a good interpolation? Interpolating a function: minimizing variation with respect to a metric. Interpolating a metric? No intrinsic method: depends on a choice of parametrization. Subject of the next study 12/16 Color Histograms using the perceptual metric Barycentric interpolation in the Lab space 13/16 Color Histograms using the perceptual metric Volume change (a) (b) Figure: (a): color photography (b): Zoom of the density change adapted to colours present in the photography 14/16 Color Histograms using the perceptual metric experimental results (a) (b) (c) Figure: The canonical Euclidean metric of the ab projective plane in (a), the canonical metric followed by a division by the local density of the perceptual metric in (b) and the modied kernel formula in (c). 15/16 Color Histograms using the perceptual metric Conclusion A simple observation which improve the consistency of the histogram without requiring additional computational costs Future works will focus on: The interpolation of the ellipses The construction of the geodesics and their applications Thank you for your attention 16/16 Color Histograms using the perceptual metric
Air traffic management (ATM) aims at providing companies with a safe and ideally optimal aircraft trajectory planning. Air traffic controllers act on flight paths in such a way that no pair of aircraft come closer than the regulatory separation norm. With the increase of traffic, it is expected that the system will reach its limits in a near future: a paradigm change in ATM is planned with the introduction of trajectory based operations. This paper investigate a mean of producing realistic air routes from the output of an automated trajectory design tool. For that purpose, an entropy associated with a system of curves is defined and a mean of iteratively minimizing it is presented. The network produced is suitable for use in a semiautomated ATM system with human in the loop.

Entropy minimizing curves Application to automated ight path design S. Puechmorel ENAC 29th October 2015 Problem Statement Flight path planning • Trac is expected to double by 2050; • In future systems, trajectories will be negotiated and optimized well before the ights start; • But humans will be in the loop : generated ight plans must comply with operational constraints; Mutiagent systems • A promising approach to address the planning problem; • Does not end up with a human friendly trac! • Idea : start with the proposed solution and rebuild a route network from it. A curve optimization problem An entropy criterion • Route networks and currently made of straight segments connecting beacons; • May be viewed as a maximally concentrated spatial density distribution; • Minimizing the entropy with such a density will intuitively yield a ight path system close to what is expected. Problem modeling Density associated with a curve system • A classical measure : counting the number of aircraft in each bin of a spatial grid and averaging over time; • Suers from a severe aw : aircraft with low velocity will overcontribute; • May be corrected by enforcing invariance under reparametrization of curves; • Combined with a nonparametric kernel estimate to yield : ˜d : x → N i=1 1 0 K ( x − γi (t) ) γi (t) dt N i=1 Ω 1 0 K ( x − γi (t) ) γi (t) dtdx (1) Problem modeling II The entropy criterion • Kernel K is normalized over the domain Ω so as to have a unit integral; • Density is directly related to lengths li , i = 1. . . n of curves γi , i = 1. . . N : ˜d : x → N i=1 1 0 K ( x − γi (t) ) γi (t) dt N i=1 li (2) • Associated entropy is : E(γ1, . . . , γN) = − Ω ˜d(x) log ˜d(x) dx (3) Optimal curve displacement eld Entropy variation • ˜d has integral 1 over the domain Ω ; • It implies that : − ∂ ∂γj E(γ1, . . . , γN)( ) = Ω ∂ ˜d(x) ∂γj ( ) log ˜d(x) dx (4) where is an admissible variation of curve γi . • The denominator in the expression of ˜d has derivative : [0,1] γj (t) γj (t) , (t) dt = − [0,1] γj (t) γj (t) N , dt (5) Optimal curve displacement eld Entropy variation • The numerator of ˜d has derivative : [0,1] γj (t) − x γj (t) − x N , K ( γj (t) − x ) γj (t) dt (6) − [0,1] γj (t) γj (t) N , K ( γj (t) − x ) dt (7) Optimal curve displacement eld II Normal move • Final expression yield a displacement eld normal to the curve : Ω γj (t) − x γj (t) − x N K ( γj (t) − x ) log ˜d(x)dx γj (t) (8) − Ω K ( γj (t) − x ) log ˜d(x))dx γj (t) γj (t) N (9) + Ω ˜d(x) log( ˜d(x))dx γj (t) γj (t) N n i=1 li (10) Implementation A gradient algorithm • The move is based on a tangent vector in the tangent space to Imm([0, 1], R3)/Di+ ([0, 1) ; • It is not directly implementable on a computer; • A simple, landmark based approach with evenly spaced points was used; • A compactly supported kernel (epanechnikov) was selected : it allows the computation of density ˜d on GPUs as a texture operation that is very fast. A output from the multiagent system Integration in the complete system • Route building from initially conicting trajectories : Figure Initial ight plans and nal ones Conclusion and future work An integrated algorithm • Entropy minimizer is now a part of the overall route design system; • Only a simple postprocessing is necessary to output a usable airways network; • The complete algorithm is being ported to GPU. Future work : take the headings into account • The behavior is not completely satisfactory when routes are converging in opposite directions; • An improved version will make use of entropy of a distribution in a Lie group (publication in progress).
We introduce a novel kernel density estimator for a large class of symmetric spaces and prove a minimax rate of convergence as fast as the minimax rate on Euclidean space. We prove a minimax rate of convergence proven without any compactness assumptions on the space or Hölderclass assumptions on the densities. A main tool used in proving the convergence rate is the HelgasonFourier transform, a generalization of the Fourier transform for semisimple Lie groups modulo maximal compact subgroups. This paper obtains a simplified formula in the special case when the symmetric space is the 2dimensional hyperboloid.

Dena Marie Asta Department of Statistics Ohio State University Supported by NSF grant DMS1418124 and NSF Graduate Research Fellowship under grant DGE1252522. Kernel Density Estimation on Symmetric Spaces 2 Geometric Methods for Statistical Analysis q Classical statistics assumes data is unrestricted on Euclidean space q Exploiting the geometry of the data leads to faster and more accurate tools ¯X = 1 n nX i=1 Xi var[X] = E[X2 ] E[X]2 implicit geometry in nonEuclidean data explicit geometry in networks Motivation: NonEuclidean Data 3 Normal Distributions sphere Diffusion Tensor Imaging Material Stress, Gravitational Lensing Directional Headings 3x3 symmetric positive definite matrices 3x3 symmetric positive definite matrices hyperboloid 4 Nonparametric Methods: NonEuclidean Data q Classical nonparametric estimators assume Euclidean structure q Sometimes the given data has other geometric structure to exploit. kernel density estimator kernel regression conditional density estimator Motivation: NonEuclidean Distances 5 Normal Distributions sphere Diffusion Tensor Imaging Euclidean distances are often not the right notion of distance between data points. Material Stress, Gravitational Lensing Directional Headings 3x3 symmetric positive definite matrices 3x3 symmetric positive definite matrices hyperboloid Motivation: NonEuclidean Distances 6 sphere Euclidean distances are often not the right notion of distance between data points. Directional Headings Distance between directional headings should be shortest pathlength. Motivation: NonEuclidean Distances 7 Normal Distributions Euclidean distances are often not the right notion of distance between data points. hyperboloid mean standarddeviation An isometric representation of the hyperboloid is the Poincare HalfPlane. Each point in either model represents a normal distribution. Distance is the Fisher Distance, which is similar to KLDivergence. Motivation: NonEuclidean Distances 8 Euclidean distance not the right distance à Euclidean volume not the right volume We want to minimize risk for density estimation on a (Riemmanian) manifold. estimator based on n samples true density manifold volume measure based on intrinsic distanceEf Z M (f ˆfn)2 dµ Existing Estimators 9 ˆfh (X1,...,Xn)(x) = 1 nh nX i=1 K ✓ x Xi h ◆ optimal rate of convergence1 (s=smoothness parameter, d=dimension) O(n2s/(2s+d)) division by h undefined for general M subtraction undefined for general M Euclidean KDE Exploiting Geometry: Symmetries 10 q symmetries = geometry q symmetries make the smoothing of data (convolution by a kernel) tractable q translations in Euclidean space are specific examples of symmetries q other spaces call for other symmetries 11 Exploiting symmetries to convolve Kernel density estimation is about convolving a kernel with the data. More general spaces, depending on their geometry, we will require symmetries other than translations… ˆfh (X1,...,Xn) = Kh ⇤ empirical(X1, . . . , Xn) (g ⇤ f)(x) = Z Rn g(t)f(x t) dt density on the space of translations on Rn density on Rn (g ⇤ f)(x) = Z Rn g(t)f(x t) dt = Z Rn g(Tt)f(T 1 t (x)) dt 12 Exploiting symmetries to convolve Tv(w) = v + w Identify t with Tt and interpret g as a density on the space of Tt’s. Kernel density estimation is about convolving a kernel with the data. More general spaces, depending on their geometry, we will require symmetries other than translations… density on Rn density on the space of translations on Rn ˆfh (X1,...,Xn) = Kh ⇤ empirical(X1, . . . , Xn) (g ⇤ f)(x) = Z G g(T)f(T 1 (x)) dT 13 X is a symmetric space, a space having a suitable space G of symmetries. space of symmetries on X Generalized kernel density estimation involves convolving a generalized kernel with the data. density on X density on the space G Exploiting symmetries to convolve ˆfh (X1,...,Xn) = Kh ⇤ empirical(X1, . . . , Xn) (“empirical density”) GKernel Density Estimator: general form density on group of symmetries G “empirical density” on symmetric space Xbandwidth and cutoff parameters sample observations We can use harmonic analysis on symmetric spaces to define and analyze this estimator. 1Asta, D., 2014.
ˆfh,C (X1,...,Xn) = Kh ⇤ empirical(X1, . . . , Xn) 15 Harmonic Analysis on Symmetric Spaces 1Terras, A., 1985.
H : L2(X) ⌧ L2(· · · ) : H 1 The (Helgason)Fourier Transform sends convolutions to products. HelgasonFourier Transform: for symmetric space X, an isometry frequency space depends on the geometry of X F : L2(R) ⌧ L2(R) : F 1 Fourier Transform: an isometry 16 Generalization: GKernel Density Estimator 1Asta, D., 2014.
q The true density is sufficiently smooth (in Sobolev ball). q The kernel transforms nicely with the space of data q The kernel is sufficiently smooth assumptions on kernel and true density: GKernel Density Estimator 17 THEOREM: GKDE achieves the same minimax rate on symmetric spaces as the ordinary KDE achieves on Rd.1 1Asta, D., 2014.
O(n2s/(2s+d)) optimal rate of convergence1 (s=Sobolev smoothness parameter, d=dimension) ˆfh,C (X1,...,Xn) = H 1 [ (X1,...,Xn)H[Kh]IC] 18 Kernels on Symmetries Symmetric Positive Definite (nxn) Matrices SPDn: Kernels are densities on space G=GLn of nxn invertible matrices. Hyperboloid H2: Kernels are densities on space G=SL2 of 2x2 invertible matrices having determinant 1. Each SL2matrix M determines an isometry (distancepreserving function): M : H2 ⇠= H2 M (x) = M11x + M12 M21x + M22 Each GLnmatrix M determines an isometry (distancepreserving function): M (X) = MT XMM : SPDn ⇠= SPDn example of kernel K (hyperbolic version of the gaussian): solution to the heat equation on SL2: 19 Kernels on Symmetries Hyperboloid H2: Kernels are densities on space G=SL2 of 2x2 invertible matrices having determinant 1. Each SL2matrix M determines an isometry (distancepreserving function): M : H2 ⇠= H2 M (x) = M11x + M12 M21x + M22 samples from K (points in SL2) represented in H2=SL2/SO2 H[Kh](s, k✓) / eh2 ¯s2 h¯s 20 Recap: GKDE 1Asta, D., 2014.
Exploiting the geometric structure of the data type: q Tractable data smoothing = convolving a kernel on a space of symmetries q Harmonic analysis on symmetric spaces allows us to prove minimax rate q Symmetric spaces are general enough to include: Normal Distributions Diffusion Tensor Imaging Material Stress, Gravitational Lensing Directional Headings
Keynote speach Tudor Ratiu (chaired by Xavier Pennec)
The goal of these lectures is to show the influence of symmetry in various aspects of theoretical mechanics. Canonical actions of Lie groups on Poisson manifolds often give rise to conservation laws, encoded in modern language by the concept of momentum maps. Reduction methods lead to a deeper understanding of the dynamics of mechanical systems. Basic results in singular Hamiltonian reduction will be presented. The Lagrangian version of reduction and its associated variational principles will also be discussed. The understanding of symmetric bifurcation phenomena in for Hamiltonian systems are based on these reduction techniques. Time permitting, discrete versions of these geometric methods will also be discussed in the context of examples from elasticity.

Dimension reduction on Riemannian manifolds (chaired by Xavier Pennec, Alain Trouvé)
This paper presents derivations of evolution equations for the family of paths that in the Diffusion PCA framework are used for approximating data likelihood. The paths that are formally interpreted as most probable paths generalize geodesics in extremizing an energy functional on the space of differentiable curves on a manifold with connection. We discuss how the paths arise as projections of geodesics for a (non bracketgenerating) subRiemannian metric on the frame bundle. Evolution equations in coordinates for both metric and cometric formulations of the subRiemannian geometry are derived. We furthermore show how rankdeficient metrics can be mixed with an underlying Riemannian metric, and we use the construction to show how the evolution equations can be implemented on finite dimensional LDDMM landmark manifolds.

Faculty of Science Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations GSI 2015, Paris, France Stefan Sommer Department of Computer Science, University of Copenhagen October 29, 2015 Slide 1/21 Intrinsic Statistics in Geometric Spaces Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 2/21 Statistics on Manifolds • Frech´et mean: argminx∈M 1 N ∑N i=1 d(x,yi)2 • PGA (Fletcher et al., ’04); GPCA (Huckeman et al., ’10); HCA (Sommer, ’13); PNS (Jung et al., ’12); BS (Pennec, ’15) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 3/21 PGA GPCA HCA Inﬁnitesimally deﬁned Distributions; MLE • aim: construct a family NM(µ,Σ) of anisotropic Gaussianlike distributions; ﬁt by MLE/MAP • in Rn , Gaussian distributions are transition distributions of diffusion processes dXt = dWt • on (M,g), Brownian motion is transition distribution of stochastic process (EellsElworthyMalliavin construction), or solution to heat diffusion equation ∂ ∂t p(t,x) = 1 2 ∆p(t,x) • inﬁnitesimal dXt vs. global pt (x;y) ∝ e− x−y 2 Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 4/21 MLE of Diffusion Processes • EellsElworthyMalliavin construction gives map Diff : FM → Dens(M) • Diff(FM) = NM ⊂ Dens(M): the set of (normalized) transition densities from FM diffusions • γ = Diff(x,Xα) = pγγ0, the loglikelihood lnL(x,Xα) = lnL(γ) = N ∑ i=1 lnpγ(yi) • Estimated Template: argmax(x,Xα)∈FM lnL(x,Xα) • MLE of data yi under the assumption y ∼ γ ∈ NM • Diffusion PCA (Sommer ’14): argmax lnL(x,Xα +εI) generalizing Probabilistic PCA (Tipping, Bishop, ’99; Zhang, Fletcher ’13) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 5/21 Most Probable Paths to Samples • Euclidean: • density pt (x;y) ∝ e−(x−y)T Σ−1 (x−y) • transition density of diffusion processes with stationary generator • x −y most probable path from y to x • Manifolds: • which distributions correspond to anisotropic Gaussian distributions N(x,Σ)? • what is the most probable path from y to x? Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 6/21 Anisotropic Diffusions and Holonomy • driftless diffusion SDE in Rn , stationary generator: dXt = σdWt , σ ∈ Mn×d • diffusion ﬁeld σ, inﬁnitesimal generator σσT • curvature: stationary ﬁeld/generator cannot be deﬁned due to holonomy Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 7/21 Stochastic Development: EellsElworthyMalliavin Construction • Xt : Rn valued Brownian motion (driving process) • Ut : FM valued (subelliptic) diffusion • Yt : M valued stochastic process (target process) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 8/21 The Frame Bundle • the manifold and frames (bases) for the tangent spaces TpM • F(M) consists of pairs u = (x,Xα), x ∈ M, Xα frame for Tx M • curves in the horizontal part of F(M) correspond to curves in M and parallel transport of frames Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 9/21 Driving process, FM valued process and Target process • Hi, i = 1...,n horizontal vector ﬁelds on F(M): Hi(u) = π−1 ∗ (ui) • SDE in Rn (driving): dXt = IdndBt , X0 = 0 • SDE in FM: dUt = Hi(Ut )◦dXi t , U0 = (x0,Xα) , Xα ∈ GL(Rn ,Tx0M) • Process on M (target): Yt = πFM(Ut ) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 10/21 Ut: Frame Bundle Diffusion Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 11/21 Estimated Templates Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 12/21 MLE template Most Probable Paths • in Rn , straight lines are most probable for stationary diffusion processes • OnsagerMachlup functional, σt curve on M: L(σt ) = − 1 2 σ (t) 2 g + 1 12 R(σ(t)) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 13/21 Most Probable Paths • in Rn , straight lines are most probable for stationary diffusion processes • OnsagerMachlup functional, σt curve on M: L(σt ) = − 1 2 σ (t) 2 g + 1 12 R(σ(t)) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 13/21 Most Probable Paths • in Rn , straight lines are most probable for stationary diffusion processes • OnsagerMachlup functional, σt curve on M: L(σt ) = − 1 2 σ (t) 2 g + 1 12 R(σ(t)) • MPP for target process Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 13/21 Most Probable Paths • in Rn , straight lines are most probable for stationary diffusion processes • OnsagerMachlup functional, σt curve on M: L(σt ) = − 1 2 σ (t) 2 g + 1 12 R(σ(t)) • MPP for driving process Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 13/21 R=0 Deﬁnition (MPPs for Driving Process) Let Xt be the driving process for the diffusion Yt and x ∈ M, i.e. Yt = π(φ(Xt )). Then σ is a most probable path for the driving process if it satisﬁes σ = argminc∈H(Rd ),φ(c)(1)=x 1 0 −L(ct )dt Proposition Let Yα be a frame for Ty M, and let Yt = π(φ(y,Yα)(Xt )), i.e. Yt is the development of Xt starting at (y,Yα). Then MPPs for the driving process Xt maps to geodesics of a lifted subRiemannian metric on FM: w, ˜w FM = X−1 α π∗w,X−1 α π∗ ˜w Rn . • isotropic case, MPPs for drv. process maps to geodesics • if −lnL(x,Xα) ≈ c + 1 N ∑N i=1 p(MPP(x,yi )). Then Frech´et mean ≈ MLE, isotropic case Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 14/21 MPPs on S2 increasing anisotropy −→ (a) cov. diag(1,1) (b) cov. diag(2,.5) (c) cov. diag(4,.25) Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 15/21 SubRiemannian Geometry on FM • Xα : Rn → Tx M gives innerproduct v,w Xα = X−1 α v,X−1 α w Rn • optimal control problem with nonholonomic constraints xt = arg min ct ,c0=x,c1=y 1 0 ˙ct 2 Xα,t dt • let ˜v, ˜w HFM = X−1 α,t π∗(˜v),X−1 α,t π∗(˜w) Rn on H(xt ,Xα,t )FM. This deﬁnes a subRiemannian metric G on TFM and equivalent problem (xt ,Xα,t ) = arg min (ct ,Cα,t ),c0=x,c1=y 1 0 (˙ct , ˙Cα,t ) 2 HFMdt with horizontality constraint (˙ct , ˙Cα,t ) ∈ H(ct ,Cα,t )FM Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 16/21 MPP Evolution Equations • subRiemannian HamiltonJacobi equations ˙yk t = Gkj (yt )ξt,j , ˙ξt,k = − 1 2 ∂Gpq ∂yk ξt,pξt,q • in coordinates (xi ) for M, Xi α for Xα, and W encoding the inner product Wkl = δαβXk αXl β: ˙xi = Wij ξj −Wih Γ jβ h ξjβ , ˙Xi α = −Γiα h Whj ξj +Γiα k Wkh Γ jβ h ξjβ ˙ξi = Whl Γ kδ l,i ξhξkδ − 1 2 Γ hγ k,iWkh Γ kδ h +Γ hγ k Wkh Γ kδ h,i ξhγξkδ ˙ξiα = Γ hγ k,iα Wkh Γ kδ h ξhγξkδ − Whl ,iα Γ kδ l +Whl Γ kδ l,iα ξhξkδ − 1 2 Whk ,iα ξhξk +Γ hγ k Wkh ,iα Γ kδ h ξhγξkδ Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 17/21 Landmark LDDMM • Christoffel symbols (Michelli et al. ’08) Γk ij = 1 2 gir gkl grs ,l −gsl grk ,l −grl gks ,l gsj • mix of transported frame and cometric: Fd M bundle of rank d linear maps Rd → Tx M, ξ,˜ξ ∈ T∗Fd M, cometric gFd M +λgR: ξ,˜ξ = δαβ (ξπ−1 ∗ Xα)(˜ξπ−1 ∗ Xβ)+λ ξ,˜ξ gR • the whole frame need not be transported Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 18/21 LDDMM Landmark MPPs Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 19/21 + horz. var. isotropic + vert. var. Statistical Manifold: Geometry of Γ • Densities Dens(M) = {γ ∈ Ωn (M) : M γ = 1,γ > 0} • FisherRao metric: GFR γ (α,β) = M α γ β γ γ • Γ ﬁnite dim. subset of Dens(M) Diff : FM → Dens(M) • naturally deﬁned on bundle of symmetric positive T0 2 tensors Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 20/21 Summary • inﬁnitesimal deﬁnition of anisotropic normal distributions NM(µ,Σ) on M • diffusion map Diff : FM → Dens(M) from EellsElworthyMalliavin construction, stoch. develop. • MLE of template / covariance (in FM) • MPPs for driving processes generalize geodesics being subRiemannian geodesics 1 Sommer: Diffusion Processes and PCA on Manifolds, Oberwolfach extended abstract (Asymptotic Statistics on Stratiﬁed Spaces), 2014. 2 Sommer: Anisotropic Distributions on Manifolds: Template Estimation and Most Probable Paths, Information Processing in Medical Imaging (IPMI) 2015. 3 Sommer: Evolution Equations with Anisotropic Distributions and Diffusion PCA, Geometric Science of Information (GSI) 2015. 4 Svane, Sommer: Similarities, SDEs, and Most Probable Paths, SIMBAD15 extended abstract. 5 Sommer, Svane: Holonomy, Curvature, and Anisotropic Diffusions, MOTR15 extended abstract. Stefan Sommer (sommer@diku.dk) (Department of Computer Science, University of Copenhagen) — Anisotropic Distributions on Manifolds, Diffusion PCA, and Evolution Equations Slide 21/21
This paper addresses the generalization of Principal Component Analysis (PCA) to Riemannian manifolds. Current methods like Principal Geodesic Analysis (PGA) and Geodesic PCA (GPCA) minimize the distance to a “Geodesic subspace”. This allows to build sequences of nested subspaces which are consistent with a forward component analysis approach. However, these methods cannot be adapted to a backward analysis and they are not symmetric in the parametrization of the subspaces. We propose in this paper a new and more general type of family of subspaces in manifolds: barycentric subspaces are implicitly defined as the locus of points which are weighted means of k + 1 reference points. Depending on the generalization of the mean that we use, we obtain the Fréchet/Karcher barycentric subspaces (FBS/KBS) or the affine span (with exponential barycenter). This definition restores the full symmetry between all parameters of the subspaces, contrarily to the geodesic subspaces which intrinsically privilege one point. We show that this definition defines locally a submanifold of dimension k and that it generalizes in some sense geodesic subspaces. Like PGA, barycentric subspaces allow the construction of a forward nested sequence of subspaces which contains the Fréchet mean. However, the definition also allows the construction of backward nested sequence which may not contain the mean. As this definition relies on points and do not explicitly refer to tangent vectors, it can be extended to non Riemannian geodesic spaces. For instance, principal subspaces may naturally span over several strata in stratified spaces, which is not the case with more classical generalizations of PCA.

Barycentric Subspaces and Affine Spans in Manifolds GSI 30102015 Xavier Pennec Asclepios team, INRIA SophiaAntipolis – Mediterranée, France and Côte d’Azur University (UCA) Statistical Analysis of Geometric Features Computational Anatomy deals with noisy Geometric Measures Tensors, covariance matrices Curves, tracts Surfaces, shapes Images Deformations Data live on nonEuclidean manifolds X. Pennec  GSI 2015 2 Manifold dimension reduction When embedding structure is already manifold (e.g. Riemannian): Not manifold learning (LLE, Isomap,…) but submanifold learning Low dimensional subspace approximation? X. Pennec  GSI 2015 3 Manifold of cerebral ventricles Etyngier, Keriven, Segonne 2007. Manifold of brain images S. Gerber et al, Medical Image analysis, 2009. X. Pennec  GSI 2015 4 Barycentric Subspaces and Affine Spans in Manifolds PCA in manifolds: tPCA / PGA / GPCA / HCA Affine span and barycentric subspaces Conclusion 5 Bases of Algorithms in Riemannian Manifolds Reformulate algorithms with Expx and Logx Vector > Bipoint (no more equivalence classes) Exponential map (Normal coordinate system): Expx = geodesic shooting parameterized by the initial tangent Logx = development of the manifold in the tangent space along geodesics Geodesics = straight lines with Euclidean distance Local global domain: starshaped, limited by the cutlocus Covers all the manifold if geodesically complete 6 Statistical tools: Moments Frechet / Karcher mean minimize the variance
We present a novel method that adaptively deforms a polysphere (a product of spheres) into a single high dimensional sphere which then allows for principal nested spheres (PNS) analysis. Applying our method to skeletal representations of simulated bodies as well as of data from real human hippocampi yields promising results in view of dimension reduction. Specifically in comparison to composite PNS (CPNS), our method of principal nested deformed spheres (PNDS) captures essential modes of variation by lower dimensional representations.

Introduction Deformation Skeletal Representations Conclusion Dimension Reduction on Polyspheres with Application to Skeletal Representations joint work with Stephan Huckemann and Sungkyu Jung Benjamin Eltzner University of Göttingen conference on Geometric Science of Information, 20151030 Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Dimension Reduction on Manifolds PCA relies on linearity. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Dimension Reduction on Manifolds PCA relies on linearity. Tangent space approaches ignore geometry and periodic topology. Intrinsic approaches rely on manifold geometry. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Dimension Reduction on Manifolds PCA relies on linearity. Tangent space approaches ignore geometry and periodic topology. Intrinsic approaches rely on manifold geometry. Two classes: Forward methods: Submanifold dimension d = 1, 2, 3, . . . Needs “good” geodesics and a construction scheme. Backward methods: d = D − 1, D − 2, D − 3, . . . Needs rich (parametric) set of submanifolds. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Polysphere Dimension Reduction Almost all geodesics of PD = Sd1 r1 × · · · × SdK rK are dense in (S1 )K . Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Polysphere Dimension Reduction Almost all geodesics of PD = Sd1 r1 × · · · × SdK rK are dense in (S1 )K . Low symmetry isom(PD ) = SO(d1 + 1) × · · · × SO(dK + 1), no generic rich set of submanifolds. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Deformation for Unit Spheres Dimension reduction methods exist for spheres: GPCA1 , HPCA2 , PNS3 Recursively deform polysphere to sphere f : PD → SD . Squared line elements of two unit spheres: ds2 1 = d1 k=1 k−1 j=1 sin2 φ1,j dφ2 1,k, ds2 2 = d2 k=1 k−1 j=1 sin2 φ2,j dφ2 2,k Deformation: ds2 = ds2 2 + d2 j=1 sin2 φ2,j ds2 1 1 S. Huckemann and H. Ziezold. Advances in Applied Probability 2.38 (2006), pp. 299–319. 2 S. Sommer. Geometric Science of Information. Vol. 8085. Lecture Notes in Computer Science. 2013, pp. 76–83. 3 S. Jung, I. L. Dryden, and J. S. Marron. Biometrika 99.3 (2012), pp. 551–568. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Deformation for Unit Spheres Dimension reduction methods exist for spheres: GPCA1 , HPCA2 , PNS3 Recursively deform polysphere to sphere f : PD → SD . Squared line elements of two unit spheres: ds2 1 = d1 k=1 k−1 j=1 sin2 φ1,j dφ2 1,k, ds2 2 = d2 k=1 k−1 j=1 sin2 φ2,j dφ2 2,k Deformation: ds2 = ds2 2 + d2 j=1 sin2 φ2,j ds2 1 Degrees of freedom: Rotation and ordering of spheres. 1 S. Huckemann and H. Ziezold. Advances in Applied Probability 2.38 (2006), pp. 299–319. 2 S. Sommer. Geometric Science of Information. Vol. 8085. Lecture Notes in Computer Science. 2013, pp. 76–83. 3 S. Jung, I. L. Dryden, and J. S. Marron. Biometrika 99.3 (2012), pp. 551–568. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Fixing Degrees of Freedom Rotation: Embed Sdi ri into Rdi+1 . Determine Fréchet mean ˆµi and use rotation along a geodesic to move it to positive xi,di+1direction (north pole). Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Fixing Degrees of Freedom Rotation: Embed Sdi ri into Rdi+1 . Determine Fréchet mean ˆµi and use rotation along a geodesic to move it to positive xi,di+1direction (north pole). Ordering: Data spread: si = N n=1 d2 (ψi,n, ˆµi) Choose permutation p such that sp−1(1) is maximal and sp−1(K) is minimal. Minimizes distortion due to factors sin2 φj, i. e. deviation from polysphere geometry. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Mapping Data Points Embedding Sdi 1 ⊂ Rdi+1 we get ∀1 ≤ j ≤ d2 : yj = x2,j, ∀1 ≤ k ≤ d1 + 1 : yd2+k = x2,d1+1x1,j Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Mapping Data Points Embedding Sdi 1 ⊂ Rdi+1 we get ∀1 ≤ j ≤ d2 : yj = x2,j, ∀1 ≤ k ≤ d1 + 1 : yd2+k = x2,d1+1x1,j For different radii, rescale ∀1 ≤ j ≤ d1 + 1 : x1,j → ˜x1,j = R1x1,j, ∀i > 1 ∀1 ≤ j ≤ di : xi,j → ˜xi,j = Rixi,j and use ˜x in deﬁnition of y coordinates. This yields an ellipsoid x ∈ Rd2+d1+1 d2 k=1 R−2 2 x2 2,k + d1+1 k=1 R−2 1 (x2,d2+1x1,k)2 = 1 . Normalize all yvectors to length R := K j=1 Rj 1 K as ﬁnal step. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Illustration for Different Radii 1. Map from blue polysphere to green ellipsoid. 2. Map to red sphere. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion A Brief Review of Principal Nested Spheres (PNS) PNS determines a sequence SK ⊃ SK−1 ⊃ · · · ⊃ S2 ⊃ S1 ⊃ {µ}. Recursively ﬁt small subsphere Sd−1 ⊂ Sd minimizing sum of squared geodesic projection distances. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion A Brief Review of Principal Nested Spheres (PNS) PNS determines a sequence SK ⊃ SK−1 ⊃ · · · ⊃ S2 ⊃ S1 ⊃ {µ}. Recursively ﬁt small subsphere Sd−1 ⊂ Sd minimizing sum of squared geodesic projection distances. At every projection, save signed projection distance (residuals). Parameter space dimension for Sd−1 ⊂ Sd is p = d + 1, compared to linear PCA where for Rd−1 ⊂ Rd it is p = d. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Skeletal Representation (srep) Parameter Space Srep consists of 1. A twodimensional mesh of m × n skeletal points. 2. Spokes from mesh points to the surface. Image from: J. Schulz et al. Journal of Computational and Graphical Statistics 24.2 (2015), p. 539 Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Skeletal Representation (srep) Parameter Space Srep consists of 1. A twodimensional mesh of m × n skeletal points. 2. Spokes from mesh points to the surface. Parameters: Size of centered mesh, spoke lengths, normalized meshpoints, spoke directions: Q = R+ × RK + × S3mn−1 × S2 K Polysphere deformation on S3mn−1 × S2 K yields Q = S5mn+2m+2n−5 Image from: J. Schulz et al. Journal of Computational and Graphical Statistics 24.2 (2015), p. 539 Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Dimension Reduction for Real Sreps PNDS: Deform polysphere to sphere and apply PNS. CPNS: PNS on spheres individually and linear PCA on joint residuals. 0 10 20 30 40 50 Dimension 0 20 40 60 80 100 Variances[%] PNDS CPNS Figure : PNDS vs. CPNS: residual variances for sreps of 51 hippocampi5 . 5 S. M. Pizer et al. Ed. by M. Breuß, Bruckstein, and Maragos. Springer, Berlin, 2013, pp. 93–115. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Dimension Reduction for Simulated Sreps −100 −50 0 50 100 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 component 3 variance = 0.64% −100 −50 0 50 100 −100 −50 0 50 100 components 3 and 2 −100 −50 0 50 100 −100 −50 0 50 100 components 3 and 1 −100 −50 0 50 100 −100 −50 0 50 100 components 2 and 3 −100 −50 0 50 100 0.0 0.2 0.4 0.6 0.8 1.0 1.2 component 2 variance = 5.95% −100 −50 0 50 100 −100 −50 0 50 100 components 2 and 1 −100 −50 0 50 100 −100 −50 0 50 100 components 1 and 3 −100 −50 0 50 100 −100 −50 0 50 100 components 1 and 2 −100 −50 0 50 100 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 component 1 variance = 92.02% −100 −50 0 50 100 0.0 0.2 0.4 0.6 0.8 1.0 1.2 component 3 variance = 2.17% −100 −50 0 50 100 −100 −50 0 50 100 components 3 and 2 −100 −50 0 50 100 −100 −50 0 50 100 components 3 and 1 −100 −50 0 50 100 −100 −50 0 50 100 components 2 and 3 −100 −50 0 50 100 0.0 0.1 0.2 0.3 0.4 0.5 component 2 variance = 32.10% −100 −50 0 50 100 −100 −50 0 50 100 components 2 and 1 −100 −50 0 50 100 −100 −50 0 50 100 components 1 and 3 −100 −50 0 50 100 −100 −50 0 50 100 components 1 and 2 −100 −50 0 50 100 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 component 1 variance = 62.73% Figure : PNDS vs. CPNS for simulated twisted ellipsoids: scatter plots of residual signed distances for the ﬁrst three components. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Reﬂection on Parameter Space Dimension −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 Figure : Simulated twisted ellipsoid data projected to the second component (a small twosphere) in PNDS with ﬁrst component (a small circle) inside. Parameter space dimensions: PNS on SD : p = 1 2 D(D + 3) − 1. PCA on RD : p = 1 2 D(D + 1). Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations Introduction Deformation Skeletal Representations Conclusion Conclusion We propose a deformation procedure mapping data on a polysphere to sphere. The construction aims at minimizing geometric distortion. We achieve lower dimensional representations than CPNS. The success of our method is rooted in the higher parameter space dimension. Benjamin Eltzner University of Göttingen Dimension Reduction on Polyspheres with Application to Skeletal Representations
This paper studies the affineinvariant Riemannian distance on the RiemannHilbert manifold of positive definite operators on a separable Hilbert space. This is the generalization of the Riemannian manifold of symmetric, positive definite matrices to the infinitedimensional setting. In particular, in the case of covariance operators in a Reproducing Kernel Hilbert Space (RKHS), we provide a closed form solution, expressed via the corresponding Gram matrices.

Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators H`a Quang Minh Istituto Italiano di Tecnologia, ITALY Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.1/52 From ﬁnite to inﬁnite dimensions Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.2/52 Outline 1. Review of ﬁnitedimensional setting: Afﬁneinvariant Riemannian metric on the manifold of symmetric positive deﬁnite matrices 2. Inﬁnitedimensional generalization: RiemannHilbert manifold of positive deﬁnite unitized HilbertSchmidt operators 3. Afﬁneinvariant Riemannian distance between Reproducing Kernel Hilbert Spaces (RKHS) covariance operators Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.3/52 Positive deﬁnite matrices Sym++ (n) = symmetric, positive deﬁnite n × n matrices Have been studied extensively mathematically Numerous practical applications Brain imaging (Arsigny et al 2005, Dryden et al 2009, Qiu et al 2015) Computer vision: object detection (Tuzel et al 2008, Tosato et al 2013), image retrieval (Cherian et al 2013), visual recognition (Jayasumana et al 2015) Radar signal processing: Barbaresco (2013), Formont et al 2013 Machine learning: kernel learning (Kulis et al 2009) Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.4/52 Positive deﬁnite matrices Sym++ (n) = symmetric, positive deﬁnite n × n matrices Differentiable manifold viewpoint Tangent space TP (Sym++ )(n) ∼= Sym(n) = vector space of symmetric matrices Afﬁneinvariant Riemannian metric: on TP (Sym++ (n)) A, B P = P−1/2 AP−1/2 , P−1/2 BP−1/2 F = tr[P−1 AP−1 B] with the Frobenius inner product A, B F = tr(AT B) Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.5/52 Positive deﬁnite matrices Riemannian metric: on TP (Sym++ (n)) = Sym(n) A, B P = P−1/2 AP−1/2 , P−1/2 BP−1/2 F with the Frobenius inner product A, B F = tr(AT B) Afﬁneinvariance CACT , CBCT CPCT = A, B P for any matrix C ∈ GL(n) Siegel (1943), Mostow (1955), Pennec et al 2006, Bhatia 2007, Moakher and Zéraï 2011, Bini and Iannazzo 2013 Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.6/52 Positive deﬁnite matrices Geodesically complete, with nonpositive curvature Geodesic joining P, Q ∈ Sym++ (n) γPQ(t) = P1/2 (P−1/2 QP−1/2 )t P1/2 The exponential map ExpP : TP (Sym++ (n)) → Sym++ (n) ExpP (V ) = P1/2 exp(P−1/2 V P−1/2 )P1/2 is deﬁned on all of TP (Sym++ (n) Afﬁneinvariant Riemannian distance between inﬁnitedimensional covariance operators – p.7/52 Positive deﬁnite matrices Riemannian distance daiE(A, B) =  log(A−1/2 BA−1/2 F where log(A) is the principal logarithm of A A
We develop a generic framework to build large deformations from a combination of base modules. These modules constitute a dynamical dictionary to describe transformations. The method, built on a coherent subRiemannian framework, defines a metric on modular deformations and characterises optimal deformations as geodesics for this metric. We will present a generic way to build local affine transformations as deformation modules, and display examples.

A subRiemannian modular approach for diffeomorphic deformations GSI 2015 Barbara Gris Advisors: Alain Trouvé (CMLA) and Stanley Durrleman (ICM) gris@cmla.enscachan.fr October 30, 2015 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Sommaire 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Introduction "Is it possible to mechanize human intuitive understanding of biological pictures that typically exhibit a lot of variability but also possess characteristic structure ?" Ulf Grenander Hands : a Pattern Theoric Study of Biological Shapes, 1991 Introduction Structure in data Introduction Structure in data ˙ϕt = vt ◦ ϕt , ϕt=0 = Id Introduction Structure in data Structure in deformations Introduction Structure in data Structure in deformations Type of vector ﬁelds Previous works locally afﬁne deformations Polyafﬁne [C. Seiler , X. Pennec, and M. Reyes. Capturing the multiscale anatomical shape variability with polyafﬁne transformation trees. Medical image analysis, 2012] Previous works locally afﬁne deformations Polyafﬁne [C. Seiler , X. Pennec, and M. Reyes. Capturing the multiscale anatomical shape variability with polyafﬁne transformation trees. Medical image analysis, 2012] v(x) = i wi (x)Ai (x) Previous works locally afﬁne deformations Polyafﬁne [C. Seiler , X. Pennec, and M. Reyes. Capturing the multiscale anatomical shape variability with polyafﬁne transformation trees. Medical image analysis, 2012] v(x) = i wi (x)Ai (x) Deformation structure does not evolve with the ﬂow Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works : Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works : LDDMM [M. I. Miller, L. Younes, and A. Trouvé. Diffeomorphometry and geodesic positioning systems for human anatomy, 2014] Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works : LDDMM [M. I. Miller, L. Younes, and A. Trouvé. Diffeomorphometry and geodesic positioning systems for human anatomy, 2014] Higherorder momentum [S. Sommer M. Nielsen, F. Lauze, and X. Pennec. Higherorder momentum distributions and locally afﬁe lddmm registration. SIAM Journal on Imaging Sciences, 2013] Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works : LDDMM [M. I. Miller, L. Younes, and A. Trouvé. Diffeomorphometry and geodesic positioning systems for human anatomy, 2014] Higherorder momentum [S. Sommer M. Nielsen, F. Lauze, and X. Pennec. Higherorder momentum distributions and locally afﬁe lddmm registration. SIAM Journal on Imaging Sciences, 2013] Sparse LDDMM [S. Durrleman, M. Prastawa, G. Gerig, and S. Joshi. Optimal datadriven sparse parameterization of diffeomorphisms for population analysis. In Information Processing in Medical Imaging , pages 123134. Springer, 2011] Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works : LDDMM [M. I. Miller, L. Younes, and A. Trouvé. Diffeomorphometry and geodesic positioning systems for human anatomy, 2014] Higherorder momentum [S. Sommer M. Nielsen, F. Lauze, and X. Pennec. Higherorder momentum distributions and locally afﬁe lddmm registration. SIAM Journal on Imaging Sciences, 2013] Sparse LDDMM [S. Durrleman, M. Prastawa, G. Gerig, and S. Joshi. Optimal datadriven sparse parameterization of diffeomorphisms for population analysis. In Information Processing in Medical Imaging , pages 123134. Springer, 2011] Deformation structure evolves with ﬂow Previous works Shape space (S. Arguillère) Shape space [S. Arguillere. Géométrie sousriemannienne en dimension inﬁnie et applications à l’analyse mathématique des formes . PhD thesis, 2014.]: Deformation structure imposed by shapes and action of vector ﬁelds Previous works : LDDMM [M. I. Miller, L. Younes, and A. Trouvé. Diffeomorphometry and geodesic positioning systems for human anatomy, 2014] Higherorder momentum [S. Sommer M. Nielsen, F. Lauze, and X. Pennec. Higherorder momentum distributions and locally afﬁe lddmm registration. SIAM Journal on Imaging Sciences, 2013] Sparse LDDMM [S. Durrleman, M. Prastawa, G. Gerig, and S. Joshi. Optimal datadriven sparse parameterization of diffeomorphisms for population analysis. In Information Processing in Medical Imaging , pages 123134. Springer, 2011] Deformation structure evolves with ﬂow No control on deformation structure Previous works Constraints Diffeons [L. Younes. Constrained diffeomorphic shape evolution. Foundations of Computational Mathematics, 2012.] Our model : Deformation modules Purpose : Our model : Deformation modules Purpose : Incorporate constraints in the deformation model Our model : Deformation modules Purpose : Incorporate constraints in the deformation model Merge different constraints in a complex one Sommaire 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Deformation modules Deﬁnition and ﬁrst examples A deformation module : Deformation modules Deﬁnition and ﬁrst examples A deformation module : Contains a space of shapes Deformation modules Deﬁnition and ﬁrst examples A deformation module : Contains a space of shapes Can generate vector ﬁelds that : Deformation modules Deﬁnition and ﬁrst examples A deformation module : Contains a space of shapes Can generate vector ﬁelds that : are of a particular type Deformation modules Deﬁnition and ﬁrst examples A deformation module : Contains a space of shapes Can generate vector ﬁelds that : are of a particular type −→ deformation structure Deformation modules Deﬁnition and ﬁrst examples A deformation module : Contains a space of shapes Can generate vector ﬁelds that : are of a particular type −→ deformation structure depend on the state of the shape Deformation modules Deﬁnition and ﬁrst examples A deformation module : Contains a space of shapes Can generate vector ﬁelds that : are of a particular type −→ deformation structure depend on the state of the shape −→ the deformation structure evolves with the ﬂow Sommaire 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ Example of generated vector ﬁeld Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) Deformation modules Deﬁnition and ﬁrst examples : local translation of scale σ M = (O, H, V, ζ, ξ, c) O is a shape space (S. Arguillère) There exists C > 0 : ∀(o, h) ∈ O × H: ζ(o, h)2 V ≤ C c(o, h) Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Example of generated vector ﬁeld Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Example of generated vector ﬁeld Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Example of generated vector ﬁeld Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Example of generated vector ﬁeld z1z2 z3 Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Example of generated vector ﬁeld z1z2 z3 d1d2 d3 Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Example of generated vector ﬁeld z1z2 z3 d1d2 d3 Deformation modules Deﬁnition and ﬁrst examples : local scaling of scale σ Deformation modules Deﬁnition and ﬁrst examples : local rotation of scale σ Introduction Deﬁnition and ﬁrst examples : local translation of scale σ and ﬁxed direction Sommaire 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Deformation modules Modular large deformations M = (O, H, V, ζ, ξ, c) Deformation modules Modular large deformations Studied trajectories : t → (ot , ht ) ∈ O × H such that ˙ot = ξot (vt ) where vt = ζot (ht ) ∈ ζot (H). Deformation modules Modular large deformations Studied trajectories : t → (ot , ht ) ∈ O × H such that ˙ot = ξot (vt ) where vt = ζot (ht ) ∈ ζot (H). −→ Solutions of ˙ϕv t = vt ◦ ϕv t , ϕv t=0 = Id exist. Deformation modules Modular large deformations Studied trajectories : t → (ot , ht ) ∈ O × H such that ˙ot = ξot (vt ) where vt = ζot (ht ) ∈ ζot (H). −→ Solutions of ˙ϕv t = vt ◦ ϕv t , ϕv t=0 = Id exist. −→ ϕv = modular large deformation. Deformation modules Modular large deformations : an example Sommaire 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Deformation modules Combination Deformation modules Combination Deformation modules Combination Features : if ci oi (hi) = ζi oi (hi)2 Vi then co(h) = i ζi oi (hi)2 Vi =  i ζi oi (hi)2 V Deformation modules Combination Features : if ci oi (hi) = ζi oi (hi)2 Vi then co(h) = i ζi oi (hi)2 Vi =  i ζi oi (hi)2 V Geometrical descriptors are transported by the global vector ﬁeld Deformation modules Combination Features : if ci oi (hi) = ζi oi (hi)2 Vi then co(h) = i ζi oi (hi)2 Vi =  i ζi oi (hi)2 V Geometrical descriptors are transported by the global vector ﬁeld Coherent mathematical framework : possibility to combine any modules Deformation modules Combination : Example of modular large deformation Sommaire 1 Introduction 2 Deformation modules Deﬁnition and ﬁrst examples Modular large deformations Combining deformation modules 3 Numerical results Deformation modules Matching problem Deformation modules Matching problem Deformation modules Matching problem 1 0 co(h) + g(ϕv t=1 · fsource, ftarget ) v = ζo(h) [N. Charon and A. Trouvé. The varifold representation of nonoriented shapes for diffeomorphic registration, 2013] Deformation modules Matching problem Deformation modules Matching problem Deformation modules Matching problem Deformation modules Matching problem Deformation modules Matching problem Deformation modules Matching problem Conclusion We have presented Conclusion We have presented a coherent mathematical framework Conclusion We have presented a coherent mathematical framework to build modular large deformations. Conclusion We have presented a coherent mathematical framework to build modular large deformations. We showed how easily incorporating constraints in a deformation model Conclusion We have presented a coherent mathematical framework to build modular large deformations. We showed how easily incorporating constraints in a deformation model and merging different constraints in a global one. Conclusion "Is it possible to mechanize human intuitive understanding of biological pictures that typically exhibit a lot of variability but also possess characteristic structure ?" Ulf Grenander Hands : a Pattern Theoric Study of Biological Shapes, 1991 Thank you for your attention !
Optimization on Manifold (chaired by PierreAntoine Absil, Rodolphe Sepulchre)
The Riemannian trustregion algorithm (RTR) is designed to optimize differentiable cost functions on Riemannian manifolds. It proceeds by iteratively optimizing local models of the cost function. When these models are exact up to second order, RTR boasts a quadratic convergence rate to critical points. In practice, building such models requires computing the Riemannian Hessian, which may be challenging. A simple idea to alleviate this difficulty is to approximate the Hessian using finite differences of the gradient. Unfortunately, this is a nonlinear approximation, which breaks the known convergence results for RTR. We propose RTRFD: a modification of RTR which retains global convergence when the Hessian is approximated using finite differences. Importantly, RTRFD reduces gracefully to RTR if a linear approximation is used. This algorithm is available in the Manopt toolbox.

Ditch the Hessian Hassle with Riemannian Trust Regions Nicolas Boumal, Inria & ENS Paris Geometric Science of Information, GSI 2015 Oct. 30, 2015, Paris The goal is to optimize a smooth function on a smooth manifold The Trust Region method is like Newton’s with a safeguard