> feedbackdroid wrote:> >> >> So, maybe the contradiction is in melding it in the same conceptual >> idea with Occam's Razor. > > Occam's Razor is a heuristic. It says that less is more, in a sense. It > is not a priori certain that the simplest explanation for a finite set > of facts will turn out to be the correct explanation in a more general > context.
Not "certain", but, given some plausible assumptions, a priori more "probable".
Lester is, I'm guessing, reacting to quotes I posted yesterday on CAP (or maybe it's just a coincidence that his post came one day after this):
"We have reached an understanding of the connections between learning and complexity as unified by the idea of predictive information ... The results provide a conclusive answer to the long standing problem of how to characterize the complexity of time series, and serve to unify ideas from different areas of physics and computer science. In particular we can classify data streams by their complexity, and if there is something to be learned from the data stream then this classification corresponds to measures for the complexity of the model that can be learned."
The comments below, where I have included them, are Nemenman's.
Ming Li, John Tromp, and Paul Vitanyi. Sharpening Occam's razor. In Proc. 8th COCOON, LNCS 2387, pages 411-419, Berlin, 2002. Springer-Verlag. Abstract: We provide a new representation-independent formulation of Occam's razor theorem, based on Kolmogorov complexity. This new formulation allows us to: (i) Obtain better sample complexity than both length-based and VC-based versions of Occam's razor theorem, in many applications. (ii) Achieve a sharper reverse of Occam's razor theorem than previous work. Specifically, we weaken the assumptions made in an earlier publication, and extend the reverse to superpolynomial running times.
I Nemenman and W Bialek. Occam factors and model-independent Bayesian learning of continuous distributions. Phys. Rev. E, 65(2):026137, 2002. supress Preliminary version: I. Nemenman and W. Bialek, Learning Continuous Distributions: Simulations With Field Theoretic Priors in T. Leen, T. Dietterich, and V. Tresp, eds. Adv. Neural Inf. Proc. Syst. 13, pp. 287-293, MIT Press, 2001 supress . Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the data and the phase space factors arising from the integration over the model space determine the free parameter of the theory ("smoothness scale") self-consistently. This persists even for distributions that are atypical in the prior and is a step towards a model-independent theory for learning continuous distributions. Finally, we point out that a wrong parameterization of a model family may sometimes be advantageous for small data sets.
I Nemenman. Information Theory and Learning: A Physical Approach. PhD thesis, Princeton University, Department of Physics, 2000. Abstract: We try to establish a unified information theoretic approach to learning and to explore some of its applications. First, we define predictive information as the mutual information between the past and the future of a time series, discuss its behavior as a function of the length of the series, and explain how other quantities of interest studied previously in learning theory - as well as in dynamical systems and statistical mechanics - emerge from this universally definable concept. We then prove that predictive information provides the unique measure for the complexity of dynamics underlying the time series and show that there are classes of models characterized by power-law growth of the predictive information that are qualitatively more complex than any of the systems that have been investigated before. Further, we investigate numerically the learning of a nonparametric probability density, which is an example of a problem with power-law complexity, and show that the proper Bayesian formulation of this problem provides for the `Occam' factors that punish overly complex models and thus allow one to learn not only a solution within a specific model class, but also the class itself using the data only and with very few a priori assumptions. We study a possible information theoretic method that regularizes the learning of an undersampled discrete variable, and show that learning in such a setup goes through stages of very different complexities. Finally, we discuss how all of these ideas may be useful in various problems in physics, statistics, and, most importantly, biology.
Jonathan Oppenheim, Michal Horodecki, and Ryszard Horodecki. Are there phase transitions in information space?. Phys. Rev. Lett., 90:010404, 2003. Abstract: The interplay between two basic quantities -- quantum communication and information -- is investigated. Quantum communication is an important resource for quantum states shared by two parties and is directly related to entanglement. Recently, the amount of local information that can be drawn from a state has been shown to be closely related to the non-local properties of the state. Here we consider both formation and extraction processes, and analyze informational resources as a function of quantum communication. The resulting diagrams in information space allow us to observe phase-like transitions when correlations become classical.
Liam Paninski. Design of experiments via information theory. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the staircase method widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the psychometric function underlying the output responses.
(This next is off topic here, but see Nemenman's comment after the abstract)
Eliano Pessa and Giuseppe Vitiello. Quantum noise, entanglement and chaos in the quantum field theory of mind/brain states. Abstract: We review the dissipative quantum model of brain and present recent developments related with the role of entanglement, quantum noise and chaos in the model.
Comments: Either I just don't understand a thing, or WTF?
David H. Wolpert. On the Bayesian "Occam factors" argument for Occam's razor. In T. Petsche et others, editor, Computational Learning and Natural Learning Systems, volume 3. MIT Press, 1995. Abstract: This paper discusses some of the problematic aspects of the Bayesian first-principles "proof" of Occam's razor which involves Occam factors. Although it is true that the posterior for a model is reduced due to Occam factors if that model is capable of expressing many functions, the phenomenon need not have anything to do with Occam;s razor. This paper shows this by i) performing reductio ad absurdum on the argument that the Occam factors effect implies Occam's razor; ii) presenting an alternative Bayesian approach which explicitly does not result in Occam's razor; and finally iii) disentangling the underlying problem with viewing the Occam factors argument as a proof or "automatic embodiment" of Occam's razor.
Comments: This the first of the "no free lunch" papers by Wolpert. See also [Wolpert and Macready, 1997]. The paper's main argument may be boiled down to the fact that Bayesian model selection assigns equal a priori probabilities to model families, not individual models. This is true. However, I believe in simple world (otherwise Wolfram wins), thus these arguments are basically useless. On top of this, last part of the paper, which refers to "integration over all possible model families" is by now outdated -- this has been done by [Bialek et al., 1996] and other
David H. Wolpert and William G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67-82, April 1997. Abstract: A framework is developed to explore the connection between effective optimization algorithms and the problems they are solving. A number of "no free lunch" (NFL) theorems are presented that establish that for any algorithms, any elevated performance over one class of problems is exactly paid for in performance over another class. these theorems result in a geometric interpretation of what it means for an algorithm to be well suited to an optimization problems. Applications of the NFL theorems to information theoretic aspects of optimization and benchmark measures of performance are also presented. Other issues addresses are time-varying optimization problems and a priori "head-to-head" minimax distinction between optimization algorithms, disctinctions that can obtain despite NFL theorems' enforcing of a type of uniformity over all algorithms.
Comments: Paper proves various NFL theorems for optimization. It is important to realize, however, that many of us believe that the world is indeed simple in some sense. Thus we are willing to tolerate bad performance on complex problems that do not occur in real worlds, if we perform well on the cases that are common. Universal learning people would also agree with this, since there are only very few problems in the world that are computable, have low Kolmogorov complexity, and, therefore, are of high probability a priori.