Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Occam's Sharpest Razor
Replies: 46   Last Post: Feb 16, 2007 5:23 PM

 Messages: [ Previous | Next ]
 Michael Olea Posts: 57 Registered: 12/13/04
Re: Occam's Sharpest Razor
Posted: Feb 13, 2007 8:48 PM

Bob Kolker wrote:

> feedbackdroid wrote:>
>>
>> So, maybe the contradiction is in melding it in the same conceptual
>> idea with Occam's Razor.

>
> Occam's Razor is a heuristic. It says that less is more, in a sense. It
> is not a priori certain that the simplest explanation for a finite set
> of facts will turn out to be the correct explanation in a more general
> context.

Not "certain", but, given some plausible assumptions, a priori more
"probable".

Lester is, I'm guessing, reacting to quotes I posted yesterday on CAP (or
maybe it's just a coincidence that his post came one day after this):

===========================================================================

"We have reached an understanding of the connections between learning and
complexity as unified by the idea of predictive information ... The
results provide a conclusive answer to the long standing problem of how to
characterize the complexity of time series, and serve to unify ideas from
different areas of physics and computer science. In particular we can
classify data streams by their complexity, and if there is something to be
learned from the data stream then this classification corresponds to
measures for the complexity of the model that can be learned."

The comments below, where I have included them, are Nemenman's.

Ming Li, John Tromp, and Paul Vitanyi. Sharpening Occam's razor. In Proc.
8th COCOON, LNCS 2387, pages 411-419, Berlin, 2002. Springer-Verlag.
Abstract: We provide a new representation-independent formulation of Occam's
razor theorem, based on Kolmogorov complexity. This new formulation allows
us to: (i) Obtain better sample complexity than both length-based and
VC-based versions of Occam's razor theorem, in many applications. (ii)
Achieve a sharper reverse of Occam's razor theorem than previous work.
Specifically, we weaken the assumptions made in an earlier publication, and
extend the reverse to superpolynomial running times.

I Nemenman and W Bialek. Occam factors and model-independent Bayesian
learning of continuous distributions. Phys. Rev. E, 65(2):026137, 2002.
supress Preliminary version: I. Nemenman and W. Bialek, Learning
Continuous Distributions: Simulations With Field Theoretic Priors in T.
Leen, T. Dietterich, and V. Tresp, eds. Adv. Neural Inf. Proc.
Syst. 13, pp. 287-293, MIT Press, 2001 supress .
Abstract: Learning of a smooth but nonparametric probability density can be
regularized using methods of Quantum Field Theory. We implement a field
theoretic prior numerically, test its efficacy, and show that the data and
the phase space factors arising from the integration over the model space
determine the free parameter of the theory ("smoothness scale")
self-consistently. This persists even for distributions that are atypical
in the prior and is a step towards a model-independent theory for learning
continuous distributions. Finally, we point out that a wrong
parameterization of a model family may sometimes be advantageous for small
data sets.

I Nemenman. Information Theory and Learning: A Physical Approach. PhD
thesis, Princeton University, Department of Physics, 2000.
Abstract: We try to establish a unified information theoretic approach to
learning and to explore some of its applications. First, we define
predictive information as the mutual information between the past and the
future of a time series, discuss its behavior as a function of the length
of the series, and explain how other quantities of interest studied
previously in learning theory - as well as in dynamical systems and
statistical mechanics - emerge from this universally definable concept. We
then prove that predictive information provides the unique measure for the
complexity of dynamics underlying the time series and show that there are
classes of models characterized by power-law growth of the predictive
information that are qualitatively more complex than any of the systems
that have been investigated before. Further, we investigate numerically the
learning of a nonparametric probability density, which is an example of a
problem with power-law complexity, and show that the proper Bayesian
formulation of this problem provides for the `Occam' factors that punish
overly complex models and thus allow one to learn not only a solution
within a specific model class, but also the class itself using the data
only and with very few a priori assumptions. We study a possible
information theoretic method that regularizes the learning of an
undersampled discrete variable, and show that learning in such a setup goes
through stages of very different complexities. Finally, we discuss how all
of these ideas may be useful in various problems in physics, statistics,
and, most importantly, biology.

Jonathan Oppenheim, Michal Horodecki, and Ryszard Horodecki. Are there phase
transitions in information space?. Phys. Rev. Lett., 90:010404, 2003.
Abstract: The interplay between two basic quantities -- quantum
communication and information -- is investigated. Quantum communication is
an important resource for quantum states shared by two parties and is
directly related to entanglement. Recently, the amount of local information
that can be drawn from a state has been shown to be closely related to the
non-local properties of the state. Here we consider both formation and
extraction processes, and analyze informational resources as a function of
quantum communication. The resulting diagrams in information space allow us
to observe phase-like transitions when correlations become classical.

Liam Paninski. Design of experiments via information theory. In Sebastian
Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural
Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
Abstract: We discuss an idea for collecting data in a relatively efficient
manner. Our point of view is Bayesian and information-theoretic: on any
given trial, we want to adaptively choose the input in such a way that the
mutual information between the (unknown) state of the system and the
(stochastic) output is maximal, given any prior information (including data
collected on any previous trials). We prove a theorem that quantifies the
effectiveness of this strategy and give a few illustrative examples
comparing the performance of this adaptive technique to that of the more
usual nonadaptive experimental design. For example, we are able to
explicitly calculate the asymptotic relative efficiency of the staircase
method widely employed in psychophysics research, and to demonstrate the
dependence of this efficiency on the form of the psychometric function
underlying the output responses.

(This next is off topic here, but see Nemenman's comment after the abstract)

Eliano Pessa and Giuseppe Vitiello. Quantum noise, entanglement and chaos in
the quantum field theory of mind/brain states.
Abstract: We review the dissipative quantum model of brain and present
recent developments related with the role of entanglement, quantum noise
and chaos in the model.

Comments: Either I just don't understand a thing, or WTF?

David H. Wolpert. On the Bayesian "Occam factors" argument for Occam's
razor. In T. Petsche et others, editor, Computational Learning and Natural
Learning Systems, volume 3. MIT Press, 1995.
Abstract: This paper discusses some of the problematic aspects of the
Bayesian first-principles "proof" of Occam's razor which involves Occam
factors. Although it is true that the posterior for a model is reduced due
to Occam factors if that model is capable of expressing many functions, the
phenomenon need not have anything to do with Occam;s razor. This paper
shows this by i) performing reductio ad absurdum on the argument that the
Occam factors effect implies Occam's razor; ii) presenting an alternative
Bayesian approach which explicitly does not result in Occam's razor; and
finally iii) disentangling the underlying problem with viewing the Occam
factors argument as a proof or "automatic embodiment" of Occam's razor.

[Wolpert and Macready, 1997]. The paper's main argument may be boiled down
to the fact that Bayesian model selection assigns equal a priori
probabilities to model families, not individual models. This is true.
However, I believe in simple world (otherwise Wolfram wins), thus these
arguments are basically useless. On top of this, last part of the paper,
which refers to "integration over all possible model families" is by now
outdated -- this has been done by [Bialek et al., 1996] and other

David H. Wolpert and William G. Macready. No free lunch theorems for
optimization. IEEE Transactions on Evolutionary Computation, 1(1):67-82,
April 1997.
Abstract: A framework is developed to explore the connection between
effective optimization algorithms and the problems they are solving. A
number of "no free lunch" (NFL) theorems are presented that establish that
for any algorithms, any elevated performance over one class of problems is
exactly paid for in performance over another class. these theorems result
in a geometric interpretation of what it means for an algorithm to be well
suited to an optimization problems. Applications of the NFL theorems to
information theoretic aspects of optimization and benchmark measures of
performance are also presented. Other issues addresses are time-varying
between optimization algorithms, disctinctions that can obtain despite NFL
theorems' enforcing of a type of uniformity over all algorithms.

Comments: Paper proves various NFL theorems for optimization. It is
important to realize, however, that many of us believe that the world is
indeed simple in some sense. Thus we are willing to tolerate bad
performance on complex problems that do not occur in real worlds, if we
perform well on the cases that are common. Universal learning people would
also agree with this, since there are only very few problems in the world
that are computable, have low Kolmogorov complexity, and, therefore, are of
high probability a priori.

http://www.menem.com/ilya/digital_library/full.html

Another good site with both applied and theoretical papers on complexity is
Jim Crutchfield's:

http://cse.ucdavis.edu/~cmg/compmech/pubs.htm

Another good site is the Santa Fe Institute for Complexity Science:

http://www.santafe.edu/

-- Michael