$Id: article.pl,v 1.2 1999/07/14 07:06:42 ken Exp $

In most programming tasks, we humans carefully state the problem, carefully code an algorithm to solve the problem, and turn the computer loose to run the algorithm. Then we hope the correct solution pops out, and if it doesn't we send an irate bug report to the people who developed our programming language.

Genetic algorithms (GAs) offer a different approach; we still carefully state the problem, but we let the computer find a suitable algorithm as well as applying that algorithm to the problem to produce a solution. Generating algorithms programmatically usually means working with code as data, which has traditionally left it more in the realm of LISP and related languages. But Perl can also generate and evaluate code on the fly, so it is also capable of handling generalized GAs.

The core principle behind a GA is that of evolution. You start with a set
of random *organisms*, each of which is a program. You then run each of these programs and
determine their *fitness*, which is the degree to which they succeed at the required task. Once the
fitness of each is determined, you jump through some hoops to remove really
bad entries (natural selection), randomly permute some remaining entries
(mutation), and intermingle other entries (hot algorithmic sex).

After repeating this process through several generations, the population will begin to converge on a solution. Often, it is not quite the solution you were expecting, and therein lies lot of the fun in building a GA engine.

The genetic code we will use to describe our organisms has many parallels to genetic code found in real plants & animals. It also has a huge number of divergences. Draw analogies at your own risk.

The GA task we'll set before ourselves is to find algebraic expressions
that can generate a certain predetermined set of data points when applied
to a given set of input data. We'll call these algebraic expressions our
organisms, and we'll represent them as simple binary syntax trees composed
of *functions* and *terminals*. Each function has two branches representing its arguments, and each
argument can either be another function or a terminal. Terminals are
dead-end leaf nodes and are usually constants or one of the input
parameters in the problem. For instance, in an organism represented by the
algorithm `add(multiply(2,x),5)`

, we have the functions
`add()`

and `multiply()`

, and the terminals 2, x, and 5.

Each problem requires some twiddling by the Genetic Engineer. You need to determine a set of functions and terminals that will be part of your organisms' makeup. You might think of these as roughly equivalent to the various base pairs in DNA, or perhaps to various short sequences of base pairs. For the example we'll show here, we're trying to fit a function to the data:

-1407, -931, -577, -327, -163, -67, -21, -7, -7, -3, 23, 89, 213, 413, 707, 1113, 1649

...which just happens to be the output of 3x^3 + 2x^2 - x - 7 applied to each integer on the interval [-8,8]. To hit this (and we'll assume that we have no idea what the solution should look like) we'll use a function set that includes:

sin(x,y) log(x,y) mul(x,y) add(x,y)

Now we know that the real solution only needs the last two, so the first
two are there just to give the GA something to think about. We picked `sin(x,y)`

because some regions on the graph look like they might be sinusoidal. We
picked `log(x,y)`

just to drive up the computing time and make things interesting.

I know that `sin(x,y)`

and `log(x,y)`

look odd, since `sin()`

and
`log()`

traditionally only take one argument apiece. But by our definition, each
function in the tree has two branches. In the case of unary functions, we
simply throw the second argument away, but it is valuable to keep it in the
data structure as a mutation may change a `sin()`

to an `add()`

and suddenly make the second argument interesting. As the second argument
could be a whole tree of functions on its own, this could get very
interesting indeed.

So given these functions, we build an array of functions as strings. These will get eval'ed later (remember I said we need code as data? Here it is):

my @functions = ( # Format is 'function <pretty version>: <actual code>' 'function add($a,$b): ($a+$b)', 'function mul($a,$b): ($a*$b)', 'function sin($a,$b): sin($a)', 'function log($a,$b): do{my $_a=$a; $_a ? log(abs($_a)) : 0}', );

Notice that our `log()`

function is protected from zero or negative values. Any function that has
sensitive arguments will need to be appropriately protected from them as
there will be no chance to trap the errors outside of the evaluation. In
this case, we need to protect it in a rather obscure way from Perl's
constant-folding optimizer - try running the one-liner `0 ? log(abs(0)) : 0`

to see the problem.

Next we need terminals. For this exercise we have two: x (the parameter that we will vary over the interval of interest, i.e. the input to these algebraic expressions) and a constant between -5 and 5. We specify these as subroutine references:

my @terminals = ( sub { '$x' }, # The input parameter sub { '$x' }, sub { int(rand(20) - 10) }, # Some random number sub { int(rand(20) - 10) }, );

.....which return either the name of the parameter ('x') or a constant number generated at random.

Now, what does this code *look* like? Well, as we said, it's a syntax tree that looks something like so:

....which corresponds to `mul(add(sin(x,7),-5),mul(x,x))`

which in turn reduces to (sin(x)-5)x^2, which is, of course, wrong. Let's
hope evolution can help.

We'll represent each organism as a Perl object. If making little genetic organisms isn't a good opportunity to use object-oriented programming, I don't know what is.

Now, it's one thing to know what a bridge looks like, but designing and building a bridge is something else altogether. I think it's safe to say the same is true of organisms. We'll build ours recursively, with some sanity checks.

First we check against a maximum depth value and plug in only terminals
past this point. This keeps the tree from getting too crazy at the outset
(it will get crazy enough later). If we are inside the maximum depth, we
randomly select a function or terminal (with an arbitrary 2:1 weight
towards terminals). If we selected a function, then we call the organism's `_buildTree`

method again to get two more nodes to use as input to the function. And so
on.

sub _buildTree { my $self = shift; my $depth = shift || 0; my %tree;

$tree{contents} = ($depth > $maxdepth or int rand(3)) ? &{$terminals[rand @terminals]} : $functions[rand @functions];

if ($tree{contents} =~ /^function /) { $tree{'a'} = $self->_buildTree($depth + 1); $tree{'b'} = $self->_buildTree($depth + 1); }

return \%tree; }

All of this builds for us a hash of nodes, each of which has three
components: `$tree{contents}`

, which contains either a terminal value (constant or '$x' in this case) or
a function; `$tree{'a'}`

and
`$tree{'b'}`

are references to other nodes. If the content is a terminal, left and right
pointers are not generated.

Just generating random organisms is not enough. We need to rank them according to their fitness, so we can at least decide which to cull out. We also need to determine fitness so that we know when to stop: unlike real evolution, we are trying to hit a fixed target. Evolution, however, is a feedback mechanism, and is therefore designed to hit a moving target. This means that once we reach a perfect fit for the data, the algorithm will keep trying new things even though the current fit is perfect. This will result in the program oscillating around the correct answer, which doesn't help us any. If you are trying to find an algorithm to hit a moving target, you still need to know the fitness at each generation, though you will probably have to do some statistical work on your results in order to find the mean success rate over the changing target.

We calculate the fitness by averaging the difference between each fixed
data point and the corresponding result of the organisms' function (its
phenotype). Thus fitness is a non-negative number, and a fitness of zero
indicates a perfect organism. To calculate the output of the syntax tree,
we have a function called `fitness`

:

sub fitness { # Determine the fitness of an organism in this crazy world my ($org, @target) = @_; my $sumdiff = 0; foreach (0..$#target) { $sumdiff += abs($org->evaluate({'x'=>$_}) - $target[$_]); } return $sumdiff/@target; }

...which repeatedly calls the Organism's `evaluate`

method at points on the interval we are interested in. Think of the `evaluate`

method as the Organism's whole reason for existance; if the Organism were a
leech, the method would be called `suck_blood`

. If it were a gerbil, the method would be called `freak_out`

. The method simply applies the embedded behavior of the Organism to the
given input data. In this case, we evaluate a single-variable algebraic
expression for a given number.

The `evaluate`

method is just a simple front-end which calls Perl's
`eval`

on the result of the Organism's `expression`

method.

sub evaluate { my $self = shift; my $params = shift; my $x = $params->{'x'}; return eval $self->expression; }

sub expression { # Turn the syntax tree into a Perl expression my $self = shift; my $tree = shift || $self->{tree}; # Check the cache return $self->{expr} if defined $self->{expr}; local $_ = $tree->{contents}; # A shortcut for the current node if ( s/^function [^:]*: (.*)/$1/ ) { # Extract the perl expression s/\$([a-zA-Z]+)/$self->expression($tree->{$1})/ge; } $self->{expr} = $_ if $tree eq $self->{tree}; # A nasty trick return $_; }

Since `expression`

works on a recursive data structure, it's natural that it's a recursive
subroutine. If the current node represents a function, we scan the function
description and pull out the Perl expression that implements that function.
Then we replace any embedded variables ($a or $b) with the Perl expression
their syntax tree represents. If the current node is a terminal, we leave
it alone. Note that we had to go to a little bit of trouble to avoid
touching the `$_a`

variable in the logarithm function.

I happen to particularly like the way the `expression`

method combines its recursion and caching techniques. Building the Perl
code from the syntax tree is a pretty intensive process, and we don't want
to do it over and over again for the same Organism. So we cache our results
as We only want to put the result in the cache when we're done with the
work, i.e. when we exist from the topmost call to this recursive
subroutine. We detect whether that's the case by comparing the node we're
currently working on to the topmost node in $self's tree. If they're the
same, we cache and finish. The trick is to compare these two references as
strings. When a reference is used in a string context, it contains a
representation of the reference's memory address, so if the two references
evaluate to the same string, they're the same reference. Sneaky. Most of
the time this type of caching requires a wrapper function to handle the
caching, and a recursive function to do the real work.

So now that we know how to evaluate the success or failure of each Organism, we need to do something about it.

It won't be enough to just throw out the bottom half of the list and generate a new set of random organisms from scratch. This would be akin to trying to make a watch from a bucket of watch parts by shaking the bucket until a watch spontaneously assembles. Evolution doesn't work that way, and we don't have enough time to wait for pure randomness to cough up the algebra we want.

So what we do is rank the organisms by fitness and perform three operations on the list:

Cull some bad ones off of the bottom of the list

Mutate some percentage of the remainder by randomly changing a node on the syntax tree

Mate some individuals with some others to produce offspring with similar attributes.

Culling is simple: we just unshift the same number of organisms that we are
going to add by mating. In the case of our example, we have set the
parameter `$mates`

to 5, so we remove that many individuals. In point of fact what we *actually* do is mate and `pop`

five times. Same thing.

Mutating is also pretty straightforward: we grab an organism at random and mutate it with its mutate method:

sub mutate { my $self = shift;

my @index = $self->_treeIndex; %{ $index[rand @index] } = %{ $self->_buildTree($maxdepth-1) }; $self->clear_cache; }

This is a little deceptive. What it does is generate a list of the nodes in
the syntax tree with `_treeIndex`

, and then pick a random one and substitute a new randomized branch. This
mutation mechanism is pretty drastic, so we don't do it often. The
likelihood of improving an organism is very small, though it is important
to keep some random element of change happening within the population to
keep it from settling at some local maximum that fits the function well,
but not as well as we want.

The `clear_cache`

method simply clears the cached information we built in the `expression`

method, since it's no longer valid.

The index generator looks like so:

sub _treeIndex { # Generates a list of all the nodes in this tree. These are # references to the stuff in the object itself, so changes to the # elements of this list will change the object. my $self = shift; my $tree = shift || $self->{tree}; my @sofar = @_; # Dump the content nodes into a list if ($tree->{contents} =~ /^function /) { return(@sofar, $tree, $self->_treeIndex($tree->{'a'}, @sofar), $self->_treeIndex($tree->{'b'}, @sofar) ); } else { return(@sofar, $tree); } }

And naturally it is recursive as well. Maybe this article should have been about recursion instead, but we were confident that you already knew all about recursion.

Finally, we take the top *n* organisms and mate them with a random other organism from the list. Each
mating involves taking a tree index of each partner, selecting a random
point in each, and grafting the end of one list on to the beginning of the
other. This is known as a crossover permutation, which is similar to the
genetic scrambling that occurs in sexual reproduction.

The `mate`

method also uses `_treeIndex`

:

sub mate { my ($self, $partner) = @_; my $self_clone = dclone $self; # Get part of a node from $partner and stick it somewhere in $self_clone my @clone_index = $self_clone->_treeIndex; my @partner_index = $partner->_treeIndex; %{ $clone_index[rand @clone_index] } = %{ dclone $partner_index[rand @partner_index] }; $self_clone->clear_cache; return $self_clone; }

This is not so different from the mutation, except that we know that each
half of the new tree previously belonged to some relatively successful
individual, so the chances are higher that the result will also be fit than
if we did a random mutation. Note that we use
`dclone`

from the `Storable`

module to create deep copies of complex structures rather than write our
own cloning routine. This is because we are lazy.

When all of this is plugged into a framework that iterates over generations, tests results, and dumps the results on screen, we discover that there is an awful lot of incomprehensible output. This output needs to be examined by hand to determine what's going on, although I'm sure that someone out there is willing to donate routines to assist in the analysis. One run I did for 1000 generations came up with the following results.

By generation 50 the GA had stumbled on a 12 node equation that simplified to 3x^3, which gave it a mighty good fit over the complete interval, though this is deceptive as the method we use for testing fitness is inordinately happy with fitting the ends where the numbers are huge. A better fitness test would weight the targets evenly so that the fit in the middle would be better. Still, this gave us a substantially complete fit and showed that the GA could establish the order of the target polynomial very rapidly.

By generation 200 the GA had found the second term---the results reduced to 3x^3 + 2x^2. Again, given the fitness calculation it seemed unlikely that it would discover any further refinements of interest as the 'value' of such refinements was so small. This outlines the importance of selecting a fitness function that reflects your needs! Out fitness function basically said that we are interested primarily in matching the broad strokes of the curve without regard for the details, which is what we got.

By termination the GA had started fiddling with sines and natural logs in order to better fit the middle region of the graph. This dead end caused it to actually diverge from the real curve outside of the tested interval, while improving the fit inside -- while it better met our stated criteria, it was actually taking us further afield. Again, it shows how important it is to tell computers what you want rather than make them guess.

x polynomial gen 50 gen 200 gen 1000 -12* -4891 -5184 -4896 -4883.81 -11* -3747 -3993 -3751 -3743.32 -10* -2797 -3000 -2800 -2795.29 - 9* -2023 -2187 -2025 -2023.29 - 8 -1407 -1536 -1408 -1408 - 7 - 931 -1029 - 931 - 932.04 - 6 - 577 - 648 - 576 - 577.52 - 5 - 327 - 375 - 325 - 326.59 - 4 - 163 - 192 - 160 - 161.36 - 3 - 67 - 81 - 63 - 63.95 - 2 - 21 - 24 - 16 - 16.51 - 1 - 7 - 3 - 1 - 1.15 0 - 7 0 0 0 1 - 3 3 5 4.81 2 23 24 32 31.15 3 89 81 99 96.9 4 213 192 224 219.94 5 413 375 425 418.12 6 707 648 720 709.33 7 1113 1029 1127 1111.44 8 1649 1536 1664 1642.32 9* 2333 2187 2349 2319.85 10* 3183 3000 3200 3161.89 11* 4217 3993 4235 4186.33 12* 5453 5184 5472 5411.03

*outside the range used in calculating fitness

Overall, the convergence to our stated criteria was very rapid and effective, as we had very solid fitness levels and the order of the polynomial correctly established by generation 50.

An interesting observation relates to the offset. As the actual interval is [-8,8], but the tested interval is [0,16], any solution must be offset by -8. This is discovered very early in the run by the GA, as evidenced by this otherwise quite unfit solution:

add(-8,mul(mul(add(sin(-8,0),add(add(add(sin(x,x), add(add(add(sin(add(x,-8),x),add(add(x,add(0,x)),-8)),add(x,-8)), -8)),add(add(sin(log(-8,mul(x,x)),x),add(add(add(sin(add(add(0, add(0,x)),-8),x),add(add(x,x),-8)),add(x,-8)),-8)),add(x,-8))), -8)),log(add(x,-9),x)),x))

Note how often we see the function/terminals combination of
`add(x,-8)`

.

This GA engine could easily be extended to do different work. Simply
fitting functions to data is cute, but not by any stretch a limitation. By
altering the function set, the terminal set, and the evaluation process,
the GA can be used to generate behavioural algorithms for robots, control
system solutions, and so on. This is not to say it's going to be easy: your
evaluation routines will become very complex for non-arithmetic projects,
but it's still feasible. Examples from Koza's *Genetic Programming* include:

**Emulate food foraging behaviour in ants**-
Terminal set is MoveRandomly, MoveToNest, PickUp, and DropPheremone.

The function set contains custom routines IfFoodHere, IfCarryingFood, MoveToAdjacentFoodElse, MoveToAdjacentPheremoneElse, and Progn (which executes a list of instructions---I think we can just drop it and eval everything, but I haven't tried. You tell us.)

It's not instantly clear how to do this, but it should be possible. We look forward to hearing the results of your own experiments on the PerlAI mailing list!

**Find the Fourier Series for a given periodic function**-
Terminal set is just a constant from -10.000 to 10.000

The function set contains XSIN, XCOS, +, -, *, and %. XSIN and XCOS are two argument functions taking the coefficient and the harmonic (truncated to an integer).

There are plenty more, including ones that develop neural nets and other sophisticated decision-making or controlling algorithms. We recommend you pick up Koza's book to investigate further.

As research in genetic programming has advanced, some new mechanisms have been considered. One of the more interetsing ones is 'islanding'. This involves running more than one population in isolation from each other, and occasionally migrating some small number of the fittest individuals from one 'island' to another.

When you first run your GA, you will probably notice that the speed of convergence and the quality of the result depends quite heavily on the initial set of organisms and the nature of the first few mutations and matings. This is probably a flaw in our engine somewhere, but if it is fundamental to GAs, then islanding provides a way to occasionally introduce 'ideas' to each system that are both good and novel. In theory this should increase the fitness of each population.

You have probably also noticed that we have built the GA so that it can be distributed---because the algebraic expressions are implemented as strings to be evaluated rather than as references to subroutines, expensive runs can be designed so that the organisms are serialized and shipped to sepearate processors for the fitness runs. We haven't done this yet, so we're looking forward to either your results or your donations of hardware to facilitate the project. A farm of Sparcs would be fine.

Genetic Programming, John R. Koza, MIT Press

The Perl-AI mailing list: perlAI@netizen.com.au

Illinois Genetic Algorithms Lab (IlliGAl) at http://gal4.ge.uiuc.edu/