Talk:Evolution strategies
In general, the text is well done, BUT:
From the beginning and throughout the whole text,
the plural of ES/EA should be written as ESs/EAs,
which is not yet the case.
The optimization problem should be defined WITH inequality constraints g(y) in the beginning, later on as multicriteria problem. ESs may be used in such case, too.
The canonical form of an ES should be formulated as (/mu, /kappa, /lambda, /rho)- ES with /kappa = 1 resembling the so-called (elitist) plus- version and /kappa = /inf resembling the so-called comma-version. /kappa may take any other value inbetween, as well (i.e. max. number of reproduction cycles an ES individual takes place in the iterations). /rho is a vector with two elements, one for the use with object variables, the other for the strategy variables - since they are not often taken as equal.
A (/mu + /lambda)-ES should NOT be called an ES without recombination, because this is in contradiction to a vast literature in the field (similar arguments hold for the comma-version).
In connection with asynchronous parallel ES versions one should mention the predator-prey version of an ES.
The /sigma in the name of a self-adaptive ES should either be used consequently throughout the text or be omitted.
The factor 2 in the formula /tau = 1 / /sqrt(2n) must be justified or omitted.
As far as I know, ESs with covariance matrix were first proposed by Schwefel in 1974 (in English texts 1981) already.
Whether a CMA-ES can be classified as nature-inspired and following Darwin's evolution principle, is questionable if not false. It resembles oldfashioned QP algorithms.
A truncation ratio /mu / (/lambda) of 1/2 is the death of /sigma -self-adaptation because it provokes an annealing process /sigma -> zero in nonlinear fitness landscapes. This ratio must be small enough to enable improvements of the offspring over their ancestors within one reproduction cycle!
The maximum entropy principle for integer (discrete) search spaces leads to geometrically distributed mutations. That should be mentioned, too.
Contents |
Comments to the first reviewer's statements:
I have tried to incorporate the suggestions of the reviewer. Here are my comments to the reviewer's statement which have been set as block quotes and italicized
In general, the text is well done, BUT:
From the beginning and throughout the whole text, the plural of ES/EA should be written as ESs/EAs, which is not yet the case.
The use of "s" to denote the plural in ES and EA has been incorporated whenever it was necessary.
The optimization problem should be defined WITH inequality constraints g(y) in the beginning, later on as multicriteria problem. ESs may be used in such case, too.
Inequality constraints are already implicitly included in the first equation through the use of the set notation $y \in \mathcal{Y}$. This simplifies the notation. Furthermore, it has been mentioned in the text. As to its applicability to multi-objective optimization problems, an additional sentence has been added to the introductory part of the article.
The canonical form of an ES should be formulated as (/mu, /kappa, /lambda, /rho)- ES with /kappa = 1 resembling the so-called (elitist) plus- version and /kappa = /inf resembling the so-called comma-version. /kappa may take any other value inbetween, as well (i.e. max. number of reproduction cycles an ES individual takes place in the iterations). /rho is a vector with two elements, one for the use with object variables, the other for the strategy variables - since they are not often taken as equal.
While the reviewer is principally right with his differentiation, these aspects are rather of special nature. Furthermore, the (mu, kappa, lambda, rho) notation has been used only sporadically compared to the (mu/rho,+lambda) notation which is widely used in literature. It is not the task of this article to promote a not well established notation.
A (/mu + /lambda)-ES should NOT be called an ES without recombination, because this is in contradiction to a vast literature in the field (similar arguments hold for the comma-version).
I stick to the well established notation here. BTW, this notation has been confirmed by an expert team of the German Engineering Organization VDI/VDE GMA. The definitions used (they can also be found in an online version http://www2.staff.fh-vorarlberg.ac.at/~hgb/ea-glossary/ea-terms-engl.html) has been confirmed by the founders of ES, Prof. Rechenberg and Prof. Schwefel. According to these definitions, (mu+lambda)-ESs are strategies without recombination. However, one has to contrast this to (mu+lambda)-selection. Of course, an EA with that type of selection may or may not have recombination. I have added specific comments in the conceptual algorithm definition to clarify the difference.
In connection with asynchronous parallel ES versions one should mention the predator-prey version of an ES.
Done, Predator-Prey-ES has been added.
The /sigma in the name of a self-adaptive ES should either be used consequently throughout the text or be omitted.
I don't agree, self-adaptation is not restricted to the mutation strength sigma. That is why, sigma is omitted in the conceptual ES algorithm, but used in the Example.
The factor 2 in the formula /tau = 1 / /sqrt(2n) must be justified or omitted.
The choice of the factor can be justified, but this is beyond the scope of the article: As one can show, this factor is optimal for large search space dimensionalities on the sphere model (cf. Meyer-Nieberg and Beyer: "On the Analysis of Self-Adaptive Recombination Strategies: First Results", Proceedings of the CEC'05 Conference, pp. 2341-2348, 2005). BTW, this factor is also proposed (without a deeper justification) in various papers of Schwefel et al.
As far as I know, ESs with covariance matrix were first proposed by Schwefel in 1974 (in English texts 1981) already.
Right, and I did *not* claim anything against that. However, if you refer to the section on CMA-ES - an acronym with a special meaning - Ostermeier et al. are to be mentioned. The work of Schwefel has been acknowledged in the introduction. Schwefel's approach using correlated mutations, proposed in 1974, has not been widely used in literature and practice. That is why it is not mentioned explicitly.
Whether a CMA-ES can be classified as nature-inspired and following Darwin's evolution principle, is questionable if not false. It resembles oldfashioned QP algorithms.
This is not the point. Presenting it in this chapter is simply because of its wide dissemination and its superb performance beating most (if not all) other Evolutionary Algorithms published so far on a standard testbed on real-valued parameter optimization tasks.
A truncation ratio /mu / (/lambda) of 1/2 is the death of /sigma -self-adaptation because it provokes an annealing process /sigma -> zero in nonlinear fitness landscapes. This ratio must be small enough to enable improvements of the offspring over their ancestors within one reproduction cycle!
This touches my own field of research :-) I've chosen my words carefully when writing "typical truncation rations ... are from 1/7 to 1/2". There are of course extremes to be mentioned below. However, the reviewer is wrong with his claim that a "truncation ratio of 1/2 is the death of sigma self-adaptation": Considering sufficiently small mutations, most of the fitness landscapes can be approximated locally by a hyperplane. Due to the symmetry of the Gaussian mutations, on average 50% of the mutations will be improvements, i.e., selecting the 50% best mutants (this is equivalent to a truncation ratio of 1/2), will ensure evolutionary improvements (on average). That is why, the argument presented by the reviewer cannot be used to explain his (I guess empirical) observation that sigma-SA-ES *may* fail for the truncation ratio 1/2. Actually, things are much more involved:
- on the sharp ridge test function, 1/lambda is the best truncation ratio (see Meyer-Nieberg and Beyer: "Mutative Self-Adaptation on the Sharp and Parabolic Ridge", Foundations of Genetic Algorithms FOGA 2007, in print); the same holds for noisy fitness functions with additive skewed noise (e.g., \chi_1^2 noise) when considering the final steady state error (see Arnold and Beyer: "A General Noise Model and Its Effects on Evolution Strategy Performance", IEEE Transactions on Evolutionary Computation, 10(4):380-391, 2006)
- on noisy optimization problems with symmetric noise, the final expected steady state error is obtained for the truncation ratio 1/2 (see Beyer and Arnold: "The Steady State Behavior of $(\mu/\mu_I, \lambda)$-ES on Ellipsoidal Fitness Models Disturbed by Noise", Proceedings of GECCO'03, pp. 525-536, 2003)
- there are also test functions from the field of robust optimization, the so-called "functions with noise-induced multi-modality" there the optimal truncation error is obtained for a truncation ratio > 1/2 (see Beyer and Sendhoff: "Evolutionary Algorithms in the Presence of Noise: To Sample or Not to Sample", Proceedings of First IEEE Symposium on Foundations of Computational Intelligence (FOCI'07), 2007, in print.)
The only thing we can say definitely, the truncation ration must be greater than zero and less than 1.
The maximum entropy principle for integer (discrete) search spaces leads to geometrically distributed mutations. That should be mentioned, too.
Even though this is a rather special item, I've now added it to the text.
=====================================
2nd reviewer:
I have 3 main suggestions and some smallies: 1. use as often as possible usual characters 2. do not include fitness F in the individuals 3. explicitely discuss survivor selection
1. A paper print looks awkward because the formulas print vaguely. Formulas are OK if really needed, for summation etc, but not in the running text, if normal characters can be used too, e.g., for y, F(y), etc.
2. Mutation and recombination concern only inheritable material, the genome. The fitness value of a child is not determined by mut/recomb, but by evaluating the new individual. Most descriptions of EAs do include an explicit evaluation step with a good reason. Please adhere to this convention, you will make things more clear this way.
3.I tried to answer the simple question wheather survivor selection is deterministic or stochastic in ES. Based on the general scheme I could not. ONly in the mu/1+lambda example I found an indication "taking the best mu". The same hint is in the description of the mu/rho+1 example. Alas it is phrased differently as "the worst individual is removed", so it is not immediately clear tha it isthe same principle. Yet later truncation is named, hinting to deterministically choosing the best, but this is not unambigous, one has to interpret this term.
4. In the 1st section you correctly say, mu denotes thre no of parents. Please note here that mu is called the population size in all other branches of EC.
5. After the formula (I): representing step 2 only, step 3 is not covered.
6. Formula (II): mu should be rho, shouldn't it. And, why is m:lambda the BEST m offspring? This contradicts the general scheme where parents are chosen randomly.
7. breeder --> breeders
8. global optimizer --> global optimum
9. why do you need the term multi-recombination?
10. permutation example: why do you need the index m? and (see 6.) selecting the best for parent is not how the general scheme tells us.
Comments to the second reviewer's statements:
Here are my comments to the reviewer's statement which have been set as block quotes and italicized
I have 3 main suggestions and some smallies: 1. use as often as possible usual characters 2. do not include fitness F in the individuals 3. explicitely discuss survivor selection 1. A paper print looks awkward because the formulas print vaguely. Formulas are OK if really needed, for summation etc, but not in the running text, if normal characters can be used too, e.g., for y, F(y), etc.
I agree that it may look awkward and I also experimented with your suggestion. But mixing both styles (LaTeX) and standard text produced even worse results. A lot of symbols are not available and mixing, e.g., standard F(y) with LaTeX generated formula in text looks rather ugly. Also, on some browsers, Greek letters are not available ...
2. Mutation and recombination concern only inheritable material, the genome. The fitness value of a child is not determined by mut/recomb, but by evaluating the new individual. Most descriptions of EAs do include an explicit evaluation step with a good reason. Please adhere to this convention, you will make things more clear this way.
I'm not sure whether I've got you right. If you refer, however, to the inclusion of the F(y) value in the individual, this is a consequent step which simplifies the description of the ES algorithms and it also improves the readability in ES implementations using object-oriented programming languages.
3.I tried to answer the simple question wheather survivor selection is deterministic or stochastic in ES. Based on the general scheme I could not. ONly in the mu/1+lambda example I found an indication "taking the best mu". The same hint is in the description of the mu/rho+1 example. Alas it is phrased differently as "the worst individual is removed", so it is not immediately clear tha it isthe same principle. Yet later truncation is named, hinting to deterministically choosing the best, but this is not unambigous, one has to interpret this term.
I did not expect a problem here, because already in the 4. line of section "The Canonical ES Versions" I mentioned that selection is done deterministically. Anyway, I have italized "deterministically" and have added some parenthetical remarks in order to clarify things. Now, things should be clear.
4. In the 1st section you correctly say, mu denotes thre no of parents. Please note here that mu is called the population size in all other branches of EC.
Yes, that makes one of the differences to other (standard) EAs
5. After the formula (I): representing step 2 only, step 3 is not covered.
It is covered: (I) represents a very condensed description of the ES algorithm. The angular brackets in (I) which are explained in (II) contain also the selection process, i.e., choosing the mu best offspring individuals.
6. Formula (II): mu should be rho, shouldn't it. And, why is m:lambda the BEST
In general you are right, but, we consider here the example where rho=mu.
m offspring? This contradicts the general scheme where parents are chosen randomly.
Again, if rho=mu, all mu offspring are used (there is no random selection from the parental pool at all). As to "why is m;lambda the BEST", this has not been written in the text. There you find "m;lambda denotes the mth best", i.e., that individual which is at the mth position in the fitness ranking.
7. breeder --> breeders
Corrected.
8. global optimizer --> global optimum
I know this objection very well. But, I want to stick to the standard terminology used in mathematical programming and optimization literature. And there, the state y^* that optimizes the function f(y) is referred to as the "optimizer". The optimum is the function value f(y^*).
9. why do you need the term multi-recombination?
This is used to emphasize that we often deal with more than 2 parents in the recombination process (I've now rearranged the parenthetical comment to introduce this term explicitly).
10. permutation example: why do you need the index m? and (see 6.) selecting the best for parent is not how the general scheme tells us.
I'm not sure whether I've got you right. Here we have an example where from the parental+offspring set (of the last generation) individuals are chosen randomly. However, due to the ";" notation and the range of m \in 1 ... mu, only individuals from the set of the mu best individuals are drawn. This is reflected by the y_{m;mu+lambda} notation.
=====================================
3rd reviewer:
Very good introduction to ES. The author gives a short and complete overview of the topic.
Here some minor comments and suggestions:
- The \((\mu/\mu_I, \lambda)-\sigma\)-Self-Adaptation-ES does not adapt
the optimal step sizes, in the static case, for \(\mu>1\) (in dfference to cumulated step size adaptation). Since this self-adaptation-scheme is very simple, the author should not encourage the reader to use it for \(\mu>1\). Perhaps he can change the title to \((1, \lambda)-\sigma\)-Self-Adaptation-ES.
- In the Matlab-implementation of (\mu/\mu_I, \lambda)-\sigma-Self-Adaptation_ES:
The line gives errors in Matlab
OffspringIndividual.y = Recombinant.y + OffspringIndividual.sigma * randn(1,n); % mutate object parameter
it should be
OffspringIndividual.y = Recombinant.y + OffspringIndividual.sigma * randn(n,1); % mutate object parameter
- The performance of an ES on a specific problem class depends also crucially on the design
of the adaptation scheme.
- Not only the variation operators should respect the principle of strong causality.
The goal function, the representation (encoding) and the the variation operators should respect the principle of strong causality. For many real valued optimizaton problems it is easier to change the representation than the variation operators.
- It should be stated that a simplified for of the CMA-ES is presented.
- Perhaps the author can include some statements about the time complexity of an ES.
- I have corrected the links, thus they lead to the actual webpages
Comments to the third reviewer's statements:
Here are my comments to the reviewer's statement which have been set as block quotes and italicized
Very good introduction to ES. The author gives a short and complete overview of the topic. Here some minor comments and suggestions:the optimal step sizes, in the static case, for \(\mu>1\) (in dfference to cumulated step size adaptation). Since this self-adaptation-scheme is very simple, the author should not encourage the reader to use it for \(\mu>1\). Perhaps he can change the title to \((1, \lambda)-\sigma\)-Self-Adaptation-ES.
- The \((\mu/\mu_I, \lambda)-\sigma\)-Self-Adaptation-ES does not adapt
I did never claim that the \((\mu/\mu_I, \lambda)-\sigma\)-Self-Adaptation-ES adapts optimal step sizes in the (artificial) static case. Actually, I resisted to provide any optimality claims since optimality always depends on performance definitions and respective test scenarios. For example, the optimality of the cumulated step size adaptation (CSA) mentioned by the Reviewer does not hold on the noisy sphere test function (cf. Beyer and Arnold: "Qualms Regarding the Optimality of Cumulative Path Length Control in CSA/CMA-Evolution Strategies", J. Evolutionary Computation, 11(1):19-28, 2003). Actually, one can make \((\mu/\mu_I, \lambda)-\sigma\)-Self-Adaptation-ES to provide optimal performance in a given test scenario by fine tuning the learning parameter \(\tau\), however, one can always find other test scenarios where this \(\tau\) provides bad performance. The same holds for (CSA). The no-free-lunch theorem also holds for ES-algorithms.
* In the Matlab-implementation of (\mu/\mu_I, \lambda)-\sigma-Self-Adaptation_ES: The line gives errors in Matlab OffspringIndividual.y = Recombinant.y + OffspringIndividual.sigma * randn(1,n); % mutate object parameter it should be OffspringIndividual.y = Recombinant.y + OffspringIndividual.sigma * randn(n,1); % mutate object parameter
Thanks! I have corrected this.
* The performance of an ES on a specific problem class depends also crucially on the design of the adaptation scheme.
Right, I have updated the respective sentence in order to account for the adaptation issue.
* Not only the variation operators should respect the principle of strong causality. The goal function, the representation (encoding) and the the variation operators should respect the principle of strong causality. For many real valued optimizaton problems it is easier to change the representation than the variation operators.
This is already covered in the fifth item of the section on "Variations on ESs and Operator Design Principles".
* It should be stated that a simplified for of the CMA-ES is presented.
Done!
* Perhaps the author can include some statements about the time complexity of an ES.
This would be a rather long section to be added. Since this is an introductory article I would rather abstain from presenting this stuff here.
* I have corrected the links, thus they lead to the actual webpages
Ok, I assume that the Reviewer is more up-to-date with that information than me ....