Characterization of a Class of Sigmoid Functions With Applications to Neural Networks

sigmoids using Euler's incomplete beta functions, and have described composition rules that illustrate how such functions may be synthesized from others. We have obtained power series expansions of hyperbolic sigmoids, and suggested procedures for obtaining coe(cid:14)cients of the expansions. For a large class of node functions, we have shown that the continuous Hop(cid:12)eld net equations can be reduced to Legendre di(cid:11)erential equations. Finally, we have shown that a large class of feedforward networks represent the output function as a Fourier series sine transform evaluated at the hidden layer node inputs, thus extending an earlier result due to Gallant and White.


Introduction
Sigmoid functions, whose graphs are \S-shaped" curves, appear in a great variety of contexts, such as the transfer functions in many neural networks. 1 Their ubiquity is no accident; these curves are the among the simplest non-linear curves, striking a graceful balance between linear and non-linear behavior. Figure 1 shows three sigmoidal functions, and their inverses; the hyperbolic tangent tanh( ) (graph A'), the \logistic" sigmoid 1=(1 + exp(?x)) (graph`B'), and the \algebraic" sigmoid, x= p (1 + x 2 ) (graph`C'), with inverses, tanh ?1 (y), ln y=(1 ? y), and y= p 1 ? y 2 , respectively. In a few cases, sigmoid curves can be described by formulae; this rubric includes power series expansions (e.g., hyperbolic tangent), integral expressions (e.g., error function), composition of simpler functions (e.g., the Gudermannian function), inverses of functions de nable by formulae (e.g., the \complexi ed" Langevin function, a sigmoid de ned as the inverse of the function, 1=x ? cot(x)), di erential equations et cetera.
Although the level of abstraction in many problems is such that one does not need to work with explicit formulae 2 , it is useful to study networks with speci c transfer functions for the following reasons: In determining whether a single layered feedforward net is uniquely determined by its corresponding input-output map, Sussmann's elegant proof of uniqueness speci cally used the properties of the tanh( ) function 35]. A later analysis by Sontag obtained the same results with fewer assumptions on the node transfer function, but still requires such functions to be odd, and satisfy certain \independence" properties 1]. With respect to the uniqueness problem, all node transfer functions are not equivalent 3 . 2. Without tractable analytical forms to work with, many problems relating to sigmoids are resistant to theory. Neural net theory o ers many examples. For example, there have been claims in the literature about the advantage (with respect to computability, training times etc.) of certain sigmoidal transfer functions over others in backpropagation networks 8,17,33]. Some theoretical support comes from considering the rst derivatives (if de ned) of the various transfer functions proposed; the rst derivatives are partially responsible for controlling the step size in the weight adjustment phase of the back propagation algorithms, which in turn in uences the rate of convergence. Explicit expressions for sigmoids are useful in such considerations. 3. The dynamical system describing the continuous Hop eld model raises an intriguing query.
If one assumes a tanh( ) node transfer function, one can show that the Hop eld model is transformable to the Legendre di erential equation (see section 6.1); An important question is whether this relationship is robust with respect to the choice of the transfer function. 4. The recent study of sigmoidal derivatives by Minai and Williams 26] is another case in point; they derived a connection with Eulerian numbers 15, pp. 252-257] but restricted their inquiry to the very speci c logistic sigmoid. Any generalization of their results requires a careful look at sigmoids representable by formulae.
5. There are other related issues. For instance, the hyperbolic tangent and logistic sigmoid are essentially equivalent in that, one can be obtained from the other, by simple translation and scaling transformations: 1 1 + exp(?x) ? 1 2 = 1 2 tanh(x=2) (1.1) Many sigmoids have power series expansions which alternate in sign. Many have inverses with hypergeometric series expansions. On the other hand, many sigmoids have no such simple forms, or obvious connections with well known sigmoids. It is natural to ask whether these varied analytical expressions for sigmoids have anything in common. It is di cult to answer such questions without a thorough understanding of the analytical expressions for sigmoid functions. In view of these considerations, this paper undertakes a study of two classes of sigmoids: the simple sigmoids, de ned to be odd, asymptotically bounded, completely monotone functions in one variable, and the Hyperbolic sigmoids, a proper subset of simple sigmoids and a natural generalization of the hyperbolic tangent. The class of hyperbolic sigmoids includes a surprising number of well known sigmoids. The regular structure of the simple sigmoids often makes a theory tractable, paving the way for more general analysis.
The main contributions of the paper are as follows Simple and Hyperbolic sigmoids and their inverses are completely characterized in Sections 4 and 5.
Using series inversion techniques, in Section 5, we obtain the series expansions of hyperbolic sigmoids from those of their inverses. These results extend results of Minai and Williams 26] for the logistic function.
In section 4, we study the composition of simple sigmoids via di erentiation, addition, multiplication, and functional composition. These results also completely specify the relationship between Euler's incomplete Beta function and the parameterized sigmoids. In Section 6.1 we show that the continuous Hop eld equations belong to the class of nonhomogeneous Legendre di erential equations if the neural transfer function is a simple sigmoid. In Section 6.2 we establish a connection between Fourier transforms and feedforward nets with one summing output and one hidden layer whose nodes contain simple sigmoidal transfer functions. We do not purport to have discovered a general framework to describe all sigmoids; indeed, such a quest is largely meaningless; nor are we arguing for limiting the notion of sigmoids to the classes considered in this paper. Simple sigmoids are rather special sigmoids, but their regular structure often makes a theory tractable, paving the way for more general analysis.

Preliminaries
Notation: < and < + denote real space, and the set of positive real numbers, respectively. (a; b) and a; b] denote the open and closed intervals from a to b. If A is a set, then jAj is the cardinality of A. Given a function f, its domain and range are denoted by Dom(f) and Ran(f), respectively. f (k) refers to the k-th derivative of f (if it exists). Occasionally, we shall use f 0 (x) in place of f (1) (x). If a function f( ) is k times continuously di erentiable on a given interval I, then we write f 2 C k (I De nition 2.1 (Real Analyticity) Let U < be an open set. A function f : U ! < is said to be real analytic 4 at x 0 2 U, if the function may be represented by a convergent power series on some interval of positive radius centered at x 0 , i.e. , f(x) = P 1 j = 0 a j (x ? x 0 ) j . The function is said to be real analytic on V U, if it real analytic at each x 0 2 V . De nition 2.2 (Monotonicity) A function f : < ! < is absolutely monotonic in (a; b) if it has non-negative derivatives of all orders there, i.e. , f 2 C 1 ((a; b)) and, f (k) (x) 0 a < x < b; k = 0; 1; 2 : : : . Equivalently, f is completely monotonic in (a; b) i f 2 C 1 ((a; b)) and, (?1) k f (k) (x) 0 a < x < b; k = 0; 1; 2 : : : A function f : < ! < is completely convex in (a; b), i f 2 C 1 ((a; b)), and for all non-negative k and x 2 (a; b), (?1) k f (k) (x) 0.
A fundamental property of absolutely monotone and completely monotone functions is that they are necessarily real analytic on their domains (S. Bernstein's theorem 5 12, pp. 184]). Additionally, if f is absolutely monotone on an interval I <, then it is non-negative, non-decreasing, convex, and continuous on I.
In the case of P 1 Proof 8 : If f is completely monotone in (0; 1), then the power series expansion of f in (0; 1) has to be alternating (because, (?1) k f (k) 0). On the other hand, consider an alternating power series f(x) converging for all 0 < x < 1 and its derivatives: f(x) = a 0 ? a 1 x + a 2 x 2 ? a 3 x 3 + a i 0 (0 < x < 1) From real analysis we know that each of (?1) n f (n) (x) has the same convergence properties as Equation (3.1). Also, the sum of a convergent in nite alternating series is always less than or equal to the rst term. This fact, along with the above equations implies that (?1) k f (k) (x) 0 i.e., f(x) is completely monotone on (0, 1).
Corollary 3.1 (x)=x is a completely convex function in (0; 1) i ( p x)= p x is a completely monotone function in (0; 1).
Also, (x)=x is an even function, implying that its power series expansion will consist only of even powers in x, which alternate in sign. From Lemma 3.1, ( p x)= p x), will hence be completely monotone in (0; 1). The same argument su ces for the converse.
If a simple sigmoid is also strictly increasing, then a much stronger statement can be made, as demonstrated by the following proposition.
Remark 3.2 Since a simple sigmoid has two horizontal asymptotes, it implies that its inverse (if it exists) will have two vertical asymptotes (i.e. lim y ! 1 (y) ! 1). It will be seen that as they have been de ned, sigmoids and their inverses are quite similar; both are odd, increasing, univalent, analytical functions. However, the two di er fundamentally in that sigmoids are aymptotically bounded, while their inverses are not.
Simple sigmoids encompass many of the often used sigmoids described by formulae. The hyperbolic tangent and its close relative, the \exponential" or logistic sigmoid, are often used in many neural network theoretical studies and applications. For example, most of the spin-glass models of the Hop eld net use the hyperbolic tangent. 9 The hyperbolic tangent has, among others, the following properties: 1. It is an odd, strictly increasing analytical function, asymptotically bounded by the lines y = 1.
2. Its inverse tanh ?1 (y) has a GH expansion given by yF(1; 1=2; 3=2; y 2 ). 3. The rst derivative of tanh ?1 (y) is given by 1=(1 ? y 2 ) = 1 F 0 (1; ; y 2 ), i.e. , the GH expansion of the rst derivative of tanh ?1 (y) is dependent on only one numeratorial parameter. It can be shown that many other simple sigmoids, such as Elliot's sigmoid 8], the Gudermannian (section 4.2) etc. , also have inverses with classical GH series representations. 10 The function tanh ?1 (y)=y satis es a second order linear homogeneous di erential equation, with three regular singular points, located at 0; 1 and 1. A sigmoid with a similar analytical behavior could be expected to have an inverse that is a solution to some second order Fuchsian equations 11 . Since any second order Fuchsian equation with three singularities can be transformed into the Gauss hypergeometric di erential equation, one solution of which is the classical GH series (Klein-Bôcher theorem) 37, pp. 203], it follows that the inverses would have classical series expansions. These considerations motivate the following de nition.
2. Let : (?1; 1) ! < denote the inverse of , and 0 its rst derivative. Then, (a) (y)=y has a Gauss hypergeometric series expansion in y 2 with at most three parameters. (b) 0 (y) has a Gauss hypergeometric series expansion in y 2 with at most one parameter.

Characterization: Inverse hyperbolic sigmoids
The following result is a complete characterization for the inverses of hyperbolic sigmoids. Proofs are presented in the appendix. 9 Stochastic versions of neural nets often start by replacing a set of deterministic state assignment rules, by probabilistic ones, obtained from some distribution | usually the Gibbsian distribution (e.g. Boltzman machines, Stochastic Hop eld models etc.). Computing expected values for the states of the system then leads to the hyperbolic tangent function. See Hertz et. al. for a typical example 18, pp. 28]. 10 The phenomenon is not unduly surprising. A heuristic argument may be given as follows: If the graphs of two functions \look" the same, their respective di erential equations are usually members of the same family. 11 Fuchsian equations are linear di erential equations each of whose singular points are regular 31, pp. 143-168]. tanh ?1 (x)=x satis es such an equation. Theorem 4.1 (Inverses) Let y = (x) be a hyperbolic sigmoid, and let : (?1; 1) ! < be its inverse. Then, either (y) = yF( ; 1 2 ; 3 2 ; y 2 ) = y or (y) = yF( ; ?; ?; y 2 ) = y (1 ? y 2 ) > 0 where, by F( ; ?; ?; y 2 ), we mean F( ; ; ; y 2 ) ( 2 <).
Notation: Each inverse hyperbolic sigmoid is denoted by and is characterized by a single parameter .
Corollary 4.1 The set of hyperbolic sigmoids is a proper subset of the set of simple sigmoids.
A proof for Corollary 4.1 may be given along the following lines. If is a hyperbolic sigmoid, then it is simple on the interval (?1; 1): For, from Theorem 4.1, the series representation for its inverse in (?1; 1) has non-negative coe cients, and this implies (y)=y) is absolutely monotone (Proposition 3.1). Hence (x)=x is completely monotone, and therfore simple. (Lemma 3.1 and Remark 3.1). The converse is not true. Simple sigmoids need not be hyperbolic. The error function erf( ) is simple, but one can use Carlitz's study of the function to show that it does not have an inverse representable by a classical hypergeometric series 4]. It follows that erf( ) is not a hyperbolic sigmoid, and hence the set of hyperbolic sigmoids is a proper subset of the set of simple sigmoids.
For speci c values of its parameters, the hypergeometric function often reduces to other well known special functions. When inverse hyperbolic sigmoids are characterized by Equation (  The relationship between hyperbolic sigmoids and the incomplete Beta function, also makes explicit the relationship between tanh ?1 ( ), and inverse hyperbolic sigmoids of form yF( ; 1=2; 3=2; y 2 ). Other The fundamental role played by the hyperbolic tangent is once again evident. Here, it relates the two types of hyperbolic sigmoids de ned by Equations 4.1 and 4.2.
12 Equation (4.13), with K = 1, provides an amusing application for Fermat's last theorem; if we accept that for all n > 2, there cannot exist positive integers a; b and c satisfying the identity a n + b n = c n , then we may conclude that the sum of inverse hyperbolic sigmoids with di erent integral parameters cannot be an inverse hyperbolic sigmoid with an integral parameter.
In the case g(y) = y, we obtain the characterization for inverse hyperbolic sigmoids. Another interesting special case is when g(y) = (y), where (y) is an inverse hyperbolic sigmoid (since (y) is an injective, smooth, odd, increasing function the conditions of the theorem are satis ed). The elementary composition rules presented here allows the generation of an in nite variety of inverse hyperbolic sigmoids 13 . The next section presents some examples.

Examples
Any function of the form y=(1 ? y 2 ) , where > 0, is the inverse of a hyperbolic sigmoid. For example, for = 2, the function y= p 1 ? y 2 is the inverse of the hyperbolic sigmoid x= p 1 + x 2 . Of all inverse hyperbolic sigmoids of the form yF( ; 1=2; 3=2; y 2 ), the function tanh( ) is noteworthy; rstly, it corresponds to the case = 1, secondly, all inverse hyperbolic sigmoids with integral values of may be generated from tanh(x) by a process of di erentiation (Lemma 4.1), and thirdly, it is a function often encountered in neural nets 19]. As was mentioned in the Introduction, the logistic function may be thought of as a translated and scaled version of the hyperbolic tangent.

Characterization: Hyperbolic Sigmoids
It is often desirable and necessary to work with sigmoids themselves, rather than their inverses. In this section, we obtain power series expansions of sigmoids.

Hyperbolic Sigmoids of the First Kind
When an inverse hyperbolic sigmoid is of the form y=(1 ? y 2 ) , a remarkably explicit form for the coe cients fb 2l + 1 g 1 0 may be given: 13 An intriguing case is Elliot's piecewise rational sigmoid 8], de ned as (x) = y=(1 + jxj). Although its inverse (y) = y=(1 ? jyj) does not t in an obvious way into the framework developed in the last few sections, it is fairly simple to relax the conditions placed on g(y), in Theorem 4.2, so as to include this sigmoid as well. 14 The inverse Gudermannian function nds use in relating circular and hyperbolic functions, without the use of complex functions. 15 In particular, 32, pp. 149-165], 16, pp. 196-198] are minelodes of such functions and expansions. The following theorem presents an e cient way to implement this procedure.
Proof: Theorem 5.2 is easily proved by an induction argument on n.
While the procedure implicit in Theorem 5.2 is e cient, it does involve the computation of the derivative of G n (y). Equation (5.6) is a partial di erence equation with variable coe cients. Therefore there is little hope of solving it in any generality and obtaining a closed form expression. Even more sophisticated methods such as Truesdell's generating function technique and Weisner's group theoretic approach (see 25]), do not give any special insight into the nature of the polynomials G n (y). 16 The next theorem o ers a somewhat di erent approach to the method of repeated derivatives. Theorem 5.3 (Hyperbolic sigmoids -II B) Let (x) = P 1 k = 0 b 2k + 1 (2k + 1)! x 2k be an expansion for a hyperbolic sigmoid, whose inverse is of the form yF( ; 1=2; 3=2; y 2 ), valid in some neighborhood of the origin. Then b 2k = 0, and b 2k + 1 = C(2k + 1; k), where the sequence C(n; k) satis es: C(1; 0) = 1 C(n; k) = 0 8 k n; k < 0 C(n + 1; k) = (2k ? n + 1)C(n; k) ? 2(n ? k + 1)C(n; k ? 1) n 1 (5.7) n and k are natural numbers, D n ( (x)), the nth derivative of , is given by: D n (y) = D n ( (x)) = n ?1 X k = 0 C(n; k)y 2k ? n + 1 (1 ? y 2 ) n ?k ; for n 1 (5.8) Proof: See Appendix I.  17 . In general, Equation (5.7) is a partial di erence equation with variable coe cients, and the system does not appear to be related to any well known sets of numbers. A closed form solution for the numbers C(n; k) appears to be intractable.

Applications
In this section, we present two applications. The rst shows that if the neural network transfer function is a hyperbolic sigmoid, then the dynamical equations describing the Hop eld neural network 16 Equation (5.6) is a di erential-di erence system of the ascending type; it can then be shown that the polynomials fGn(y)g 1 n = 1 satisfy Truesdell's F -equation. Unfortunately, the resulting generating function for Gn(y) is too complicated for any practical use. 17 Interestingly, in the case of the logistic sigmoid, these relations happened to be the recursions corresponding to the Eulerian numbers 15, pp. 253-257]; in other words, the coe cients arising in the computation of higher order derivatives of the logistic sigmoid turn out to be the Eulerian numbers. 19] can be transformed into a set of non-homogeneous associated Legendre di erential equations. Some conclusions regarding the behavior of the Hop eld model, as the outputs saturate (i.e. output ! 1) can then be drawn.
The second application derives an interesting connection between Fourier transforms and 1-hidden layer feedforward nets (1-HL nets). Subject to an additional minor constraint, we show that the use of 1-HL nets with simple sigmoidal transfer functions for function approximation is tantamount to assuming that the function being approximated is the product of two functions; one the derivative of a bounded non-negative function, and the other satisfying some linear n-th order di erential equation, where n is the number of nodes in the hidden layer.

Continuous Hop eld nets & Legendre Di erential Equations
The continuous Hop eld network model 19] with N neurons is described by the following dynamics: du i dt + g i u i = X j T ij v j + I i = E i = ? @E @v i 8 i 2 f1; : : : ; Ng where u i and v i are the net input and net output of the i th neuron, respectively, I i is a constant external excitation, and E is the so called \energy" of the network, given by: It is clear that the left hand side in equation ( Neglecting the e ect of g i , as is common practice, we obtain from Equation ( The fact that the connection between Legendre di erential equations and the Hop eld equation holds for such a wide variety of sigmoids, and is not just an accidental consequence of a particular sigmoid, strongly indicates that further exploration is warranted.

Fourier transforms & Feedforward nets
There have been many di erent attempts to describe the behavior of feedforward networks such as the group theoretic analysis of the Perceptron, proposed by Minsky and Papert 27], the space partition (via hyperplanes) interpretation discussed by Lippman 22] (and many others), the metric synthesis viewpoint introduced by Pao and Sobajic 29], the statistical interpretation emphasized by White 36], et cetera. In 1988, Gallant and White showed that a 1-HL feedforward net with \monotone cosine" squashing at the hidden layer, and a summing output node, embeds as a special case a \Fourier network" that yields a Fourier series approximation to a given function as its output 13]. We present a related construction in this section; it is shown that a one hidden layer (1-HL) nets with simple sigmoidal convex transfer functions (at the hidden layer), and a single summing output, can be thought of as performing trigonometric approximation (regression) 34, Chap. 4]. Speci cally, the inverse Fourier transform of the function (to be learned) is approximated as a linear combination of weighted sinusoids.
The result is a consequence of a connection between a class of simple sigmoids and Fourier transforms, that facilitates a novel interpretation of 1-HL feedforward nets. Polya's theorem is a starting point 30]. Proposition 6.1 (Polya's theorem) : 12] A real valued and continuous function f(x) de ned for all real x and satisfying the following properties: is always a characteristic function (Fourier transform) of an absolutely continuous distribution function 18 , i.e., f(x) = F(h(t); x) = R 1 ?1 e ixt h(t)dt. Furthermore, the density h(t) is an even function, and is continuous everywhere except possibly at t = 0.
The following result connects simple sigmoids with Fourier transforms. Theorem 6.1 Let (x) be a simple sigmoid. If (x)=x is a convex function, then it is the Fourier transform of an absolutely continuous distribution function i.e., (x) ?1 e ixt h(t)dt (6.15) Proof: It su ces to prove that (x)=x satisfy the conditions of Polya's theorem. (x) being simple is bounded, and hence lim x ! 1 (x)=x = 0. Also, (?x)= ? x = ? (x)= ? x = (x)=x. Since (x) is completely monotone in (0; 1), it follows that lim x ! 0 (x)=x = K (some positive constant). There is no loss of generality in assuming K = 1, since one can always scale ( ) appropriately.
Finally, the convexity of (x)=x ensures that all of the conditions of Polya's theorem are satis ed and the conclusion follows. Remark 6.1 Polya's theorem is a su cient but not necessary condition for f(x) to be the Fourier transform of some function h(t). Hence, Theorem 6.1 is also only a su cient condition for a simple sigmoid to be a Fourier transform. A case in point is the function tanh(x) which is not convex, but is still a Fourier transform 28, pp. 42, item # 240], i.e., tanh(x) x = F(log( 1 coth( t)); x) (6.16) In other words, the conclusions we draw in the next few paragraphs may be valid for some non-convex simple sigmoids as well. Remark 6.2 In Equation (6.15) h(t) is an even function. Hence the transform is a Fourier cosine transform. The sine component vanishes during the course of an integration.
Consider a 1-HL net, with k input nodes, n hidden layer nodes with convex simple sigmoidal transfer functions ( ), and one summing output node. Let w ij denote the weight of the connection between the ith node in the hidden layer and j th node in the input layer; similarly, let c i denote the weight 18 Recall that an absolutely continuous function F (x) is a distribution function if it can be written in the form F (x) = Equation (6.19) can be recognized as being analogous to the Heaviside expansion formula in Laplace transform theory 20 , which allows the reconstruction of a time varying function using information relating to its spectral components. Equation (6.19) suggests that 1-HL nets with convex simple sigmoidal transfer functions can be thought of as implementing a spectral reconstruction of the output using the weighted inputs u 0 i s to evaluate the associated pole coe cients (residues) of the Heaviside expansion.
In particular, it can be demonstrated that the results of Gallant and White 13] are implied by Equation (6.19). In what follows, we shall use F s (h; x) and F c (h; x) to indicate the Fourier sine and cosine transforms of h(t). Since h(t), the continuous distribution function corresponding to (x)=x is an even function (from Polya's theorem), it follows that (x) = xF(h(t); x) = xF c (h(t); x). Using the property of Fourier transforms that xF c (g(t); x) = F s (?g 0 (t); x) 6 Equation (6.23) may be used as a starting point for an analysis identical to that adopted by Gallant and White in their study of 1-HL nets with \cosine squashing" functions 13]. It is then straightforward to show that the weights may be so chosen (hardwired) so that the 1-HL nets embeds as a special case a Fourier network, which yields a Fourier series approximation to a given function as its output. In this sense, the results of this section extend the study of Gallant and White. More generally, one can draw similar conclusions by considering sigmoids that are the Laplace transforms of some function; for example tanh(x)=x is the Laplace transform of sgn sin( t 2 ) , where sgn(x) is +1, 0 or ?1 depending on whether x is greater, equal or lesser than zero 32, pp. 248].
An analysis similar to the one described above, would lead to a connection with real exponential approximation (rather than trigonometric approximation). E cient algorithms, such as Prony's, exist for certain restricted forms of the exponential approximation problem 34, pp. 82-101]. Also related are the considerations of Marks and Arabshahi on the multidimensional Fourier transforms of the output of a 1-HL feedforward net; they showed that the transform of the output is the sum of certain scaled Dirac delta functions 24]. Here, we view the sigmoid itself as the Fourier transform of some function; the main advantage of our interpretation is the algorithms it suggests for training 1-HL nets of the type considered in this section. Extensions to multiple layer nets, while not trivial, should not present undue di culties.
Another potential use of Equation (6.23) is its possible use in exploring the \goodness" of the approximation obtained by a 1-HL net with simple sigmoidal transfer functions. In the last 200 years, much has been learned about the errors associated with exponential and trigonometric approximation, and ways to deal with it; however, consideration of these issues is beyond the scope of this paper.

Conclusion
We have analyzed the behavior of important classes of sigmoid functions, called simple and hyperbolic sigmoids, instances of which are extensively used as node transfer functions in arti cial neural network implementations. We have obtained a complete characterization for the inverses of hyperbolic sigmoids using Euler's incomplete beta functions, and have described composition rules that illustrate how such functions may be synthesized from others. We have obtained power series expansions of hyperbolic sigmoids, and suggested procedures for obtaining coe cients of the expansions. For a large class of node functions, we have shown that the continuous Hop eld net equations can be reduced to Legendre di erential equations. Finally, we have shown that a large class of feedforward networks represent the output function as a Fourier series sine transform evaluated at the hidden layer node inputs, thus extending an earlier result due to Gallant and White. Appendix I Theorem 4.1: Let y = (x) be a hyperbolic sigmoid, and let : (?1; 1) ! < be its inverse.
Proof: Since ( ) is hyperbolic, by de nition ( )=x is described by a GH series with at most three parameters. There are then four major possibilities: Case 4 (7.6) (7.7) The following proposition shows why there is no need to consider cases 1, 3 and 4, as possible forms for hyperbolic sigmoids.
Proposition A: 32, pp. 155] Let p F q ( 1 ; : : : ; p ; 1 ; : : : ; q ; z), be a GH series in z, with p + q parameters. If none of the numeratorial parameters are non-positive integers, i.e. 8 i : i 6 = 0; ?1; ?2; ;, then convergence behavior of p F q is as follows: p < q + 1 p F q necessarily converges for all nite z. p = q + 1 convergence of p F q is limited to ?1 < z < 1, and depends on the parameters i 's and i 's.
p > q + 1 p F q necessarily diverges for all nonzero z. (7.8) Since lim z ! 1 (z) ! 1, but is nite in the interval (?1; 1), it follows that if a GH series is to represent ( ), then it has to converge in the interval (?1; 1), but diverge at z = 1.
This rules out non-positive integral values for the numeratorial parameters; otherwise, the series would converge for all z 2 < (and not just in the interval (?1; 1)). Yet, even if the numeratorial parameters do not have non-positive integral values, in three of the above cases, the number of numeratorial parameters to denominatorial ones is such that either series again converges for all z (case 1), or diverges for all z (case 3, 4). That leaves just one case to consider, viz . the classical series, 2 F 1 ( 1 ; 2 ; 1 ; z) = F( ; ; ; z), i.e. we may take (x) = x F( ; ; ; x 2 ).
Since ( ) has to be a GH series with at most three parameters, some of the parameters are allowed to be \missing". In other words, Case 2 spawns in turn, the following possibilities: Case 2(f) (7.14) (7.15) Proposition A can be used once again to weed out all but two of the above set, viz. Cases 2(a) and 2(d). The rest lead to inappropriate divergence or convergence behavior in the interval. The following property of GH functions will be needed.  From the de nition of hyperbolic sigmoids, (x), is to representable by a GH function with at most three parameters; we must make therefore make the identi cation, = 1=2 and = 3=2. From the symmetry properties of the GH function, we need not consider the case when = 1=2, = 3=2.  1 is Euler's Gamma function. If < 1, from Proposition C we see that the series converges absolutely at z = x 2 = 1. From in g( ), with at most one parameter. Then, either (y) = g(y)F( ; 1 2 ; 3 2 ; (g(y)) 2 ) = g(y) 1 X k = 0 ( ) k 2k + 1 (g(y)) 2k k! ; for 1 (7.23) or, (y) = g(y)F( ; ?; ?; (g(y)) 2 ) = g(y) (1 ? (g(y)) 2 ) ; for > 0 (7.24) provided lim y !1 g 0 (y) (1 ? y 2 ) ! 1, where g 0 ( ) is the rst derivative of g( ). Proof: The proof for Theorem 4.2 is very similar to that for Theorem 4.1. If we start with (x) = g(x) F( ; 1=2; 3=2; (g(x)) 2 ), then we can show that: where g 0 (x) is the rst derivative of g(x). Since g 0 (x) > 0 for all x 2 Dom(g), and > 0, it follows that 0 (x) > 0 for all x 2 Dom( ), i.e. (x) is a strictly increasing function. The analyticity, continuity and oddness of ( ) follow from the respective properties of the GH function. We assure that lim x !1 (x) ! 1, by forcing its derivative 0 (x) to go to in nity at the endpoints of its interval.
Consider the functional equation: u = t (u). Suppose f(u) and (u) are analytic in some neighborhood of the origin (u-plane), with (0) = 1. Then there is a neighborhood of the origin (in the tplane) in which the equation u = t (u) has exactly one root for u. Let P k 0 a k t k be the Maclaurin expansion of f(u(t)) in t, and P k 0 c k t k be the Maclaurin expansion of the function f 0 (u) (u)] n . Then: a n = 1 n c n ?1 Here, y u, x t, and (u) = (1 ? y 2 ) . Take f(u) = u y, and the theorem follows from the Lagrange inversion formula. Theorem 5.3: Let (x) = P 1 k = 0 b 2k + 1 (2k + 1)! x 2k be an expansion for a hyperbolic sigmoid, with an inverse of the form yF( ; 1=2; 3=2; y 2 ), valid in some neighborhood of the origin. Then, b 2k = 0 and, b 2k + 1 = C(2k + 1; k). where we de ne the sequence C(n; k) as follows: C(1; 0) = 1 C(n; k) = 0 8 k n; k < 0 C(n + 1; k) = (2k ? n + 1)C(n; k) ? 2(n ? k + 1)C(n; k ? 1) n 1 (7.27) n and k are natural numbers, D n ( (x)), the nth derivatives of , are given by: D n (y) = D n ( (x)) = n ? 1 X k = 0 C(n; k)y 2k ?n+ 1 (1 ? y 2 ) n ? k (7.28) Proof: This theorem was obtained by a process almost identical to that described in Minai and Williams' work on the derivatives of the logistic sigmoid 26]. We therefore restrict ourselves to an outline.
It is given that y = (x) = x F( ; 1=2; 3=2; x 2 ), and x = (y). It can be shown that, D(x) = d dy (y) = 1= 0 (x) = (1 ? x 2 ) . Consider the derivatives of the polynomial f k; l (x) = x k (1 ? x 2 ) l , D(f k;l (x)) = d dy f k; l (x) = kx k ? 1 (1 ? x 2 ) + l + ?2lx k + 1 (1 ? x 2 ) + l ? 1 = (k)f k ? 1; + 1 (x) + (?2l)f k + 1; + l ? 1 (x) = L(f k; l (x)) + R(f k;l (x)) (7.29) In Equation (7.29) we have split the e ect of the operator D d dy into the sum of the actions of two operators L and R (Minai and Williams refer to them as 0 and 1 ). With respect to the polynomials f k; l , these operators are de ned by: L(Af k; l (x)) = Akf k ? 1; + l (x) (7.30) R(Af k; l (x)) = ?2lAf k + 1; + l ?1 (x) (7.31) where A is a constant. The main advantage of introducing these operators is that they give a systematic way of visualizing the production of D n + 1 (x) from D n (x). L and R may be thought of as being applied to a binary tree of expressions, where each node is some polynomial f k; l (x), and the root is the polynomial f 0; = (1 ? x 2 ) . The action of L on each node of this tree is to produce a left child, given by Equation (7.30), and that of R is to produce a right child, given by Equation (7.31). L acting upon f k;l (x) does three things: multiplies it by k (= the degree of x), reduce the degree of x by 1, and increase the degree of (1 ? x 2 ) by . On the other hand, R increases the degree of x by 1, that of (1 ? x 2 ) by ( ? 1), and multiplies the operand by ?2l, where l is the degree of (1 ? x 2 ). Induction arguments in conjunction with the above arguments then give: C(1; 0) = 1 C(n; k) = 0 8 k n; k < 0 C(n + 1; k) = (2k ? n + 1)C(n; k) ? 2(n ? k + 1)C(n; k ? 1) n 1 (7.32) Now, all terms in D n (x), with a x term having positive degree will vanish, when evaluated at x = 0. For even n, all the nodes have an x term with an odd degree, and hence D n (x) vanishes identically at x = 0. For odd n, all terms, excepting the term corresponding to k = (n + 1)=2, vanish at x = 0. Since b n = D n (x) j x = 0 , it follows that b 2k = 0 and b 2k + 1 = C(2k + 1; k).