On Ranking and Selection from Independent Truncated Normal Distributions

This paper develops probability statements and ranking and selection rules for independent truncated normal populations. An application to a broad class of parametric stochastic frontier models is considered, where interest centers on making probability statements concerning unobserved firm-level technical inefficiency. In particular, probabilistic decision rules allow subsets of firms to be deemed relatively efficient or inefficient at prespecified probabilities. An empirical example is provided. r 2004 Elsevier All reserved.


Introduction
Truncated distributions are common in economics, where non-negative random variables characterize data generation processes. On the most fundamental level, price and quantity are assumed to be non-negative. A specific distributional class, truncated normal distributions, are commonly used. For example, truncated normal are used in the censored and truncated regression models, see Tobin (1958), Amemiya (1974) or Heckman (1976). Recently, Hong and Shum (2002) show that the vector of drop-out prices observed in an asymmetric ascending auction may be multivariate truncated normal. Additionally, the truncated normal distribution is used to describe technical inefficiency in parametric stochastic frontier models, see Greene (2005). The importance (and relative complexity) of multivariate truncated normals is illustrated in the sizeable Bayesian literature devoted to computer simulation of these random variates, for example, see Geweke (1991).
This paper develops probability statements on independent truncated normal distributions that characterize the relative magnitude of realizations from the distributions. That is, if is a multivariate truncated normal random variable, the goal is to attach probabilities to statements on the relative magnitudes of the Wj when they are assumed independent. In particular, this paper presents probabilistic ranking and selection rules to determine subsets of the n elements of W that are relatively small or large at prespecified probabilities. The proposed rules are based on a non-standard multivariate distribution derived from differences of independent truncated normals. While the form of the multivariate distribution is non-standard (complicated by the truncation), the probability inequalities are readily calculable. An application to parametric stochastic frontiers models is considered; these models yield truncated normal distributions for firm-level technical (in)efficiency, and then attempt to characterize the ranks of realizations from these (in)efficiency distributions. The proposed selection rules accomplish this task by identifying relatively (in)efficient firms at a prespecified probability, and it is argued that the rules are more theoretically justified than current methods for assessing distributional differences in these models.

Characterizations of the distribution
This section presents characterizations of the multivariate truncated normal distribution. Definition 1. Let be an n-dimensional random variable. W * has a non-singular n-variate normal distribution with mean vector and (n x n) positive definite covariance matrix , if it has density Adopt the standard notation: W_ Nð0;RÞ. Let be thetruncation of W * below Definition 2. W has an n-variate truncated normal distribution given by Where is an n-dimensional Riemann integral from a to and . One could envision truncation of a subset of this just requires that for certain , the truncation point goes from to in the limit. Other forms of truncation have been suggested by Tallis (1963Tallis ( , 1965 and Beattie (1962). Define vectors: and with typical elements respectively, and . Then, the characteristic function of W is given by the following result: Theorem 3. The characteristic function of W is The proof is in Section A.1. Tallis (1961) derives a similar formula for the moment generating function of W. The derivation hinges on a series of variable transformations that shift W without scaling it. The univariate case was first suggested in a problem posed by Horrace and Hernandez (2001). Of course, the characteristic function generates the moments of W. Tallis (1961) derives these moments using differentiation of the moment generating function when and R is the correlation matrix associated with . Amemiya (1974) adapts the Tallis results for the case where . Weiler (1959) derives them for the case n = 2 using integration of the density f W . While most of the results of this paper can be derived for arbitrary truncation below c, this paper is concerned with truncation below 0 for each element of W * . Moreover, ranking and selection rules are greatly simplified by orthogonality. Therefore, we will always make the following two assumptions.
Assumption A.0. c = 0. These are widely known results; for example, see Bera and Sharma (1999). A useful monotonicity result is: The proof is in Section A.1. Result (i) of Lemma 8 was indirectly shown by Bera and Sharma (1999). The implication is that if I for fixed for fixed , then for . Therefore, in certain cases, the relative ranks of can be assessed by examining the relative ranks of . This is potentially useful, if the distribution of some estimates of the are normal or asymptotically normal. If so, ranking inference on the estimates of would be standard, while ranking inference on the , using the transformation of the estimates of in Eq. (1), would be non-standard.

Linear transformations
The selection procedures that follow hinge on distributions of differences of truncated normals, so understanding the effects of linear transformations on the truncated normal distribution is useful. First, the family of multivariate truncated normal distributions (Definition 2) is not closed to linear transformations in general. Rescaling and/or summation of elements cause the distribution to loose its truncated normal shape. However, it is closed to relocation. Second, even under Assumption A.1, the family of multivariate truncated normal distributions (Definition 4) is not closed to linear transformations. However, under A.1 it is closed to relocation and (positive) rescaling. (Positive rescaling is only necessary to preserve truncation below the truncation point; negative rescaling produces truncation above the truncation point.) Consequently, the sampling distribution for the sample average from a random sample of a truncated normal population will not be truncated normal. 1 Third, the marginal distributions from multivariate truncated normal distributions will not be truncated normal in general, however under the independence assumption (A.1) the marginal distributions are truncated normal.1 The consequence of the preceding is that ranking and selection rules for differences of independent truncated normals will hinge on nonstandard distributions and, in particular, not truncated normal distributions. For example, under A.0 and A.1, the density of the difference is where f Wj is the marginal density function for Wj given in Definition 4 with n = 1. The partition of the integral on w<0 and w 0 is for computational convenience. 2 Notice that , so the density is not truncated at all. Also, when and , the distribution is symmetric about the origin. Consider generalizing Eq.
(3) to the (n-1)-dimensional case where k is a control index. Define vector Let be any realization of , then under A.0 and A.1 the distribution of is where FWj is the marginal distribution function for Wj given by Definition 5 with n = 1. The upper tail probabilities are The probabilities given in Eqs. (4) and (5) are general (can be used for any independent, absolutely continuous distribution with no probability mass below zero) and are used in the next section to derive the selection rules. The equations can be used to construct multivariate probability statements on the . Define a and as the solution in dj = d (for all j) to Also, define and as the solution in dj = d (for all j) to Then, and k are one-sided confidence bounds similar to the multiple comparisons with a control (MCC) intervals suggested by Dunnett (1955), although here there is no sample of which to speak. Dunnett made inferential statements for the sampling distribution of population statistics; these statements are for individual realizations from the underlying truncated normal populations. Notice that in general

Selection rules
Suppose that we are interested in the relative ranks of a potential realization from the distribution of W under A.0 and A.1. Let _ be the ranks of the elements of a single potential realization of W. Interest centers on selecting a subset of the indices {1; 2; . . . ; n} that contains the index [n] with a prespecified confidence level and another subset that contains [1] with a prespecified confidence level. Consider the following selection rules Rmax and Rmin: Furthermore, define corresponding subsets Smax and Smin: Notice that monotonicity of and k in d implies equivalent selection rules The continuity of ensures that That is, there can only be one minimum or maximum with positive probability. Therefore, probability statements such as are valid. Of course, when Smax is empty, the probability given in Eq. (7) is just 0; similarly, for Smin. Let us always assume: .
The following is a useful result: Lemma 9. S max can have no more that one element. Similarly, Smin can have no more that one element.
Proof. Suppose not. If there were more that one index in Smax, then there would be more than one index that satisfies Rmax, so there would be more than one index k where Therefore, under A.2 in Eq. (7). Contradiction. The proof is completed similarly for Smin. Of course Smax and Smin can be empty, so there are only two states for the subsets: empty set or singleton. Given Lemma 9, Eq. (7) becomes where (n) and (1)  For a prespecified confidence level , a correct selection is guaranteed at that level (as long as the inference is defined). Theorem 10 is related to the results of Gupta (1965), but Gupta's results are based on a sample of observations. Rizvi (1971) considers ranking and selection statements for the absolute value of estimates of , but the result of Theorem 10 is markedly different 3 The subsets S_max and S_min contain those single indices with high probability of corresponding to the maximal Wj and minimal Wj , respectively. One could consider finding subsets containing indices with low probability of corresponding to the maximal Wj and minimal Wj . Therefore, alternative (but not equivalent) rules are with corresponding subsets S_max and S_min. (Note the notational subtlety: ''max'' corresponds to ''maximum with high probability'', while '' max'' corresponds to ''maximum with low probability'' or ''not the maximum''.) Again, monotonicity implies A useful result that relates selection rules and subsets is: Lemma 11. The sets Smax and S_max are non-intersecting. Also, the sets Smin and S_min are non-intersecting.
Proof. Under A.2, monotonicity of implies so that Therefore, if Rmax selects , then R_max will not select k, because violates the selection rule: (and vice versa). The proof is completed similarly for Smin and S_min.
Here we define the probabilities of correct selection as: and , respectively. Then These probabilities are not necessarily bound by the prespecified confidence level, unless Smax and S_max are non-empty.
Theorem 12. If Smax is non-empty, the probability of a correct selection conditional on the selection rule . If Smin is nonempty, the probability of a correct selection conditional on the selection rule R min is .
Proof. When Smax is non-empty, because Lemma 11. Therefore, by Theorem 10. The proof is completed similarly for S_min.
Of course, the event does not necessarily imply the event , so may not be exactly . In fact, the exact value is governed by Eq. (9). When Smax is empty, the probability of a correct selection is not bound from below and is determined by Eq. (9), but this does not preclude a reasonable probability of correct selection. Examples are provided below: Example 13. Suppose that under A.0 and A.1, n = 3; =1; = 2; = 3; Consider and : For so none of the variables have high or low probability of being the maximum or minimum. There is no inference at the prespecified level.
Example 14. Now suppose in the previous example that there is less variability and 0:50. Then Smax is still empty, so none of the variables have high probability of being the maximum. However, S_max = {1}, since . Therefore, one can conclude that index 1 corresponds to the maximum with low probability. By Eq. (9), the probability of a correct selection conditional on R_max is equal to 0:831 + 0:155 = 0:968, which happens to be greater than = 0:831 and 1 -y = 0:95. However, Theorem 12 is not governing this high confidence level, because Smax is empty. Instead, the high confidence level is strictly an artifact of these particular distributional assumptions. Also, Smin = and S_min = {3}. . This illustrates the impact of Lemma 9. The distributions of 3 and 4 are equally probable of generating the largest observation, but they are not both in Smax. Now S_max = {1; 2}. Therefore, one can conclude that indices 1 and 2 correspond to the maximum with low probability. By Eq. (9), the probability of a correct selection conditional on R_max is 0:5 + 0:5 = 1:0. Also, Smin = and S_min = {3; 4}.
Example 17. Suppose = 0:15: For g ¼ 0:05; Smax ¼ f4g, so index 4 corresponds to the maximum with high probability. The probability of a correct selection conditional on Rmax is equal to which is consistent with Theorem 10. Additionally, S_max = {1; 2; 3} and the probability of a correct selection conditional on R_max is also per Theorem 12. Also, Smin = {1} and S_min = {2; 3; 4}. It should be noted that all the probabilities in Eqs. (7)-(9) could be estimated by rejection sampling from univariate normal variates. Indeed, all the preceding examples were verified with rejection sampling simulations. However, rejection sampling can be impractical. In fact, the simulations used to verify the examples were only feasible because the probability of rejection was low ( large and positive). When rejection sampling is not feasible, there is a growing body of literature devoted to efficient sampling from troublesome truncated normal distributions. For example, see Geweke (1991). However, even these techniques are subject to potential problems and criticisms. Therefore, if the integration in Eqs. (4) and (5) is calculable, then theoretical implementations of these procedures may be superior to simulation approaches. 4 Also, it is clear from the preceding examples, that the subsets are a convenient way of summarizing probabilities. Whether one wishes to report the subsets based on prespecified confidence levels or the actual probabilities ( and ), is a matter of taste.

Stochastic frontiers
In the literature on productivity and efficiency measurement, a common parametric class of production (cost) function estimators imply conditional distributions for technical inefficiency that are independent normal random variables, truncated below zero. Consider the specification where is productive output of firm j in period t; g is a production function that maps a vector of productive inputs into output through the unknown parameter vector . (The production function g typically satisfies some additional assumptions that are unimportant to the current discussion.) The are random variables representing stochastic shocks to the production process. Let the distribution of be that of an i.i.d. zero-mean normal random variable with variance . Furthermore, let where the are positive random variables representing technical inefficiencies, and is some positive, continuous parameterization of t. In this formulation of the production function, smaller u corresponds to better production (given inputs) and higher efficiency. Let the distribution of uj be the absolute value of an i.i.d. zeromean normal random variable with variance . Additionally, let the and be independent across j and across t. This parametric ''stochastic frontier'' model has been extensively studied in various forms originating with Aigner et al. (1977) who restrict the model to and Jondrow et al. (1982) study a formulation similar to Aigner, Lovell, and Schmidt. Battese and Coelli (1988) consider the case where jtb. Kumbhakar (1990) extends this to the case where . Battese and Coelli (1992) consider , while Cuesta (2000) considers . Finally, Greene (2005) relaxes the parametric form of technical efficiency and allows for heterogeneity in g. In all these formulations, the parametric assumptions on v jt and u jt imply that u jt conditional on is a normal variable truncated below zero (this is also the case when u jt is i.i.d. exponential). For example, in the Battese and Coelli (1988) formulation, the distribution of u j conditional on is the truncation below zero of an variable where where . Additionally, in all the formulations neither realizations nor estimates of realizations of ujt are available; only estimates of the mean and variance of ujt conditional on . Continuing the example, Battese and Coelli suggest maximum-likelihood estimates of and (although alternative consistent estimates, like GLS, exist) based on a point estimate of . If it is assumed that the value of the estimate of equals the true value of , then the sampling variability in the estimates of and can be ignored and the conditional distribution of uj is independent truncated normal, and an estimate of firm-level technical efficiency is the mean of the conditional truncated distribution: . Indeed, Battese and Coelli (1988, p. 391) state, ''[w]e obtain the conditional distribution of the firm effect [uj ], given the value of the random variables, . This assumes that the values of the parameter are known''. More recent formulations, based on time-varying technical inefficiency, suggest as an estimate of technical efficiency in period t. However, the common thread in all these parametric formulations is that the sampling variability in the estimates of and is ignored and the mean of the conditional distribution of u serves as a point estimate of technical inefficiency. 5

Ranking the conditional means
Empirical implementations of these parametric models are too many to name here. However, they typically assume that is known and include some sort of ranking of the conditional means, , over j in each period t as a proxy for the ranking of the unobserved random variable, ujt. For example, see Horrace and Schmidt (1996). Unfortunately, is a misleading point estimate for ujt. While smaller may suggest smaller ujt, it may not be the case that ujt is small in any particular sample, even if is small. Therefore, using as a point estimate of u jt may have its limitations. Indeed, a firm j with may be operating with u jt much greater than 0 in any sample, y jt ; x jt .
That being said, Theorems 10 or 12 are a better way to draw inferences on technical inefficiency (instead of examining the rankings of the conditional means). That is, use Theorem 10 to define a set S min that contains the j with the smallest (unobserved) u jt with probability at least . The idea is that these parametric stochastic frontier models only produce distributions for u jt , not u jt itself, and as such the conditional mean can only characterize the distribution of u jt , and not the probability of a realization of u jt of specific magnitude. However, Theorem 10 may be used to characterize the magnitude of the u jt in a probabilistic sense, and this is all that the data can allow. Ultimately, the traditional approach of ranking the conditional means and the new approach suggested here are similar in that both follow from the relative magnitudes of the means of the underlying normal distributions before truncation. However, the difference in the two approaches is embodied in the fact that the latter takes into account the variance of the underlying distribution. As such, using Theorem 10 to identify efficiency is theoretically more appealing.

Texas electrical utility application
We examine a formulation of the Eq. (10) with time-invariant technical inefficiency, although the selection rules could be applied in the time-varying case on a period-by-period basis. Consider the model of Horrace and Schmidt (1996): Under the assumptions that independent, generalized least squares (GLS) yields consistent estimates v and which imply the conditional distribution of u j . Specifically, and are consistent for and in Eqs. (11) and (12), respectively. Then the usual point estimates of technical efficiency based on Battese and Coelli (1988) are Horrace and Schmidt (1996) calculate the GLS technical efficiency of 10 Texas electric utility plants from a panel of data between 1966 and 1985, where inputs to the production of the logarithm of electricity are capital, labor, and fuel. See Kumbhakar (1996) for a complete explanation of the data. Using a Cobb-Douglas specification, Horrace and Schmidt (1996) estimate the marginal products of capital, labor, and fuel to be: 0.5882, -0.0966, and 0.5807, respectively (only capital and fuel are significant at the 95% level). They also estimate = 0:0126. Ranked estimates of TE j and are contained in Table 1. Notice TE j is an increasing function of for fixed by Lemma 8.
Ignoring the sampling variability in and per Battese and Coelli (1988), Theorem 10 selects S min , a subset of efficient firms (that have small u j ) with at least a confidence level of 1 -. Assuming S min is non-empty, Theorem 12 selects S_min, a subset of inefficient firms (that do not have small u j ) with at least a confidence level of 1-. Let = 0:10. Results for the Texas utilities are contained in Table 1. 6 The results for are in the 4th column of the table. Notice that is a decreasing function of . Based on the results the following conclusions can be drawn. First, S min = ;, so there is no inference on the single most efficient firm at the 90% confidence level. One can conclude that firm 5 is efficient with 71% probability and firm 3 is efficient with 29% probability, but these are not very strong inferential statements. Additionally, one might conclude that firms 3 or 5 (or both) are efficient with near certainty ð0:71 þ 0:29 ¼ 1Þ. Since S min is empty, there is no guarantee that Theorem 12 will hold, but one can conclude that S_min = {10; 1; 8; 9; 2; 6; 7; 4} and that these firms are not most efficient with near certainty (0:71 + 0:29 = 1). is calculated in column 5 of the table. Clearly, S max = {4}, so firm 4 is least efficient with at least 90% confidence (in fact we are 90.74% confident). Since S max is nonempty and S_max = {5; 3; 10; 1; 8; 9; 2; 6; 7}, one can conclude from Theorem 12 that these firms are not least efficient with at least 90% confidence (in fact we are 90.74% confident). A comparison to the inference of Horrace andSchmidt (1996, 2000) is in order. First, in Horrace and Schmidt (1996), their GLS specification calculates the same point estimates for TE j as in Table 1. Their confidence intervals, based on critical points from univariate truncated normals, implicitly assume that technical efficiency is being measured relative to an unknown (out of sample) absolute standard. For instance, their 90% confidence interval for firm 5 is [0:9982 2 0:9721; 0:9994], so firm 5 is not operating on the absolutely efficient frontier with 90% probability. The inference presented here is for relative efficiency: firm 5 is efficient relative to the other firms with 71.17% probability. Of course, at the 90% level the inference determines that firm 5 may not be on the efficient frontier (Smin = ), so at least the results of the two different techniques confirm one another. One might conclude that with 90% probability firm 3 or 5 is the most efficient, 4 is least efficient, and the rest are somewhere in between. This a stronger statement than that of the Horrace and Schmidt (1996) intervals, which can only say that all the firms are absolutely inefficient. Horrace and Schmidt (2000) calculate confidence intervals using a fixed-effect specification and ''multiple comparisons with the best'' techniques, based on differences of normal (non-truncated) variates. Their inference is explicitly for relative differences (similar to the results here) and imply a subset selection criterion for the relatively efficient firm. For instance, in that study firm 5 has a 90% confidence interval of [0.9448, 1.000], and the subset of firms that may be relatively efficient consists of firms 3 and 5. This is similar, but not identical, to the inference here, where at the 90% level no single firm is relatively efficient, but with near certainty firm 3 or 5 (or both) are efficient. Specification and distributional differences aside, the inferential differences are also driven by the fact that in Horrace and Schmidt (2000) efficiency is a time-invariant estimable parameter, while here efficiency is an unobserved error component that (potentially) varies with time. Estimating actual technical efficiency (not just its distribution) enables Horrace and Schmidt to identify a non-empty set, similar to Dunnett (1955).

Conclusions
This paper develops selection rules for performing inference on rankings of firmlevel technical efficiency in parametric stochastic frontier models. If we are willing to ignore the sampling variability in estimates of the mean and variance that underlie the truncated distributions that characterize technical inefficiency, then the suggested selection rules are a better gauge of inefficiency than the commonly used rankings of , because the rules take into account the variance of the underlying distributions, while the conditional mean rankings do not. Other attempts at incorporating this variance into efficiency analysis have been made: Horrace and Schmidt (1996) use it to construct marginal confidence intervals for the conditional distribution of u, and Bera and Sharma (1999) use it as a proxy for production risk or uncertainty. However, neither one of these innovations is a substitute for the proposed selection rules, because neither exploits the multivariate distribution of the differences to draw inferences about who is technically efficient and who is not. If we are unwilling to ignore the sampling variability, then the conditional distribution of u is not necessarily truncated normal, and the power of the selection rule is suspect. However, so are the usual sample rankings of , the confidence intervals of Horrace and Schmidt (1996), and virtually every application of parametric stochastic frontiers that provides firm-level technical efficiency rankings. Therefore understanding the nature of this sampling variability should be a high priority in the productivity research agenda. In the context of the selection rule, accommodation of the sampling variability would involve adjusting the power of the rule based on some quantification of the variability, but this problem is left for future research. Alternatively, the conditional distribution of u could be boot-strapped, then quantiles from the distribution of all differences could be simulated to perform inference, but this is no substitute for a well-developed distributional theory. Moreover, Bayesian approaches could be adopted that allow for ranking inference while viably controlling for sampling variability. Finally, it is interesting to speculate on theoretical and empirical extensions for the selection rules. Perhaps, they could be used for inference on truncated normal population parameters, based on random observations from the truncated populations. This seems reasonable, but the distributional theory may be cumbersome. Also, perhaps the rules could be adapted to allow ranking and selection of various econometric model specifications based on some positive acceptance criteria, such as R-squared or ''sum of squared errors'', insofar as these criteria possess positive distributions. This, however, remains to be seen.