LASSO for Stochastic Frontier Models with Many Efficient Firms

Abstract We apply the adaptive LASSO to select a set of maximally efficient firms in the panel fixed-effect stochastic frontier model. The adaptively weighted L 1 penalty with sign restrictions allows simultaneous selection of a group of maximally efficient firms and estimation of firm-level inefficiency parameters with a faster rate of convergence than least squares dummy variable estimators. Our estimator possesses the oracle property. We propose a tuning parameter selection criterion and an efficient optimization algorithm based on coordinate descent. We apply the method to estimate a group of efficient police officers who are best at detecting contraband in motor vehicle stops (i.e., search efficiency) in Syracuse, NY.


Introduction
Stochastic frontier (SF) models for panel data typically estimate firm-level efficiency from firm fixed-effects and rank them to identify a single firm in the sample as most efficient.That is, SF estimators do not identify efficiency ties in general, while in truth there may be multiple firms in the population tied for most efficient, particularly in competitive markets.
There exist methodologies in the literature to identify multiple efficient firms, but they rely on strong distributional assumptions and use two-step procedures.In the first step, firm-level efficiencies (or equivalent measures) are estimated, and in the second step a separate inference technique or a selection criterion determines membership in a set of the most efficient firms.For example, based on the parametric SF model of Aigner et al. (1977), the papers of Horrace and Schmidt (1996), Simar and Wilson (2009), and Wheat et al. (2014) estimate efficiency using Jondrow et al. (1982) and construct univariate prediction intervals to identify multiple firms that are statistically indistinguishable from the most efficient firm in the sample.Horrace (2005) and Flores-Lagunes et al. (2007) extend this to multivariate intervals that account for the multiplicity inherent in the ranked estimates, and Horrace and Schmidt (2000) develop multivariate intervals for the fixed-effect SF model of Schmidt and Sickles (1984).Despite the semiparametric nature of the fixed-effect model, these inference techniques still rely on parametric assumptions on the distribution of estimated efficiencies (i.e., that they are normally distributed or asymptotically so).More recently, Kumbhakar et al. (2013) propose a zero inefficiency stochastic frontier model for cross sectional data that produces a subset of firms in the sample that are fully efficient.They estimate the probability of a firm falling into the zero inefficiency regime using a latent class model, then use the probability to determine efficient firms.However, this approach still relies on parametric distributional assumptions and a twostep procedure. 1n this article, we explicitly assume that some unknown fraction of firms in the panel are fully efficient and develop a onestep, semiparametric procedure for identifying a set of efficient firms using the adaptive LASSO (Zou 2006).Specifically, the proposed approach proceeds as least squares dummy variable (LSDV) estimation, but the objective function is augmented with an adaptively weighted L 1 penalty for firm-level inefficiencies.Since the inefficiency parameters are constrained to be nonnegative in the model, we imposes sign restrictions. 2The estimation procedure identifies a subset of firm-level inefficiencies as exactly zero, which is an interesting feature of our model compared to the conventional LASSO where identification of nonzero coefficients is of primary interest.The LASSO has been applied to various selection problems, but our article is the first to consider its application to the stochastic frontier models for identification of a group of efficient firms.We also propose an efficient optimization algorithm based on the coordinate descent method, which significantly reduces computational costs.
Our estimator requires inefficiencies to be time-invariant, so it is best deployed when measures of average sample inefficiency are appropriate or desired.If high frequency data are available over a short time period (e.g., a year), a time-invariant assumption is arguably reasonable.There are more flexible specifications that allow inefficiency to vary over time while accounting for time-invariant firm heterogeneity (e.g., the "true fixed-effect SF model" of Greene 2005), but they often come at the cost of more parametric assumptions.We discuss these models and limitations of time-invariant inefficiency in greater detail in the next section.
We analyze the asymptotic properties of the proposed estimator for the case (N, T) → ∞, where N is the number of firms and T is the number of time periods in the sample.We allow for time-series dependence and cross-sectional heteroscedasticity in regression errors and covariates, which is new to the panel SF literature.3Also, our approach allows N to be much larger than T under proper moment conditions.Therefore, our estimator is well-suited to analyze large and competitive markets where many firms may be on the efficient frontier.We show that the proposed estimator consistently identifies a set of true efficient firms when the efficiency gap between efficient and inefficient groups vanishes slowly with the sample size and when regression errors and covariates satisfy proper moment conditions.The LASSO estimator of maximal efficiency exhibits √ δNT consistency, where δ is the proportion of fully efficient firms in the sample, while the LSDV estimator exhibits T/(log N) 2 consistency.Consequently, the LASSO estimators outperform LSDV in many panels, including short panels.This is borne out in our simulation study.We also propose a tuning parameter selection criterion and prove that selection of the efficient subset of firms remains consistent with the tuning parameter chosen.
We apply our technique to a panel of police stop/search/arrest data in Syracuse, NY, in the year 2006.We consider a linear probability model of arrest rates conditional on a vehicle search with time-invariant officer fixed effects, while controlling for other features of officer ability and patrol assignment.Our LASSO technique identifies a group of 45 out of 139 officers who are efficient at vehicle searches in 2006 (i.e., 45 officers who are best at uncovering illegal items leading to a motorist arrest).Policymakers can use this tool to identify a set of efficient officers for performance recognition, for example.
The rest of this article is organized as follows.The next section introduces the model and the adaptive LASSO estimator.Section 3 provides some technical assumptions and derives the oracle property of the estimator.Section 4 discusses our optimization algorithm and tuning parameter selection.Sections 5 and 6 provide simulation and empirical application results, and Section 7 concludes.All the proofs and additional simulation results are given in the online supplementary material.

LSDV Estimation
We consider the panel SF model with time-invariant technical inefficiency (e.g., Schmidt and Sickles 1984) given as for i = 1, . . ., N and t = 1, . . ., T, where y it is the logarithm of scalar output of the ith firm in the tth period, α 0 is a common intercept, x it is the logarithm of a p × 1 input vector, and β 0 is a p × 1 corresponding parameter vector of marginal effects.The regression equation has two error terms: the first error term v it is a two sided noise with E[v it |x it , u 0,i ] = 0 and the second error term u 0,i is time-invariant firm-specific inefficiency, which can be arbitrarily correlated with x it .We suppose no cross-sectional dependence, but allow time-series dependence in errors and covariates.Unlike standard fixed-effect panel regression models, we restrict u 0,i ≥ 0 for all i but we do not impose any distributional assumptions on inefficiency.Time-invariant inefficiency u 0,i is somewhat restrictive, especially when we study panel data over a long period of time.However, if high frequency data are available and measures of average sample inefficiency are desired, this approach can be employed. 4There are SF models that specify time-varying inefficiency with some restrictive functional forms (e.g., Ahn et al. 2007).However, it is unclear how to apply the LASSO in these cases to model a group of efficient firms.
Another limitation of time-invariant inefficiency is that the model (1) cannot identify marginal effects for time-invariant regressors in x it , so their marginal effects are absorbed into the inefficiency term.In this case, the interpretation of the firmspecific inefficiency u 0,i can be subtle.There are SF models that include an additional term that accounts for such time-invariant firm heterogeneity (e.g., Greene 2005), but these models generally require distributional assumptions on the noise and inefficiency terms.Our interest in this article is to specify a model amenable to semiparametric estimation, so these approaches are not appropriate for our case.If firm heterogeneity is more of a concern, practitioners may use the panel data estimator of Hausman and Taylor (1981), where a set of time-invariant regressors or instruments that are not correlated with fixed effects are used to control individual heterogeneity.Another method one can adopt to minimize time-invariant heterogeneity is the withincategory comparison proposed by Feng and Horrace (2007), where comparisons of fixed effects are made within groups of relatively homogeneous firms, rather than across all the firms.
Existing studies estimate (1) using the LSDV method.More precisely, we rewrite (1) as where α 0,i = α 0 − u 0,i is the firm-specific fixed effect.We can consistently estimate α 0,i (as T → ∞) and β 0 (as N or T → ∞) by standard within estimation, denoting each estimator αi and β, respectively, provided x it does not include any time-invariant variables.
In the LSDV approach, the frontier parameter α 0 is estimated as which is consistent for α 0 with (N, T) → ∞ under the assumption that the density of u 0,i is nonzero in the neighborhood of zero, so min 1≤i≤N u 0,i → 0 as N → ∞ with probability approaching one (w.p.a.1), and consequently max 1≤i≤N α 0,i → α 0 as N → ∞ (e.g., Greene 1980;Schmidt and Sickles 1984).
The individual firm inefficiency u 0,i is then consistently estimated as ûi = α − αi .
Here, α represents maximal efficiency in the sample, and we interpret ûi as inefficiency relative to the most efficient firm.
In practice, it is very unlikely that there are ties in the estimates ûi .For this reason, all the firms have strictly positive ûi values except for the firm that is estimated as most efficient in the sample.Therefore, the LSDV approach has the limitation that it can identify only one efficient firm, even when there are multiple efficient firms (with u 0,i = 0) in the sample.

Adaptive LASSO Estimation
To overcome the aforementioned limitations of the LSDV approach, we instead estimate (1) using the "adaptive least absolute shrinkage and selection operator" (adaptive LASSO) method, from which we can identify multiple efficient firms (i.e., all the firms with true u 0,i equal to zero) by shrinking small values of ûi toward zero.
To this end, we first assume the following sparsity condition.We let S = {i : u 0,i = 0} be the index set of efficient firms and |C| be the cardinality of a set C.
This sparsity assumption implies that |S| firms are efficient in the sample and the fraction of efficient firms doesn't vanish as N → ∞, which plays an important role in the asymptotic analysis later.Note that the model (1) becomes the standard fixed-effect SF model when |S| = 1 and hence δ 0 = 0; it becomes the neoclassic production model when |S| = N and hence δ 0 = 1.Although we suppose p = dim(β 0 ) to be fixed in this article, we can also allow p to increase with N and assume sparsity on β 0 , under which we can identify nonzero elements of β 0 as well.However, this result is already well-studied (e.g., Belloni et al. 2016;Caner et al. 2018), so we focus on shrinkage estimators of u 0,i in this article.
Let β be a consistent estimator of β 0 from (2), such as the standard within estimator: where ỹit = y it − ȳi with ȳi = (1/T) T s=1 y is and similarly for xit .After concentrating out β 0 in (1), the adaptive LASSO estimator for θ 0 = (α 0 , u 0,1 , . . ., u 0,N ) is defined as where λ > 0 is a tuning parameter.The { πi } N i=1 are some data-dependent weights, obtained from some consistent initial estimates of u 0,i .In particular, we let πi = û−γ i for some γ > 1, where ûi is the LSDV estimator described in the previous section. 5Unlike the original LASSO by Tibshirani (1996), the adaptive LASSO allows for unequal shrinkage for each parameter depending on the data-dependent weight πi , which results in the oracle property (see Fan and Li 2001;Zou 2006).However, it should be emphasized that θ(λ) in ( 5) is different from the standard adaptive LASSO estimator by Zou (2006) since we impose sign restrictions on a diverging number of parameters u i ≥ 0 for all i.
One important remark on ( 5) is that we estimate α 0 and (u 0,1 , . . ., u 0,N ) together in one step.This is not feasible in the standard fixed-effect SF model because of the perfect multicollinearity between the constant term and the individual dummies.In contrast, one-step estimation is feasible in our case due to the sparsity assumption and the L 1 penalty term, which shrinks some of the individual dummies to zero.
The main goal of this approach is to identify two groups: efficient and inefficient firms.Therefore, it is similar to Bonhomme and Manresa (2015), who also consider a latent group structure problem determined by group-specific fixed effects.However, their methodology relies on minimization of a least squares criterion with respect to all possible groupings, whereas we use the LASSO technique to identify latent groups (efficient firms vs. inefficient firms) under sign restrictions on the fixed effects.
The adaptive LASSO problem in ( 5) is also related to the latent group structure model by Su et al. (2016), 6 or the fused LASSO by Tibshirani et al. (2005).They penalize over pairwisedifferences among the coefficient values to produce group identification.However, our problem is different from theirs, because we impose sign restrictions on u 0,i and allow the size of the smaller (near-zero) inefficiencies to shrink to zero at an appropriate rate.In comparison, Su et al. (2016) assume that the group-specific parameters in their model are separated from each other by a fixed distance.

Oracle Properties
The adaptive LASSO allows for unequal shrinkage for each parameter and results in the oracle property.This oracle property extends to our case under (N, T)-asymptotics, where N can grow faster than T when errors and covariates satisfy proper moment conditions.
We assume the following conditions in our asymptotic analysis.We define for all i and t, and In Assumption 2-(1), we rule out cross-sectional dependence, but allow for time-series dependence and heteroscedasticity in the errors and covariates.In Assumption 2-( 1)-(ii) and (iii), we require (x it , v it ) be a strong mixing process over t with geometric decay rate and further restrict the moments of ||x it || and |v it | to be finite up to a certain order.The tail restrictions and finite moment condition allow us to use exponential inequalities for strong mixing processes (e.g., Merlevède et al. 2009) to bound misclassification probabilities and achieve selection consistency. 7ssumption 2-( 2)-(i) holds for general M-estimators such as the within estimator under (N, T) → ∞.Assumption 2-( 2)-(ii), (iii) and 2-(3) impose rate conditions on N, T, η, and λ to ensure both selection and estimation consistency.Assumption 2-( 2)-(ii) allows N to grow much faster than T when q is large (i.e., when the tail probability of the error decays quickly).Therefore, it covers many panel structures including short panels.Allowing for large N (i.e., a large market) compared to T is useful here, since we consider time-invariant inefficiency and assume many efficient firms.The rate conditions also control for the magnitude of the tuning parameter λ, so the LASSO procedure can select the zero coefficients correctly without yielding bias in the nonzero coefficient estimators in the limit. 8The assumption allows the nonzero inefficiencies to be close to zero (i.e., η can be very small), but it shrinks slowly enough to be distinguished from the zero coefficients and also not be affected by shrinkage estimation.
We first derive the following lemma, which shows that the LSDV estimator of the frontier parameter α 0 summarized in Section 2.1 is consistent.This lemma serves as a technical lemma to prove the theorems in this chapter and also allows us to compare the convergence rate of α, the LSDV estimator, with that of α(λ), the LASSO estimator.
Lemma 1. Recall that α = max 1≤i≤N αi , where αi is the LSDV estimator of α 0,i in (2).Then, under Assumptions 1, 2-(1), and 2-(2), as This lemma implies α is estimated from one of the efficient firms in the sample w.p.a.1 (i.e., Pr α = max i∈S αi → 1 as (N, T) → ∞).The convergence rate in this lemma is identical to that derived in Park et al. (1998), but their result is derived under iid data with exponential moment conditions imposed on errors and covariates.9Therefore, our lemma can be seen as a generalization of their result.Now we turn to the LASSO estimators.Let Ŝ = {i : ûi (λ) = 0}.We first establish selection consistency, which implies that the LASSO consistently identifies two latent groups of efficient and inefficient firms.
We introduce the following assumptions and notations for the limiting distributions of the LASSO estimators.

Assumption 3. (i) There exist positive constants σ
S c , and σ 2 i for each i ∈ S c such that where Theorem 2. Suppose Assumptions 1, 2, and 3 hold.Then, as Theorem 2 says that we can efficiently estimate the frontier and the firm-level inefficiency parameters using the LASSO estimator.Therefore, combined with Theorem 1, it establishes the oracle property of the adaptive LASSO estimators.It is noteworthy that α(λ) − α 0 = O p (δNT) −1/2 , which exhibits a much faster rate than the LSDV estimator, α, in Lemma 1.This is quite intuitive, because the LSDV estimator uses only a single best firm's observations, but α(λ) uses | Ŝ|•T observations of the firms identified as efficient by the LASSO.As long as δ doesn't vanish as N → ∞, which is a reasonable assumption for competitive markets, the LASSO estimator is preferred.

Optimization Algorithm
The L 1 penalty term in the LASSO objective function has no second derivative at the origin, so we can't directly apply standard quadratic optimization algorithms such as Newton-Raphson.Many alternative optimization algorithms have been developed: local quadratic approximation (Fan and Li 2001), least angle regression (Efron et al. 2004), coordinate descent algorithm (Friedman et al. 2010), among others.
In this section, we derive an efficient coordinate decent algorithm that accounts for the sign restrictions in our model.This algorithm uses preliminary inefficiency ranking information from the initial LSDV estimation, which allows us to skip a large number of irrelevant optimization steps.
The efficiency of our proposed optimization procedure can be understood as follows.In our problem, the Karush-Kuhn-Tucker (KKT) conditions10 for (5) implies that the coordinate decent algorithm reduces to successively updating α (λ) , u 1 (λ) , . . ., u N (λ) based on the following two equations: We note that the ordering of α (λ) − (1/T) T t=1 y it − x it β in (6) follows the ordering of ûi since ûi = α − αi and αi = (1/T) T t=1 y it − x it β , and the shrinkage effect from the penalty term in (6), (λ/2T) πi , is larger for smaller ûi since πi = û−γ i with γ > 1.This implies that for given λ, if ûi ≤ ûj , then ûi (λ) ≤ ûj (λ).Therefore, we can skip updates for all the firms i with ûi ≤ ûj (and identify them as efficient firms) once ûj (λ) shrinks to 0.11 This reduces computational costs significantly when N is large.Our proposed algorithm based on this idea is summarized as follows.
2. For a given λ, check the KKT condition for the second best firm based on the sign of Using this new frontier parameter estimate α(1) , update the rest of the inefficiencies as û(1) for all k ≤ j ≤ N − 1 and then report the results. 12his coordinate descent algorithm uses convexity of the object function and the preliminary inefficiency ranking at the same time, which enables us to find the objective function minimum quickly.

Tuning Parameter Selection
The performance of the adaptive LASSO estimator relies on appropriate selection of the tuning parameter, λ.Methods based on cross-validation and the AIC criteria are known to result in over-selection (i.e., too many nonzero estimates), which will result in under-selection of the efficient firms in our context.Wang et al. (2007) instead propose tuning parameter choice based on the BIC-type criterion, which is shown to consistently estimate the correct model when it exists.
We consider a BIC-type criterion for the choice of λ, which is given by where φ NT is a sequence increasing with N or T, Ŝc (λ) = {i : ûi (λ) > 0} and 2 from (5) for fixed λ.The following theorem proves that selection consistency still holds with the tuning parameter chosen by (7).
Theorem 3 indicates that when φ NT grows with an appropriate rate, we can consistently identify the true set of efficient firms using the tuning parameter chosen by (7).In particular, the conditions (φ NT /T) 1/2 η −1 → 0 and φ NT /(log N) 2 → ∞ ensure the probabilities of under-fitting (i.e., some nonzero inefficiencies estimated as zero) and over-fitting (i.e., some zero inefficiencies estimated as nonzero) vanish asymptotically.The choice of φ NT is crucial in practice to control these under-and over-fitting probabilities.From our simulations, we find that 0.1(log N) 2 c NT with c NT = log log (NT/(N + T)) works well for various panel structures. 13We use it for our simulations and empirical application.

Simulations
In this section, we study the finite sample performance of the LASSO estimator.We consider the model (1) with α 0 = 1, β 0 = (1, 1, 1, 1) , x it ∼ iid N (0, ), where the (j 1 , j 2 )th element of is 0.5 |j 1 −j 2 | for j 1 , j 2 = 1, . . ., 4, and δ = 0.3 (i.e., 30% of firms in the sample are fully efficient). 14The two sided error, v it , is conditionally heteroscedastic and serially correlated such that v it = 0.25v i,t−1 + ω it for t = 2, . . ., T and v i1 = ω i1 where ω it |x it ∼ iid N (0, σ 2 it ) with σ it = 0.45 if 4 j=1 x itj < 0 and σ it = 1.45 otherwise. 15 In every simulation each nonzero individual inefficiency, u 0,i , is identically and independently generated from an exponential distribution, max{0.01,(1/σ u )e −u/σ u } for some σ u > 0, where trimming is to ensure all draws are strictly positive.We experiment with σ u ∈ {1, 2, 4}.Note that as σ u gets smaller, the probability of small inefficiency draws gets larger, making it more difficult for the LASSO to distinguish them from zero.This would be particularly difficult when T is small.Figure 1 shows the density of the inefficiency u 0,i for each given σ u value (figure on the left) and an example of draws from each case (figure on the right).We can clearly see that inefficiencies have high density near zero when σ u = 1.For the penalty function, we set γ = 2 and λ is selected by (7) from a grid search over 250 evenly spaced points between 10 −4 and 10T. 16We simulate each 13 A φ NT that satisfies the rate conditions can take the form of ν(log N) 2 c NT where ν is some positive constant that gives flexibility in controlling the degree of penalization in the criterion (similar to ERIC by Hui et al. 2015), and c NT is a diverging sequence, but its rate of divergence is arbitrarily slow.Note that c NT = log (log (NT/(N + T))) ≈ min{log (log N) , log (log T)} in our case.We also experimented with other types of selection criteria in the simulation study, including ERIC, IC p1 by Bai and Ng (2002), and LIC BIC by Lee and Phillips (2015), and found that (7) performs best in this panel SF model. 14Additional simulation results for δ ∈ {0.1, 0.9} are in the online supplementary material.As δ decreases, the finite sample performance of the LASSO estimators deteriorates, but we still observe notable improvements from the LASSO estimation compared to the LSDV. 15The variances of ω it were chosen so that the overall variance of v it is approximately one. 16We are free to choose the value of γ as long as it satisfies the rate conditions in Assumption 2-(3).From the asymptotic analysis, we can see that case 1000 times for combinations of N ∈ {100, 200, 400, 1000} and T ∈ {10, 30, 50, 70}.First, Table 1 reports and compares the results from the adaptive LASSO estimation in (5) and the conventional LSDV approach described in Section 2.1.In particular, we report the root mean squared errors (RMSE) of ÛLASSO = ( û1 (λ * ), . . ., ûN (λ * )) and ÛLSDV = ( û1 , . . ., ûN ) ; point estimates of α 0 from αLASSO = α(λ * ) and αLSDV = α (= max 1≤i≤N αi ); and the sample correlations between the ranking of U 0S c (i.e., nonzero inefficiencies) and the ranking of their counterpart estimates ÛLASSO,S c and ÛLSDV,S c for given S.
The LASSO notably outperforms the LSDV in terms of RMSE of Û from all cases.Note that as N increases, the RMSE of ÛLASSO decreases but that of ÛLSDV increases, leading to a larger disparity between the two methods.17When N = 1000, the RMSE of ÛLSDV is almost three times larger than that of ÛLASSO .As the asymptotic analysis implies, this is mainly because of the faster convergence of αLASSO to the true value.
As the means and variances of αLASSO and αLSDV in Table 1 show, the distributions of αLASSO are centered much closer to the true value (α 0 = 1) than that of αLSDV even when T and σ u are small, and the bias and variance of αLASSO decrease quickly as N or T increase.In addition, the max operator that αLSDV uses to estimate α 0 tends to pick the most biased individual intercept estimate.Therefore, in the presence of multiple zeroinefficiency firms, the max operator produces a biased estimate for α 0 , which, in turn, leads to bias in estimating inefficiencies u 0,i 's. 18he LASSO and the LSDV show similar rank correlation results.It appears that the LASSO preserves the original ranking better than LSDV when T and σ u are small.This is when we have much uncertainty in the inefficiency estimates, and the LASSO improves the ranking accuracy by estimating statistically small inefficiencies as zero.
Second, Table 2 presents the selection accuracy of the LASSO estimation.In particular, we report the probability of yielding a zero estimate for i ∈ S, P S = Pr(i ∈ Ŝ|i ∈ S); the probability of yielding a nonzero estimate for i ∈ S c , P S c = Pr(i ∈ Ŝc |i ∈ S c ); the estimated proportion of efficient firms δ; and the maximum value of u 0,i , whose true value is nonzero but estimated as zero, representing the degree of misclassification (i.e., max i∈ Ŝ|i∈S c u 0,i ; Max-miss).
Both P S and P S c improve as T increases, but P S c decreases as N increases while P S increases.The tradeoff between P S and P S c when N increases is related to the form of φ NT in (7).Theorem 3 implies that φ NT should grow faster than (log N) 2 , which ensures the LASSO estimates (a diverging number of) choosing a larger value for γ ensures the LASSO estimates zero coefficients as zero, but also increases the probability of estimating (small) nonzero coefficients as zero.Therefore, in applications γ should be chosen in light of this tradeoff.zero coefficients as zero when N increases, but the smallest inefficiency, η, at the same time, should be sufficiently large enough to not adversely increase the probability of estimating (small) nonzero coefficients as zero.In our simulations, we allow for small efficiencies, so the tradeoff between P S and P S c is apparent when N increases.This is particularly true when σ u is small. 19However, note that most of the inefficient firms incorrectly estimated as efficient firms (i.e., those with zero inefficiency estimates) would have near-zero inefficiency.The small values of Max-miss in Table 2 imply that only the firms near zero inefficiency could be incorrectly categorized as fully efficient in the LASSO procedure.More importantly, it is impressive that even when T is small, including (N, T) = (1000, 10), δ is quite close to the true proportion δ = 0.3 as long as σ u is large.This gives us an important implication: our approach can be used even for short panels, as long as there are not too many near-zero inefficiency firms.Hence, in practice, information on the variance of u 0,i would be important in the choice of the proposed LASSO approach.Cai et al. (2021) study nonparametric identification of σ u in the panel setup, where it is allowed to be conditionally heteroscedastic.

Empirical Application: Police Vehicle Search
Efficiency in Syracuse, New York In this section, we consider selecting a group of best officers for annual evaluation based on how successfully they carried out vehicle searches throughout a year.The idea is that officers perform a cost-benefit analysis in the decision to search the vehicle of a stopped motorist.The costs to search are the opportunity cost of their time and effort and the potential cost of being targeted for a "wrongful search" when the search fails to uncover illegal items (contraband).The benefit to a successful search (one that uncovers contraband) is the arrest of the motorist.Specifically, we model success rates (i.e., hit rates) conditional on a search of a stopped vehicle, using a linear probability model, and use the officer fixed effects in the model to calculate officer-specific success rates (i.e., search efficiency).We include several police productivity inputs and dummy variables to control for heterogeneity due to differing levels of police experience, and location and time of the vehicle searches. 20similarly as γ , in practice φ NT should be chosen in the light of this tradeoff.However, we find that φ NT that is optimal for a wide range of N is difficult to find (e.g., our φ NT appears to grow rather quickly as N increases, leading to an underestimation of α when N = 1000).Optimal choice of φ NT is left for future research. 20Defining police productivity by success rate has a limitation that officers with higher standard for guilt tend to have a higher success We use the high-frequency panel of discretionary vehicle search activity by officer within year 2006 in the City of Syracuse NY, that was previously analyzed by Horrace and Rohlin (2016).Their focus was on estimating vehicle stop rate differentials by race, and testing their significance for the entire force, ignoring officer identifiers in the data.We exclude officers whose total number of vehicle searches in the year is less than five.We also exclude stops made in the census tracts that had observations less than five.Our final sample includes 139 field officers and 2863 observations (i.e., searches).Note that our sample is an unbalance panel where each officer makes a different number of searches, T i , in 2006. 21he linear probability model is specified as follows: Pr(arrest it = 1|α 0 , x it , z i , u 0,i ) = α 0 + x it β 0,1 + z i β 0,2 − u 0,i for i = 1, . . ., 139 and t = 1, . . ., T i , where arrest it is the binary outcome variable for officer i at time t, which is 1 if the search results in an arrest of the motorist and 0 otherwise.In the data, we identify the exact time (hh:mm:ss) of a stop.The explanatory variables, x it and z i are time-varying and timeinvariant, respectively, and x it is allowed to be correlated with u 0,i , but z i is assumed to be strictly exogenous.The variable z i is to control for an important source of time-invariant heterogeneity among officers: the level of experience, for which we use Police Experience, which measures a contemporaneous experience level for each officer based on years of employment on the force. 22We could consider a measure of experience based on the cumulative number of stops made by each officer.However, this measure may be endogenous to the probability that a vehicle search will lead to the discovery of contraband (and arrest).We instead use officer start date as a proxy for experience, which is plausibly exogenous.We estimate β 0,2 from rate since they would only search vehicles with a high probability of carrying contraband.We might consider a composite measure that accounts for both quantity and quality of search, which is left for future research.
the between equation. 23To capture possible nonlinearity in the relationship between Experience and arrest rate, we include a third order polynomial in z i .The time-varying explanatory variables, x it , includes variables that control for other dimensions of heterogeneity in search activity: motorist Youth; Dispersion and Scale of stop activity at officer-level; and Census × Shift and Season dummies.Youth is a dummy for drivers under 25, and Dispersion and Scale are constructed based on monthly stop activity, which account for police heterogeneity due to different types of duties assigned to the officers.Dispersion measures the spatial dispersion of stop activity for each officer and Scale measures the intensity with which officers perform duties, which are defined as where ST ijt is the number of stops in census tract j in the month of t by officer i and J is the total number of census tracts.These variables addresses potential selectivity and heterogeneity that arise from the way in which the police chief assigns officers to specific duties in specific parts of the city.For example, officers that tend to do more stops (ceteris paribus) may be assigned to parts of the city where performing many stops is optimal from the perspective of improving arrest rates.We first present the estimation results for β 0,1 and β 0,2 .The estimates of β 0,1 in Table 3 are intuitive.The negative coefficient on Youth implies that arrest rate for young motorists is on average lower than older motorists, which implies that the officers searched young motorists without successes (i.e., arrests) more frequently than the older motorists.This may be interpreted as bias toward young motorists, but this may be simply because the signal of guiltiness to the officers from young motorists is noisier.Dispersion is positive but statistically insignificant.The positive estimate may indicate that officers who carry out duties in a larger area may obtain additional learning opportunities, which enhances their ability to detect crime.Scale is negative and significant, which may be a result of quality-quantity tradeoff in search activity.
Figure 2 depicts the change in arrest rate by police experience and its 95% confidence intervals.The confidence intervals are computed using the delta method.Arrest rate improves as years of employment increases until around tenth year, and then decreases.Vehicle searches involve prediction tasks regarding the likelihood of arrest, which may improve over time due to learning-by-doing, but the inverted U-shape curve implies that 23 The between equation is: ŷi.= α 0 + z i β 0,2 + ς i , where ŷi.= (1/T i ) T i t=1 arrest it − x it β1,LSDV and ς i is the regression error that contains u 0,i and the original two-sided error.This regression is valid as long as z i is exogenous to u 0,i and the original two-sided error.learning in policing is not a constant accumulation but involves a degradation after a certain period.We now turn to our results on police search efficiency.The LASSO estimates 32.4 % of officers (which is 45 out of 139 officers) as efficient.The distribution of the inefficiencies are reported in Figure 3.In Figure 3, the light gray histogram represents the distribution of the inefficiencies from the conventional LSDV approach and the darker one represents that from the LASSO, where the 32.4 % of mass is concentrated at zero. 24he distribution of the inefficiencies from the conventional LSDV looks like a bimodal distribution, which has two peaks at around 0.2 and 0.6.It appears that the LASSO shrinks the inefficient mass of officers at the first peak toward zero, implying that this mass of officers are equally efficient.After the LASSO application, the density function of the inefficiencies becomes similar to the half-normal or exponential distributions that are typically assumed in parametric SF models.
This single-year analysis can be extended to multi-year analysis in a straightforward way.We can deploy separate single-year models for each year, while allowing inefficiency to vary over year.Therefore, if high-frequency data for multiple years are available, our approach can allow for time-varying inefficiency and identify a group of efficient agents for each year.

Conclusion
We propose adaptive LASSO estimation with sign restrictions to identify a group of efficient firms in the panel stochastic frontier model.The method is particularly useful when the market size is large.We show that it outperforms the conventional LSDVbased approach in many aspects.More generally, when we have a panel linear regression model with individual fixed effects, and the ranking among the fixed effects contains important information, our approach can identify a subset of the best (or worst) effects.Hence, this type of "best and the rest" classification can be used as an adaptive sample splitting method.The empirical application demonstrates well the practical value of the proposed method.

Figure 1 .
Figure 1.PDFs of inefficiency with different σ u values and an example of draws from each PDF.

Figure 2 .
Figure 2. Change in arrest rate by experience.
NOTE: Each entry contains the average value for each measure over 1000 replications and their corresponding standard deviations are in parentheses.Rank correlations are computed only among the inefficiencies whose true values are nonzero.

Table 2 .
Selection accuracy.Each entry contains the average value for each measure over 1000 replications and their corresponding standard deviations are in parentheses.

Table 3 .
LSDV Estimates of β 0,1 .The linear probability model includes dummies for different combinations of census tracts and three work shifts, and dummies for seasons.Standard errors are clustered at the officer level.***, **, and * indicate statistical significance at the 1%, 5%, and 10% levels, respectively.