Approaches to Evaluating Welfare Reform:
Lessons from Five State Demonstrations

Chapter 2:
Objectives and Methods of the Welfare Reform Waiver Evaluations

This chapter reviews the objectives of the welfare reform waiver evaluations. It then identifies alternative potential designs for the welfare reform impact evaluations and assesses their strengths and limitations.(1)

A. OBJECTIVES OF THE EVALUATIONS

The terms and conditions under which the federal government granted waivers to the states to implement welfare reform demonstration programs included specifications for evaluating the demonstrations. These specifications encompassed the basic design for the evaluations, data collection activities, outcome measures, and types of analyses. From these specifications, we can infer that the federal objectives for the evaluations were to answer the following questions:

- Participation in the AFDC and Food Stamp programs and associated benefit levels

Other outcomes, such as child school attendance and child inoculations also were specified in the terms and conditions if specific provisions of a demonstration program were designed to influence those outcomes. The impact analysis was intended to generate estimates of the impacts of the demonstration on these outcomes.

· Did the benefits derived from the reforms exceed the costs, as assessed from the perspectives of the program participants, various levels of government, and society as a whole? The cost-benefit analysis was intended to answer this question.

The terms and conditions specified requirements for the impact analysis that supplemented the core objectives noted here. For example, they required impact estimates for subgroups of the AFDC population. At a minimum, this included separate estimates of impacts by a case's AFDC applicant/recipient status in the month of random assignment and the characteristics of age and race. The terms and conditions posed two additional objectives for the impact analysis as feasibility assessments rather than as required analyses: (1) to determine the effects of the demonstration on entry into AFDC (that is, effects on applications, approved applications, and caseloads), and (2) to estimate impacts of discrete components of the overall waiver package. Most of the designs for the evaluations were not conducive to the estimation of either entry effects or the impacts of separate components of a waiver package.

Although states undertook the waiver evaluations in response to the requirement to do so in the terms and conditions, they also had their own objectives for the evaluations. For example, a state may have an especially strong interest in one or more outcome measures not emphasized (or not even mentioned) in the terms and conditions. In addition, states seeking to fine-tune their programs may have given great importance to estimating the impacts of discrete components of a waiver package. The state may have been most concerned about obtaining process information concerning program implementation and client experiences and have had much less interest in the impact evaluation. States may also have had objectives for the evaluations that could best be addressed through analyses other than the four types discussed earlier. For example, some states sought frequent feedback from welfare participants on their perceptions of welfare reform and how well it was meeting their needs. Responding to these objectives in some instances required periodic customer satisfaction surveys or focus group discussions with welfare clients.

A large number of objectives for a welfare reform evaluation can imply that an evaluation will have difficulty addressing all of them well. Efforts to do so may lead to design changes or shifts in resources that result in less reliable estimates of central outcomes such as employment and welfare participation. One example of this in the waiver context is the design and fielding of client surveys on a wide array of topics, but only after it was too late to obtain the contact information at sample intake needed to ensure a high response rate. Such dilution of effort and the resulting reduction in the quality of research can be avoided if all organizations involved work together from the start to set clear priorities. The priorities should reflect the policy importance of outcomes and the accuracy with which they can be measured.

B. ALTERNATIVE DESIGNS FOR IMPACT EVALUATIONS

With only a few exceptions, the terms and conditions for welfare reform demonstrations in the 1990s have required evaluations based on an experimental design. (The most notable exception, the evaluation of Wisconsin's WNW demonstration, is discussed later in this chapter and elsewhere in this report). To assess the advantages and limitations of an experimental design, it is helpful to identify the key features of this design and several nonexperimental designs:

The first major application of an experimental design in social welfare policy research was to evaluate the negative income tax experiments of the late 1960s and early 1970s (Burtless and Hausman 1978; and Keeley et al. 1978). Since that time, there have been many social welfare policy evaluations based on experimental designs (Greenberg and Shroder 1991). The number and diversity of these evaluations have been increasing in recent years. Using data on several of these evaluations, methodological studies were conducted to determine whether nonexperimental evaluation methods could yield impact estimates similar in sign and magnitude to those generated by experimental methods (LaLonde 1986; Fraker and Maynard 1987; and Heckman and Hotz 1989). The interpretation of the findings from these studies remains controversial (Heckman and Smith 1995). The most common conclusion, however, is that nonexperimental estimators frequently provide different results than would be found in an experimental evaluation, and are therefore biased. Furthermore, the nonexperimental results are sensitive to minor changes in model specification. Thus, experimental estimators are preferred (Burtless 1995; and Friedlander and Robins 1995). DHHS shares this conclusion, as shown by the strong preference it exhibited for experimental evaluations of the welfare reform waiver demonstrations. In special circumstances, however, it approved alternative designs for evaluations of waiver demonstrations.

Despite the methodological strength of an experimental design, the difficulty of implementing such a design sometimes may limit its usefulness. In addition, there may be nontechnical reasons for preferring an alternative design (such as considerations of cost or fairness). The next two subsections consider the advantages and limitations of an experimental design, with particular emphasis on the needs of the impact analysis component of an evaluation. The third and final subsection defines various permutations of a quasi-experimental design and discusses when such a design might be desirable.

1. Advantages of an Experimental Design

The principal advantage of a well-planned and well-executed experimental design is that it ensures that, in other respects than receipt of the treatment, experimental and control cases are alike. The difference in average outcomes between the experimental and control groups is thus an unbiased estimate of the average impact of the program; this is known as internal validity. This eliminates the need to rely on a multivariate statistical model to control for case characteristics.(3) Consequently, the estimation of impacts in an experimental design is straightforward. The central feature of an experimental design, random assignment, does what a multivariate model attempts to do in a nonexperimental design: it controls for differences in characteristics between cases that receive the reform program and those that do not. Random assignment imposes this control more effectively, however, essentially eliminating any possibility of bias from imperfectly controlling for background characteristics.

Another important advantage of an experimental design is that all cases--those receiving the reform program and those not receiving it--coexist in the same site or sites during the same time period. They are therefore exposed to the same economic and other factors that may influence outcome measures independently of the program reforms. This strategy avoids a principal limitation of a quasi-experimental design, which is that cases in one group may be exposed to plant closings, migration, floods, and other economic, social, and natural phenomena that cases in the other group are not exposed to.

2. Limitations of an Experimental Design

The advantages of the experimental design discussed earlier are compelling. When the design is implemented carefully, most policy researchers see these advantages as eclipsing the limitations discussed next. However, there may be particular applications in which one or more of these limitations looms large--perhaps because of a strong policy need for information on a specific type of outcome that an experimental design is not well suited to provide.

An experimental design can be costly and challenging to implement. Program staff members sometimes are reluctant to implement random assignment; substantial training may be necessary to convince them that it is worth doing and doing right. Alternatively, it may be necessary to contract out certain aspects of random assignment. Either approach can be expensive. Program staff members also must be trained to operate the reform and pre-reform programs side by side in the research sites. Both random assignment and the operation of two programs simultaneously require additional managerial resources.

Two limitations are associated with the challenge of successfully implementing an experimental design. First, because an experimental design often is difficult and costly to implement, state administrators generally select only a subset of counties (or other administrative units) to implement random assignment. They may be inclined to choose only those sites that they believe will be successful both in implementing random assignment and in operating the reform and pre-reform programs concurrently. Selection of any small group of sites--particularly those more likely to be successful--means that the research sample of experimental and control cases is unlikely to be representative of the statewide welfare caseload (the broader population of interest). Consequently, findings from experimental evaluations frequently lack external validity, meaning that users of the research cannot generalize from the findings for the research sample to the full (state) population. With alternative designs that are easier to implement, state-level administrators may be more willing to select research counties randomly or to allow all counties to be research counties. Either approach may yield findings with a high degree of external validity.

A second limitation associated with the difficulty of implementing an experimental design is that it may be difficult to maintain pure versions of the reform and pre-reform programs for the experimental and control groups. Participants in the pre-reform program may receive elements of the reform program, or vice versa. For example, program staff could have difficulty keeping the rules of the two programs separate, or participants in one program could be exposed to advertising or news accounts of the other program and mistakenly assume that the rules governing the other program apply to them. Any such mixing of elements from the two programs would tend to bias impact estimates toward showing no impact of the reform program. In addition, cases in the experimental and control groups could be exposed to the other program if they migrate to a nonresearch site that is operating the other program or if they split into two cases or merge with a case that has a different research status.

Unless specifically designed to do so, an experimental design does not provide a strong basis for estimating the impacts of individual components or sets of components within a package of reforms.(4) To allow estimation of component impacts, a design must include random assignment of cases to multiple experimental groups. The number of such groups increases as the number of program components with impacts to be estimated increases. The number of different programs that must be operated also increases. Few states are willing to take on such an administrative burden. It can be done, however, as shown by the MFIP demonstration, in which a four-group experimental design is being used to estimate the overall impacts of the demonstration as well as the separate impacts of two distinct sets of reforms.

Some welfare reforms may be designed to discourage families from applying for welfare or from entering welfare if they are eligible; others may actually encourage applications (for example, among two-parent families). An experimental design will not support the estimation of such entry

effects because they occur prior to application and thus random assignment. Furthermore, although an experimental design will still give unbiased impacts for those who apply for welfare after welfare reform has been implemented, substantial entry effects may imply that these estimates are not applicable to the population that would have applied under the old program. A nonexperimental study of entry effects that examines application behavior over time is vulnerable to differences between reform and pre-reform groups that are not related to the demonstration; however, no practical experimental alternatives are available.(5)

Similarly, if an intervention is designed to have substantial community effects (that is, to change the culture and mores of an entire community), it may be necessary to implement the new program on a saturation basis in selected sites, and this precludes the use of an experimental design. The federal government approved the use of a quasi-experimental design to evaluate Wisconsin's WNW demonstration, largely because this demonstration was designed to have substantial community effects. There was also concern that the program had been designed to reduce caseloads by discouraging entry into cash assistance. The following subsection provides additional information on quasi-experimental designs and the application of such a design in the context of WNW.

3. Quasi-Experimental Designs

In some circumstances, states may wish to pursue quasi-experimental designs for evaluating welfare reform programs. Motivations for pursuing these designs include the following:

The rest of this subsection considers criteria for a strong quasi-experimental design and reviews the limitations of this design, even in the best of circumstances.

A quasi-experimental design uses a comparison group separated from the experimental group in time or space. The comparison group consists of a set of cases that have not been given the opportunity to participate in the reform program. Possible configurations include:

· Pre-Post Design. This design uses as a comparison group a set of cases in the same site as the new program (which could be the entire state), but from a period before the reforms were implemented. The analysis may be conducted at the case level or may use data aggregated by county or other geographic region. The problem with the pre- post design lies in distinguishing the effects of the intervention from the effects of any other factors that change at the same time, such as unemployment rates, demographic characteristics of the low-income population, or changes in related programs. The more periods of pre-program and post-program data that are available, the more potential there is to distinguish the effects of welfare reform from other changes. A major advantage of this type of quasi-experimental design is that it is inexpensive to implement if the data are available. However, it does require that the state maintain longitudinal data on welfare cases on a regular basis.

· Matched Comparison Site Design. The preferred method for implementing a matched comparison site design has two steps. The first step is to choose pairs of sites suitable for implementing the demonstration program, matched as closely as possible in terms of demographic and economic characteristics and characteristics of the program (other than the reforms being tested). The next step is to randomly pick one member of each pair to be a demonstration site and one member to be a comparison site.(6) If, instead, demonstration sites are selected first from those willing to implement the demonstration, then the best matches are selected from those not willing to implement the demonstration, the design is weaker, since the demonstration's success may be correlated with administrators' interest in being a demonstration site. Even with random selection among matched pairs, the small number of sites involved in most demonstrations implies that impact estimates may be biased if there are site differences not captured by the matching criteria, or if events (such as plant closing or openings or natural disasters) occur that lead to major changes at one of the sites in a pair.

· Combination Pre-Post/Matched Comparison Site Design. The strongest quasi- experimental design is a combination of the pre-post and comparison site designs. This involves a comparison site design, with pre-reform samples from both the demonstration and comparison sites. In such a design, the impact of the program is measured as a "difference in differences"--the difference in outcomes before and after welfare reform in the demonstration sites is contrasted with the difference in outcomes over the same time period in the comparison sites. This approach "nets out" differences between the sites that are constant over time, by comparing changes, rather than levels. However, differences between the sites that change over time may still be confounded with the effects of reform. For instance, a plant closing in a comparison site after program implementation may destroy the initial similarity between the two sites in a pair and, thus, lead to biased impact estimates.

The WNW evaluation is based on a combined pre-post and matched comparison site design. A "difference in differences" analysis will be used for public assistance-related outcomes, for which data from five years before the implementation of the demonstration are available for all Wisconsin counties. These analyses will be conducted at both an aggregate level (with the county-month as the unit of analysis) and a disaggregate level (with the case-spell as the unit of analysis). Cross-site comparisons will be conducted of outcomes for which no pre- implementation data are available, such as employment.

The WNW demonstration sites were selected before the comparison sites, and they were selected from sites with a particular interest in implementing the WNW model. There are only two demonstration counties; both are small and relatively prosperous. For each demonstration county, MAXIMUS (the evaluation contractor) selected two nonadjacent comparison counties that are similar in characteristics such as urbanicity, population, and caseload size. It will use multivariate statistical models with case-level control variables to attempt to control for remaining differences. It is unlikely, however, that matched comparison counties and statistical models will adequately control for the fact that the demonstration counties were preselected. It may not be possible to separate the effects of the program from the effects of being in a county where program staff and administrators were highly motivated to put clients to work.

Notes

(1)Appendix C provides a glossary of evaluation terms used in this report.

(2)The specification of a statistical model refers to (1) the number, type, and measurement of control variables, (2) the measurement of policy variables (for example, participation in the reform program can be measured as a dichotomous yes/no variable or as a continuous months-of-exposure variable), (3) the measurement of outcome variables (for example, current employment can be measured as a dichotomous yes/no variable or as a continuous hours-per-week variable), (4) functional form (that is, whether the relationship between independent and dependent variables is linear or some nonlinear function), and (5) the assumed distribution of the error term. Misspecification of a model along these or other dimensions can result in it generating biased estimates of the impacts of a program reform.

(3)With an experimental design it still may be useful to use a multivariate model to increase the precision of impact estimates.

(4)Nonexperimental methods for estimating the impacts of specific reforms may be employed in an experimental context just as in a nonexperimental context. However, as discussed in Chapter VI, such approaches rarely work well.

(5)An experiment would have to randomly assign all members of the population who could conceivably be at risk of entering the program. Alternately, as discussed in Chapter VI, an experiment could randomly assign a large number of sites, and compare entry rates in experimental and control sites.

(6)Although there is random assignment of sites in this type of design, we do not consider this a truly experimental design, because the sample of sites is typically much too small to rule out the confounding of site-specific factors with the effects of the program.


Where to?

Top of Page
Table of Contents

Home Pages:
Human Services Policy
Assistant Secretary for Planning and Evaluation
U.S. Department of Health and Human Services

Updated 09/24/01