Approaches to Evaluating Welfare Reform:
Lessons from Five State Demonstrations

Chapter 3:
Sample Design

The sample design is a critical aspect of the design of the welfare reform waiver evaluations. Sample design includes (1) decisions concerning the overall sample size, (2) allocation of the sample between experimental (or demonstration) cases and control (or comparison) cases, (3) decisions concerning whether to oversample key subgroups (and sample size goals for those groups), and (4) decisions about selecting sites (including the number of sites and the method of selecting them). Key sample design decisions in the welfare reform waiver evaluations usually have been made by state policy and evaluation staff, often under considerable time pressure. Political and administrative considerations have affected decisions concerning the number of sites for evaluation, the specific sites chosen, and the level of resources committed to the evaluation (which limits sample sizes). The federal government has played a disciplining role in sample design by requiring a design that could address federal cost neutrality and by setting minimum standards for sample sizes. Federal staff members also have provided technical review of state designs and advice to the states. Often, evaluation contractors have not been involved in the sample design; they have been involved only after the sample design has been implemented. Two of the five evaluations reviewed here, however, involved evaluators to some extent in the sample design.

This chapter outlines the issues that must be confronted in developing a good sample design, to help those planning future welfare reform evaluations be better informed in making these decisions. The issues we focus on are:

For each of these topics, we outline the key issues that need to be confronted in designing the sample, describe the choices made in the five state evaluations we reviewed, and present recommendations.

A. ADEQUACY OF SAMPLE SIZE

In evaluating welfare reform, it is important to have adequate samples to learn about the effectiveness of the program. The larger the sample, the more precisely the impacts of the program can be estimated. The larger the sample, however, the more costly the implementation of the welfare reform demonstration. In random-assignment evaluations, administering two sets of policies for research is the major cost for most states; at least some administrative costs (such as training staff members to handle both policies, monitoring random assignment) will increase with sample size. Federal officials, in preparing the waiver terms and conditions, have specified minimum sample sizes, with some variation according to the particular needs and objectives of the evaluation. States have been encouraged to exceed the minimum if possible.

1. Issues

Each evaluation must address several issues concerning what is an adequate sample size:

a. Outcomes to Be Measured

In Chapter II, we discussed the importance of narrowing or prioritizing the list of research questions that an evaluation is intended to answer. This is particularly important in sample design, since a sample that is designed to provide precise estimates of one outcome may be very weak for other outcomes. To build a sample that can answer the key research questions, it is important to determine the key outcome (or, at most, a handful of key outcomes) the evaluation is seeking to address, the level of variation in that outcome, and the expected magnitude of the impact on that outcome.

In most welfare reform evaluations, four key outcomes are the focus of the impact analysis: (1) the proportion of cases on cash assistance, (2) the mean benefit per case, (3) the proportion of cases with someone working, and (4) the mean earnings per case. Of these four outcomes, those that policymakers consider particularly important should be the focus of the sample design. If all four are of roughly equal importance (as often happens), the most conservative strategy is to focus on the outcome for which the relevant impact is likely to be hardest to detect (that is, the outcome that requires the largest sample to detect a statistically significant impact). The two factors that determine the ease of detecting an impact for a particular outcome are (1) the variance of the outcome (which affects the variance of the impact estimate), and (2) the likely magnitude of the impact.

Among the four outcomes, earnings is likely to have the largest variance relative to the mean, and thus to require the largest sample size to detect an impact of a certain proportion; therefore, in many cases, samples are most conservatively designed to detect impacts on mean earnings. The likely magnitude of the impact also is important, however. In many past employment-training demonstrations, the proportionate impact on AFDC benefits tended to be smaller than the proportionate impact on earnings (Gueron and Pauly 1991). If a key goal is to be able to detect even a small impact on cash assistance benefit levels, that outcome may be the appropriate focus of the sample design.

A sample well designed for assessing impacts on these key outcomes may be weak for assessing other types of impacts. For example, the terms and conditions have required many states to assess the impacts of welfare reform on Medicaid paid claims. Because Medicaid paid claims vary extensively in the population (as some individuals have very high medical costs, but most have low costs), even large average experimental-control differences in Medicaid claims may not be statistically significant, with a sample designed primarily to estimate impacts on earnings.

Regression adjustment of impact estimates for baseline characteristics reduces the standard error of the impact estimates slightly (and thus, in principle, the sample size needed to detect a certain difference). Of the random-assignment evaluations reviewed here, only the Minnesota MFIP evaluation took into account the role of regression adjustment in determining desired sample size.

b. Precision Standard

The needed sample size also depends on the level of precision at which the impact is to be measured. The precision standard for a sample design is determined by three factors: (1) the desired level of statistical significance for the impact estimate, (2) the power of the sample design (the probability of detecting the desired effect), and (3) whether a one-sided or a two-sided hypothesis test is used. A result is referred to as statistically significant if the probability of the true impact being zero, given the estimated impact and its standard error, is very low--generally 10 percent or less (typical standards are 10 percent, 5 percent, or 1 percent). For a given size impact, the smaller the standard error, the more statistically significant the estimate; larger sample sizes are thus required to detect an effect at the 1 percent level of significance than at the 5 percent level. The power of the design is the probability of detecting an effect, assuming an effect of a given size is present--for example, if the design has 80 percent power to detect a 5 percentage point impact at a 5 percent significance level, then, assuming the true impact of the program is 5 percentage points, the probability that a statistically significant impact will be observed is 80 percent. The larger the sample size, the higher the power of the sample to detect impacts of a given size and significance level.

Most evaluation research uses two-sided hypothesis tests, under the assumption that it is useful to distinguish effects in the desired or the unintended direction from policies with no effect. Bloom (1995) argued that one-sided tests may be adequate for most evaluations, since the key concern is to distinguish whether a policy had the desired effect or not. The advantage of one- sided tests is that smaller sample sizes are needed than in two-sided tests to achieve a given level of power and statistical significance.

c. Sample Balance

Dividing the sample into equal numbers of experimental (demonstration) and control (comparison) cases (this is referred to as a "balanced" design) leads to estimates with the highest level of precision, for a given total sample size.

(1) However, substantial deviations from this balance may occur with only minor losses in precision (Bloom 1995). States may prefer an unbalanced sample because of a desire to implement the reform program as completely as possible (if the reforms are implemented statewide for all cases except control cases). By having the minimum allowed number of control cases but more experimental cases, states can increase sample precision while keeping the control group as small as possible. Thus, in many evaluations in which the intervention is implemented for everyone except the control group, the sample is designed to include two experimental group members for every control group member. Increasing the ratio of experimentals to controls beyond 2:1, for a fixed total sample size, leads to more substantial loss in precision. Increasing the total sample size by adding additional experimental group members (but keeping the control group sample the same) increases precision only slightly..

d. Trade-Offs Between Subgroup Analysis and Full-Sample Analysis

Oversampling of key subgroups allows the evaluation to obtain more precise estimates of program impacts for the subgroups of interest. However, such oversampling (if total sample size is held constant) also reduces the precision of the estimates of impacts on the full sample. This becomes less of a concern if there are enough resources to have larger than minimum sample sizes overall, since the increase in precision from having a larger sample will at least partly balance the loss in precision from stratification.

For example, suppose subgroups are defined as the individual demonstration sites. Samples may be allocated across the sites in three ways:

  1. No Stratification. If the population about which inferences are to be made is the caseload in the research sites only, sampling rates should be the same in all the sites, and the sample sizes in the sites should be proportional to the number of cases in those sites.
  2. Stratification to Increase Precision of Site-Level Impact Estimates. To make inferences about impacts in specific sites as well as the entire group of research sites, sample sizes should be set to balance the precision needs of the two types of estimates. In general, cases in the smaller sites will be oversampled in relation to cases in the larger sites. It still may be desirable to have larger samples in larger sites, however, to increase precision of the overall estimates, as long as the samples in the smaller sites meet a minimum standard for site-level precision.
  3. Stratification to Increase State-Level Representativeness. If the population about which inferences is to be made is the entire state caseload, the sampling process is appropriately conceived of as a two-stage sampling process, in which sites are selected first, then cases within sites. Such a design could, in principle, lead to oversampling of either large or small sites. In this setting, implications for precision are most appropriately evaluated in the context of the state as a whole.

These same three approaches can be applied to determining sample sizes for other subgroups.

e. Nonexperimental Versus Experimental Design Requirements

In general, nonexperimental designs require larger samples than experimental designs for a given outcome measure. For example, suppose a design compares applicants to a welfare reform program in some counties with applicants to the current program in other counties. Suppose also that differences (other than the welfare reform program) between the demonstration and comparison groups could be completely controlled for using measured background characteristics. Even in this case, for a given sample size, the standard error of the regression- adjusted impact estimate would be larger than in an experimental evaluation because of correlations between the welfare reform site indicator and the background variables in the equation. Intuitively, the greater the extent to which variables are correlated (tend to move together), the larger the sample required to "sift out" their separate effects--in this case, to separate the impact of the program from the effects of other characteristics.(2)

The difficulty of sorting out program impacts from other factors is magnified if there are unobserved differences between the demonstration and comparison groups ("selection bias"). In the best of circumstances, these differences may be adjusted for using two-equation models.(3) In many such models, the first equation predicts membership in the treatment (demonstration) group (as a function of individual or site characteristics). The second equation estimates the effects of the program using predicted treatment status from the first equation rather than actual treatment status. Such models typically produce very imprecise impact estimates and therefore require much larger sample sizes to detect impacts of a given magnitude (Burghardt et al. 1985).

In a nonexperimental evaluation, however, it may be possible to limit the population of interest to those most likely to be affected by the reforms, so that the impact to be detected is easier to measure. For instance, many of the current waiver evaluations include provisions that affect program eligibility at initial application. States are thus required to randomly assign all AFDC applicants to an experimental or control group. A concern is that the applicant sample includes many applicants who would be denied AFDC benefits under both the new and old versions of the program (and who thus "dilute" estimates of program impacts). A nonexperimental design that compared only approved applicants under the old and new programs would be examining populations with much higher levels of AFDC participation. Thus, assuming the differences between the two groups could be adequately controlled--a big assumption--it would need smaller samples to detect given percentage impacts on participation.

2. State Approaches

Staff members at DHHS typically have specified sample sizes for welfare reform evaluations in the waiver terms and conditions after detailed discussions with the state. These sample size requirements vary according to the state's evaluation objectives and the size of the population being studied. The usual minimum requirements, however, are for the control group to include 2,000 recipient cases and 2,000 approved applicant cases and for the experimental group to be one to two times as large. States may exceed these minimum requirements to improve the precision of their estimates. Usually, sample size requirements do not include specific sample size goals for subgroups. States are required to sample all applicants (not just approved applicants) if the intervention affects eligibility for AFDC, but the sample size requirement is still generally phrased in terms of approved applicants. Thus, the federal requirements generally imply larger sample size requirements when the intervention affects eligibility.(4) Despite this federal guidance, the five evaluations reviewed for this study varied greatly in their planned sample sizes (overall and for key subgroups), as well as in the goals, assumptions, and precision standards used to justify these sample sizes.

a. Planned Sample Sizes

Table III.1 TABLE III.1 PLANNED SAMPLE SIZES IN FIVE STATE WAIVER EVALUATIONS summarizes planned sample sizes in the five evaluations. We first review these planned samples and the data, assumptions, and precision standards used to justify them; later, we discuss how well actual sampling experience has accorded with plans.

Wisconsin. Wisconsin's WNW (the only nonexperimental evaluation) had a planned sample size of 4,000 cases in the demonstration counties and at least 4,000 in the comparison counties (for the part of the evaluation based on a comparison county design). The sample of 4,000 in the demonstration counties was expected to consist of 1,000 recipient cases (the full caseload in those counties) and 3,000 applicants (all applicants over a seven-year period).

The evaluation plan prepared by MAXIMUS discusses the adequacy of the sample size in the WNW evaluation in terms of Cohen's "effect size" measure, defined as the impact on an outcome divided by the standard deviation of the outcome (Cohen 1977).(5) A table shows the sample size needed to detect various effect sizes for one-sided tests with a .05 significance level, at levels of power ranging from 50 to 99 percent. The text notes that a sample of 4,000 each in the demonstration and comparison groups is more than adequate to detect the smallest shown effect size (.10 or 10 percent of the standard deviation of the outcome) at the highest level of power.(6) It is difficult to assess, however, whether an effect size of .10 is realistic for the outcomes being considered without more information. Furthermore, the evaluation plan does not discuss whether the sample is sufficient for the applicant and recipient samples considered separately. The evaluation plan also mentions possibly increasing the size of the comparison group sample

to the full caseload in the comparison counties for outcomes easily measured in administrative data, as one way to add precision to the estimates. The effects of the nonexperimental nature of the evaluation on sample precision are not considered.

California. In California, the required sample size was 15,000 recipient cases (5,000 controls and 10,000 experimental group members). The required sample size for the approved applicant sample was specified as the sample over four years assuming that applicants are sampled using the same sampling rates as used for the recipient cases. The estimated sample of applicants outlined in the sampling plan was 17,280, consisting of 11,520 experimental cases and 5,760 controls.(7) Although we have not found any explicit analysis of precision in the California materials, the large overall sample appears to have been intended to permit subgroup analyses (see Section A.2.b).

Among the five state evaluations, only California planned on unbalanced sample sizes for the two research groups, with two experimental group members for every control group member. Because the demonstration counties had caseloads much larger than twice the control group sample, including additional experimental cases was more feasible than it would have been with smaller sites.(8) The larger experimental group improves the precision of the estimates.

Colorado. In Colorado, the terms and conditions require the following samples: (1) recipients--2,000 experimental and 2,000 control cases, and (2) approved applicants--2,000 experimental and 2,000 control cases. The planned sample sizes described in the evaluation plan are: (1) recipients--2,034 experimental and 2,034 control cases, and (2) approved applicants-- 3,288 experimental and 3,288 control cases. The planned applicant sample was larger than required because the Colorado staff interpreted the sample size requirements in the terms and conditions as referring to the number of cases active two years after implementation. The Colorado sampling plan analyzes precision in terms of the minimum sample sizes needed for county-level estimates but assumes applicant and recipient cases will be pooled for analysis. It does not make clear the need for county-level precision or the rationale for pooling applicant and recipient cases (pooling is discussed further in Section B). The stated precision standard for the analysis is 95 percent power for a one-tailed test; this precision standard is applied to an assumed reduction in recidivism to welfare from 30 to 15 percent.(9) The power requirement of 95 percent is higher than that typically used in evaluation research (80 percent is more common). In addition, recidivism to welfare is not really an appropriate outcome measure on which to base the power analysis, since it is an outcome that can only be defined for a nonrandom portion of the sample (cases that have already exited AFDC).(10)

Michigan. In the Michigan TSMF evaluation, the planned sample size was 21,952--13,578 recipients and 8,374 applicant cases, evenly divided between experimental group members and control group members. The Abt proposal shows that this total sample is adequate to detect a 5 percent impact on earnings under the following assumptions: mean monthly earnings of $165 for controls, with a standard deviation of $244 (based on "a recent study of welfare recipients") and a precision standard for a one-tailed test of a 5 percent significance level and 80 percent power. This calculation assumes no increases in variance due to stratification and ignores any reductions from regression adjustment of impact estimates. Again, the assumption seems to have been that applicants and recipients would be pooled.

Minnesota. The MFIP demonstration has four experimental groups and multiple strata; this substantially complicates the relevant power calculations (see Tables III.1 and III.2). Table III.2 presents the full design for the Minnesota sample. Probably because of the complex design of the demonstration, the terms and conditions of the MFIP evaluation have an explicit precision standard, unlike those in the

TABLE III.2PLANNED SAMPLE SIZES FOR THE MINNESOTA MFIP EVALUATION, BY SUBGROUP

other evaluations that we have reviewed. The terms and conditions state that samples must be adequate to detect experimental-control differences in major outcomes equal to 20 percent of the standard deviation of the outcome at a 5 percent significance level with 80 percent power.

The MFIP evaluation design report argues that the proposed MFIP sample design can meet this standard in comparisons of any two experimental groups with 2,000 cases each, using the employment rate as the key outcome, a two-tailed test, an assumed mean of 50 percent employed in the control group (which the authors say is consistent with other MDRC studies), and assumed gains from regression adjustment of the impact estimate equivalent to a regression equation with an R-squared equal to .08 (which they also say is consistent with experience).(11) This calculation assumes pooling of applicant and recipient cases, and no increases in variance (often referred to as "design effects") due to stratification of the sample.(12) The two smaller research groups (E2 and C2) are roughly 2,000 cases each, but E2 is stratified by county. The larger groups (E1 and C1) are well above that level, but they were stratified by urban/rural location and (within these groups) into several other subgroups, with different sampling rates for the different subgroups (see the next subsection). The larger samples in groups E1 and C1 (over 6,000 in each) may balance or outweigh any design effects from stratification.

b. Subgroup Sample Sizes

Other than stratification of the sample between applicants and recipients (discussed in Section B), the only explicit stratifications of the sample in the five studies examined were by site (or grouping of sites, such as urban versus rural) and by single-parent versus two-parent cases. The motivation behind these stratifications generally was to allow more precise estimates for subgroups; the implications for precision of the estimates for subgroups and overall were not explicitly drawn out.

All of the evaluations (except Wisconsin, which is not really comparable because of its quasi- experimental design) to some extent oversampled cases in smaller sites. In three instances, the motivation was to increase the precision for subgroup estimates; in one instance, it was to increase statewide representativeness:(13)

Two of the evaluations reviewed stratified explicitly by single-parent versus two-parent cases. California set up the sample so that one-third of cases sampled were two-parent cases (AFDC-UP cases), although such cases typically make up less than 15 percent of the caseload. Minnesota also explicitly oversampled two-parent cases (including cases on the state general assistance program and AFDC-UP), relative to their basic sample of single-parent cases in urban areas.(14) Again, no explicit power analyses were offered to justify these sample sizes, but the motivation was clearly to increase the precision of estimates for two-parent cases. This stratification seems sensible, since changes in rules for two-parent families were a major part of the reform packages in these states, and both states had relatively large sample sizes.

None of the evaluators appears to have considered the effects of oversampling of sites or other subgroups on the precision of the estimates for the overall research sample.

c. Planned Versus Actual Samples

The discussion so far has been of planned sample sizes in the five state waiver evaluations reviewed. At this time, it is apparent that actual samples in several of the states are not as large as planned.(15) This problem is discussed further in the next section; here, we note only that not meeting sample goals can seriously reduce the usefulness of an evaluation.

3. Analysis and Recommendations

Our recommendation is that states, in developing their preliminary evaluation sample designs, specify the precision standard that estimates of the key outcomes must meet (rather than minimum sample sizes) and the key outcome measures to which these standards must be applied. In addition, designs should justify the magnitude of the impact they expect to measure and the assumed variance of the outcome measure, which inevitably vary with the nature of the intervention. A reasonable precision standard would be the ability to detect a plausible impact on all applicants or all recipients at a 5 percent significance level with 80 percent power, using a one- tailed test. We do not generally recommend pooling the applicant and recipient samples; in the next section we discuss reasons. In addition, we recommend allowing reductions in sample size due to the increased precision from regression adjustment, particularly if plans for collection of baseline data (discussed further in Chapter V) are also included in the design.(16)

The study's research questions should determine the key outcome on which the sample size is to be based. It may be appropriate, however, for DHHS to recommend "default" assumptions

(based on a review of the existing literature) concerning magnitude of impacts, standard errors, and regression reductions in standard errors for the most common outcome variables. States then could elect to use these outcome measures and associated assumptions, or they could propose others, but they should state and justify their assumptions.

The power of the sample design to detect impacts should be addressed for key outcomes, for the full sample and for key subgroups (particularly for subgroups for which there is an explicit stratification). States also may wish to establish an explicit precision standard for subgroups. States should consider design effects resulting from any oversampling of subgroups; federal officials could also suggest default assumptions concerning the likely magnitude of such effects from a review of previous studies.

B. SAMPLING OF APPLICANTS AND RECIPIENTS

All welfare reform evaluations have been required to sample both from the existing caseload (recipients) and from the flow of applicants. If the welfare reform program affects eligibility, the sample of applicants must include those whose applications are denied or withdrawn, but otherwise the sample may be limited to approved applicants. This section considers the issues involved in allocating the sample between applicants and recipients and in determining the length of time over which applicant sampling is to take place.

1. Issues

In making decisions about the relative sample sizes for applicants and recipients and the timing of applicant sampling, states must consider (1) how the data for the two sample groups will be used, (2) the trade-off between the risks inherent in a long intake period and the need to assess how a program matures, and (3) the ability of the state to forecast the flow of applicants.

a. Role of Applicant and Recipient Samples

Often, DHHS recommends that applicants be sampled using the same sampling intervals used in selecting recipient cases, to avoid the need to calculate weights. If the same sampling rate is used for both groups, active cases in the sample at each point in time are representative of all active cases, without weighting. A major motivation for this approach has been to allow states to meet the federal cost neutrality requirements more easily. For cost neutrality, the active cases in the research sample should be representative of the full active caseload in the research counties at each point in time, so that the impact of the program on AFDC program costs can be assessed.(17) (Although cost neutrality is not the subject of this report, it is important to note here how these requirements shape the sample designs that are also used for impact analyses.) If applicants are sampled at a different rate than recipients, it is still possible to achieve representativeness for cost neutrality purposes by appropriate weighting. As long as the sampling rate for applicants remains the same over time, construction of such weights is straightforward.(18) Whether or not weights are used, as soon as sampling of applicants ceases, the sample is no longer representative of the full active caseload. The active cases that remain in the sample increasingly will underrepresent newer cases.

In the impact analysis, the applicant and recipient samples are each of interest in their own right; in most cases, impact studies analyze data on the two groups separately. Applicant cases experience the new program from the beginning of their spell, and thus are more indicative of long-term effects; they also are representative of the full range of cases that apply for AFDC. Recipient cases give evidence of the short-term effects of welfare reform on the existing caseload.

Because the recipient group represents the stock of cases at a point in time, it is made up of long- term recipients to a much larger extent than the flow of applicant cases. The recipient sample may thus indicate the effects of reform on the most disadvantaged portion of the caseload, particularly if analysis is restricted to long-term recipients.

b. Trade-Offs Between Long and Short Sampling Periods for Applicants

In designing applicant samples, states should consider the following trade-offs between long and short sampling periods:

c. Forecasting Applicant Flows

Another consideration in the design of applicant samples is that states must accurately forecast the influx of applicants eligible for random assignment to achieve sample size goals in the expected time. Overestimating applicant flows puts states at risk of either having an inadequate sample or needing to extend sampling. Factors to be considered in estimating applicant flows are (1) the proportion of applicants who will have already been through random assignment as part of a previous welfare spell, (2) the proportion of applicants who are transfers from another site (and thus, in most demonstrations, ineligible for sampling), and (3) whether the sample frame includes the full sample relevant for random assignment or excludes some relevant cases. In addition, the ability to forecast general caseload trends is always imperfect; for example, the strong national economy has contributed to greatly reduced AFDC caseloads in many states and thus made it harder to meet sample goals in some evaluations. The intervention itself may reduce applications; this is a substantial problem for the analysis (as discussed in Chapter VI), but it also affects sample sizes. Many of the evaluations reviewed did not take these issues into account and therefore have had difficulty meeting applicant sample goals.

2. State Approaches

Table III.1 shows the planned applicant and recipient sample sizes in four of the five evaluations reviewed (the Minnesota evaluation plan did not provide estimated sample sizes for the two groups). The decisions each state made concerning whether to oversample applicants and the time period for sampling applicants are summarized as follows:

The rest of this section discusses these findings in more detail, by state.

Wisconsin. Because of the small size of the demonstration counties in Wisconsin, the state planned an applicant sample that would be three times the recipient sample. It planned to achieve this sample by sampling for seven years. This lengthy intake period seems unrealistic, given the current rapid pace of policy development; indeed, Wisconsin now is proposing a new program (Wisconsin Works) that would supersede WNW. A key problem with the Wisconsin design is that the two demonstration sites are too small to meet sample goals within a policy-relevant time frame.

As of April 1996, the evaluator for WNW had received from the state a list of cases enrolled in the demonstration in the first nine months, but no data on applicants who did not complete their applications or on cases in the comparison counties. Thus, it is difficult to assess how well the demonstration has been achieving the planned sample goals. Anecdotal evidence, however, suggests that the demonstration is having substantial entry effects, which may reduce the sample sizes well beyond those expected. One indication is that the caseloads in the two counties declined nearly 20 percent between the time the demonstration was announced and actual implementation, leading to a recipient sample of 818 cases rather than 1,000.(19)

California. California's APDP sample was designed with the goal of sampling at the same rate from the recipient pool and the applicant flow, to continue to have a representative sample of the caseload in the demonstration counties over time for cost neutrality. To support the cost neutrality calculations, California originally planned to continue sampling for four of the five years of the follow-up period. A second set of waivers, implemented 16 months after the first set, extended the duration of the demonstration; at this time, however, state officials do not plan to extend the sampling period.

The original sampling plan estimated that the applicant sample would be slightly larger than the recipient sample (see Table III.1). The actual flow of applicants into the sample has been roughly one-third of what was expected, however, resulting in an applicant sample much smaller than the recipient sample. As of April 1996, 40 months after implementation, only 5,460 applicants had been sampled (1,824 controls and 3,636 experimental cases). Three reasons for the discrepancy are:

  1. Applications have been declining because of the economic recovery and, possibly, because of the multiple cuts in the maximum benefit. Because California is a "fill-the- gap" state, which allows applicants to fill the gap between the AFDC payment and the need standard with other income without losing eligibility, the cuts in the maximum benefit would not make anyone ineligible, but they could have behavioral effects if participation in AFDC becomes less attractive.
  2. The state uses a sample frame to select the sample that does not include all approved applicants. In particular, the sample is selected from the statewide Medicaid system. Very short-term cases (which may never be entered into the Medicaid system) and cases that are entered into the Medicaid system late are not included in the sample frame. The state believes this problem reduces the size of the sample but does not bias its composition (except that very short-term recipients are excluded).
  3. The state did not accurately estimate the number of cases that would be excluded from sampling because they had received AFDC within the recent past.

Colorado. Sample intake in Colorado has proceeded pretty much according to plan. The state recently stopped intake one month early with an applicant sample of about 6,000 cases, about 600 less than the number planned, stating it had more than met the requirement in the terms and conditions for 4,000 applicant cases.

Michigan. In Michigan, the projected applicant sample after two years was 8,374, but the actual sample intake was about 6,600. The shortfall seems to reflect the omission of denied applicants (those denied for both AFDC and SFA) from the sample. Denied applicants were dropped because Michigan's data system does not retain data on them. In addition, the state has argued that the same cases would be denied all benefits under both TSMF and control group rules; the intervention merely affects whether cases are approved for AFDC or SFA.

The intake period for new applicants was extended for two years largely because of the implementation of new waivers that substantially changed the TSMF program. As a result, the applicant sample may approach the recipient sample in size. Most analyses will examine applicants in the first two years and the second two years separately, however.

Minnesota. In the design for the MFIP evaluation, sample goals are not broken down into goals for applicants and recipients in the usual way. Instead, the sample design discusses a "basic" single-parent sample, which is a proportional sample of recipients coming up for redetermination, reapplicants (defined as those on AFDC in the past three years), and new applicants (defined as those not on AFDC in the past three years) (see Table III.2). In addition, there is a plan to oversample 1,800 single-parent new applicants in urban counties, equally divided between the two larger experimental groups (E1 and C1). Again, new applicants are defined as those not on AFDC in the past three years. The report states that this implies a sampling rate in the urban counties of 13 percent for single-parent recipients, but 80 to 86 percent for single- parent applicants. In the rural counties that are part of the MFIP demonstration, all applicants and recipients are subject to random assignment to one of the two core experimental groups.

MDRC staff members report that the intake for reapplicants and new applicants has exceeded expectations; this led to shortening of the intake period for reapplicants and to assigning a greater proportion of single-parent new applicants and two-parent applicants to nonexperimental groups.

3. Analysis and Recommendations

From the perspective of the impact analysis, there is less interest in pooling the applicant and recipient samples than in analyzing each (and particularly applicants) separately. A pooled sample can be selected to give unbiased estimates of impacts on the caseload over the demonstration period. Such impacts, however, are made up of effects on those already on AFDC before welfare reform and of effects on those entering the system only after welfare reform; in general, it is the latter effect (that on applicants) that is of long-term interest. Therefore, we recommend that sample sizes be sufficient to analyze recipients and applicants separately; this typically implies sampling applicants at a higher rate than recipients.

Because an extended sampling period brings large risks along with large administrative costs, we recommend designing the applicant sampling process to reach the target sample over a two- year period, if possible. A shorter sampling period implies a longer follow-up period, less likelihood of major program changes, and some flexibility to extend sampling if goals are not being met.

Finally, applicant sampling rates need to be set carefully to take into account the exclusion of transfers and those who have been through random assignment before. States should use any available historical data on accessions to predict these rates. This is just one example of the usefulness of longitudinal data on accessions and terminations.

C. SELECTING SITES

The selection of sites is one aspect of sample design that is rarely addressed formally. Only one of the states reviewed selected sites for its welfare reform waiver evaluations even partially through a formal sampling process. None of the states analyzed the precision of its sample estimates as estimates of the state population as a whole and, thus, none attempted to take into account the increased variance due to clustering of its samples in selected sites rather than in the entire state. The federal government has not required states to do this, mainly because of the considerable political and administrative realities that limit site selection. In addition, the state may have other goals for the demonstration (such as assessing whether a program will work under the most favorable conditions). However, the lack of representativeness (also referred to as external validity) can limit the usefulness of the information from a demonstration. Here, we discuss different possible approaches for selecting sites.

1. Issues

Several approaches are possible to selecting sites for an impact evaluation:

Regardless of how sites are selected, it is always possible later to compare the characteristics of the sites to the state as a whole and to compare the characteristics of the AFDC caseload in the demonstration sites to the state caseload. If the demonstration sites appear reasonably similar to the rest of the state, this makes generalizing net impact results to the state level more plausible. If major differences exist, the evaluator can assess the possible implications of these differences. It may also be possible to reweight the demonstration sample (especially if it is relatively large) to be more representative of the state caseload.

2. State Approaches

The five evaluations range from those that were not concerned with external validity in selecting sites to those that have made a serious attempt to achieve a representative sample:

  1. Wisconsin made no claims of representativeness for the two demonstration counties for the WNW evaluation--Pierce and Fond du Lac; instead, it selected these sites because they were very interested in implementing the demonstration and seemed likely to achieve success (Bloom and Butler 1995). The two counties are small, relatively prosperous, and overwhelmingly white (as is most of the state outside Milwaukee). The state's main goal was to test the feasibility of the new approach. The selection of these sites severely limits the ability to assess impacts, however, even within these two sites. In particular, sample sizes are small, and MAXIMUS was not able to select comparison sites with unemployment rates quite as low as in the two demonstration sites.
  2. California has implemented the APDP/WPDP demonstration in four counties that are judgmentally representative of the state and that contain 49 percent of the state's caseload: Los Angeles, San Bernadino, Alameda (Oakland-Berkeley), and San Joaquin. The first process report states: "Planners chose Los Angeles because of its critical importance to the state, San Bernardino because it is adjacent to Los Angeles, Alameda for its ethnic diversity, and San Joaquin to represent the Central Valley and because of its proximity to Alameda"(UC DATA 1994). Los Angeles and Alameda have large urban areas, while the other two are largely rural. They range in population: Los Angeles has a population of 9 million, San Bernardino and Alameda populations of 1.2 to 1.5 million, and San Joaquin a population of 0.5 million. San Joaquin had the highest percentage of the population on AFDC and the highest percentage of two-parent cases. In California, welfare reform outcomes also are being tracked in the rest of the state, and state staff members are working on methods for reweighting the research sample to make it more representative of the state as a whole (for example, in terms of ethnicity).
  3. Colorado chose 5 research counties from among 13 that applied to be in the demonstration, on the basis of their capacity to implement the demonstration and to represent the state's diverse geographic, economic, and demographic conditions.
  4. Michigan's sample of four offices was designed to be judgmentally representative along the dimensions of gender, race, age, earned income status, months since the case opened, and family size. Sampling rates were set so that the share of cases in Wayne County (Detroit) was the same as in the statewide caseload.
  5. Minnesota's MFIP is operating in 7 of the state's 87 counties, of which 3 are urban (including one Twin Cities county) and 4 are rural. The sample was designed to overrepresent rural cases in relation to the statewide caseload but to choose representative counties within the urban and rural groups. For the urban sample, the state wished to include the county containing St. Paul or that containing Minneapolis (Ramsey or Hennepin); it ruled out Ramsey because it was participating in another demonstration. The state also chose one of the two large suburban counties at random, but ended up including both because the second county offered to pay planning costs itself in order to be included. The rural counties (all remaining counties in the state) were originally to be part of a nonexperimental evaluation, and two clusters of counties were chosen randomly to represent rural counties statewide.(20) When the state moved to an experimental design for the rural counties as well, one of the two clusters was chosen for the demonstration.

In summary, the selection of the rural counties in Minnesota is the only example of sites being selected through a formal sampling procedure. All of the other states except Wisconsin chose sites that were, to some extent, judgmentally representative. Most evaluators also analyze site representativeness after sites have been chosen.

3. Analysis and Recommendations

In the waiver process, there has been relatively little emphasis on selecting a representative set of sites in the negotiations for approval of welfare reform waivers. This is in large part because the federal staff recognize the administrative burdens of implementing random assignment and understand that it may not be feasible for all local welfare offices to assume these burdens. However, the lack of requirements concerning site selection has made it possible for states to select a set of sites that are most likely to implement the reforms successfully and to use the results from these sites to generalize to the rest of the state. Full implementation of the reforms statewide may then produce less positive results. It sometimes makes sense to implement new and untested programs in sites that are most likely to be successful, to determine if the approach is feasible under the best of circumstances. However, it is most useful to research and policy development if such motivations are stated explicitly and if evaluators and policymakers are then appropriately cautious in generalizing the results.

In addition, there may be a trade-off between a state's short-run interest in implementing the evaluation in sites where implementation is relatively easy and the state's longer-term interest. For instance, states that pay little attention to the representativeness of demonstration sites initially may become very concerned about this issue if impact estimates run counter to their expectations. DHHS can play a useful role in encouraging states to take a longer-term perspective on evaluation design, one that encompasses both implementation concerns and the potential ramifications of an unrepresentative group of sites.

We recommend that states, in their sampling plans, spell out the criteria used in selecting site (including whether the goal is approximate representativeness or selecting exemplary sites). We also recommend that states assess how representative the selected sites and their caseloads are of the state and its caseload as a whole. Wherever feasible, we recommend explicit sampling of sites, with some accommodation of administrative concerns (for example, very small sites or sites with special administrative difficulties could be excluded). Finally, we recommend that an analysis of the representativeness of the population in the sites (including both an updated analysis of site characteristics and a comparison of outcomes) be conducted after the fact, as well as during the site selection process. For example, in an intervention that is implemented statewide, but with random assignment in only selected demonstration counties, it may be useful to assess how similar the outcomes for the experimental group in the demonstration counties are to those in the state as a whole.

Notes

(1)1. The variance of the impact estimate is inversely related to T(1-T), where T is the proportion of the sample in the treatment group. All else equal, this variance is smallest when T = .5.

(2)2. In fact, in a comparison site design with two sites, it is impossible to distinguish the effects of the program from the effects of other site-specific factors that do not vary within the site. If a characteristic varies among individuals in a site, that variation can be used to identify its separate effect. Similarly, the larger the number of sites, the more ability there is to sort out other site effects from the effects of the intervention.

(3) Again, such unobserved factors must vary among individuals as well as across sites. Such models also require that a variable exists that predicts participation in the demonstration well (in a comparison site design, this may be equivalent to predicting where people live), but does not otherwise affect the program impact (know technically as an "identifying" variable). Otherwise, if the control variables in both the participation and outcome equations are the same, the predicted value of the participation variable will be either perfectly or highly correlated with the other variables in the outcome equation (since it is a function of those variables). The models also typically make restrictive assumptions about the distribution of the error terms. In referring to the "best of circumstances," we mean that lack of precision remains a concern in these models even when good identifying variables are available and the distributional assumptions are reasonable. Nonexperimental models are discussed further in Chapter VI.

(4)For example, if the federal requirement is for 2,000 approved control group applicants, but only 2 out of 3 applicants are approved, a sample of 3,000 control group applicants may be needed to satisfy the requirement. In an intervention that does not affect eligibility, a similar requirement implies a sample of only 2,000.

(5)The effect size is a way to standardize analysis of statistical power over different types of outcomes measured on different scales. A sample is selected to achieve a certain effect size (for example, to measure an impact equal to 10 percent of the standard deviation of the outcome) with, say, 80 percent power at a 95 percent significance level. The same sample size would be needed to reach a given effect size, regardless of the outcome measure.

(6)MAXIMUS, "Evaluation of the Work Not Welfare Demonstration: Evaluation Plan," pp. III-21-22 and Exhibit III-6.

(7)California Department of Social Services, "APDP Approval Case Sampling Plan," Attachment 1.

(8)Services offered to experimental cases are the same as those offered to most cases; some counties did not even tell caseworkers which cases were in the experimental group.

(9)Colorado Department of Social Services, "Sampling Plan to Implement the Colorado Personal Responsibility and Employment Program," p. 3.

(10)Methods for analysis of recidivism and similar outcomes are discussed in Chapter VI.

(11)MDRC, "Proposed Design and Work Plan for Evaluating the Minnesota Family Investment Program," p. 14.

(12)Technically, the design effect is the ratio of the standard error of an estimate from a complex sample design (for example, a design with oversampling of particular strata) to the standard error of an estimate from a simple random sample of the same size.

(13)Wisconsin sampled the full caseload in both demonstration counties and also may include the full caseload in the comparison counties.

(14)The basic sample in Minnesota is a proportional sample of recipients and applicants--they also oversample new applicants (defined only as applicants who had not been on AFDC for at least three years). The sampling rates in the urban counties were 13 percent for single-parent recipients, 80 to 86 percent for single-parent applicants, and 46 to 53 percent for two-parent applicants and recipients (Knox et al. 1995).

(15)The MFIP evaluation is the exception; it reports an intake of new applicants and reapplicants higher than expected. It is not clear if this indicates an effect of the demonstration or other factors. MDRC responded by cutting the sampling rates or the intake periods for several subgroups.

(16)Bloom (1995) provides a formula for making this adjustment, based on the R-squared expected for the regression equation.

(17)Strictly speaking, active cases in the research sample should be representative of the full active caseload in the state. However, DHHS generally has been willing to assume the sampled sites are representative of the caseload (see Section C).

(18)Weighting schemes become more complicated if sampling rates are changed over time, perhaps because sample intake has been lower than expected, or if weights must also be used for some other purpose (such as adjusting for oversampling of sites or subgroups).

(19)The implications of entry effects for the analysis are discussed in Chapter VI.

(20)The Upjohn Institute, as a consultant to the state, developed a model to select clusters of counties with approximately 1,500 cases each. It used 49 variables to describe each county and selected clusters with the goals of maximizing generalizability to all rural counties in the state and of having pairs of clusters that were well matched. In addition, it restricted the model so that no cluster could contain more than one county that was not interested in participating.


Where to?

Top of Page
Table of Contents

Home Pages:
Human Services Policy
Assistant Secretary for Planning and Evaluation
U.S. Department of Health and Human Services

Updated 09/24/01