One of the most reassuring things a research organization can say with respect to the privacy of the people whose health data it is studying, is: "We don't know the personal identity of our data- subjects; and we really don't want to know." (60)
This should not necessarily mean that no-one can trace back to the data-subject if scientific reasons require it. But such tracing-back should itself be at least a small project, the difficulty of which should be scaled to suit the situation.
From a privacy-protection perspective, there is a very wide distinction between personally identifiable data and truly anonymized data. But in practice the demarcation between these extremes is not sharp. Attending assiduously to where particular data lie on the spectrum between them, and especially to data that are somewhere in the middle, is a crucial protection strategy.
At present, large amounts of data lie in-betweenthey are not completely anonymized, but they are not readily identified, either. It is routine to decrease identifiability by assigning to data a pseudonym made up of numbers and/or letters. But if, for instance, the overall data category is known (say, epilepsy among men in a certain district) and the data are coded-for by, say, simply personal initials and birthdate, it may not be difficult to deduce who the data-subject is. The power of computers to perform elaborate, powerful, rapid searches, and the pressures for access, mean that merely assigning simple pseudonyms affords little protection.
For data whose identifiability has, up to now, been only lightly obscured, greater efforts now must be made either: (a) to much more effectively remove personally identifying information, or to aggregate, and thus anonymize, the data; or (b) to seek the data-subjects' informed consent and hold the data under a suitably protective regimen if identifiability is retained.
For key-coded datathat is, data for which personal identifiers are removed and secreted but which are still potentially traceable via a matching code, held separatelya variety of measures must be taken to mask the identifiability near the source, separate and lock up the identifiers, safeguard the linking codes, and carefully manage linking-back to the data-subject when it is required.
Institutions should clearly articulate their policies on use or sharing of personally identifiable data. An example of such a policy statement is this guidance by the U.S. Office for Protection from Research Risks, on HIV studies:(61)
Where identifiers are not required by the design of the study, they are not to be recorded. If identifiers are recorded, they should be separated, if possible, from data and stored separately, with linkage restored only when necessary to conduct the research. No lists should be retained identifying those who elected not to participate. Participants should be given a fair, clear explanation of how information about them will be handled. ...
As a general principle, information is not to be disclosed without the subject's consent. The protocol must clearly state who is entitled to see records with identifiers, both within and outside the project.
Much very useful health research is performed on completely anonymized data. If for a particular research project there are no compelling reasons for retaining at least potential identifiability, anonymized data should be used. Though this injunction might sound unnecessary, it is stated here because often, data with identifiers are used just because they happen already to be on hand in identified form.
Data may be non-identifiable if any of the following tactics have been employed:
- Identifiers simply have never been collected.
- Identifiers have been removed ("stripped") effectively.
- Data have been aggregatedthat is, within each data sub-element the data have been averaged or grouped into ranges, and only the averages or ranges reported, not revealing the identity of the data-subjects.
- Data have been "micro-aggregated," with small randomly assembled clusters of cases averaged, in effect generating a set of pseudo-cases that represent the real population.(62)
The test of whether data actually are non-identifiable is whether a person without prior knowledge of the data or their collection can, from the data and any other available information (such as postal-code charts, or a casually-held key to a code, or a list of the people recruited to the study), deduce the personal identity of data-subjects.
In an area in which the issue is highly contentious, a "consensus statement" from a workshop on genetic research on stored human tissue samples stated emphatically: (63)
Samples are anonymous if and only if it is impossible under any circumstances to identify the individual source. At present, in settings such as those involving large population groups, it may be possible to ensure anonymity while retaining some information about the individual source, such as ethnic origin, sex, age cohort, or limited clinical data, with the sample. In other settings, such as DNA samples obtained from a small group of individuals at risk for a specific disorder, retention of additional information may compromise anonymity. Samples are not anonymous if it is possible for any person to link the sample with its source. Even if the researcher cannot identify the source of the tissue, the samples are not anonymous if some other individual or institution has the ability.
If data must be transformed before being released for researchwhether into irreversibly anonymized or into key-coded formcharacteristics that might indirectly lead to identification of the data-subject should be obscured, blurred, or masked. Residential addresses can be translated into regions. Since some postal zones may be sparsely populated or have a distinctive cast of inhabitants, postal-code identifiers might be avoided. Instead of birthdate, perhaps age, or age brackets, can be used. Instead of the exact number of beds in nursing homes, capacity categories can be used. And personal initials are personal.
The extent to which any transformations are employed should be scaled to the characteristics of the sample and the population of which it is a subset, the potential risks to the data-subjects, the subjects' expectations, and other factors.
Many technical methods of "disclosure limitation" can be applied to make deductive identification of data-subjects difficult, if not impossible. In population studies, for instance, only relatively small proportions of the populations can be sampled. For surveys, only a randomly selected subset of the responses might be released instead of all of the responses, to obviate guessing, by elimination, who said what. And so on.(64)
For many purposes researchers must keep the ability to trace back, even if through intermediaries, to the data-subjects. Irreversible anonymization is not necessarily desirable.
There are a number of important reasons why retaining personal identifiabilityeither openly labelled or via key-codingmay be essential:
One of the clearest examples of the need to retain potential identifiability is the analysis of pharmaceutical and medical-device side-effect risks. As was mentioned above, the U.S. Food and Drug Administration, like all regulatory authorities, properly requires that data-links to the patient record be maintained (usually through the data-subject's physician) so that adverse-drug-event reports, sent in by physicians, the public, or manufacturers, can be verified and scrutinized in clinical detail if necessary.
Because irreversible anonymization often is undesirable on scientific grounds, the procedures and methods of key-coding of various forms are essential techniques. Some of the practices are very technical. Degree of key-coding or "masking" is relative. It is a question of the extent to which personal identifiability is obscuredwhich is to say, the impedance against "cracking" of the code and matching the data with the data-subjects.
U.S. agencies, such as the National Heart, Lung, and Blood Institute (NHLBI), emphasize that the first step in protecting personally identifiable data is simply to hold the identifiers close to the point of collection. Before transferring data to other researchers, then, the data should be stripped of identifiers and either key-coded or anonymized. When the Institute sends data to pharmaceutical companies from clinical trials on an investigational new drug, it strips off not only the patient and physician name but location, birthdate, and other data that could point back to the data-subject. It takes similar care when it correlates data from several sources, as when it links heart disease data with socioeconomic data.
Simply designating a reliable person within the research organization to be responsible for stripping identifiersand formally certifying to the principal investigator and/or an administrator that the resulting set of stripped data is nonidentifiablecan be prudent.
Trusted intermediary organizations, such as public accounting or consulting firms, may be asked to remove identifiers, and perhaps to hold the key linking data with identifiers. For a detailed national analysis of hospital costs based on data provided by the States, the U.S. Agency for Health Care Policy and Research arranged for an intermediary organization to remove identifying information from the patient data, and also information that might identify the hospitals, before the Agency received the data.
In its alcohol related studies, which may be painfully sensitive for the people studied, the U.S. National Institute of Alcohol Abuse and Alcoholism assigns pseudonym (key-coded) identifiers to all subjects and has the key held securely by an independent third party.
The U.S. National Institute for Child Health and Human Development (NICHD) requires that if researchers wish to perform a secondary study on data originally collected by other investigators under an NICHD grant, they must pay a fee to the original researchers to key-code the identifiers and take other protective steps before transferring the data for the secondary study.
The following example illustrates a rigorous approach to separating identifiers from data but retaining the ability to reconnect them if necessary. In several states of Germany an elaborate system is being tested for population-based cancer registries.(65) A "trusted office" (Vertrauensstelle), directed by a physician, receives cancer case data from doctors and hospitals, classifies the cases as to type of tumor and so on, and, using cryptographic procedures, assigns pseudonyms, separating the case data from the person-identifying data. Then, using a secure system, it transfers the pseudonymized data to a separately located "registration office" (Registerstelle), which stores the data securely. After a short time the "trusted office" destroys its set of the data. Again separately, a master re-identification key is held by a "supervisory office." The "registration office" cannot match identifiers to the cases it stores. If, later, it becomes scientifically necessary to trace back to the patient's physician to obtain more information, with the approval of an ethics committee the supervising office can use its re-identification key to reassociate the case data with the identifying data. The system has been endorsed in the relevant laws. Whether such a system will be widely applicable is not yet clear; but such approaches deserve to be evaluated.
A uniform "Federal Policy for the Protection of Human Subjects," often called the "Federal Common Rule," is promulgated by sixteen Federal agencies that conduct, support, or regulate research. It governs such matters as subject rights, informed consent, Institutional Review Boards, disclosure policy, recordkeeping, and a variety of other matters. (66)
The Office for Protection from Research Risks (OPRR), in the National Institutes of Health, serves as a resource and makes certain that the Federal Common Rule is implemented. Research institutions, such as academic medical centers, which wish to perform research on humans under Federal funding or other Federal auspices must negotiate and enter into a formal "Assurance" with OPRR stipulating the overall means by which the institution will protect subjects and designating an officer responsible for being sure the protections are implemented. (67)
Initial questions of any investigatory activity are: Is it "research"? Who are "subjects"? According to the Federal Common Rule, research is "a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge" [italics added] (§_.102(d)). Human subjects are "living individual(s) about whom an investigator (whether professional or student) conducting research obtains (1) data through intervention or interaction with the individual, or (2) identifiable private information" (§_.102(f)). Definitions such as these are not mere exercises. Rather, they determine how particular investigatory activities must be approached, whether they fall under Federal scrutiny, and whether they must be supervised by an Institutional Review Board.
The Federal Common Rule exempts from its IRB and other requirements "research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens... if the information is recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects" (§_.001(b)(4)). This is being reconsidered with respect to genetic analysis of stored tissues, now that the genetic mapping techniques have become so revealing.
Beyond the provisions of the Federal Common Rule, many additional regulations in various agencies cover aspects of research involving ionizing radiation, research on high-risk biological agents, alcohol and drug-abuse research, research classified under national security regimens, and other special circumstances.
A universally endorsed ethical precept is that it is permissible to collect and use personally identifiable data, if the data-subject agrees to the conditions of data protection and use. The ideal is prior, informed, freely granted, specific consent. Researchers strive for this to varying degrees, and achieve it to varying degrees. (68),(69)
Informing for consent almost always must include telling the prospective subject "the extent, if any, to which confidentiality of records identifying the subject will be maintained." (70)
For formal clinical trials and much other research, informed consent is routinely sought, Institutional Review Boards supervise the research protocols, and so on. But for many other kinds of research, for a variety of reasons notice is not routinely given nor explicit consent sought, and indeed may be practically impossible to seek. The policy and pragmatic questions are obvious. For example, retrospective studies, such as epidemiological reviews initiated years after the medical events, pose special problems. How should identifiability and consent be dealt with when such reviews are undertaken? And in general when data are collected, how broad consent should be sought for future studies that cannot be specifically anticipated? Important issues are the granting of consent for studies in large multipurpose databases, or retrospective secondary research (discussed in Chapter 7). How meaningful and sufficient is omnibus, indefinite consent?
Alas, as the ethicist Ruth Faden has rightly lamented: (71)
As a practical matter, how much moral weight the typical consent to access information can bear is dubious. The catchall phrases in the waivers and disclosure statements read and signed by patients and consumers"Your records will be kept confidential and not be made available, except for statistical purposes," "except for research purposes," and "except for administrative purposes"are doubtless not very meaningful to most people.
Unless we do a good job of soliciting genuine informed consent or conducting an extraordinarily public education and exchange to provide citizens with an understanding of who now has information and for what purposes, getting consent will not get us off the moral hook.
A real-world example just to indicate the complications is called for here. From the front- line of health social work, Jeanette Davidson and Tim Davidson have brought this sobering message, as relevant for research as for the provision of care: (72)
With managed care systems the reality is often that the name of the individual or organization receiving the disclosure may change without notice, the information to be disclosed may consist of a verbatim account of the client's most sensitive information given to persuade a gatekeeper to continue to authorize services, and the statement about the client's being able to revoke consent at any time is an illusory proposition given the virtual irretrievability of electronic transmissions of data that are stored in various locations.
External ethical oversight provides additional protection for research subjects. Prime examples in the U.S. are the Institutional Review Boards (IRBs) that supervise human-subjects research conducted under Federal jurisdiction, which is very broad. IRBs are carefully constituted boards that conduct independent oversight of research. (73)
The IRB is an administrative body established to protect the rights and welfare of human research subjects recruited to participate in research activities conducted under the auspices of the institution with which it is affiliated. The IRB has the authority to approve, require modifications in, or disapprove all research activities that fall within its jurisdiction as specified by federal regulations and local institutional policy.
In the U.S. a research institution must have in place a properly constituted and functioning IRB to be eligible to receive Federal funding for research on humans. Some institutions pledge all of their research, regardless of the source of funding, to the standards of the Federal Common Rule. All Federal agencies conducting research on humans operate under IRBs; the Centers for Disease Control and Prevention has six IRBs, and each of the seventeen Institutes of the National Institutes of Health has at least one. At present some 3,500 IRBs are in operation in the U.S. (74)
No doubt different IRBs, in practice, deliver differing degrees of supervision (and thereby, protection), depending on their capabilities and how hard they apply themselves. Some Federal programs review IRB performance; others don't. The Food and Drug Administration, in its routine audits, reviews whether data submitted in the regulation of drugs, medical devices, and so on have been gathered and protected in conformance with the pertinent local IRB stipulations on the particular research protocol. Moreover, each year it inspects the work of several hundred IRBs as to adequacy of structure and performance.
"Evaluation of the risk/benefit ratio is the major ethical judgment that IRBs must make in reviewing research protocols," the OPRR Guidebook emphasizes. (75) "Risks to research subjects posed by participation in research should be justified by the anticipated benefits to the subjects or society."
Importantly, the Guidebook states: "A risk is minimal where the probability and magnitude of harm or discomfort anticipated in the proposed research are not greater, in and of themselves, than those ordinarily encountered in daily life or in the performance of routine physical or psychological examinations or tests." (76) Rationales of this kind are often invoked in judgments regarding design of research protocols and access to personal data in databases. They deserve elaboration now, to cope more fully with privacy risks in addition to physical and emotional risks.
For research conducted outside the U.S., the Federal Common Rule allows "Department or Agency heads to determine that the procedures prescribed by the institution afford protections that are at least equivalent" and allow substitution of the foreign procedures (§_.001(6)(h)).
Some private-sector institutions, such as managed-care organizations, have established IRBs that function similarly. This is becoming even more desirable now as more research is being performed on data from mixed sources, such as pooled or comparative data from private- sector managed-care organizations and government healthcare payors.
Similar criteria and systems of external oversight are operative in most European countries, and elsewhere.
For some kinds of research, especially perhaps for some database research for which highly dependable protections can be assured, a specially constituted national-level IRB might be workable. Some precedent might be seen in the ethics reviews that were conducted by the now disbanded Recombinant-DNA Advisory Committee. In Europe there has been some experience with multi-country IRBs for clinical trials.
There is no doubt that IRBs enhance research-subject protections and provide much public reassurance. They are an integral part of biomedical research. But it is less clear that IRBs have been attending as vigorously to privacy risks as they have to physical and emotional risks. For many IRBs the workload already is heavy. Now they may well have to be asked to become more deeply engaged with the privacy and confidentiality aspects of subject protection than they have been, in database research as well as in direct experimentation, and with genetic privacy. Whether they are able and willing to do so should be assessed.(77)
| [Previous] | [Next] |
Return to the Data Council home page .
Last updated 7/23/97.