Data Privacy and AI Progress

To improve AI performance, regulators must loosen restrictions on data sharing.

Every law reflects a cost benefit analysis made at the time of enactment. With respect to many privacy laws, that analysis is out of date. The costs of collecting, storing, and analyzing data in sensitive contexts, such as in health care and educational settings, once vastly outweighed the benefits. In the 1970s, for example, it likely did not make sense to collect any more information than was necessary to address a patient’s immediate needs, let alone to store that information for very long, due to high costs. The risks of that information being stolen or leaked were much graver than any positive outcomes. The same is not true today.

Advances in artificial intelligence (AI) have changed the math. It is now the case that sharing is caring—caring for your future self as well as caring for the wellbeing of others. For example, data collected during a patient’s youth may drastically improve diagnoses in adulthood. Moreover, that data may also train and improve models that result in better treatment for others. Those gains, however, will not occur if people continue to see data as something to hoard rather than to share. Similarly, if outdated laws remain on the books, then AI progress will be delayed, leading to lives being lost and educational gains going unrealized.

Of course, there are still costs to permitting more liberal data collection in sensitive domains. But those costs must be put in the context of a health care system in which medical errors run rampant and an education system in which many kids slip through the cracks. AI will not be able to address those shortcomings absent legal reforms and cultural shifts in how we think about data sharing.

That is because data are the basis on which AI systems learn what counts as a pattern and what counts as noise. The modern literature on machine learning has shown, with striking consistency, that model performance improves as training data grows and that even very large models underperform when they are trained on too few data. Quantity is only part of the story, however. A system trained on narrow or unrepresentative data may appear accurate in familiar settings but fail when it encounters the harder cases that matter most in practice: unusual presentations, minority populations, atypical learning profiles, or circumstances that differ from the environment in which the model was developed.

Rich, diverse, and well-curated datasets make models more capable and more dependable by reducing the odds that a system has merely learned a shortcut that collapses outside the lab. In sensitive domains, then, building AI that is reliable enough to deserve public trust requires data.

Education law offers a concrete example of a data-limiting regime built on the assumption that restraint equals safety. California’s Student Online Personal Information Protection Act bars operators of K-12 online services from using covered information to amass a profile about a student except in furtherance of K-12 school purposes, forbids selling student information, and restricts disclosure. The statute preserves room for some socially valuable uses. For example, it allows educators to use deidentified information to improve educational products and permits the use of pupil data for adaptive or customized learning. But the architecture of the law still reflects an older intuition that student data should remain tethered to its immediate instructional purpose rather than contribute to broader systems of learning over time. That instinct is understandable in a world worried about commercialization and misuse. In a world of data-hungry AI tools, however, limitations of this sort can make it harder to build systems that learn from longitudinal and cross-context patterns. These are the very patterns that may be needed to identify effective interventions, detect struggling students earlier, and develop tutors that work well for more than the easiest cases.

Health care law offers the same pattern. The Privacy Rule in the Health Insurance Portability and Accountability Act (HIPAA) is built around bounded use, not maximal learning. Outside of treatment, covered entities generally must limit their uses and disclosures of protected health information to specified categories, and the rule’s “minimum necessary” standard requires reasonable efforts to limit data to what the immediate purpose demands. Research can proceed, but broader reuse often depends on patient authorization, a waiver by an institutional review board or privacy board, or deidentification. That design made sense when the law’s central concern was disclosure. It makes less sense when accurate medical AI depends on large, longitudinal, and clinically diverse datasets. These datasets would let a model learn from rare presentations, delayed complications, and patients whose care spans multiple years and providers.

Excess data hoarding is now an anti-social behavior. Imagine patients who refuse to allow a doctor to use AI to transcribe their appointment. Not only would those patients miss out on the benefits of a more accurate summary of their meeting, but they would also cause other patients to have less time with the doctor, who will now have to spend time putting pen to paper rather than sitting down with patients. The same is true in education: When schools wall off the diverse performance and interaction data that tutoring systems use to model student knowledge, they make it harder to build AI tutors that can adapt to different paces, misconceptions, and learning styles rather than merely serving the median student.

None of this is an argument for treating sensitive data carelessly. The case for broader use is strongest only if it is matched with a much higher bar for cybersecurity and a real respect for patient autonomy. HIPAA already emphasizes cybersecurity by requiring administrative, physical, and technical safeguards for electronic protected health information. Any reform worth enacting should strengthen those protections rather than relax them. Patients should also retain meaningful control: clear notice, genuine choices where feasible, and firm consequences for misuse.

But that is not an argument for reflexive non-sharing. Other sectors already operate on the premise that people may entrust highly sensitive information to regulated institutions when the social gains are large. Consumers do so in finance, where federal law requires covered institutions to maintain comprehensive information-security programs, because secure data flows make modern payments, credit, and fraud detection possible. Aviation does the same through confidential safety data sharing programs that give regulators and operators a fuller picture of systemic risk. Health care should pursue the same balance: strong security, real agency, and broad enough data sharing to build tools that are worth trusting.

The question, then, is not whether sensitive data should remain protected. It is whether the law will continue to protect it in ways that reflect yesterday’s tradeoffs rather than today’s realities. Rules built for an analog world treated restraint as the safest course because the gains from broader use were speculative, and the risks of misuse were obvious. AI has altered that balance. In both health care and education, the refusal to collect, retain, and responsibly share data now carries its own costs: worse diagnoses, weaker interventions, less personalized instruction, and slower improvement for everyone who comes next. The better path is not carelessness with data but a shift from data minimization to data stewardship. An effective policy would pair strong cybersecurity, real respect for individual choice, and serious penalties for abuse with a legal framework that permits socially valuable learning on a meaningful scale. If the law keeps treating sensitive data as something to be locked away, it will safeguard privacy while entrenching mistakes and inequality.