Skip to main content

Beyond the Owl: How "Subliminal Learning" Repeats a Classic Statistical Mistake in Educational AI

Beyond the Owl: How "Subliminal Learning" Repeats a Classic Statistical Mistake in Educational AI

Picture

Member for

10 months
Real name
Ethan McGowan
Bio
Ethan McGowan is a Professor of Financial Technology and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.

Modified

‘Subliminal learning’ signals spurious shortcuts, not new pedagogy
Demand negative controls, cross-lineage validation, and robustness-first training
Procure only models passing wothe rst-case and safety gates across schools


A single, stark statistic underscores the urgency of our situation. Between 2015 and 2022, the share of students whose principals reported teacher shortages rose from 29% to 46.7% across the OECD. This means that almost one in two students is now learning in systems struggling to staff classrooms. In this context, we cannot afford to be swayed by research fashions that overpromise and under-replicate. When headlines tout the mysterious inheritance of traits from 'teacher' AI models by 'student' AI models, even when the training data seem unrelated, we must question whether the mystery is a result of flawed methodology or a mere mirage. The need of the hour is for education leaders to have access to robust AI that performs consistently across schools, not results that collapse outside a controlled lab environment. The policy question is not whether AI can do remarkable things; it is whether the claims can withstand the confounders that have misled social science for decades.

The current spate of "subliminal learning" stories follows an experiment in which a teacher model with some trait—say, a fondness for owls or, more worryingly, misaligned behavior—generates data that appear unrelated to that trait, such as strings of numbers. A student model trained on this data reportedly absorbs the trait anyway. The Scientific American summary presents the finding as an unexpected, even eerie, kind of transfer. But read as methodology rather than magic, the setup echoes a familiar pitfall: spurious association in high-dimensional spaces, amplified by a tight coupling between teacher and student. When the teacher and student share an underlying base model, their internal representations are already aligned; "learning" may be the student rediscovering the latent structure it was predisposed to find. That is not a revelation about education. It is a reminder about research design.

The Shortcut Trap in Distillation

What the headlines call subliminal can be reframed as shortcut learning under distillation, a process in AI where a complex model is simplified into a smaller, more manageable one. Deep networks often seize on proxies that 'work' in the training distribution but fail under shifts; the literature is replete with examples of classifiers that latch onto textures or backgrounds rather than objects, or into dataset-specific quirks that do not generalize. If a student is trained to mimic a teacher's outputs—even when those outputs are filtered or encoded—the student can inherit shortcuts encoded in the teacher's representational geometry. The Anthropic/Truthful AI team itself reports that the effect occurs when teacher and student share the same base model, a strong hint that we are watching representational leakage rather than semantically independent learning. In other words, the phenomenon may be a laboratory artifact of initialization and distillation, rather than a new principle of cognition.

Spurious regression has a long pedigree: Granger and Newbold showed half a century ago how high R² and plausible coefficients can arise from unrelated time series if you ignore underlying structure. Today's machine-learning equivalent is spurious correlation across groups and environments, which predictably crumbles when it falls out of distribution. A 2024 survey catalogs how often models succeed by exploiting superficial signals, and the community has responded with methods such as group distributionally robust optimization and invariant risk minimization to force models to rely on stable, causally relevant features. The upshot for education is plain. Suppose a model's advantage depends on an idiosyncratic alignment between teacher and student networks, or on hidden artifacts of synthetic data generation. In that case, the performance will evaporate when the model is deployed across districts, devices, and demographics. That is not alignment; it is overfitting in disguise.

The Education Stakes in 2025


Figure 1: Teacher shortages jumped from 29% to 46.7% of students affected across the OECD, reversing a brief improvement in 2018. In systems under strain, flimsy AI evidence isn’t harmless—it crowds out scarce capacity.

The stakes are not hypothetical. New usage data indicate that faculty already employ large models for core instructional work: in one analysis of 74,000 educator conversations, 57% concerned curriculum development, with research assistance and assessment following behind. This is not a future scenario; it is a live pipeline from model behavior to classroom materials. Inject fragile methods into that pipeline, and the system propagates them at scale. In an environment where teacher time is scarce and digital tools shape daily practice, we must be more—not less—demanding of evidence standards.


Figure 2: Curriculum design dominates faculty AI use, meaning any modeling artifact can propagate straight into lesson materials. Governance should target where the real volume is—not edge-case demos.

Meanwhile, the sector is increasingly turning to synthetic data, both by choice and necessity. Gartner's prediction that more than 60 percent of AI training data would be synthetic by 2024 is becoming a reality, with estimates suggesting that up to one-fifth of training data is already synthetic. Independent analyses forecast that publicly available, high-quality human text may be effectively exhausted for training within a few years—by around 2028—pushing developers toward distillation and model-generated corpora. In this world, teacher–student pipelines are not an edge case; they are the default. Without careful guardrails, we risk unknowingly amplifying shortcuts and transmitting misalignment through datasets that look innocuous to humans but encode the teacher's quirks. This is precisely the failure mode the subliminal-learning experiments dramatize, and it is a risk we cannot afford to ignore.

Equity and governance concerns sharpen the point. OECD's latest monitoring places teacher capacity and system resilience at the forefront of policy priorities; UNESCO's guidance on generative AI urges human-centred, context-aware use with strong safeguards. A research culture that normalizes spurious associations as "surprising" discoveries runs directly against those priorities. For schools already navigating resource shortages, fragile AI methods are not a curiosity—they are an operational risk.

A Methods Standard to Prevent Spurious AI in Schools

If we treat the subliminal-learning story as a cautionary tale rather than a breakthrough, a practical policy agenda follows. First, preregistration with negative controls should become table stakes for educational AI research. When a study claims trait transfer through "unrelated" data, researchers must include pre-specified placebo features—like the proverbial owl preference—and report whether the pipeline also "discovers" them under random relabeling. Psychology's replication crisis taught us how easily flexible analysis can spin significance from noise; requiring negative controls protects education from becoming the next field to relearn that lesson at a high cost. Journals and funders should not accept claims about hidden structure without proof that the pipeline does not also discover structures that do not exist.

Second, environmental diversification should be mandatory for any model intended for classroom use. That includes training and evaluation across multiple school systems, devices, languages, and content standards—and critically, across model lineages. Suppose the effect depends on teacher–student architectural kinship, as the Anthropic/Truthful AI work indicates. In that case, demonstrations must show that the claimed benefits persist when the student is trained on outputs from a different family of models. Otherwise, we are validating an effect that rides on shared initialization, not educational relevance. Regulatory sandboxing can help here: ministries or states can host pooled, privacy-preserving evaluation environments that make it cheap to run the same protocol across districts and model families before procurement.

Third, robustness-first objectives—such as group distributionally robust optimization and invariant risk minimization—should be normalized in ed-AI training regimes. These methods explicitly penalize performance that arises from environment-specific quirks, encouraging models to focus on features that remain stable across contexts. They are not silver bullets; even their proponents note limitations. But unlike hype-driven discovery, they encode into the loss function what policy actually values: performance that survives heterogeneity across schools. Procurement guidelines can require vendors to report group-wise worst-case accuracy and to document the distribution shifts they tested, not just average scores.

Fourth, lineage disclosure and compatibility constraints should be included in contracts. If subliminal transfer manifests most strongly when the student shares the teacher's base model, buyers deserve to know both lineages. Districts could, for example, prefer cross-lineage distillation for high-stakes tasks to reduce the risk of hidden trait transmission. Where cross-lineage training is infeasible, vendors should present independent audits demonstrating that model behavior remains stable when the teacher is replaced with a different family. This is not bureaucratic overhead; it is the modern analog of requiring assay independence in medical diagnostics.

Ultimately, we should distinguish between safety claims and performance claims in public messaging. The same experiment that transfers an innocuous "owl preference" can also transfer misalignment, which manifests as dangerous instructions—a result widely reported in both the technical and popular press. Education systems should treat those two outcomes as a single governance problem: the risk of trait propagation through model-generated data. It follows that red-team evaluations for safety must run in parallel with achievement-oriented benchmarks, with release gates that can halt deployment if safety degrades under distribution shifts or teacher-swap tests.

Anticipating critiques. One response is that the phenomenon reveals a real, hidden structure: if a student can infer a teacher's trait from numbers, perhaps the trait is genuinely encoded in distributional signatures that human reviewers cannot see. The problem is not the possibility but the proof. Without negative controls, cross-lineage validation, and out-of-distribution tests, we cannot distinguish between a "hidden causal signal" and a "repeatable artifact." Another objection is practical: if students learn better with AI-assisted materials, why dwell on method minutiae? Because the shortcut trap is precisely that: it delivers early gains that wash out when the distribution changes, which it always does in education, across schools, cohorts, and curricula. The best evidence we have—from studies on shortcut learning to surveys of spurious correlations—indicates that models trained on unstable signals tend to falter when the context shifts. That is a poor bargain for systems already stretched to the limit.

From Artifact to Action

A brief method note: Several figures above are policy-relevant precisely because they change the baseline. Teacher shortages approaching one in two students alter the cost of false positives in ed-tech. Faculty usage data showing curriculum development as the number-one AI use means any modeling artifact can propagate directly into lesson content. Projections of a data drought explain why distillation and synthetic pipelines are not optional. These are not scare statistics; they are context variables that should have been central to how we interpret "subliminal learning" from the start.

Where does this leave policy? Schools and ministries should move quickly, but not by chasing eerie effects. Instead, codify a method's standard for educational AI: preregistration with placebo features; cross-environment and cross-lineage training and evaluation; robustness-first objectives; lineage disclosure; and joint safety-and-achievement release gates—frame procurement around worst-case group performance, not average benchmarks. Tie vendor payments to replication across independent sites. Align all of this with UNESCO's human-centred guidance and the OECD's focus on system capacity. That is how we turn an intriguing lab result into a sturdier, fairer AI infrastructure for classrooms.

In a world where 46.7% of students attend schools reporting teacher shortages, the price of fads is measured in lost learning opportunities. "Subliminal learning" may be a vivid demonstration of how strongly coupled networks echo one another, but it does not license a policy pivot toward fishing for hidden signals. The burden of proof lies with those who claim that seemingly unrelated data carry reliable pedagogical value; the default assumption, borne out by years of research on shortcuts and spuriousness, is that artifacts masquerade as insights. The practical path for education is neither cynicism nor hype. It is governance that treats every surprising correlation as a stress test waiting to be failed—until it passes the controls, crosses lineages, and survives the messy variety of real classrooms. Only then should we scale.


The views expressed in this article are those of the author(s) and do not necessarily reflect the official position of the Swiss Institute of Artificial Intelligence (SIAI) or its affiliates.


References

Anthropic Fellows Program & Truthful AI. (2025). Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv:2507.14805 (v1).
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. (2019). Invariant Risk Minimization. arXiv:1907.02893.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, perspective (preprint on arXiv:2004.07780).
Granger, C. W. J., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of Econometrics, 2(2), 111–120.
Hasson, E. R. (2025, August 29). Student AIs pick up unexpected traits from teachers through subliminal learning. Scientific American.
OECD. (2024). Education Policy Outlook 2024. OECD Publishing.
OECD. (2024). Education at a Glance 2024. OECD Publishing.
Sagawa, S., Koh, P. W., Hashimoto, T., & Liang, P. (2020). Distributionally robust neural networks for group shifts. ICLR (preprint arXiv:1911.08731).
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
TechMonitor. (2023, August 2). Most AI training data could be synthetic by next year—Gartner.
UNESCO. (2023/2025 update). Guidance for generative AI in education and research. UNESCO.

Picture

Member for

10 months
Real name
Ethan McGowan
Bio
Ethan McGowan is a Professor of Financial Technology and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.

Break the Loop: Why Search Behaves Like Reinforcement Learning—And How to Fix It

Break the Loop: Why Search Behaves Like Reinforcement Learning—And How to Fix It

Picture

Member for

10 months
Real name
David O'Neill
Bio
David O’Neill is a Professor of Finance and Data Analytics at the Gordon School of Business, SIAI. A Swiss-based researcher, his work explores the intersection of quantitative finance, AI, and educational innovation, particularly in designing executive-level curricula for AI-driven investment strategy. In addition to teaching, he manages the operational and financial oversight of SIAI’s education programs in Europe, contributing to the institute’s broader initiatives in hedge fund research and emerging market financial systems.

Modified

Search behaves like reinforcement learning, rewarding confirmation
Narrow queries and clicks shrink exposure at scale
Break the loop with IV-style ranking and teach students to triangulate queries


The number that should concern educators is five trillion—the estimated total of searches conducted each year, roughly fourteen billion daily. At this scale, even small biases in the way we formulate queries and prioritize results can have a significant impact on widespread learning. When a student clicks on something that validates a belief and spends time on a page that resonates with them, the ranking system interprets that engagement as successful and presents more content like it. The cycle tightens: queries become more specific, exposure decreases, and confidence solidifies. This is not merely “filter bubbles”; it’s a form of reinforcement—not just within the model but also in the interaction between humans and platforms that resembles reinforcement learning from human feedback (RLHF). If search functions like reinforcement learning, simply adjusting content-moderation settings or opting for a “neutral feed” will never suffice. The solution must be causal: introduce external variation into the content displayed, gain insights from that feedback, and assess relevance based on the pursuit of truth rather than comfort. In essence: modify the feedback loop.

Search Behaves Like Reinforcement Learning

We reframe the prevailing worry about “echo chambers” from a content-moderation problem to a feedback-design problem. Under this lens, the central mechanism isn’t a malicious algorithm force-feeding partisanship; it’s the way human query formulation and ranking optimization co-produce reward signals that entrench priors. In controlled studies published in 2025, participants asked to learn about topics as mundane as caffeine risks or gas prices generated a nontrivial share of directionally narrow queries, which limited belief updating; importantly, the effect generalized across Google, ChatGPT, and AI-powered Bing. When exposure was randomized to broaden results, opinions, and even immediate choices shifted measurably—evidence that the loop is plastic. For education systems that rely on students’ ability to self-inform, this is not a side issue: it is the substrate of modern learning.

The scale magnifies the stakes. If we conservatively take topic-level narrow-query rates in the teens or higher and apply them to Google’s updated volume—about 14 billion searches per day—then hundreds of millions to several billions of daily queries plausibly begin from a narrowed frame. That is not a claim about any one platform’s bias; it is arithmetic on public numbers plus experimentally measured behavior.


Figure 1: Search still dwarfs chat by ~5–6× in daily volume, so even tiny RL-style feedback biases in search ranking can shape what learners see at population scale.

For classrooms and campuses, the practical implications are sobering. In a typical week, a large share of student-directed information seeking may begin on paths that quietly narrow subsequent exposure, even before recommendation systems introduce their own preferences. And because Google still accounts for roughly 90% of global search, the design choices of a few interfaces effectively set the epistemic defaults for the world’s learners.


Figure 2: Nearly nine in ten global searches flow through a single interface—so small design choices have system-level consequences for how students learn.

From Queries to Rewards

Seen through a learning-systems lens, search today behaves like a contextual bandit with human-in-the-loop reward shaping. We type a prompt, review a ranked slate, click what “looks right,” and linger longer on agreeable content. Those behaviors feed the relevance model with gradients pointing toward “more like this.” Over time, personalization and ranking optimization align the channel with our priors. That logic intensifies when the interface becomes conversational: two 2024 experiments found that LLM-powered search led participants to engage in more biased information querying than conventional search, and an opinionated assistant that subtly echoed users’ views amplified the effect. The architecture encourages iterative prompting—asking, receiving an answer, and refining toward what feels right—mirroring the ask/feedback/refine loop of reinforcement learning from human feedback (RLHF). It’s not that the model “learns beliefs”; the system learns to satisfy a belief-shaped reward function.

If the loop were harmless, we might accept it as a usability issue. However, large-scale field experiments on social feeds, although mixed on direct attitudinal change, reveal two critical facts: shifting ranking logic affects what people see, and removing algorithmic curation sharply reduces time spent on the platform. In other words, the feedback lever is real, even if short-run belief shifts are small in some settings. For education policy, the lesson is not that algorithms don’t matter; it’s that interventions must change exposure and preserve perceived utility. Simply toggling to a chronological or “neutral” feed reduces engagement without guaranteeing learning gains. Causal, minimally intrusive interventions that broaden exposure while holding usefulness constant are the right target.

Designing for Counter-Feedback

What breaks a self-reinforcing loop is not lecturing users out of bias, but feeding the model exogenous variation that decouples “what I like” from “what I need to learn.” In econometrics, that is the job of an instrumental variable (IV): a factor that moves the input (exposure) without being driven by the latent confounder (prior belief), letting us estimate the causal effect of more diverse content on downstream outcomes (accuracy, calibration, assignment quality) rather than on clicks alone. Recent recommender-systems research is already moving in this direction, proposing IV-style algorithms and representation learning that utilize exogenous nudges or naturally occurring shocks to correct confounded feedback logs. These methods are not hand-wavy: they formalize the causal graph and use two-stage estimation (or deep IV variants) to reweight training and ranking toward counterfactual relevance—not just observed clicks. In plain terms, IV turns “what users rewarded” into “what would have been rewarded if they had seen a broader slate.”

How would this look inside a search box used by students? The instrument should be subtle, practical, and independent of a learner’s prior stance. One option is interface-level randomized broadening prompts: for a small share of sessions, the system silently runs a matched “broad” query alongside the user’s term. It interleaves a few high-quality, stance-diverse results into the top slate. Another is synonym/antonym flips seeded by corpus statistics rather than user history. Session-time or query-structure randomness (e.g., alternating topic-taxonomy branches) can also serve, provided they are orthogonal to individual priors. The ranking system then employs a two-stage estimation approach: Stage 1 predicts exposure using the instrument, and Stage 2 estimates the causal value of candidate results on learning outcomes (proxied by calibration tasks, fact-check agreement, or assignment rubric performance collected through opt-in), not just CTR. (Method note: instruments must pass standard relevance/exclusion tests; weak-IV diagnostics and sensitivity analyses should be routine.) Early IV-based recommender studies suggest such designs can reduce exposure bias on real-world datasets without harming satisfaction—exactly the trade-off education platforms need. This approach offers a promising path towards a more balanced and diverse learning experience.

What Educators and Policymakers Should Do Now

Universities and school systems do not have to wait for a grand rewrite of search. Three near-term moves are feasible. First, teach query-craft explicitly: pair every research task with a “triangulation rule”—one direct term, one contrary term, one neutral term—graded for breadth. This is a skills intervention that aligns with how biases actually arise. Second, procure search and recommendation tools (for libraries, LMSs, and archives) that document an identification strategy. Vendors should demonstrate how they distinguish between actual relevance and belief-driven clicks, and whether they employ IV-style methods or randomization to learn. Third, adopt prebunking and lightweight transparency: brief, pre-exposure videos about common manipulation tactics have shown measurable improvements in users’ ability to recognize misleading content. Paired with a “search broadly” toggle, they increase resilience without being paternalistic. The point is not to police content, but to change the geometry of exposure so that learning signals reflect truth-finding, not comfort-finding.

Objections deserve straight answers. “Instruments are hard to find” is true; it’s also why IV should be part of a portfolio, not a silver bullet. Interface randomization and taxonomy alternation are plausible instruments because they are under platform control and independent of any one student’s prior belief; weak-instrument risk can be mitigated by rotating multiple instruments and reporting diagnostics. “Isn’t this paternalistic?” Only if the system hides the choice. In the PNAS experiments, broader result sets were rated just as valuable and relevant as standard searches; that suggests we can add breadth without degrading user value. “Won’t this hurt engagement?” Some ranking changes do; however, field studies indicate that the main effect of de-optimizing for engagement is, unsurprisingly, lower time spent—not necessarily worse knowledge outcomes. If our objective is education, not stickiness, we should optimize for calibrated understanding and assignment performance, with engagement a constraint, not the goal.

The loop we opened with—the one that starts from “14 billion a day”—is not inevitable. The same behavioral evidence that documents narrow querying also shows how modest, causal tweaks can broaden exposure without alienating users. In practical terms, this means that the individuals responsible for setting education policy and purchasing education technology must revise their procurement language, syllabi, and platform metrics. Require vendors to disclose how they identify causal relevance separate from belief-shaped clicks. Fund campus pilots that randomize subtle broadening instruments inside library search and measure rubric-based learning gains, not just CTR. Teach students to triangulate queries as a graded habit, not as an afterthought. Search has become the default teacher of last resort; our responsibility is to ensure its reward function serves the purpose of learning. The fix is not louder content moderation or a nostalgia play for “neutral” feeds. It is a precise, testable redesign: instrument the loop, estimate the effect, and rank for understanding.


The views expressed in this article are those of the author(s) and do not necessarily reflect the official position of the Swiss Institute of Artificial Intelligence (SIAI) or its affiliates.


References

AP News. (2023). Google to expand misinformation “prebunking” in Europe.

Guess, A., et al. (2023). How do social media feed algorithms affect attitudes and behaviors? Science.

Leung, E., & Urminsky, O. (2025). The narrow search effect and how broadening search promotes belief updating. Proceedings of the National Academy of Sciences.

Search Engine Land. (2025). Google now sees more than 5 trillion searches per year (≈14 billion/day).

Sharma, N., Liao, Q. V., & Xiao, Z. (2024). Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking. In CHI ’24.

Statcounter. (2025). Search engine market share worldwide (July 2024–July 2025).

Wu, A., et al. (2025). Instrumental Variables in Causal Inference and Machine Learning. Communications of the ACM.

Zhang, Y., Huang, Z., & Li, X. (2024–2025). Interaction- or Data-driven Conditional Instrumental Variables for Recommender Systems (IDCIV/IV-RS).

Picture

Member for

10 months
Real name
David O'Neill
Bio
David O’Neill is a Professor of Finance and Data Analytics at the Gordon School of Business, SIAI. A Swiss-based researcher, his work explores the intersection of quantitative finance, AI, and educational innovation, particularly in designing executive-level curricula for AI-driven investment strategy. In addition to teaching, he manages the operational and financial oversight of SIAI’s education programs in Europe, contributing to the institute’s broader initiatives in hedge fund research and emerging market financial systems.

Not Your Therapist: Why AI Companions Need Statistical Guardrails Before They Enter the Classroom

Not Your Therapist: Why AI Companions Need Statistical Guardrails Before They Enter the Classroom

Picture

Member for

10 months
Real name
Ethan McGowan
Bio
Ethan McGowan is a Professor of Financial Technology and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.

Modified

Student well-being is falling fast
AI chatbots are spreading quickly
Without safeguards, risks will escalate


In the 2023–24 academic year, only 38% of U.S. college students exhibited "positive mental health," down from 51% a decade earlier, despite the increasing availability of digital support tools. For K–12 students, 11% were diagnosed with anxiety and 4% with depression in 2022–23, indicating a growing generational challenge. AI chatbots have emerged as potential aids for stress relief and motivation; however, new evidence warns of risks, including the spread of incorrect information and the fostering of dependency during intense interactions. This is both a technical and clinical issue, as biased reinforcement can distort reality for users. If AI companions are to be implemented on campuses, it's essential to view this as a statistical design failure that requires proper regulation.

From clinical caution to feedback risk: reframing the debate

The prevailing concern about "AI therapy" has been framed as a question of empathy and accuracy: Can a chatbot accurately detect a crisis? Will it hallucinate unsafe advice? Those concerns are real, but they miss the engine underneath. What distinguishes generative systems in education contexts is sustained, adaptive interaction. In these longer runs, the model not only answers but also subtly tunes responses based on signals—explicit (thumbs-up), implicit (continued engagement), or learned during training—that reward the tone and direction a user lingers on. Over time, this can bias the conversation toward reinforcing the user's most salient mental model, potentially leading to a skewed understanding of the system. For vulnerable students, this is not a neutral drift. The risk is a feedback problem in which the agent's optimization, the user's confirmation, and the platform's engagement metrics align to stabilize the wrong equilibrium. The policy lens should therefore shift from "Can chatbots do empathy?" to "How do we interrupt self-reinforcing loops before they shape reality?"

The mechanics: RLHF, endogeneity, and biased convergence

Modern assistants are trained with Reinforcement Learning from Human Feedback (RLHF): a reward model learns what humans prefer, and the chatbot is then optimized to maximize that reward. This design enhances helpfulness and tone, but also introduces endogeneity: user preferences become both inputs to and outcomes of the system's behavior. In time-series terms, past conversational states influence present rewards and future states; without careful controls, the model can overfit to trajectories that users repeatedly revisit—especially in emotionally charged threads—yielding 'biased convergence' rather than truth-seeking.

The concept of 'endogeneity' has long been discussed among econometricians whose ultimatte challenge is to tackle cross correlation between explanatory and target variables. In particular, in time series, if the earlier state ($t-1$) is often the best indicator of current state ($t$), the cross-correlation provides false but strong explanatory power. Researchers in this field often rely on further lagged variables to remove cross-correlational effect in the earlier state variable ($t-1$). Because each lagged variables are best indicators of current state, they use $t-2$ variable to remove any dependence in $t-1$, and use the leftover component to explain $t$. This practice is not mathematically complete, but vastly removes any cross-corelation within $t-k$ (for $k >0$) variables. Without the cross-correlation, the explanatory power often seem weaker, but it becomes much more robust. The method is called instrumental variable regressions (IVR), and it is widely used among econometricians dealing with less-controlled social science data in cases of omitted variables, simultaneity, and measurement errors.

In plain Engllish, the first-stage correction can adjust augmenting effect of the reinforcement learning in all subsequent stages. Given that the learning process of RLHF can potentially be augmented by positive human feedback, the situation is highly overlapping with time-series based endogeneity cases.

Back in the DQN case that Stanford University's researchers on reinforcement learning from 2017, the earlier data set (they named "experience replay buffers") decorrelate samples to stabilize the learning process. Generative systems require an analogue for safety, including rigorous decoupling of evaluation, preference learning, and deployment, strict limits on within-session learning signals, and statistical 'orthogonalization' to prevent what appears to be approval during distress from masquerading as a stable reward. Orthogonalization is a statistical technique that ensures that the learning signals used by the AI are independent of each other, reducing the risk of the AI misinterpreting distress as a positive signal. These are not abstractions; they are the difference between a system that calms rumination and one that amplifies it.

What the numbers actually say

Utilization of mental-health chatbots among U.S. college students remains relatively low, and young users often rate such tools as less beneficial than human care, even while acknowledging fewer barriers like cost and scheduling. Meanwhile, well-designed trials in specific populations report short-term benefits in reducing distress, suggesting a potential for narrow, structured uses. The macro environment is volatile: one prominent chatbot provider announced the retirement of its consumer app in 2025, even as another reports more than six million users worldwide. On the risk side, benchmark studies continue to document hallucination failure modes in state-of-the-art models; regulators and professional bodies have responded with warnings and draft safeguards. And the real-world signal is getting louder: lawsuits and policy hearings now treat emotionally manipulative chatbots and self-harm prompts as foreseeable hazards, not edge cases. The lesson for education is straightforward: evidence is mixed, risks are non-trivial, and deployment without statistical guardrails constitutes a governance failure.


Figure 1: Student mental health has eroded steadily over the past decade, with fewer than 4 in 10 reporting positive well-being in 2024—just as digital mental-health tools proliferated.

From replay buffers to "do-not-learn": translating safety into design

A practical mitigation is to prevent the model from learning, even implicitly, from the most fragile conversations. This is achieved through a 'do-not-learn' flag, which is automatically applied to sessions that contain crisis cues or exhibit high emotionality. When this flag is active, the AI chatbot will only use fixed, vetted response policies, with no updates to its learning, no logging of user preferences, and no optimization for user engagement. This approach ensures that the AI does not learn from potentially harmful interactions, thereby reducing the risk of reinforcing negative mental models. Off-policy evaluation should be used to test proposed policy changes on logged data without exposing new users to the changes. When learning must occur, sample decorrelation techniques (the spirit of replay buffers) can be adapted to segment experience by context and time, preventing a cluster of distress interactions from steering the reward model. Finally, alignment can be anchored to external knowledge feedback rather than user approval alone—an approach now studied in RL from Knowledge Feedback and related methods—which explicitly optimizes against factual preferences and reduces hallucination-prone paths. Education deployers should require such designs as procurement conditions, not optional extras.


Figure 2: Adoption of AI mental-health chatbots remains modest but is rising quickly, suggesting that small risks now may scale dramatically if governance lags behind use.

Measurement that resists endogeneity

Platforms should report metrics that are causally interpretable, not just flattering. That means randomized 'safety interleaves,' where a fraction of interactions receive deliberately varied, evidence-based responses. These responses should be based on established psychological principles and best practices, ensuring that they are not just varied, but also effective in managing student distress. Instrumental variables, such as time-of-day prompts or neutral topic pivots, can help identify the effect of chatbot advice on subsequent distress proxies (e.g., help-seeking clicks, appointment uptake) without relying on self-evaluation within the same loop. Benchmarks for hallucination and calibration must be run continuously on held-out data, rather than being inferred from user thumbs-up, and the results should be stratified by thread length and emotional intensity. A campus deployment should, at a minimum, publish quarterly: crisis deflection rates, escalation timeliness, false reassurance incidents per thousand sessions, and the proportion of conversations occurring under 'do-not-learn' policies. This is not overkill. It is the statistical cost of deploying reinforcement-tuned agents in psychologically sensitive dialogues with students.

The regulatory context—and why education should aim higher

The EU AI Act requires providers of high-risk AI to mitigate feedback loops where ongoing learning lets biased outputs contaminate future inputs. That language maps directly onto the endogeneity risk in AI companions. Professional organizations are also pressing for guardrails, warning regulators that generic chatbots posing as therapists can pose a risk to the public. However, campuses should not wait for compliance deadlines to expire. Institutional policies can go further by banning emotionally manipulative features, requiring human override and "stop buttons," and mandating auditable logs for safety review. Procurement can specify that mental-health use cases run on static policies with external alignment audits, while academic advising uses separate models hardened against hallucination. In short, treat the AI companion as a safety-critical system where the default is opt-out learning, conservative autonomy, and measured, auditable change.

Anticipating the counterarguments

Proponents will argue that chatbots are often the only scalable option when counseling centers are overwhelmed—and that some studies show meaningful reductions in distress. Both points are valid and still compatible with restraint. Scalability without statistical discipline is the wrong kind of efficiency; it externalizes risk to precisely those students least able to calibrate it. Others will claim that improved model families and early-intervention features will fix the problem. Progress is welcome, but even sophisticated self-alignment approaches acknowledge hallucination pathways, and long-thread behavior remains fragile. Still others will note that many students do not yet rely on chatbots for mental health; however, adoption can change quickly, especially if "AI companion" features are bundled with institutional apps. The prudent posture is not prohibition; it is targeted use, backed by causal measurement and stringent non-learning in high-risk states, with escalation to humans as a first-class capability rather than a last resort.

What should educators and administrators do next?

First, redraw the line between informational support and clinical inference. Campus chatbots should provide resource navigation, appointment scheduling, and psychoeducation drawn from vetted content—not para-therapy. Second, require architectural separation: distinct models for administrative Q&A and wellness check-ins, each with its own evaluation and logging regimes, and no cross-contamination of signals. Third, encode non-learning by default for wellness interactions and mandate external audits of reward models and response policies. Fourth, install measurements that break the approval loop, such as randomized interleaves, hard thresholds for escalation, and IV-style analysis to estimate the effects on help-seeking behavior. Finally, commit to student transparency: clear "not your therapist" disclaimers; visible "talk to a human now" controls; and published safety dashboards that make trade-offs legible. These steps are implementable today. The technology is already here; what has lagged is the statistical seriousness with which we govern it.

Closing the Loop: Putting Safety Before Scale

Ten years ago, the mental health of students was declining without the involvement of AI aids. Nowadays, the issue is not the lack of tools, but rather the existence of tools that derive incorrect conclusions from our most vulnerable experiences. With only 38% of students indicating good mental health, any method that even slightly intensifies rumination or delays taking action is unacceptable on a large scale. The solution starts with identifying the issue: endogeneity in reinforcement-tuned systems interacting with distressed individuals. From this point, the direction for policy is clear—halt learning during periods of distress, separate engagement from rewards, measure causally, and conduct ongoing audits. Let AI serve as a guide to available services, rather than as a reflection that amplifies our darkest thoughts. Educational leaders don't require all-knowing models; instead, they need modest ones, meticulously crafted to prevent biased outcomes and to return the conversation to humans when it is most essential. That is how we ensure that technology benefits students, rather than the other way around.


The views expressed in this article are those of the author(s) and do not necessarily reflect the official position of the Swiss Institute of Artificial Intelligence (SIAI) or its affiliates.


References

American Council on Education. (2025). Key mental health in higher education stats (2023–24). ACE.

American Psychological Association (APA). (2025, March 12). Using generic AI chatbots for mental health support. APA Services.

Bang, Y., et al. (2025, April). HalluLens: LLM hallucination benchmark. Proceedings of ACL.

Centers for Disease Control and Prevention. (2025, June 5). Data and statistics on children's mental health. CDC.

Chaudhry, B. M., et al. (2024). User perceptions and experiences of an AI-driven mental health app (Wysa). Digital Health, 10.

Colasacco, C. J. (2024). A case of artificial intelligence chatbot hallucination. Journal of the Medical Library Association, 112(2).

European Union. (2024). Regulation (EU) 2024/1689: Artificial Intelligence Act. Official Journal of the European Union.

Hugging Face. The Deep Q-Learning algorithm. (Experience replay explainer).

Lambert, N., et al. (2024). Reinforcement Learning from Human Feedback (RLHF). (Open book; foundations and optimization stages).

Li, J., et al. (2025). Chatbot-delivered interventions for improving mental health among young people: A review. Adolescent Research Review.

Liang, Y., et al. (2024). Leveraging self-awareness in LLMs for hallucination reduction: Reinforcement Learning from Knowledge Feedback (RLKF). KnowledgeNLP Workshop.

Rackoff, G. N., et al. (2025). Attitudes and utilization of chatbots for mental health among U.S. college students. JMIR Mental Health, 12.

Stanford HAI. (2025, June 11). Exploring the dangers of AI in mental health care. Stanford Institute for Human-Centered AI.

Wysa. (2025). Wysa: Everyday mental health (company site; "6+ million users").

Wysa Research Team. (2024). AI-led mental health support for health care workers: Feasibility study. JMIR Formative Research, 8, e51858.

Woebot Health. (2025, April 28). Woebot Health is shutting down its app. HLTH Community.

Picture

Member for

10 months
Real name
Ethan McGowan
Bio
Ethan McGowan is a Professor of Financial Technology and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.