Skip to main content

Domain-Specific AI Is the Safer Bet for Classrooms and Markets

Picture

Member for

10 months 4 weeks
Real name
Ethan McGowan
Bio
Ethan McGowan is a Professor of Financial Technology and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.

Modified

General AI predicts probabilities, not context-specific safety
Domain-specific AI fits the task and lowers risk in classrooms and markets
Use ISO 42001, NIST RMF, and the EU AI Act, and test on domain benchmarks

Reported AI incidents reached 233 in 2024, marking a 56% increase from the previous year. This sharp rise highlights real-world failures as generative systems become part of everyday tools. The public is aware of this shift. By September 2025, half of U.S. adults expressed more concern than excitement about AI in their daily lives, with only 10% feeling more excited than worried. This sentiment reveals a trend that cannot be ignored: general-purpose models focus on predicting the next token instead of meeting safety standards that differ by context. They operate on probability, not purpose. As usage shifts from chat interfaces to classrooms and financial settings, relying on probability as a measure of safety is inadequate. The solution lies in domain-specific AI, which consists of systems designed specifically for a narrow task, including safety measures suited to the task's stakes.

The case for domain-specific AI

The main issue with the push for greater generality is that it misaligns model learning with safety requirements. Models learn from training data trends, while safety is situational. What is acceptable in an open forum may be harmful in a school feedback tool or a trading assistant. As regulators outline risk tiers, this misalignment becomes both a compliance and ethical concern. The European Union’s AI Act, which took effect on August 1, 2024, categorizes AI used for educational decisions as high-risk, imposing stricter obligations and documentation requirements. General-purpose models also fall under a specific category with unique responsibilities. In short, the law reflects reality: risk depends on usage, not hype. Domain-specific AI acknowledges this reality. It narrows inputs and outputs, aligns evaluations with workflow, and establishes error budgets that correspond to the potential harm involved.

General benchmarks support the same idea. When researchers adapted the traditional MMLU exam into MMLU-Pro, top models experienced a 16-33% drop in accuracy. Performance on today’s leaderboards tends to falter when faced with real-life scenarios. This serves as a warning against unscoped deployment. Meanwhile, government labs now publish pre-deployment safety evaluations because jailbreak vulnerabilities still occur. The UK’s AI Safety Institute has outlined methods for measuring attack success and correctness; its May 2024 update bluntly addresses models’ dangerous capabilities and the need for contextual testing. If failure modes vary by context, safety must also be context-dependent. Domain-specific AI enables this possibility.

Figure 1: Domain-specific tuning reduces error rates in responses involving under-represented groups, showing safer alignment than general fine-tuning.

Evidence from incidents, benchmarks, and data drift

The curve for incidents is steep. The Stanford AI Index 2025 reported 233 incidents in 2024, marking the highest number on record. This figure is a system-wide measure rather than a reaction to media coverage. Additionally, the proportion of “restricted” tokens in common web data rose from about 5-7% in 2019 to 20-33% by 2023, altering both model inputs and behavior. As training data becomes more varied, the likelihood of general models misfiring in edge cases increases. Safety cannot be a once-and-done action; it must be a continuous practice tied to specific domains, using ongoing tests that reflect real tasks.

Safety evaluations confirm this trend. New assessment frameworks like HELM-Safety and AIR-Bench 2024 reveal that measured safety is heavily influenced by the specific harms tested and how prompts are structured. The UK AISI’s method also rates attack success rates and underscores that risk is contingent on deployment context rather than solely on model capability. The conclusion is clear: a general score does not guarantee safety for a specific classroom workflow, exam proctor, or FX-desk summary bot. Domain-specific AI allows us to select relevant benchmarks, limit the input space, and establish refusal criteria that reflect the stakes involved.

Data presents another limitation. As access to high-quality text decreases, developers increasingly rely on synthetic data and smaller, curated datasets. This raises the risk of overfitting and makes distribution drift more likely. Therefore, targeted curation gains importance. A model for history essay feedback should include exemplars of rubrics, standard errors, and grade-level writing—not millions of random web pages. A model for news reading in FX requires calendars, policy documents, and specific press releases, rather than a general assortment of internet text. In both cases, domain-specific AI addresses the data issue by defining what “good coverage” means.

What domain-specific AI changes in education and finance

In education, the risks are personal and honest. AI tools that determine student access or evaluate performance fall into high-risk categories in the EU. This requires rigorous documentation, monitoring, and human oversight. It also calls for choosing the right design for the task. A model that accepts inputs aligned with rubrics, generates structured feedback, and cites authoritative sources is easier to audit and remains within an error budget compared to a general chat model that can veer off-topic. NIST’s AI Risk Management Framework and its 2024 Generative AI profile provide practical guidance: govern, map, measure, manage—applied to specific use cases. Schools can utilize these tools to define what “good” means for various needs, such as formative feedback, plagiarism checks, or support accommodations, and to determine when AI should not be applied.

In finance, the most effective approaches are focused and testable. The best outcomes currently emerge from news-to-signal pipelines rather than general conversation agents. A study by the ECB this summer found that using a large model to analyze two pages of commentary in PMI releases significantly enhanced GDP forecasts. Academic and industry research shows similar improvements when models extract structured signals from curated news instead of trying to act like investors. One EMNLP 2024 study showed that fine-tuning LLMs on newsflow yielded return signals that surpassed conventional sentiment scores in out-of-sample tests. A 2024 EUR/USD study combined LLM-derived text features with market data, reducing MAE by 10.7% and RMSE by 9.6% compared to the best existing baseline. This demonstrates domain-specific AI in action: focused inputs, clear targets, and straightforward validation.

The governance layer must align with this technical focus. ISO/IEC 42001:2023, the first AI management system standard, provides organizations with a way to integrate safety into daily operations: defining roles, establishing controls, implementing monitoring, and improving processes. Combining this with the EU AI Act’s risk tiers and NIST’s RMF creates a coherent protocol for schools, ministries, and finance firms. Start small. Measure what truly matters. Prove it. Domain-specific AI is not a retreat from innovation; it is how innovation thrives amid real-world challenges.

Answering the pushback—and governing the shift

Critics may argue that “narrow” models will lag behind general models that continue to scale. However, the issue isn’t about measuring intelligence; it’s about being fit for purpose. When researchers modify tests or change prompts, general models struggle. MMLU-Pro’s 16-33 point decline illustrates that today’s apparent mastery could collapse under shifting distributions. General models remain prone to jailbreak vulnerabilities, so safety teams continue to publish defenses because clever attacks still work. The UK AI Safety Institute’s methodologies and the follow-up reports from labs and security firms emphasize one need: we must evaluate safety against the specific hazards of each task. Domain-specific AI achieves this inherently.

Figure 2: User evaluations show that domain-specific dialogue systems achieve higher persuasiveness and competence with no increase in discomfort.

Cost is another concern. Building smaller, focused systems may be redundant. In reality, effective stacks combine both types. Use a general model for drafting or retrieval, then run outputs through a policy engine that checks for rubric compliance, data lineage, and refusal rules. ISO 42001 and NIST RMF guide teams in this process by detailing what data was utilized, what tests were conducted, and how failures are addressed. The EU AI Act rewards such design with clearer paths to compliance, particularly for high-risk educational applications and for general models used in regulated environments. The lesson from recent evaluations is evident: governance must reside where the work occurs. This is more cost-effective than dealing with incident responses, reputational damage, and rework after audits.

A final objection is that specialization could limit creativity. Evidence from the finance sector suggests otherwise. The most reliable improvements currently come from models that analyze specific news to generate precise signals, validated against clear benchmarks. The ECB example demonstrated how small amounts of focused text improved projections, while the EUR/USD study outperformed baselines using task-specific features. Industry research indicates that fine-tuned newsflows perform better than generic sentiment analysis. None of these systems “thinks like a fund manager.” They excel at one function, making them easier to evaluate when they falter. In education, the parallel is clear: assist teachers in providing rubric-based feedback, highlight patterns of misunderstanding, and reserve critical judgment for humans. This approach keeps the helpful tool and mitigates harm.

The evidence is compelling. Incidents rose to 233 in 2024, public trust is fragile, and stricter benchmarks reveal delicate performance. The solution isn't abstract “alignment” with universal values implemented in larger models. The remedy is to connect capability with context. Domain-specific AI narrows inputs and outputs, utilizes curated data, and demonstrates effectiveness through relevant tests. It establishes concrete governance through ISO 42001 and NIST RMF. It aligns with the EU AI Act’s risk-driven framework. Its efficacy is already demonstrated in lesson feedback and news-based macro signals. The call to action is straightforward. Schools and ministries should set domain-specific error budgets, adopt risk frameworks, and choose systems that prove their reliability in domain tests before they interact with students. Financial firms should define assistants to focus on information extraction and scoring, not decision-making, and hold them to measurable performance standards. We can continue to chase broad applications and hope for safety, or we can develop domain-specific AI that meets the standards our classrooms and markets require.


The views expressed in this article are those of the author(s) and do not necessarily reflect the official position of the Swiss Institute of Artificial Intelligence (SIAI) or its affiliates.


References

Adalovelace Institute. (2024). Under the radar? London: Ada Lovelace Institute.
Carriero, A., Clark, T., & Pettenuzzo, D. (2024). Macroeconomic Forecasting with Large Language Models (slides). Boston College.
Ding, H., Zhao, X., Jiang, Z., Abdullah, S. N., & Dewi, D. A. (2024). EUR/USD Exchange Rate Forecasting Based on Information Fusion with LLMs and Deep Learning. arXiv:2408.13214.
European Commission. (2024). EU AI Act enters into force. Brussels.
Guo, T., & Hauptmann, E. (2024). Fine-Tuning Large Language Models for Stock Return Prediction Using Newsflow. In EMNLP Industry Track. ACL Anthology.
ISO. (2023). ISO/IEC 42001:2023—Artificial intelligence management system. Geneva: International Organization for Standardization.
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). Gaithersburg, MD.
NIST. (2024). AI RMF: Generative AI Profile (NIST AI 600-1). Gaithersburg, MD.
RAND Corporation. (2024). Risk-Based AI Regulation: A Primer on the EU AI Act. Santa Monica, CA.
Reuters. (2025, June 26). ECB economists improve GDP forecasting with ChatGPT.
Stanford CRFM. (2024). HELM-Safety. Stanford University.
Stanford HAI. (2025). AI Index Report 2025—Responsible AI. Stanford University.
UK AI Safety Institute (AISI). (2024). Approach to evaluations & Advanced AI evaluations—May update. London: DSIT & AISI.
Wang, Y., Ma, X., Zhang, G., et al. (2024). MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. NeurIPS Datasets & Benchmarks (Spotlight).

Picture

Member for

10 months 4 weeks
Real name
Ethan McGowan
Bio
Ethan McGowan is a Professor of Financial Technology and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.