Artificial Writing and Automated Detection

Discover the effectiveness of AI detectors in identifying AI-generated content across various metrics and applications.

AI writing tools are now everywhere—students use them for essays and homework, researchers use them to draft papers, and professionals rely on them for reports and presentations. As AI-generated text becomes indistinguishable from human writing, AI detection tools have become essential for academic integrity, performance assessment, test scoring, and institutional accountability. But with so many detectors on the market, which one actually works best?

Paper reviewed:

Jabarian, Brian and Imas, Alex, Artificial Writing and Automated Detection (August 26, 2025). Available at SSRN: https://ssrn.com/abstract=5407424 or http://dx.doi.org/10.2139/ssrn.5407424

Summary

This research paper assesses the performance of AI detectors, revealing that Pangram outperforms others in detecting AI-generated text across different passage lengths and models. The study also explores the implications for businesses and policymakers.

Key Findings

Pangram is the top-performing AI detector across various metrics, achieving near-zero False Positive Rates (FPR) and False Negative Rates (FNR) for medium to long passages.
Pangram maintains robust performance across different Large Language Models (LLMs), genres, and passage lengths, including short passages and 'stubs' (< 50 words).
OriginalityAI and GPTZero form a secondary tier, with OriginalityAI excelling in AUROC scores and GPTZero having lower FPRs; however, both struggle with very short texts and are susceptible to "humanizers."
The open-source RoBERTa base detector performs poorly, often misclassifying human-written text and failing to detect a significant portion of AI-generated passages.
Implementing a "policy cap" framework allows for a more nuanced comparison of detectors based on specific tolerance levels for FPR or FNR.

Implications

Business and Policy Implications

Businesses and policymakers can utilize Pangram for high-stakes applications requiring accurate AI detection, such as educational institutions and content moderation.
The cost-per-true-positive (CPTP) metric reveals that Pangram is more cost-efficient than OriginalityAI and GPTZero, making it an attractive choice for large-scale deployments.
Setting appropriate "policy caps" can help organizations balance the trade-off between FPR and FNR based on their specific risk tolerance and operational needs.
The findings suggest that routine audits and transparent reporting of detector performance are crucial to maintaining effective AI detection in the face of evolving LLMs and "humanizer" tools.

Introduction

The rapid proliferation of Generative Artificial Intelligence (AI) tools has led to an increasing need for reliable detection mechanisms to distinguish between human-generated and AI-generated text. This is crucial across various domains, from academic integrity to content authenticity. The current paper benchmarks leading AI detection tools and proposes a framework for evaluating their performance.

Background and Context

The advent of Large Language Models (LLMs) has transformed the landscape of written content, with AI-generated text becoming increasingly prevalent. Stakeholders across industries are grappling with the implications of this shift, necessitating effective AI detection solutions. Previous studies have highlighted the challenges in detecting AI-generated text, particularly with the evolution of LLMs and the emergence of "humanizer" tools designed to evade detection.

The current research addresses this need by evaluating three commercial AI detectors—Pangram, OriginalityAI, and GPTZero—and an open-source baseline, RoBERTa. The study utilizes a comprehensive corpus of 1,992 passages spanning multiple genres and matched with AI-generated equivalents from four frontier LLMs. The analysis focuses on key performance metrics, including False Positive Rate (FPR), False Negative Rate (FNR), Area Under the ROC Curve (AUROC), and ∆-Mean.

By examining the performance of these detectors across various thresholds, passage lengths, and LLM models, the study provides insights into their strengths and limitations. The introduction of a "policy cap" framework offers a flexible approach to detector evaluation, allowing policymakers to set tolerance levels for FPR or FNR based on their specific objectives.

The remainder of this part has detailed the methodology, key results, and initial implications of the study. The next part will delve deeper into the practical applications and strategic considerations for businesses and policymakers in implementing AI detection solutions.

Main Results

The study evaluates the performance of three commercial AI detectors (Pangram, OriginalityAI, and GPTZero) and one open-source detector (RoBERTa) in distinguishing between human-generated and AI-generated text. The analysis is based on a large corpus of 1,992 passages spanning six genres and four frontier Large Language Models (LLMs).

AUROC and ∆-Mean

The results show that Pangram achieves near-flawless classification across all four source models, with AUROC scores of 1.0000 for most categories. OriginalityAI scores are high but lower than Pangram's, while GPTZero's scores are lower still. RoBERTa performs around or below random (0.5) on most categories.

Table 1: Detector Performance by Genre and Model: AUROC

Model	Genre	GPTZero	Originality	Pangram	RoBERTa
GPT-4.1	amazon review	0.9849	0.9996	0.9998	0.6271
GPT-4.1	blog	0.9975	1.0000	0.9998	0.5701
...	...	...	...	...	...

Pangram and GPTZero perform similarly well in score separation between AI and human-generated text, with mean differences (∆-Mean) of approximately 0.805 to 1.0. OriginalityAI achieves smaller score separation that is more LLM model dependent.

Threshold Sensitivity

The study assesses each detector's robustness by systematically imposing exogenous changes in the threshold and documenting the corresponding variation in FPR and FNR.

False Positive Rate

Pangram dominates the other detectors across all thresholds, with an FPR of essentially zero across all thresholds 0.5 and greater. OriginalityAI's FPR ranges between 0.001 at the tightest threshold and 0.003 at the loosest threshold.

Table 3: False Positive Rates at Different Detector-Specific Thresholds

Detector	0.1	0.3	0.5	0.7	0.9
gptzero	0.0071	0.0071	0.0071	0.0071	0.0071
originality	0.0027	0.0011	0.0011	0.0011	0.0011
pangram	0.0010	0.0005	0.0000	0.0000	0.0000
roberta-base-detector	0.9774	0.9608	0.9503	0.9327	0.8991

False Negative Rate

Pangram's FNR ranges between 0.0045 and 0.038, depending on the threshold and LLM model used. OriginalityAI performs worse than both Pangram and GPTZero, with FNR scores as high as 0.300 even on the loosest threshold.

Methodology Insights

The study's methodology is based on a comprehensive corpus of human and AI-generated texts, matched in length and content. The use of four frontier LLMs (GPT-4.1, Claude Opus 4, Claude Sonnet 4, and Gemini 2.0 Flash) ensures that the results are robust across different AI models.

The study's evaluation framework, which includes metrics such as AUROC, ∆-Mean, FPR, and FNR, provides a comprehensive understanding of each detector's performance. The use of Youden-optimized thresholds allows for an apples-to-apples comparison between detectors.

Analysis and Interpretation

The results have significant implications for businesses and policymakers seeking to implement AI detection solutions. Pangram's near-flawless performance across various metrics and genres makes it a top choice for applications requiring high accuracy.

The study's findings also highlight the importance of considering the trade-offs between FPR and FNR when selecting a detector. The introduction of a "policy cap" framework offers a flexible approach to detector evaluation, allowing policymakers to set tolerance levels for FPR or FNR based on their specific objectives.

The results also have strategic implications for companies and managers. For instance, the choice of detector depends on the specific use case and the relative costs of false positives and false negatives. Companies must weigh the costs of implementing a detector against the potential benefits of accurate AI detection.

Practical Business Insights

Detector selection: Businesses should consider the specific requirements of their use case when selecting a detector. For applications requiring high accuracy, Pangram may be the top choice.
Threshold setting: Companies must carefully consider the trade-offs between FPR and FNR when setting thresholds for their chosen detector.
Cost-benefit analysis: Businesses should conduct a cost-benefit analysis to determine the optimal detector and threshold settings for their specific use case.

Strategic Implications

Competitive advantage: Companies that implement effective AI detection solutions can gain a competitive advantage in their respective markets.
Risk management: AI detection can help businesses mitigate the risks associated with AI-generated content, such as misinformation or intellectual property infringement.
Regulatory compliance: Companies must ensure that their AI detection solutions comply with relevant regulations and standards.

Real-World Implementation Considerations

Data quality: The quality of the data used to train and test AI detectors is crucial for ensuring accurate performance.
Detector maintenance: Companies must regularly update and maintain their AI detectors to ensure they remain effective against evolving AI models.
Human oversight: Businesses should implement human oversight and review processes to ensure that AI detection solutions are accurate and fair.

By understanding the strengths and limitations of different AI detectors, businesses and policymakers can make informed decisions about implementing AI detection solutions that meet their specific needs and objectives.

Practical Implications

The findings of this study have significant implications for businesses, policymakers, and organizations seeking to detect AI-generated text. The results demonstrate that Pangram is the most effective AI detector, achieving near-perfect accuracy across various text genres and lengths. This has important practical implications for companies looking to implement AI detection solutions.

Real-World Applications

Content moderation: Companies can use AI detectors like Pangram to identify and flag AI-generated content on their platforms, reducing the risk of misinformation and maintaining the integrity of user-generated content.
Academic integrity: Educational institutions can utilize AI detectors to detect AI-generated text in student submissions, promoting academic honesty and preventing plagiarism.
Customer review authenticity: Businesses can leverage AI detectors to verify the authenticity of customer reviews, ensuring that reviews are genuine and not generated by AI.

Strategic Implications

Detector selection: Companies should carefully evaluate the performance of different AI detectors, considering factors such as accuracy, cost, and robustness to "humanizers."
Threshold setting: Businesses must determine the optimal threshold for their AI detection solution, balancing the need to detect AI-generated text with the risk of false positives.
Ongoing maintenance: Organizations should regularly update and maintain their AI detectors to ensure they remain effective against evolving AI models.

Who Should Care

Businesses: Companies that rely on user-generated content, customer reviews, or other text-based data should be concerned about the authenticity of this content.
Educational institutions: Schools and universities need to ensure academic integrity by detecting AI-generated text in student submissions.
Policymakers: Regulators and policymakers must understand the capabilities and limitations of AI detectors to develop effective policies and guidelines for their use.

Actionable Recommendations

Implement Pangram: Businesses and organizations should consider implementing Pangram as their AI detection solution due to its high accuracy and robustness.
Set policy caps: Companies should establish policy caps to determine the acceptable false positive rate for their AI detection solution, ensuring that the detector is optimized for their specific needs.
Regularly update detectors: Organizations should regularly update and maintain their AI detectors to stay ahead of evolving AI models and "humanizers."
Monitor detector performance: Businesses should continuously monitor the performance of their AI detectors, adjusting thresholds and policies as needed to ensure optimal results.

Conclusion

The study's findings have significant implications for businesses, policymakers, and organizations seeking to detect AI-generated text. By understanding the strengths and limitations of different AI detectors, companies can make informed decisions about implementing AI detection solutions that meet their specific needs and objectives. The use of policy caps and ongoing maintenance are crucial to ensuring the effectiveness of AI detection solutions. As AI technology continues to evolve, it is essential for organizations to stay vigilant and adapt their AI detection strategies accordingly.

In conclusion, the results of this study provide valuable insights into the performance of AI detectors and offer practical guidance for businesses and policymakers seeking to detect AI-generated text. By implementing effective AI detection solutions, organizations can mitigate the risks associated with AI-generated content and maintain the integrity of their text-based data.