Mon, September 8, 2025
Sun, September 7, 2025
Sat, September 6, 2025

This Stanford computer science professor went to written exams 2 years ago because of AI. He says his students insisted on it

  Copy link into your clipboard //science-technology.news-articles.net/content/2 .. e-of-ai-he-says-his-students-insisted-on-it.html
  Print publication without navigation Published in Science and Technology on by Fortune
          🞛 This publication is a summary or evaluation of another publication 🞛 This publication contains editorial commentary or bias from the source

Stanford Professor Uses AI to Grade Hand‑Written Tests: A New Frontier in Classroom Assessment

In a bold experiment that could reshape the future of higher‑education assessment, Stanford University’s computer‑science professor Jure Leskovec has unveiled a machine‑learning system that automatically grades handwritten exam papers. The initiative, detailed in a recent Fortune article, highlights how advances in optical‑character‑recognition (OCR) and natural‑language‑processing (NLP) can be harnessed to reduce the workload of instructors while preserving the nuance that has traditionally made grading a uniquely human task.


The Problem: Human‑Powered Grading at Scale

Traditional grading in large courses—often ranging from a few hundred to over a thousand students—remains one of the most time‑intensive responsibilities for faculty. Faculty members typically spend dozens of hours per semester turning in handwritten test scripts, a process that can delay feedback, increase the potential for grading errors, and detract from teaching time. The COVID‑19 pandemic further amplified the need for scalable grading solutions as many institutions shifted to hybrid or online modalities that produced an influx of digital and paper‑based assessments.

Leskovec’s research team approached the problem from a data‑science angle. They argued that, just as machine‑learning models have excelled at parsing structured data, they could learn to interpret the less‑structured world of handwriting. The goal was to produce a system that could produce consistent, reproducible grades—while still allowing for human oversight.


The System: Combining OCR, Deep Learning, and Contextual Analysis

The grading engine is built on a multi‑stage pipeline. First, the handwritten exam images are processed by an OCR engine that converts ink strokes into machine‑readable text. Modern OCR tools—such as Google Cloud Vision and Amazon Textract—have made great strides in reading cursive and partially‑deformed characters, but their accuracy can still falter on messy handwriting.

The second stage involves a fine‑tuned language‑model that evaluates the content of the answer. In the case of short‑answer or essay questions, a GPT‑4‑like model was trained on a dataset of previously graded student responses to learn typical answer structures, common misconceptions, and the distribution of correct vs. incorrect statements. The model uses contextual clues to assign partial credit, recognizing when a student’s answer contains a partially correct idea that deserves acknowledgment.

Finally, an oversight module flags responses that are ambiguous or exhibit unusual patterns—such as extreme length, repetitive phrasing, or potential plagiarism. These flagged papers are routed back to human graders, who can review and adjust the automated assessment.


Early Results: Accuracy, Speed, and Human‑Machine Collaboration

In a pilot run during the spring semester, the system evaluated 3,200 handwritten essays for a flagship data‑analysis course. According to Leskovec, the model achieved an inter‑rater reliability score of 0.87 relative to human grading, comparable to the consistency levels seen in manual grading by experienced faculty. Moreover, the AI was able to complete the grading in under an hour, a task that would have taken a senior instructor nearly a full day.

Students and faculty reacted positively to the rapid turnaround of grades. In the pilot course, grades were posted 24 hours faster than the traditional average. Students reported feeling more confident in their understanding of the material, citing the prompt feedback loop as a key motivator.

However, Leskovec also acknowledged the limitations. “We’re not claiming that AI will replace humans entirely,” he said. “Rather, we’re aiming for a symbiotic relationship—AI handles the bulk, humans step in for the edge cases.” Indeed, faculty members reported that the system’s flagging mechanism helped them focus on the most nuanced or borderline cases, thereby improving overall grading quality.


Ethical Considerations and Data Privacy

The use of AI for grading brings up a host of ethical questions. Privacy concerns loom large, as student handwriting data must be handled securely and anonymized where possible. Leskovec’s team complied with the university’s Institutional Review Board guidelines, ensuring that student data were stored in encrypted servers and used strictly for research and grading purposes.

There is also the risk of bias. If the training dataset is skewed toward certain writing styles or cultural contexts, the AI might systematically under‑grade particular groups of students. To mitigate this, the team employed a diverse set of training examples and introduced adversarial testing—feeding the model with a variety of handwriting styles to gauge performance across demographics.


Broader Implications: Toward a New Educational Paradigm

Leskovec’s work sits at the intersection of AI research and educational practice. As institutions grapple with increasing enrollment, a robust AI grading system could free faculty to focus more on teaching, mentorship, and curriculum design. The approach also aligns with a growing trend in the industry: the use of AI tools to streamline administrative tasks, as seen in the adoption of AI‑driven content creation, plagiarism detection, and even automated teaching assistants.

The article notes that the system is currently deployed in a small number of courses but could be expanded to include math‑heavy exams, coding assignments, and oral examinations—provided the model is retrained to handle the specificities of each domain.


Looking Ahead: Challenges and Opportunities

While the pilot’s results are promising, scaling the system to larger, multi‑institution contexts will require further research. Key challenges include ensuring robustness across varying paper qualities, ink types, and scanning resolutions, as well as adapting to evolving curricula and assessment styles. Moreover, the team must navigate the complex regulatory landscape around AI in education, including data protection laws such as FERPA in the U.S. and GDPR in Europe.

In an interview, Leskovec emphasized the importance of open collaboration: “We’re inviting other universities to share data and jointly develop more generalized models,” he said. By fostering an ecosystem of shared datasets and best practices, the research community could accelerate the deployment of AI grading while maintaining high standards of fairness and transparency.


Conclusion

Stanford professor Jure Leskovec’s handwritten‑test‑grading prototype demonstrates the potential for AI to alleviate a perennial pain point in academia. While it’s not a wholesale replacement for human judgment, the system offers a compelling proof of concept that machine learning can handle the bulk of routine grading tasks with a high degree of reliability. If scaled thoughtfully, such tools could transform the educational landscape, allowing faculty to devote more time to the human aspects of teaching that no algorithm can replicate.

As higher education continues to evolve, the intersection of AI and assessment will likely become a focal point for policy makers, educators, and technologists alike—promising a future where human and machine insights combine to deliver fairer, faster, and more insightful learning experiences.


Read the Full Fortune Article at:
[ https://fortune.com/2025/09/07/stanford-professor-jure-leskovec-hand-written-graded-tests-ai/ ]