Preparing, Taking, and Grading Exams with AI

A Classroom Experiment

One of the modules I teach in Vocational Training includes learning the Python programming language. As a joke, I once told my students that if they asked in advance, they could take an exam using AI. To my surprise, this week they actually took me up on it.

AI-based Grading

Throughout the year, I’ve often explained how to use AI as a learning assistant—to study, to prepare exercises, and to ask questions. Both my students and I know perfectly well that, at their current level, any of their previous exercises or exams could be solved flawlessly by an AI. So we all understood that this exam needed to be something different.

Preparation

I had several ideas in mind, and of course, I asked an AI for suggestions on how to make the exam both feasible and fair. The first proposal it gave me was laughable—students could have solved it in 30 seconds with any AI tool. So I asked it instead what kinds of things AIs tend to misunderstand. Between that and my own experience, it became clear that the exam needed to meet several criteria:

  1. Add ambiguous requirements that require human interpretation
  2. Include domain-specific knowledge that goes beyond standard programming
  3. Add constraints that require creative problem-solving
  4. Remove explicit step-by-step instructions in favor of high-level goals

With those ideas, I initially drafted a four-page exam in English (it’s a bilingual program, so that’s justified)… but when I mentioned it at home, I was told it was absolute madness. So I replaced it with a much shorter, single-page version and took that to class instead.

Grading

Since the whole experiment started with AI, I figured I should continue in the same direction. So I turned to one of the AIs I normally use for programming tasks, Qwen (from Alibaba). I uploaded the exam and asked it to create a grading rubric. It took around fifteen minutes of refinement to get a result I was happy with—one that included the kinds of comments I wanted students to see—but eventually the rubric was good enough.

That got me most of the way there. For the final step, I needed something more powerful. I’ve tried several AI-powered code editors, but as a long-time terminal enthusiast, I’ve ended up sticking with Claude Code. Of course, I’m not planning to pay the [Anthropic subscription] (https://www.claude.com/pricing)) for volunteer work or hobby projects, so I have it connected to another AI that costs me only $9 per quarter, which works well enough for this purpose.

I placed everything into a directory—my exam, my rubric, and all the student submissions downloaded from Moodle—along with a final instruction file. Then I launched Claude Code and gave it just one prompt:

read instrucciones.md and follow the instructions

I went off to have lunch and completely forgot about it. When I checked back later, it was asking for permission to generate the evaluation reports for the students. I said yes, and about fifteen minutes later it was done. Here’s an example (with the student’s name changed for privacy):

# Exam Evaluation - FirstName LastName1 LastName2

## Exercise 1 - Design the Library (3 pts)

### Grade: 2.7/3.0

**Components:**
- **Task/algorithm/design correctness**: 1.0/1.0
- **Code clarity & structure**: 0.4/0.5
- **Evidence of individual understanding**: 1.3/1.5

### Specific Comments:

#### ✅ Strengths:
- **Excellent data model**: Well-designed classes (`Book`, `LibrarySystem`) with clearly defined attributes
- **Multi-category support**: Handles books that belong to multiple categories via a list
- **Clear definition of “popular”**: Uses a quantitative threshold (POPULARITY_THRESHOLD = 5)
- **Strong preservation policy**: Considers rarity and historical value with specific thresholds
- **Priority system implemented**: Differentiates between priority and standard readers
- **Modular, well-structured code**: Methods are cleanly separated by function

#### ⚠️ Areas for Improvement:
- Comments could better justify design decisions (e.g., why the popularity threshold is 5)
- The `remove_book` method could include a fuller implementation of the priority system

## Exercise 2 - Create a Reservation System (3 pts)

### Grade: 2.8/3.0

**Components:**
- **Task/algorithm/design correctness**: 1.0/1.0
- **Code clarity & structure**: 0.5/0.5
- **Evidence of individual understanding**: 1.3/1.5

### Specific Comments:

#### ✅ Strengths:
- **Well-designed dynamic limit system**: Takes multiple factors into account (demand, history, academic period)
- **Excellent contextual visualization**: `show_contextual_inventory` groups by topic or author effectively
- **Sophisticated conflict resolution**: Introduces a waiting list with a composite priority system
- **Anti-starvation logic**: Uses `accumulated_wait_time` to prevent users from being stuck indefinitely
- **Purpose-based prioritization**: Differentiates between research, study, and leisure
- **Very well-documented code**: Comments clearly explain the design decisions

#### ⚠️ Areas for Improvement:
- Could elaborate further on how demand would be estimated in a real system

## Exercise 3 - Fairness & Ethical Constraints (0 pts)

### Grade: 0/4

**Components:**
- **Task/algorithm/design correctness**: 0/2.0
- **Code clarity & structure**: 0/0.5
- **Evidence of individual understanding**: 0/1.5

### Specific Comments:

#### ❌ Observations:
- **Exercise 3 was not submitted**: No implementation was provided
- Exercise 3 required modifying the solution from Exercise 2 to add ethical and anti-abuse constraints

## Final Grade: 5.5/10

### Summary:
FirstName demonstrates a strong understanding of object-oriented programming and system design. Exercises 1 and 2 are very well implemented, with clean, modular, and well-documented code. However, the missing Exercise 3 significantly lowers the final score.

### Highlighted Strengths:
- Robust, consistent class design  
- Full implementation of the required features  
- Clean, maintainable, well-organized code  

Seeing what AI can already do—both as a tutor and now as an exam grader—I’m not entirely sure what the future holds for teachers. Those of us in Vocational Training may survive thanks to hands-on practical work, but for the rest… it’s easy to imagine a near future where one teacher manages 100 students supported by 100 AI agents. Time will tell.


See also