128 Code Integrity System

AI plagiarism detection and student submission pattern analysis for large-enrollment programming courses

Status: In ProductionDuration: Fall 2025 - Present

Overview

Detecting academic dishonesty in large-enrollment programming courses has traditionally relied on comparing student submissions against one another for structural similarity. I built the 128 Code Integrity System to extend this capability with edit volatility analysis, which examines how each student’s code develops across their full submission history. The system integrates directly with PrairieLearn, retrieves every submission attempt, and provides course staff with the evidence necessary to investigate violations that student-to-student comparison alone would miss.

The Problem

Established tools for detecting plagiarism in programming courses, including MOSS and Harvard’s Compare50, were effective at scale, capable of comparing thousands of submissions while factoring out starter code and common patterns. Student-to-student similarity detection in large-enrollment courses was a solved problem.

These tools assume one student is copying from another; however, AI code generation introduced a different problem. The student is not copying a classmate but plagiarizing the AI. Their code appears more original than similar when compared against the submission pool, and overlap-based checkers cannot detect it. We needed to maintain student-to-student comparison while adding the ability to detect code that students did not write.

Detection

Structural Similarity

The structural similarity component reimplements the winnowing fingerprinting approach established by prior plagiarism detection systems, built to maintain the capability of checking student code against other students’ code within our own infrastructure.

The system tokenizes source code, computes cryptographic hashes of k-gram sequences (contiguous token sequences of configurable length), and selects representative fingerprints using the winnowing algorithm. Comparing fingerprint overlaps between submission pairs identifies structurally equivalent solutions despite surface-level differences such as variable renaming, comment removal, or statement reordering. Optional identifier normalization replaces all variable and function names with generic placeholders, further reducing the effectiveness of superficial disguise.

The system also extracts control flow patterns from each submission, reducing code to an ordered sequence of structural elements (loops, conditionals, switch cases) annotated with nesting depth. Comparing these sequences using Levenshtein distance identifies submissions that implement the same algorithm even when variable names, formatting, and surface-level code differ entirely.

To limit false positives, the system filters out starter code fingerprints and excludes patterns appearing in more than 75% of submissions. Similarity scores are computed using Jaccard similarity over fingerprint sets, with configurable thresholds determining which matches warrant investigation.

Edit Volatility

Similarity analysis is blind to dishonesty that does not involve copying. A student who uses an AI tool produces unique code, indistinguishable from honest work when compared against the submission pool. Volatility analysis closes this gap by examining how the code was developed, not what it looks like.

The system retrieves every submission attempt for every student, not just final submissions. For each pair of consecutive attempts, it computes a volatility score that captures how much the code changed and how quickly. The score combines normalized Levenshtein distance with token-level Jaccard dissimilarity, weighted by a logarithmic time factor; a large rewrite over several hours scores lower than the same rewrite in five minutes. A student who writes code incrementally produces low volatility: adding a function, fixing a bug, adjusting logic. A student who replaces most of their code between attempts produces high volatility.

The system flags sequences of consecutive high-volatility attempts, a pattern that is difficult to produce through genuine iterative development. Streak detection presents autograder scores, volatility metrics, and side-by-side code comparisons for each attempt. Reviewers can inspect what changed between each submission, giving them the context to judge whether the student wrote the code themselves.

Investigation and Reporting

Reviewing Flagged Students

The system surfaces flagged students through two independent channels. For similarity matches, the review interface presents side-by-side code comparison with matched regions highlighted, overlap counts, and similarity scores. For volatility flags, the interface presents each student’s full submission timeline: every attempt, its autograder score, the volatility score between consecutive attempts, and side-by-side code comparisons. For each finding, reviewers decide whether the evidence supports an infraction and warrants referral, or should be dismissed.

Report Generation

When a case is ready for referral, the system generates a PDF report compiling the evidence. For similarity cases, reports include side-by-side code comparison between students with matched regions highlighted, overlap counts, and similarity scores; partner identities are anonymized. For volatility cases, reports include side-by-side source code of consecutive submission attempts with volatility scores, autograder scores, and timestamps. These reports are submitted directly to the institutional academic integrity process.

Educational Impact

Volatility-based detection: In its first semester, approximately 10% of the course was flagged for AI usage by the system. Of those flagged, roughly half denied the allegation. About 15% of students who denied chose to appeal, but no findings were overturned.

Investigation efficiency: For similarity, filtering starter code and common patterns before analysis ensures flagged matches reflect genuine overlap, not shared boilerplate. For volatility, the system surfaces only students whose submission patterns are inconsistent with iterative development, rather than requiring staff to manually review every student’s submission history.

Evidentiary standards: When integrity violations are suspected, the system provides objective, quantitative evidence for investigation. Generated reports present students and faculty reviewers with specific similarity scores and volatility metrics rather than subjective assessments of code similarity.

Looking Forward

Similarity detection remains necessary for catching students who copy from each other, but it is no longer sufficient on its own. Edit volatility analysis fills the gap that AI code generation has opened: it detects dishonesty that produces no similarity signal at all. A development process can theoretically be fabricated, but because the system accounts for time between attempts and measures incremental progress, doing so would require sustained effort and deliberate pacing across the assignment. As AI-generated code becomes harder to distinguish from student-written code on its surface, examining how code was produced rather than what it looks like will only become more important.