Milestone 3: Final Implementation, Evaluation, and Poster Session
Percent of Final Grade: 30%
Poster Session: Tuesday, June 2 from 5:00–7:00 PM in the CoDa Sunken Courtyard (Computing and Data Science Building).
Final code and artifacts due: Tuesday, June 2 at 11:59 PM PT (after the poster session). Teaching staff have direct access to your GitHub repository — you do not need to upload a code snapshot. Whatever is on main at the deadline is what we grade.
Every team member must still have ongoing commit activity in the final stretch — not just code, but issues, PRD updates, test cases, reviews, or poster content.
Attendance at the poster session is required. Any team member who misses the poster session without a teaching-staff-approved excuse causes a 5% reduction in the Milestone 3 grade for the entire team (i.e., the team collectively loses 5% of their Milestone 3 grade for each absent member).
Deliverables:
- Working deployed product — both a user-facing UI and a moderator UI, both live on the deployed URL
- Abuse classifier — at least one ML classifier, at least one LLM-based classifier, and at least one hybrid approach, with evaluation results checked into the repo
- Test sets — one set of allowed/benign content and one set of disallowed/abusive content, checked into the repo along with their sourcing documentation
- Poster — 24”×36”, displayed at the poster session; PDF checked into the repo
- Demo video — ~5-minute walkthrough, playable on a tablet or laptop during the poster session
There is no separate writeup for Milestone 3 — the poster is the writeup.
AI Usage Policy for Milestone 3
Same as Milestone 2: AI coding assistants are encouraged, and we grade to a higher bar because of it. Keep your AI use statement in the README up to date through the final deadline. Text on the poster, and your narration of the poster to judges, must be written and delivered by the team — not AI-generated.
Part 1: Working Abuse Prevention Tool (30%)
Your product must be live on a real URL and end-to-end functional. Both UIs must work.
Product UI (8% of Milestone 3)
What a normal user of the platform would see. Post content, interact with the harmful model (for AI-as-abuser teams), receive messages, submit reports — whatever the core user flow is for the abuse type you picked.
Moderator Interface (12% of Milestone 3 — the single largest grading item)
The trust & safety console. The Moderator Interface is weighted more heavily than any other single item because in real T&S work, the moderation tooling is the product. A perfect classifier with no moderator UI to route decisions through is not a deployable system. Your Moderator Interface must include:
- A flagged-content queue — newly-flagged items appear here; moderators can open each for review
- A decision workflow — moderators can approve, remove, escalate, or otherwise act on flagged items, and the action is reflected downstream in the product
- Per-user or per-content history — a moderator can see prior decisions relevant to the current item (e.g., “this user has been flagged 3 times this week”)
- Basic statistics — daily / weekly counts of flags, decisions, and outcomes so a platform operator can understand how the system is performing
Judges and teaching staff will actually click through the Moderator Interface during the poster session. Expect scrutiny.
Detection / mitigation in the path (5% of Milestone 3)
Your detection or mitigation layer must be visibly operating on live traffic through the deployed URL:
- Human-abuser teams: the classifier must route flagged content into the Moderator Interface automatically. Demonstrate by posting an unallow example through the product UI and showing it appear in the queue.
- AI-as-abuser teams: the mitigation layer must intercept harmful model outputs in real time. Demonstrate by showing what the user would have received with no mitigation vs. what they receive with mitigation.
“Logs show we classified it” does not satisfy this — judges need to see the guardrail firing in the live UI.
Feedback loop (when architecturally feasible)
Real trust & safety systems don’t stop at detecting abuse — they learn from moderator decisions and get better over time. Where your architecture supports it, moderator actions should flow back into the detection layer so future classifications improve. A few common patterns:
- LLM-as-classifier: append recent moderator decisions to the classifier’s prompt as few-shot examples; refresh the prompt on a schedule or after every N decisions. Cheap and easy if you already have an LLM in the loop.
- Traditional ML with incremental / online learning: maintain a rolling labeled dataset that includes newly-moderated items; retrain on a cadence (nightly, weekly) or with online algorithms (SGD, Vowpal Wabbit).
- Hybrid systems: moderator-overridden edge cases become the next round’s training data for the LLM adjudicator, or expand the “obvious cases” filter based on high-confidence moderator patterns.
- Rule-based supplements: high-confidence moderator decisions can promote deterministic rules (e.g., “always flag this URL pattern”) that short-circuit expensive classification.
Not every team will be able to implement this cleanly. Hosted foundation models you can’t fine-tune, tight deadlines, and architectures that separate detection from storage all make feedback loops harder. If your architecture supports it, you’re expected to build it — the Moderator Interface grading rewards this. If your architecture genuinely does not support it, document why in your README and in the poster’s “Looking Forward” section; teams that justify the gap well are not penalized.
Teams who implement a feedback loop should include a before-and-after evaluation in Part 2: run your test sets through the classifier at time T, apply simulated moderator feedback, and run again at time T+1 to show the delta in F1, precision, and recall.
End-to-end pipeline (5% of Milestone 3)
A complete cycle — user action → detection → moderator decision → downstream effect — must run on the deployed URL, not just on localhost.
Part 2: Classifier Evaluation (25%)
Milestone 3’s technical heart is does your detection actually work, and how does it compare to alternatives? You will run real test data through multiple approaches and report results.
Test sets
Check two sets of labeled examples into your repo:
- Allow set: content that is benign and must pass through (false-positive evaluation)
- Unallow set: content that is abusive and must be caught (false-negative evaluation)
You may source test sets from public datasets (cite them), scrape or collect real data (document the method), or construct synthetic examples (document how you generated them). You can use stand-ins for illegal or extremely harmful content — for example, pictures of nude kittens in place of child sexual abuse material. Size guidance: at least 100 examples per set; more is better.
Required approaches
Run both sets through at least three detection approaches and report results for each:
- A traditional ML classifier — e.g., logistic regression over TF-IDF features, a fine-tuned BERT or RoBERTa, a small text classifier you train, or a computer vision model for image abuse types
- An LLM-as-classifier — prompt a hosted LLM (Claude, GPT-4, Gemini, or an open-weight model) to classify each example as allow/unallow and return a rationale
- A hybrid approach — e.g., traditional classifier filters the obvious cases, LLM adjudicates the uncertain middle; or ensemble voting across both; or LLM used only for appeal/second-look
Required metrics
For each approach, on each test set, report:
- Precision, recall, and F1
- Confusion matrix
- Per-inference cost (estimated USD per 1,000 classifications)
- Latency (median and p99)
A good way to present this is a single table with one row per approach and one column per metric.
Guidance on what to expect
In our own work and in industry, hybrid systems that combine a cheap traditional classifier with an LLM adjudicator on edge cases often beat pure-LLM systems on F1 while being substantially cheaper to run. We encourage you to test this claim on your own data rather than take our word for it; reporting that the hybrid is worse for your specific abuse type is also a valid and interesting finding. The point of Milestone 3 is to build evidence, not to confirm a preconception.
If your team has implemented a feedback loop (see Part 1), include a before/after evaluation in your results: metrics at time T, the same metrics after simulated moderator feedback has been applied, and a discussion of what changed and why. This is a strong signal of production-realistic engineering and will be rewarded in the Moderator Interface grading.
Part 3: Poster (25%)
A 24”×36” poster, presented in person at the poster session.
Required poster sections
- Problem Description — short overview of the abuse type and victim profile; assume judges have general awareness but not domain expertise
- Policy Language (for human-abuser teams) or Model Safety Spec (for AI-as-abuser teams) — a concise written standard, under 400 words, readable by a normal user, explaining what your platform disallows and why
- Technical Back-end — what you built, how it works, which technologies, why you chose them
- Evaluation — the F1/precision/recall table from Part 2, plus any confusion matrices or plots (ROC curves are optional but welcome)
- Looking Forward — what you would do with more time; what scaling, privacy, or adversarial concerns remain open; what other techniques you considered but did not pursue
Poster logistics
- Size: 24”×36”, no larger. The courtyard has limited space.
- Backing: easels and boards will be available; no rigid backing needed.
- Team number: display your team number in a large font somewhere prominent on the poster — guest judges use it to find the posters they are assigned to.
- Printing: Stanford Hub poster printing takes ~3 business days (so submit by Mon, June 1 at the latest). Free tiling services like BlockPoster let you print A4 pages and tape them together. Costs are not covered by the course but split five ways are modest.
- Poster design: less text is better. Stanford Undergrad Research has good guidance. Hit the headlines on the poster; save detail for your response to judge questions.
- PDF of poster must be committed to the repo.
Poster session location and format — CoDa Sunken Courtyard
- Location: CoDa Sunken Courtyard — the outdoor sunken courtyard at the new Computing and Data Science building at Stanford.
- Time: Tuesday, June 2 from 5:00 PM to 7:00 PM. We will excuse class at 4pm so you can pick up your poster and setup at 4:45pm.
- Format: open poster session. Guest judges from industry (previous years have included Google, Meta, OpenAI, Anthropic, TikTok, Discord, Match, Uber, Sony, Microsoft, xAI, Adobe, Apple, Reddit, etc) will rotate through posters and ask questions. Teaching staff will also evaluate. Prepare a ~5-minute pitch you can deliver repeatedly; ensure every teammate who is present gets to speak at least once during the pitch.
⚡ No power outlets in the courtyard ⚡
The sunken courtyard is outdoors and does not have accessible power. Plan accordingly:
- Bring a tablet or laptop that can run on battery for the full two-hour session (5:00–7:00 PM) or be prepared to rotate devices.
- You can borrow tablets from Stanford here if you don’t have one.
Part 4: Demo Video (10%)
Record a ~5-minute video that reflects all the features of your product and abuse mitigation tool.
- Open with the abuse problem; show the user flow; show the moderator flow; show a classifier catch; close with your evaluation headline.
- You can show the video or a live demo at the poster session, but we will only be able to grade the video so it should be thorough.
- The video should have a voice-over for grading. When live you will speak over it.
Part 5: Software Engineering Practices — Final Sprint (10%)
Milestone 2’s engineering practices continue through Milestone 3. By June 2, the repo should show:
- All five teammates still contributing
- Real PR/review activity through the final sprint, not just a single pre-deadline dump
- Issues closed as work lands; open issues documenting known limitations
- Green CI on
mainat the deadline - README updated with final deployed URL, final AI use statement, and a pointer to the poster PDF in the repo
Grading Criteria
Part 1 — Working Abuse Prevention Tool (30%)
- Product UI live and usable (8%) — user-facing UX works end-to-end on the deployed URL
- Moderator Interface (12%) — moderation queue, decision workflow (approve / remove / escalate), per-user or per-content history, and basic statistics are all implemented and usable. This is the highest-weighted single item in Milestone 3 because moderation is the operational reality of a real trust & safety function — a classifier that nobody acts on is not a product. Teams whose architecture supports it are expected to build a feedback loop (see narrative) that lets moderator decisions improve future classifications; teams whose architecture does not must document why.
- Detection / mitigation visibly working (5%) — for human-abuser teams, the classifier routes flagged content into the moderator queue; for AI-as-abuser teams, the mitigation layer visibly intercepts harmful model outputs in the user path
- End-to-end pipeline on the deployed URL (5%) — a complete cycle (user action → detection → moderator decision) runs on the live URL
Part 2 — Classifier Evaluation (25%)
- Test sets present, labeled, and documented (5%)
- Three required approaches implemented (10%)
- Metrics reported per approach (precision, recall, F1, cost, latency) (5%)
- Clear analysis of which approach wins where, and why (5%)
Part 3 — Poster (25%)
- Problem description (3%)
- Policy Language or Model Safety Spec (4%)
- Technical back-end (5%)
- Evaluation visuals and table (7%)
- Looking Forward (3%)
- Poster design, team number visible, physical setup (3%)
Part 4 — Demo Video (10%)
- All required segments present; audio clear; under 8 minutes.
Part 5 — Software Engineering Practices (10%)
- All five teammates contributing through the final sprint
- Green CI on
mainat the deadline - Real review activity; issues tracked
- README current
Questions? Post on Ed. Final alumni evening Zooms will be held the week before the poster session to help teams debug their deployments and rehearse their pitches.
