AI-Native EngineeringJun 2, 2026 · 10 min read

How to Run OWASP Security Reviews With Claude Code on Every PR

A pentest twice a year tells you what was broken months ago. Wire Claude Code's GitHub Action into your pipeline for everyday code review, then add a second, explicit security step that reads every diff against the OWASP Top 10 and blocks the merge when it finds a hole. Security stops being an event you survive twice a year and turns into something your pipeline checks on every commit.

Oshri Cohen

Chief Product & Technology Officer

OWASPOn every pull request

Most companies still treat security like a fire drill. Once or twice a year a pentest lands, a report full of findings arrives, and an engineer who has long since moved on to other work gets a ticket about a vulnerability that shipped to production three months ago. Everyone agrees security matters. Nobody can honestly say it's continuous. The audit is a snapshot of how exposed you were last quarter, delivered too late to do anything but clean up.

The reason security lived at the end of the process instead of inside it was never philosophical. It was economic. You cannot put a security engineer on every pull request. There aren't enough of them, they cost a fortune, and nobody good wants to spend their week reading the four-hundredth diff. So review got rationed: saved for the big releases, and the rest of the time you mostly hoped. That constraint just disappeared. An AI reviewer can read every change, against a real security standard, every single time, in the minutes between opening a pull request and merging it. And the standard you'd hold it to already exists, in public, for free.

OWASP is the checklist nobody keeps in the room

OWASP, the Open Worldwide Application Security Project, has spent two decades writing down, openly, exactly how web software gets broken. The OWASP Top 10 is the short list every engineer half-remembers. The Application Security Verification Standard is the long, specific one. The Cheat Sheet Series tells you how to get each control right. Between them you have a mature, industry-agreed catalog of the ways an application fails its users, and you didn't have to write a word of it.

The problem was never that the standard didn't exist. It's that the standard lives in a PDF, and the PDF is not in the room at 5pm on a Friday when someone is merging a fix to get the release out. Knowing the OWASP Top 10 exists has never once stopped an injection bug from shipping. The catalog only does work if something reads every change against it, at the moment the change is made. So that's the whole move: take OWASP out of the PDF and put it where the code actually changes.

A01 Broken Access Control: can a user reach data or actions that aren't theirs?
A02 Cryptographic Failures: is sensitive data protected in transit and at rest, with algorithms that aren't a decade out of date?
A03 Injection: SQL, command, and the rest. Is untrusted input ever allowed to become code?
A04 Insecure Design: is the flaw in the shape of the thing, not just a slip in the implementation?
A05 Security Misconfiguration: defaults left on, headers missing, a storage bucket quietly public.
A06 Vulnerable and Outdated Components: does this change pull in a dependency with a known CVE?
A07 Identification and Authentication Failures: weak sessions, guessable resets, login you can walk past.
A08 Software and Data Integrity Failures: unverified updates, untrusted deserialization, a poisoned build pipeline.
A09 Security Logging and Monitoring Failures: if the attack happened, would you ever see it?
A10 Server-Side Request Forgery: can the server be tricked into fetching something it has no business touching?

None of that is exotic. Every engineer has met every item on the list. What no team has ever had is someone with the patience to check all ten against every diff, forever, without getting bored or going home. That's the job that just became automatable.

Step one: let the reviewer review

Claude Code, the tool I build with and the one I'll talk about here, ships an official GitHub Action. Wire it into your repository and it runs inside the CI you already have, triggered when a pull request opens or gets new commits, authenticated with an API key you keep as a repository secret. It reads the diff and the code around it and leaves review comments on the pull request the way a colleague would. Point it at your repo's CLAUDE.md and it picks up your project's conventions and context for free.

That first layer is ordinary engineering review: correctness, clarity, design, the things a strong senior would flag. Run it on every pull request. It's genuinely useful, and on a fast-moving team it catches a great deal before a human ever looks. But here is the trap, and it's the whole reason this post exists: do not assume that a reviewer looking at everything will also catch the security holes. It won't, not reliably, and it fails for the same reason your humans do.

Step two: security gets its own step, with one job

A reviewer asked to judge correctness, style, performance, design, and security all at once will do all of them at the depth of none of them. Attention is the scarce resource, for a model exactly as for a person. The fix is the same one I use everywhere with AI: don't build one overloaded generalist, build a narrow specialist with a single job. I've argued you should treat agents like very stupid employees, and the kindest, most effective thing you can do for a stupid employee is give them exactly one thing to worry about.

So you add a second, explicit step whose only job is security. Its entire mandate is to read the change and ask one question, ten ways: does this introduce, or fail to prevent, any of the OWASP risks? Anthropic publishes a security-review action and a /security-review command built for precisely this, but the principle holds however you wire it. Security review is a separate pass, with its own prompt, its own rubric, and nothing else competing for its attention.

A reviewer asked to check everything checks nothing in particular. Security gets its own step, its own rubric, and one job, the same way you'd never ask your auditor to also write the feature.

How to make the security step actually good

Give it the rubric, not just the diff

A vague instruction to "check for security issues" gets you vague results. Hand it the actual standard. Put the OWASP categories you care about, in plain language, into the security step's instructions, and then add what makes your codebase specific: how you do authorization, where secrets are allowed to live, which data is sensitive, what your input-validation pattern looks like. "Broken access control" is an abstraction until the reviewer knows that in your system every query must be scoped to the current tenant. Encode that, and the review stops being generic boilerplate and starts being about you.

Review the change in its real context

Security bugs are rarely visible inside one isolated hunk. A new endpoint reads fine until you notice that nothing upstream of it checks the caller's role. So the step has to read the diff and the surrounding code, map each risky change to the OWASP category it threatens, and comment on the exact line, naming the category, explaining the exposure, and proposing the fix. A finding pinned to line 240 that says "A01: this query isn't scoped to the current user" is something an engineer acts on in seconds. "Consider security implications" is noise, and people learn to scroll past noise.

Run it on what matters, but run it always

This step has no business firing on a README typo or a CSS tweak. Gate it with path filters to the changes that touch real code, so you aren't paying a review tax where there's nothing to review. But inside that boundary, run it on every qualifying pull request, every single time. Continuity is the entire value proposition. A security review that runs sometimes is just a slower, more expensive pentest with extra steps.

Make it a gate, not a suggestion

A review nobody has to act on is decoration, and decoration is worse than nothing here, because it sells you the feeling of safety without the fact of it. Make the security step a required status check. A high-severity finding (injection, broken authentication, a leaked secret) blocks the merge. Full stop. If a human wants to override it, fine, but they do it on the record, with a written reason, so a waiver is a decision someone owns rather than a checkbox someone clicked on the way out the door.

Someone will object that the model is non-deterministic and won't surface the identical finding on every run. True, and it doesn't matter, because you are not asserting on its exact words. You're using it as a tireless first-pass reviewer that escalates what it sees, the same way I argue you should test non-deterministic agents on behavior and gates rather than on string-matching. Set the bar high on the categories that end careers, and let the lower-severity notes ride along as advice.

Required check, not optional. The security step has to pass to merge, like any other gate that means something.
Hard-block the severe categories. Injection, authentication, access control, and exposed secrets are merge-stoppers, not gentle suggestions.
Waivers are explicit and logged. A human can override a finding, with a written justification that stays in the record.
Re-run on every push. The diff changed, so the review must too; a green check from three commits ago proves nothing about the code in front of you.
Comment on the exact line. A finding nobody can locate is a finding nobody fixes.

It's a layer, not an alibi

Be clear-eyed about what this is. The AI security step does not replace the rest of your security program, and anyone who sells it to you that way is selling you a liability. Keep your dependency scanning and advisory alerts for the known-CVE problem, and your secret scanning alongside them. Keep a real static-analysis tool like CodeQL for the deep stuff. And keep paying for human pentests on the parts of the system where being wrong is catastrophic. What the AI step adds is the thing none of those ever gave you: a reviewer that reads every change for the logic-level OWASP risks (the broken access control, the insecure design, the server-side request forgery) that pattern-matching scanners structurally cannot see, and reads them continuously instead of quarterly.

And it will be wrong sometimes. It will flag things that are fine and cost you a few minutes, and it will miss things, which means you still own whatever ships. The OWASP Top 10 is a floor, not a ceiling. Clearing it means you avoided the common catastrophes, not that you're secure. A passing security step is not a certificate. It's evidence that the obvious holes aren't there, produced on every single pull request, which is precisely the evidence you never used to have.

A green security check isn't a certificate that the code is safe. It's evidence the obvious holes aren't there, on every pull request, which is exactly what you never had before.

Why this is the right time

Strip away the tooling and this is the same pattern I keep coming back to. Security review was rationed because skilled attention was expensive and scarce, so it went to the releases that mattered and skipped everything else. AI makes one specific kind of attention cheap and effectively infinite, so careful standard-driven review, the work that was never worth doing on every change, suddenly pays for itself. You don't slow your engineers down to stay safe. You build the guardrail that lets them keep moving, which is the entire thesis of professional vibe coding: speed and rigor stop being a trade-off the moment the rigor is cheap to enforce.

It also moves your security people up the stack, where they're worth far more. Instead of hand-reviewing the four-hundredth diff, they curate the rubric, sharpen the OWASP checklist against your actual threats, and adjudicate the genuinely hard findings the machine escalates. The boring, infinite, first-pass work goes to the tireless reviewer; the judgment stays with the humans. That isn't security theater with an AI logo stuck on it. It's the first time "we review every change for security" can be a true sentence instead of an aspiration.

Put OWASP in the pull request. Let the first step review the engineering, give the second step one job and the OWASP rubric, gate the merge on what it finds, and keep your other defenses exactly where they are. Do that, and security stops being an event you brace for twice a year. It just runs, on every commit, the way your tests already do. If you want help wiring this into your pipeline, let's talk →