OpenAI Codex Case Study: How a Fintech Company Generated 2,000 Test Cases and Achieved 90% Code Coverage
The Problem: 35% Code Coverage in a Payment Processing System
A fintech company processing $2 billion in annual payments had a testing crisis. Their core payment processing system — 180,000 lines of Python — had 35% code coverage. The remaining 65% was untested: edge cases in currency conversion, timeout handling in payment gateway integrations, retry logic for failed transactions, and regulatory compliance checks.
The consequences were real:
- 4 production incidents in 6 months involving untested code paths
- One incident caused $340,000 in duplicate charges (a retry bug that was never tested)
- Regulatory audit flagged insufficient test coverage for compliance-critical functions
- Engineering velocity slowed as developers feared modifying untested code
The CTO mandated 90% coverage within 3 months. The engineering team estimated this required approximately 2,000 new test cases — a monumental task for a 12-person backend team already at capacity.
Manual test writing at their historical pace (15 tests per developer per week) would take: 2,000 tests / (12 developers x 15 tests/week) = 11 weeks — assuming the entire team did nothing but write tests. Unacceptable.
The engineering manager proposed using OpenAI Codex to accelerate test generation. The hypothesis: Codex could generate the boilerplate and standard test cases (70% of the work), while engineers focused on complex edge cases and business logic verification (30% of the work).
The Approach: Systematic Test Generation by Module
Phase 1: Categorize the Codebase by Test Priority (Week 1)
The team classified all untested code into priority tiers:
Tier 1 — Critical (must have 95%+ coverage): Payment processing pipeline (authorize, capture, refund) Currency conversion and rounding Fraud detection rules Regulatory compliance checks (KYC, AML, PCI) Customer data handling (PII encryption, access control) Total: 45,000 lines, currently 28% covered Tier 2 — Important (must have 85%+ coverage): Payment gateway integrations (Stripe, Adyen, PayPal) Retry and timeout logic Webhook handling and idempotency Reporting and reconciliation Total: 55,000 lines, currently 32% covered Tier 3 — Standard (must have 75%+ coverage): Admin APIs Notification system Logging and audit trail Configuration management Total: 80,000 lines, currently 40% covered
Phase 2: Generate Characterization Tests (Week 2-3)
Before writing tests for correctness, the team used Codex to generate characterization tests — tests that document current behavior regardless of whether it is correct.
Codex task for each module: "Analyze the PaymentProcessor class in src/payments/processor.py. Generate characterization tests that: 1. Call every public method with representative inputs 2. Assert on the actual return values (run the code to determine them) 3. Cover each branch of every conditional statement 4. Test with boundary values for numeric parameters 5. Test with null/None for optional parameters 6. Document any side effects (database writes, API calls, logging) Use pytest with fixtures from tests/conftest.py. Mock external API calls using the existing mock patterns in tests/mocks/. Goal: document what the code DOES, not what it SHOULD do. These tests become our regression safety net."
Codex generated characterization tests at approximately 50 tests per hour (including review). The team generated 800 characterization tests in 2 weeks.
Phase 3: Generate Correctness Tests (Week 4-7)
With characterization tests providing a safety net, the team shifted to correctness tests — tests that verify the code does what it should.
Codex task for payment processing: "Generate correctness tests for the authorize_payment function. Business rules: - Amount must be positive and under $50,000 (single transaction limit) - Currency must be one of: USD, EUR, GBP, JPY, CAD, AUD - Card must not be expired - Card must pass Luhn check - Duplicate authorization within 60 seconds must be rejected (idempotency) - If the gateway returns 'declined', the function must: 1. Log the decline reason 2. Not charge the card 3. Return a DeclinedError with the gateway's reason code - If the gateway times out (>10 seconds), the function must: 1. Retry once after 2 seconds 2. If second attempt also times out, return a TimeoutError 3. Never retry a non-idempotent operation Test each rule with: - Happy path (rule satisfied) - Boundary values (exactly at limits) - Violation (rule broken, verify correct error handling) - Concurrent scenarios where relevant (two authorizations simultaneously)"
The specificity of business rules in the prompt was critical. Without explicit rules, Codex generated generic payment tests. With explicit rules, it generated tests that matched the company’s actual business logic.
Phase 4: Generate Edge Case and Security Tests (Week 8-10)
Codex task for edge cases: "Generate edge case tests for the currency conversion module. Known edge cases to test: 1. JPY has no decimal places (round to whole yen) 2. BHD has 3 decimal places (Bahraini dinar) 3. Converting $0.01 USD to JPY (rounding to zero?) 4. Converting $49,999.99 at an exchange rate that pushes the amount over the $50,000 limit 5. Exchange rate updates during a conversion (stale rate) 6. Negative exchange rates (should be impossible but test rejection) 7. Currency conversion when source and target are the same 8. Overflow with very large amounts (10^15 base units) 9. Conversion chains: USD -> EUR -> GBP (floating point accumulation) 10. Historical rate lookup with missing date (what fallback?) For each edge case: test the expected behavior AND test that the system handles the unexpected case gracefully (no crashes, no data corruption, proper error messages)."
Codex task for security tests: "Generate security tests for the payment API endpoints. Test for: 1. SQL injection in search parameters (order lookup by ID) 2. XSS in webhook payload processing 3. Authentication bypass (missing token, expired token, wrong scope) 4. Authorization escalation (user A accessing user B's payments) 5. Rate limiting enforcement (verify 429 after threshold) 6. PII exposure in error messages (card numbers, SSN in stack traces) 7. IDOR (Insecure Direct Object Reference) on payment endpoints 8. Replay attacks on webhook signatures 9. Integer overflow in amount fields 10. Path traversal in file export endpoints Use the OWASP testing methodology. Each test should verify both that the attack fails AND that the failure is handled securely (proper logging, no information leakage)."
Phase 5: Human Review and Refinement (Week 11-12)
Every Codex-generated test was reviewed by a human engineer:
Review checklist per test: [ ] Assertion is meaningful (not tautological) [ ] Assertion verifies business logic (not implementation detail) [ ] Test is independent (no dependency on other test execution order) [ ] Mocks are realistic (return data that matches real API responses) [ ] Edge case is actually an edge case (not a normal case mislabeled) [ ] Security test actually tests the vulnerability (not just making a request) [ ] Test name clearly describes what is being tested [ ] Test would fail if the bug it guards against were introduced
Review findings:
- 78% of Codex tests passed review without changes
- 14% needed minor adjustments (mock data correction, assertion refinement)
- 5% needed significant rewrite (wrong business logic assumption)
- 3% were deleted (testing implementation details, not behavior)
Results After 12 Weeks
Coverage Achievement
| Module | Before | After | Target | Status |
|---|---|---|---|---|
| Payment processing (Tier 1) | 28% | 94% | 95% | Near target |
| Gateway integrations (Tier 2) | 32% | 88% | 85% | Exceeded |
| Standard modules (Tier 3) | 40% | 82% | 75% | Exceeded |
| Overall | 35% | 88% | 90% | Near target |
The remaining 2% gap was closed in week 13 with focused manual test writing for the most complex untested paths.
Test Quality Metrics
| Metric | Value |
|---|---|
| Total tests generated by Codex | 2,340 |
| Tests passing review without changes | 1,825 (78%) |
| Tests requiring modification | 327 (14%) |
| Tests deleted after review | 70 (3%) |
| Final test count in suite | 2,097 |
| Mutation testing score | 72% (good) |
| Average test execution time | 4.2 minutes (full suite) |
Production Impact
In the 6 months following test suite completion:
- Production incidents from tested code: 0 (down from 4 in the prior 6 months)
- Deployment frequency: increased from weekly to daily (confidence from test coverage)
- Time to fix bugs: decreased 45% (tests pinpoint the failing component)
- Regulatory audit: passed with no findings related to test coverage
Engineering Velocity
| Metric | Before | After | Change |
|---|---|---|---|
| Deploys per week | 1 | 5 | +400% |
| Bugs caught before production | 30% | 78% | +48pp |
| Time to merge a PR | 3.2 days | 1.4 days | -56% |
| Developer confidence (survey 1-5) | 2.8 | 4.3 | +54% |
Cost Analysis
Codex-assisted approach (12 weeks): Codex API cost: $2,800 Engineer time (review + manual tests): 480 hours Engineer cost (at $75/hour): $36,000 Total: $38,800 Manual approach (estimated 20 weeks): Engineer time: 2,400 hours (12 engineers x 200 hours) Engineer cost: $180,000 No API cost Total: $180,000 Savings: $141,200 (78%) Time savings: 8 weeks (40% faster)
What Went Wrong
Problem 1: Codex Generated Tests That Passed for the Wrong Reason
Several tests were green — but they were testing that a mock returned what the mock was told to return, not that the actual function was correct. The tests would pass even if the production code was completely broken.
Root cause: Codex set up mocks that returned the expected value, then asserted that the function returned that value. The test verified the mock setup, not the function logic.
Fix: The team added a review rule: “Every test must include at least one assertion that would fail if the function’s core logic were removed.” They also ran mutation testing (mutmut) which caught tests that never failed regardless of code changes.
Problem 2: Flaky Tests from Time-Dependent Code
Codex-generated tests for timeout handling used real time delays (sleep(10) in tests). These were flaky in CI — sometimes passing, sometimes failing depending on system load.
Fix: The team provided updated mock patterns: “For time-dependent tests, mock the clock rather than using real delays. Use freezegun or unittest.mock.patch for datetime.now(). Never use time.sleep() in tests.”
Problem 3: Business Logic Assumptions Were Sometimes Wrong
Codex inferred business rules from code comments and variable names, which were sometimes outdated or misleading. A comment saying ”// max retry: 3” when the code actually retried 5 times led to incorrect test assertions.
Fix: The team provided a business rules document to Codex for each module, separate from the code. The explicit rules overrode any assumptions from code comments.
Lessons for Engineering Teams
Characterization Tests First, Correctness Tests Second
Starting with characterization tests (what the code does) before correctness tests (what it should do) provides a safety net. If a characterization test fails after you add correctness tests, you know you changed behavior — which may or may not be intentional.
Explicit Business Rules Produce Better Tests
Codex writes better tests when you tell it the rules than when you ask it to infer rules from code. Invest time in documenting business rules for each module before generating tests.
Mutation Testing Validates Test Quality
Code coverage tells you which lines are executed during tests. Mutation testing tells you which lines are actually verified by tests. A test suite with 90% coverage but 50% mutation score is weaker than 80% coverage with 75% mutation score. Use mutation testing to validate Codex-generated tests.
Review Is Non-Negotiable
The 78% acceptance rate means 22% of Codex tests needed human intervention. For fintech — where incorrect tests create a false sense of security around code that handles money — every test must be reviewed. The 3% deletion rate included tests that, if left in, would have masked bugs.
Frequently Asked Questions
Is 90% code coverage a meaningful target for fintech?
It depends on what is covered. 90% overall coverage with 50% coverage on payment processing is worse than 80% overall with 99% on payment processing. The tier-based approach ensures critical code gets the highest coverage.
Can Codex generate tests for legacy code with no documentation?
Yes, with characterization tests. Codex reads the code, generates tests that document current behavior, and the team then verifies whether that behavior is correct. This is actually Codex’s strongest use case — documenting undocumented code through tests.
How do you maintain 2,000+ tests over time?
The same way you maintain any code: each PR that changes production code must update the corresponding tests. Codex helps here too — “update the tests in test_payment_processor.py to reflect the new retry logic I just added to payment_processor.py.”
Should we use Codex or Claude Code for test generation?
Both work. Codex is better for autonomous, large-scale batch generation (generate 100 tests, review later). Claude Code is better for interactive, exploratory test writing (discuss the module, write tests together). This fintech used Codex for the bulk generation and Claude Code for the complex edge cases that required discussion.
What about test maintenance cost?
The 2,097 tests required approximately 5-8 hours per week of maintenance (fixing broken tests after code changes, adding tests for new features). This is reasonable for a 12-person team and a small price for the production stability improvement.