Agentic AI Engineering: Comparing Local Coding Models – Early 2026

A lot of the work we do as Software Engineers is not just creating and shipping new applications with rich APIs, Events and other distributed system components – but we are doing a lot more of legacy application modernisation as we have realised that we can apply the power of large language models to reading code, extracting logic and documenting / forward engineering. Well atleast we hope we do more of this!

With this intent I have built over time a local set of code samples for .NET, Java, Webservices, PL/SQL even COBOL / CAGEN applications that are functional and quite dense to represent a good sample set for testing the abilities of AI models, coding-agents and humans in transforming them into updated services. These also include PL/SQL stored procedures, Java JEE EJBs, .NET WCF services and the test requires turning them into API specifications, business rules documentation, and modern implementations as we look to use “Spec-driven development” for application modernisation context

April 2026 – Gemma 4 release

With the April 2026 Gemma 4 release, I was excited to try out a new model and set out to compare Gemma 4 with Qwen 3.5 – now I lack good local hardware so my test uses Ollama Cloud and the best of the models. While I can compare locally gemma4:e4b  vs
qwen3.5:9b, I chose to compare qwen3.5:397b-cloud (Alibaba’s 397-billion parameter flagship) and gemma4:31b-cloud (Google’s 31-billion parameter model).

Both models are free via Ollama, both support 256K context windows, vision, tool use, and thinking/reasoning. The parameter count gap is 12x. The question: does that translate to meaningfully better output for real modernisation work?

I ran two rounds of tests — 12 tasks total — covering both general coding agent work and application modernisation specifically.

Part 1: General Coding Agent Tests

Six tasks that mirror everyday coding agent work: writing code, fixing bugs, reviewing PRs, refactoring, writing tests, and designing systems.

Test Results

Testqwen3.5:397b-cloudgemma4:31b-cloudWinner
1. Code Generation
Merge sorted streams (Python, min-heap)
121.1s · Clean, correct, followed “code only” instruction16.9s · Correct, but added examples despite “code only” instructiongemma4 (speed) / qwen3.5 (instruction following)
2. Bug Fixing
Thread-safe rate limiter
19.4s · Found 6 bugs inc. KeyError, race conditions, stale timestamps17.3s · Found 5 bugs, missed KeyError on delqwen3.5
3. Security Code Review
SQL injection-riddled REST API
52.4s · Found 12 issues inc. IDOR, mass assignment, privilege escalation38.3s · Found 8 issues, missed mass assignment & IDORqwen3.5
4. Refactoring
Nested callback hell to clean code
25.4s · Enterprise-style: extracted constants, separate functions, exports63.8s · Modern concise JS: arrow fns, optional chaining, ??Tie
5. Test Writing
LRU cache Jest tests
83.7s · 15+ test cases, excellent edge case coverage141.7s · 11 test cases, creative async interleaving testqwen3.5
6. System Design
Webhook delivery with retries + DLQ
68.8s · 5 tables, full TypeScript types, HMAC with timing-safe compare45.2s · 2 tables, simpler but explained patterns wellqwen3.5 (completeness) / gemma4 (clarity)

Result: qwen3.5 won 4 out of 6 tests.

Coding Agent Timing

Metricqwen3.5:397b-cloudgemma4:31b-cloud
Total time370.8s323.2s
Average per test61.8s53.9s
Fastest test19.4s (bug fix)16.9s (code gen)
Slowest test121.1s (code gen)141.7s (test writing)

Result: gemma4 was slightly faster

Quality Scores — Coding Agent (out of 5)

Dimensionqwen3.5:397b-cloudgemma4:31b-cloud
Code correctness5 / 55 / 5
Security awareness
5 / 5

4 / 5
Instruction following
5 / 5

4 / 5
Thoroughness
5 / 5

4 / 5
Code style / idioms
4 / 5

5 / 5
Explanation quality
4/ 5

5 / 5
Speed
3 / 5

4 / 5
Overall4.4 / 54.3 / 5

Result: qwen 3.5 was slightly better

Part 2: Application Modernisation Tests

These six tests mirror my actual workflow: reading legacy code, extracting rules, generating specs, forward engineering, designing integrations, and writing parity test specifications. I want to apply task and spec-driven development by asking the agents to break things down, follow the specs and then complete the tasks

The Tests

  1. PL/SQL Business Rules Extraction — Read an insurance claims stored procedure (~130 lines) and extract every business rule with boundary values, edge cases, and modernisation risks
  2. Java JEE Documentation — Document an EJB order fulfillment service for a modernisation handover: data flow, dependencies, error handling, state transitions
  3. OpenAPI 3.1 Spec Generation — Generate a complete API spec from business requirements including JWT auth, rate limiting, RFC 9457 error responses, and pagination
  4. .NET Forward Engineering — Convert a WCF service with ADO.NET and stored procedures to .NET 8 minimal API with EF Core, FluentValidation, and structured logging
  5. Integration Architecture Design — Design the integration layer for decomposing a monolith into 5 microservices with event schemas, API contracts, and compensation strategies
  6. Test Specification — Write a comprehensive parity test spec from extracted business rules to validate the modernised system matches legacy behaviour

Test Results

Testqwen3.5:397b-cloudgemma4:31b-cloudWinner
1. PL/SQL Rules Extraction
Insurance claims procedure
139.5s · 21 business rules with boundary values, edge cases, risk flags, and 6 modernisation recommendations51.6s · 13 rules in clean categories, missed audit log inconsistencies, race conditions, empty string vs NULL edge caseqwen3.5
2. Java JEE Documentation
EJB order fulfillment
27.9s · Full tech doc with Mermaid sequence diagram, state transition table, JNDI resources, error handling matrix33.3s · Clean doc with modernisation roadmap (Saga pattern, Outbox, CompletableFuture), caught the JMS dual-write problemTie
3. OpenAPI Spec Generation
Claims API from requirements
103.0s · Complete OAS 3.1 with all paths, $ref schemas, rate limit headers, Problem Details (RFC 9457), pagination, examples53.0s · Good OAS 3.1 with schemas and security, but fewer paths and missing reusable componentsqwen3.5
4. .NET Forward Engineering
WCF to .NET 8 minimal API
57.8s · Full solution: Result pattern, EF Core model config, FluentValidation, Serilog, NuGet refs, architectural rationale43.9s · Clean solution with records, repository pattern, async throughout, but less complete Program.csqwen3.5
5. Integration Design
Monolith to microservices
48.8s · CloudEvents schemas, Transactional Outbox with code (.NET + Node.js), Saga compensation, CQRS28.6s · Clear sync vs async decision matrix, OpenAPI contract for sync calls, Outbox pattern, compensation eventsTie
6. Test Specification
Parity testing from business rules
63.8s · 34 test cases + 7 cross-rule interaction tests + boundary value matrix + rule precedence matrix + negative tests36.7s · 33 test cases + BVA section + negative tests, but no cross-rule interaction or precedence analysisqwen3.5

Result: qwen3.5 won 4 out of 6 tests, with 2 ties.

App Modernisation Timing

Metricqwen3.5:397b-cloudgemma4:31b-cloud
Total time440.8s247.1s
Average per test73.5s41.2s
Fastest test27.9s (Java doc)28.6s (integration)
Slowest test139.5s (PL/SQL rules)53.0s (OpenAPI)

gemma4 was 78% faster on average across the app modernisation tests.

Quality Scores — App Modernisation (out of 5)

Dimensionqwen3.5:397b-cloudgemma4:31b-cloud
Legacy code comprehension5/54/5
Business rules extraction5/54/5
API spec generation5/54/5
Forward engineering5/54/5
Integration design5/55/5
Test specification5/54/5
Modernisation insight4/55/5
Speed3/55/5
Overall4.6 / 54.3 / 5

The Key Findings

Finding 1: Qwen 3.5 is better when applied to modernisation work

On general coding tasks, the two models were neck and neck (4.4 vs 4.3). On application modernisation, the gap widened to 4.6 vs 4.3. Modernisation tasks reward thoroughness — and that’s where qwen3.5’s extra parameters help

Finding 2: 21 rules vs 13 from the same procedure

This was the most telling result. Given the same ~130-line PL/SQL insurance claims procedure, qwen3.5 extracted 21 distinct business rules while gemma4 found 13.

The 8 rules gemma4 missed weren’t obscure. They included:

  • Boundary conditions (fraud score 70 passes, 71 blocks — is the check > or >=?)
  • A race condition in the claim frequency counter
  • The difference between NULL and empty string in PL/SQL for the rejection notes check
  • A COMMIT inside an exception handler after a ROLLBACK (which only commits the error log)
  • Inconsistent audit logging paths depending on early returns

Every one of those is the kind of thing that becomes a production bug if you miss it during migration.

Finding 3: Cross-rule interaction tests matter

Both models wrote roughly the same number of test cases for the parity test specification (34 vs 33). But qwen3.5 added 7 cross-rule interaction tests — scenarios like “what happens when fraud score is exactly 70 AND the claim exceeds the assessor’s threshold AND the customer has 4 claims in 12 months?”

Those combination tests are where legacy migration bugs hide. Individual rules work fine in isolation. It’s the interactions that break.

Finding 4: gemma4 had better modernisation instincts

On the Java JEE documentation task, gemma4 caught something qwen3.5 didn’t lead with: the EJB sends a JMS message to the warehouse before the database transaction commits. If the commit fails, the warehouse already has a pick request for an order that doesn’t exist. gemma4 immediately recommended the Transactional Outbox pattern and the Saga pattern for compensation.

gemma4’s modernisation recommendations were consistently more architectural. It thinks in patterns.

Model Specifications

Specqwen3.5:397b-cloudgemma4:31b-cloud
ProviderAlibaba (Qwen)Google (Gemma)
Parameters397B~32B
Context Window256K tokens256K tokens
VisionYesYes
Tool UseYesYes
Thinking/ReasoningYesYes
QuantizationBF16BF16
Local downloadCloud only9.6 GB (also runs locally)
CostFree (Ollama cloud)Free (Ollama cloud or local)

My Setup

I’m using qwen3.5:397b-cloud as my primary coding agent for the work that matters most: rules extraction from legacy code, OpenAPI spec generation, forward engineering, and parity test specifications.

gemma4:31b-cloud is my secondary — I reach for it when I need quick documentation from code, a fast second opinion on integration patterns, or when I’m iterating rapidly and don’t need exhaustive analysis.

Recommendation by Use Case

Use CaseRecommended Model
Legacy code rules extractionqwen3.5:397b-cloud
OpenAPI / spec generationqwen3.5:397b-cloud
Forward engineering (.NET, Java)qwen3.5:397b-cloud
Parity test specificationsqwen3.5:397b-cloud
Security code reviewqwen3.5:397b-cloud
Quick code documentationgemma4:31b-cloud
Integration pattern selectionEither — both strong
Rapid prototyping / iterationgemma4:31b-cloud

How to Run These Tests Yourself

Both models are available free through Ollama. No API keys, no usage limits.

# Install Ollama (macOS)
brew install ollama
# Run the models
ollama run qwen3.5:397b-cloud
ollama run gemma4:31b-cloud

The test scripts and full raw outputs are available on my GitHub. If you’re doing application modernisation work and want to evaluate models against your own legacy codebase, the scripts are designed to be adapted — just swap in your own code samples.

For anyone doing application modernisation: the model that catches more edge cases in your legacy code is the one that saves you from production incidents later. A missed boundary condition in a PL/SQL procedure becomes a production defect in your shiny new microservice. That’s where I landed, and that’s why I’m going with qwen3.5:397b-cloud as my primary agent.

Leave a Reply