Agentic AI Engineering: Comparing Local Coding Models – Early 2026

<div class="cs-rating pd-rating" id="pd_rating_holder_1819065_post_3263"></div> <p class="wp-block-paragraph">A lot of the work we do as Software Engineers is not just creating and shipping new applications with rich APIs, Events and other distributed system components – but we are doing a lot more of <strong>legacy application modernisation</strong> as we have realised that we can apply the power of large language models to reading code, extracting logic and documenting / forward engineering. Well atleast we hope we do more of this! </p> <p class="wp-block-paragraph">With this intent I have built over time a local set of code samples for .NET, Java, Webservices, PL/SQL even COBOL / CAGEN applications that are functional and quite dense to represent a good sample set for testing the abilities of AI models, coding-agents and humans in transforming them into updated services. These also include PL/SQL stored procedures, Java JEE EJBs, .NET WCF services and the test requires turning them into API specifications, business rules documentation, and modern implementations as we look to use “Spec-driven development” for application modernisation context</p> <h2 class="wp-block-heading">April 2026 – Gemma 4 release </h2> <p class="wp-block-paragraph">With the April 2026 Gemma 4 release, I was excited to try out a new model and set out to compare Gemma 4 with Qwen 3.5 – now I lack good local hardware so my test uses Ollama Cloud and the best of the models. While I can compare locally gemma4:e4b vs <br>qwen3.5:9b, I chose to compare <strong>qwen3.5:397b-cloud</strong> (Alibaba’s 397-billion parameter flagship) and <strong>gemma4:31b-cloud</strong> (Google’s 31-billion parameter model).</p> <p class="wp-block-paragraph">Both models are free via Ollama, both support 256K context windows, vision, tool use, and thinking/reasoning. The parameter count gap is 12x. The question: does that translate to meaningfully better output for real modernisation work?</p> <p class="wp-block-paragraph">I ran two rounds of tests — 12 tasks total — covering both general coding agent work and application modernisation specifically.</p> <h2 class="wp-block-heading">Part 1: General Coding Agent Tests</h2> <p class="wp-block-paragraph">Six tasks that mirror everyday coding agent work: writing code, fixing bugs, reviewing PRs, refactoring, writing tests, and designing systems.</p> <h3 class="wp-block-heading">Test Results</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Test</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th><th>Winner</th></tr></thead><tbody><tr><td><strong>1. Code Generation</strong><br>Merge sorted streams (Python, min-heap)</td><td>121.1s · Clean, correct, followed “code only” instruction</td><td>16.9s · Correct, but added examples despite “code only” instruction</td><td>gemma4 (speed) / qwen3.5 (instruction following)</td></tr><tr><td><strong>2. Bug Fixing</strong><br>Thread-safe rate limiter</td><td>19.4s · Found <strong>6 bugs</strong> inc. KeyError, race conditions, stale timestamps</td><td>17.3s · Found <strong>5 bugs</strong>, missed KeyError on <code>del</code></td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>3. Security Code Review</strong><br>SQL injection-riddled REST API</td><td>52.4s · Found <strong>12 issues</strong> inc. IDOR, mass assignment, privilege escalation</td><td>38.3s · Found <strong>8 issues</strong>, missed mass assignment & IDOR</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>4. Refactoring</strong><br>Nested callback hell to clean code</td><td>25.4s · Enterprise-style: extracted constants, separate functions, exports</td><td>63.8s · Modern concise JS: arrow fns, optional chaining, <code>??</code></td><td>Tie</td></tr><tr><td><strong>5. Test Writing</strong><br>LRU cache Jest tests</td><td>83.7s · <strong>15+ test cases</strong>, excellent edge case coverage</td><td>141.7s · <strong>11 test cases</strong>, creative async interleaving test</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>6. System Design</strong><br>Webhook delivery with retries + DLQ</td><td>68.8s · <strong>5 tables</strong>, full TypeScript types, HMAC with timing-safe compare</td><td>45.2s · <strong>2 tables</strong>, simpler but explained patterns well</td><td><strong>qwen3.5</strong> (completeness) / gemma4 (clarity)</td></tr></tbody></table></figure> <p class="is-style-info wp-block-paragraph"><strong>Result: qwen3.5 won 4 out of 6 tests.</strong></p> <h3 class="wp-block-heading">Coding Agent Timing</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Total time</td><td>370.8s</td><td>323.2s</td></tr><tr><td>Average per test</td><td>61.8s</td><td>53.9s</td></tr><tr><td>Fastest test</td><td>19.4s (bug fix)</td><td>16.9s (code gen)</td></tr><tr><td>Slowest test</td><td>121.1s (code gen)</td><td>141.7s (test writing)</td></tr></tbody></table></figure> <p class="is-style-info wp-block-paragraph"><strong>Result: gemma4 was slightly faster</strong></p> <h3 class="wp-block-heading">Quality Scores — Coding Agent (out of 5)</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Dimension</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Code correctness</td><td>5 / 5</td><td>5 / 5 </td></tr><tr><td>Security awareness</td><td><br>5 / 5</td><td><br>4 / 5</td></tr><tr><td>Instruction following</td><td><br>5 / 5</td><td><br>4 / 5</td></tr><tr><td>Thoroughness</td><td><br>5 / 5</td><td><br>4 / 5</td></tr><tr><td>Code style / idioms</td><td><br>4 / 5</td><td><br>5 / 5</td></tr><tr><td>Explanation quality</td><td><br>4/ 5</td><td><br>5 / 5</td></tr><tr><td>Speed</td><td><br>3 / 5</td><td><br>4 / 5</td></tr><tr><td><strong>Overall</strong></td><td><strong>4.4 / 5</strong></td><td><strong>4.3 / 5</strong></td></tr></tbody></table></figure> <p class="is-style-info wp-block-paragraph"><strong>Result: qwen 3.5 was slightly better</strong></p> <h2 class="wp-block-heading">Part 2: Application Modernisation Tests</h2> <p class="wp-block-paragraph">These six tests mirror my actual workflow: reading legacy code, extracting rules, generating specs, forward engineering, designing integrations, and writing parity test specifications. I want to apply task and spec-driven development by asking the agents to break things down, follow the specs and then complete the tasks</p> <h3 class="wp-block-heading">The Tests</h3> <ol class="wp-block-list"> <li><strong>PL/SQL Business Rules Extraction</strong> — Read an insurance claims stored procedure (~130 lines) and extract every business rule with boundary values, edge cases, and modernisation risks</li> <li><strong>Java JEE Documentation</strong> — Document an EJB order fulfillment service for a modernisation handover: data flow, dependencies, error handling, state transitions</li> <li><strong>OpenAPI 3.1 Spec Generation</strong> — Generate a complete API spec from business requirements including JWT auth, rate limiting, RFC 9457 error responses, and pagination</li> <li><strong>.NET Forward Engineering</strong> — Convert a WCF service with ADO.NET and stored procedures to .NET 8 minimal API with EF Core, FluentValidation, and structured logging</li> <li><strong>Integration Architecture Design</strong> — Design the integration layer for decomposing a monolith into 5 microservices with event schemas, API contracts, and compensation strategies</li> <li><strong>Test Specification</strong> — Write a comprehensive parity test spec from extracted business rules to validate the modernised system matches legacy behaviour</li> </ol> <h3 class="wp-block-heading">Test Results</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Test</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th><th>Winner</th></tr></thead><tbody><tr><td><strong>1. PL/SQL Rules Extraction</strong><br>Insurance claims procedure</td><td>139.5s · <strong>21 business rules</strong> with boundary values, edge cases, risk flags, and 6 modernisation recommendations</td><td>51.6s · <strong>13 rules</strong> in clean categories, missed audit log inconsistencies, race conditions, empty string vs NULL edge case</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>2. Java JEE Documentation</strong><br>EJB order fulfillment</td><td>27.9s · Full tech doc with Mermaid sequence diagram, state transition table, JNDI resources, error handling matrix</td><td>33.3s · Clean doc with modernisation roadmap (Saga pattern, Outbox, CompletableFuture), caught the JMS dual-write problem</td><td>Tie</td></tr><tr><td><strong>3. OpenAPI Spec Generation</strong><br>Claims API from requirements</td><td>103.0s · Complete OAS 3.1 with all paths, $ref schemas, rate limit headers, Problem Details (RFC 9457), pagination, examples</td><td>53.0s · Good OAS 3.1 with schemas and security, but fewer paths and missing reusable components</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>4. .NET Forward Engineering</strong><br>WCF to .NET 8 minimal API</td><td>57.8s · Full solution: Result pattern, EF Core model config, FluentValidation, Serilog, NuGet refs, architectural rationale</td><td>43.9s · Clean solution with records, repository pattern, async throughout, but less complete Program.cs</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>5. Integration Design</strong><br>Monolith to microservices</td><td>48.8s · CloudEvents schemas, Transactional Outbox with code (.NET + Node.js), Saga compensation, CQRS</td><td>28.6s · Clear sync vs async decision matrix, OpenAPI contract for sync calls, Outbox pattern, compensation events</td><td>Tie</td></tr><tr><td><strong>6. Test Specification</strong><br>Parity testing from business rules</td><td>63.8s · <strong>34 test cases</strong> + 7 cross-rule interaction tests + boundary value matrix + rule precedence matrix + negative tests</td><td>36.7s · <strong>33 test cases</strong> + BVA section + negative tests, but no cross-rule interaction or precedence analysis</td><td><strong>qwen3.5</strong></td></tr></tbody></table></figure> <p class="is-style-info wp-block-paragraph"><strong>Result: qwen3.5 won 4 out of 6 tests, with 2 ties.</strong></p> <h3 class="wp-block-heading">App Modernisation Timing</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Total time</td><td>440.8s</td><td>247.1s</td></tr><tr><td>Average per test</td><td>73.5s</td><td>41.2s</td></tr><tr><td>Fastest test</td><td>27.9s (Java doc)</td><td>28.6s (integration)</td></tr><tr><td>Slowest test</td><td>139.5s (PL/SQL rules)</td><td>53.0s (OpenAPI)</td></tr></tbody></table></figure> <p class="is-style-info wp-block-paragraph">gemma4 was <strong>78% faster</strong> on average across the app modernisation tests.</p> <h3 class="wp-block-heading">Quality Scores — App Modernisation (out of 5)</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Dimension</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Legacy code comprehension</td><td>5/5</td><td>4/5</td></tr><tr><td>Business rules extraction</td><td>5/5</td><td>4/5</td></tr><tr><td>API spec generation</td><td>5/5</td><td>4/5</td></tr><tr><td>Forward engineering</td><td>5/5</td><td>4/5</td></tr><tr><td>Integration design</td><td>5/5</td><td>5/5</td></tr><tr><td>Test specification</td><td>5/5</td><td>4/5</td></tr><tr><td>Modernisation insight</td><td>4/5</td><td>5/5</td></tr><tr><td>Speed</td><td>3/5</td><td>5/5</td></tr><tr><td><strong>Overall</strong></td><td><strong>4.6 / 5</strong></td><td><strong>4.3 / 5</strong></td></tr></tbody></table></figure> <h2 class="wp-block-heading">The Key Findings</h2> <h3 class="wp-block-heading">Finding 1: Qwen 3.5 is better when applied to modernisation work</h3> <p class="wp-block-paragraph">On general coding tasks, the two models were neck and neck (4.4 vs 4.3). On application modernisation, the gap widened to 4.6 vs 4.3. Modernisation tasks reward thoroughness — and that’s where qwen3.5’s extra parameters help</p> <h3 class="wp-block-heading">Finding 2: 21 rules vs 13 from the same procedure</h3> <p class="wp-block-paragraph">This was the most telling result. Given the same ~130-line PL/SQL insurance claims procedure, qwen3.5 extracted 21 distinct business rules while gemma4 found 13.</p> <p class="wp-block-paragraph"><strong>The 8 rules gemma4 missed weren’t obscure. They included:</strong></p> <ul class="wp-block-list"> <li>Boundary conditions (fraud score 70 passes, 71 blocks — is the check <code>></code> or <code>>=</code>?)</li> <li>A race condition in the claim frequency counter</li> <li>The difference between NULL and empty string in PL/SQL for the rejection notes check</li> <li>A COMMIT inside an exception handler after a ROLLBACK (which only commits the error log)</li> <li>Inconsistent audit logging paths depending on early returns</li> </ul> <p class="wp-block-paragraph">Every one of those is the kind of thing that becomes a production bug if you miss it during migration.</p> <h3 class="wp-block-heading">Finding 3: Cross-rule interaction tests matter</h3> <p class="wp-block-paragraph">Both models wrote roughly the same number of test cases for the parity test specification (34 vs 33). But qwen3.5 added 7 cross-rule interaction tests — scenarios like “what happens when fraud score is exactly 70 AND the claim exceeds the assessor’s threshold AND the customer has 4 claims in 12 months?”</p> <p class="wp-block-paragraph">Those combination tests are where legacy migration bugs hide. Individual rules work fine in isolation. It’s the interactions that break.</p> <h3 class="wp-block-heading">Finding 4: gemma4 had better modernisation instincts</h3> <p class="wp-block-paragraph">On the Java JEE documentation task, <strong>gemma4 caught something qwen3.5 didn’t lead with: the EJB sends a JMS message to the warehouse <em>before</em> the database transaction commits</strong>. If the commit fails, the warehouse already has a pick request for an order that doesn’t exist. <strong>gemma4 immediately recommended the Transactional Outbox pattern and the Saga pattern for compensation.</strong></p> <p class="wp-block-paragraph">gemma4’s modernisation recommendations were consistently more architectural. It thinks in patterns.</p> <h2 class="wp-block-heading">Model Specifications</h2> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Spec</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Provider</td><td>Alibaba (Qwen)</td><td>Google (Gemma)</td></tr><tr><td>Parameters</td><td>397B</td><td>~32B</td></tr><tr><td>Context Window</td><td>256K tokens</td><td>256K tokens</td></tr><tr><td>Vision</td><td>Yes</td><td>Yes</td></tr><tr><td>Tool Use</td><td>Yes</td><td>Yes</td></tr><tr><td>Thinking/Reasoning</td><td>Yes</td><td>Yes</td></tr><tr><td>Quantization</td><td>BF16</td><td>BF16</td></tr><tr><td>Local download</td><td>Cloud only</td><td>9.6 GB (also runs locally)</td></tr><tr><td>Cost</td><td>Free (Ollama cloud)</td><td>Free (Ollama cloud or local)</td></tr></tbody></table></figure> <h2 class="wp-block-heading">My Setup</h2> <p class="wp-block-paragraph">I’m using <strong>qwen3.5:397b-cloud as my primary coding agent</strong> for the work that matters most: rules extraction from legacy code, OpenAPI spec generation, forward engineering, and parity test specifications.</p> <p class="wp-block-paragraph"><strong>gemma4:31b-cloud</strong> is my secondary — I reach for it when I need quick documentation from code, a fast second opinion on integration patterns, or when I’m iterating rapidly and don’t need exhaustive analysis.</p> <h3 class="wp-block-heading">Recommendation by Use Case</h3> <figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Use Case</th><th>Recommended Model</th></tr></thead><tbody><tr><td>Legacy code rules extraction</td><td>qwen3.5:397b-cloud</td></tr><tr><td>OpenAPI / spec generation</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Forward engineering (.NET, Java)</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Parity test specifications</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Security code review</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Quick code documentation</td><td>gemma4:31b-cloud</td></tr><tr><td>Integration pattern selection</td><td>Either — both strong</td></tr><tr><td>Rapid prototyping / iteration</td><td>gemma4:31b-cloud</td></tr></tbody></table></figure> <h2 class="wp-block-heading">How to Run These Tests Yourself</h2> <p class="wp-block-paragraph">Both models are available free through <a href="https://ollama.com" target="_blank" rel="noopener">Ollama</a>. No API keys, no usage limits.</p> <div class="wp-block-code"> <div class="cm-editor"> <div class="cm-scroller"> <pre><code><div class="cm-line"># Install Ollama (macOS)</div><div class="cm-line">brew install ollama</div><div class="cm-line"></div><div class="cm-line"># Run the models</div><div class="cm-line">ollama run qwen3.5:397b-cloud</div><div class="cm-line">ollama run gemma4:31b-cloud</div></code></pre> </div> </div> </div> <p class="wp-block-paragraph">The test scripts and full raw outputs are available on my GitHub. If you’re doing application modernisation work and want to evaluate models against your own legacy codebase, the scripts are designed to be adapted — just swap in your own code samples.</p> <p class="wp-block-paragraph"><em>For anyone doing application modernisation: the model that catches more edge cases in your legacy code is the one that saves you from production incidents later. A missed boundary condition in a PL/SQL procedure becomes a production defect in your shiny new microservice. That’s where I landed, and that’s why I’m going with qwen3.5:397b-cloud as my primary agent.</em></p>

A lot of the work we do as Software Engineers is not just creating and shipping new applications with rich APIs, Events and other distributed system components – but we are doing a lot more of legacy application modernisation as we have realised that we can apply the power of large language models to reading code, extracting logic and documenting / forward engineering. Well atleast we hope we do more of this!

With this intent I have built over time a local set of code samples for .NET, Java, Webservices, PL/SQL even COBOL / CAGEN applications that are functional and quite dense to represent a good sample set for testing the abilities of AI models, coding-agents and humans in transforming them into updated services. These also include PL/SQL stored procedures, Java JEE EJBs, .NET WCF services and the test requires turning them into API specifications, business rules documentation, and modern implementations as we look to use “Spec-driven development” for application modernisation context

April 2026 – Gemma 4 release

With the April 2026 Gemma 4 release, I was excited to try out a new model and set out to compare Gemma 4 with Qwen 3.5 – now I lack good local hardware so my test uses Ollama Cloud and the best of the models. While I can compare locally gemma4:e4b vs
qwen3.5:9b, I chose to compare qwen3.5:397b-cloud (Alibaba’s 397-billion parameter flagship) and gemma4:31b-cloud (Google’s 31-billion parameter model).

Both models are free via Ollama, both support 256K context windows, vision, tool use, and thinking/reasoning. The parameter count gap is 12x. The question: does that translate to meaningfully better output for real modernisation work?

I ran two rounds of tests — 12 tasks total — covering both general coding agent work and application modernisation specifically.

Part 1: General Coding Agent Tests

Six tasks that mirror everyday coding agent work: writing code, fixing bugs, reviewing PRs, refactoring, writing tests, and designing systems.

Test Results

Test	qwen3.5:397b-cloud	gemma4:31b-cloud	Winner
1. Code Generation Merge sorted streams (Python, min-heap)	121.1s · Clean, correct, followed “code only” instruction	16.9s · Correct, but added examples despite “code only” instruction	gemma4 (speed) / qwen3.5 (instruction following)
2. Bug Fixing Thread-safe rate limiter	19.4s · Found 6 bugs inc. KeyError, race conditions, stale timestamps	17.3s · Found 5 bugs, missed KeyError on `del`	qwen3.5
3. Security Code Review SQL injection-riddled REST API	52.4s · Found 12 issues inc. IDOR, mass assignment, privilege escalation	38.3s · Found 8 issues, missed mass assignment & IDOR	qwen3.5
4. Refactoring Nested callback hell to clean code	25.4s · Enterprise-style: extracted constants, separate functions, exports	63.8s · Modern concise JS: arrow fns, optional chaining, `??`	Tie
5. Test Writing LRU cache Jest tests	83.7s · 15+ test cases, excellent edge case coverage	141.7s · 11 test cases, creative async interleaving test	qwen3.5
6. System Design Webhook delivery with retries + DLQ	68.8s · 5 tables, full TypeScript types, HMAC with timing-safe compare	45.2s · 2 tables, simpler but explained patterns well	qwen3.5 (completeness) / gemma4 (clarity)

Result: qwen3.5 won 4 out of 6 tests.

Coding Agent Timing

Metric	qwen3.5:397b-cloud	gemma4:31b-cloud
Total time	370.8s	323.2s
Average per test	61.8s	53.9s
Fastest test	19.4s (bug fix)	16.9s (code gen)
Slowest test	121.1s (code gen)	141.7s (test writing)

Result: gemma4 was slightly faster

Quality Scores — Coding Agent (out of 5)

Dimension	qwen3.5:397b-cloud	gemma4:31b-cloud
Code correctness	5 / 5	5 / 5
Security awareness	5 / 5	4 / 5
Instruction following	5 / 5	4 / 5
Thoroughness	5 / 5	4 / 5
Code style / idioms	4 / 5	5 / 5
Explanation quality	4/ 5	5 / 5
Speed	3 / 5	4 / 5
Overall	4.4 / 5	4.3 / 5

Result: qwen 3.5 was slightly better

Part 2: Application Modernisation Tests

These six tests mirror my actual workflow: reading legacy code, extracting rules, generating specs, forward engineering, designing integrations, and writing parity test specifications. I want to apply task and spec-driven development by asking the agents to break things down, follow the specs and then complete the tasks

The Tests

PL/SQL Business Rules Extraction — Read an insurance claims stored procedure (~130 lines) and extract every business rule with boundary values, edge cases, and modernisation risks
Java JEE Documentation — Document an EJB order fulfillment service for a modernisation handover: data flow, dependencies, error handling, state transitions
OpenAPI 3.1 Spec Generation — Generate a complete API spec from business requirements including JWT auth, rate limiting, RFC 9457 error responses, and pagination
.NET Forward Engineering — Convert a WCF service with ADO.NET and stored procedures to .NET 8 minimal API with EF Core, FluentValidation, and structured logging
Integration Architecture Design — Design the integration layer for decomposing a monolith into 5 microservices with event schemas, API contracts, and compensation strategies
Test Specification — Write a comprehensive parity test spec from extracted business rules to validate the modernised system matches legacy behaviour

Test Results

Test	qwen3.5:397b-cloud	gemma4:31b-cloud	Winner
1. PL/SQL Rules Extraction Insurance claims procedure	139.5s · 21 business rules with boundary values, edge cases, risk flags, and 6 modernisation recommendations	51.6s · 13 rules in clean categories, missed audit log inconsistencies, race conditions, empty string vs NULL edge case	qwen3.5
2. Java JEE Documentation EJB order fulfillment	27.9s · Full tech doc with Mermaid sequence diagram, state transition table, JNDI resources, error handling matrix	33.3s · Clean doc with modernisation roadmap (Saga pattern, Outbox, CompletableFuture), caught the JMS dual-write problem	Tie
3. OpenAPI Spec Generation Claims API from requirements	103.0s · Complete OAS 3.1 with all paths, $ref schemas, rate limit headers, Problem Details (RFC 9457), pagination, examples	53.0s · Good OAS 3.1 with schemas and security, but fewer paths and missing reusable components	qwen3.5
4. .NET Forward Engineering WCF to .NET 8 minimal API	57.8s · Full solution: Result pattern, EF Core model config, FluentValidation, Serilog, NuGet refs, architectural rationale	43.9s · Clean solution with records, repository pattern, async throughout, but less complete Program.cs	qwen3.5
5. Integration Design Monolith to microservices	48.8s · CloudEvents schemas, Transactional Outbox with code (.NET + Node.js), Saga compensation, CQRS	28.6s · Clear sync vs async decision matrix, OpenAPI contract for sync calls, Outbox pattern, compensation events	Tie
6. Test Specification Parity testing from business rules	63.8s · 34 test cases + 7 cross-rule interaction tests + boundary value matrix + rule precedence matrix + negative tests	36.7s · 33 test cases + BVA section + negative tests, but no cross-rule interaction or precedence analysis	qwen3.5

Result: qwen3.5 won 4 out of 6 tests, with 2 ties.

App Modernisation Timing

Metric	qwen3.5:397b-cloud	gemma4:31b-cloud
Total time	440.8s	247.1s
Average per test	73.5s	41.2s
Fastest test	27.9s (Java doc)	28.6s (integration)
Slowest test	139.5s (PL/SQL rules)	53.0s (OpenAPI)

gemma4 was 78% faster on average across the app modernisation tests.

Quality Scores — App Modernisation (out of 5)

Dimension	qwen3.5:397b-cloud	gemma4:31b-cloud
Legacy code comprehension	5/5	4/5
Business rules extraction	5/5	4/5
API spec generation	5/5	4/5
Forward engineering	5/5	4/5
Integration design	5/5	5/5
Test specification	5/5	4/5
Modernisation insight	4/5	5/5
Speed	3/5	5/5
Overall	4.6 / 5	4.3 / 5

The Key Findings

Finding 1: Qwen 3.5 is better when applied to modernisation work

On general coding tasks, the two models were neck and neck (4.4 vs 4.3). On application modernisation, the gap widened to 4.6 vs 4.3. Modernisation tasks reward thoroughness — and that’s where qwen3.5’s extra parameters help

Finding 2: 21 rules vs 13 from the same procedure

This was the most telling result. Given the same ~130-line PL/SQL insurance claims procedure, qwen3.5 extracted 21 distinct business rules while gemma4 found 13.

The 8 rules gemma4 missed weren’t obscure. They included:

Boundary conditions (fraud score 70 passes, 71 blocks — is the check > or >=?)
A race condition in the claim frequency counter
The difference between NULL and empty string in PL/SQL for the rejection notes check
A COMMIT inside an exception handler after a ROLLBACK (which only commits the error log)
Inconsistent audit logging paths depending on early returns

Every one of those is the kind of thing that becomes a production bug if you miss it during migration.

Finding 3: Cross-rule interaction tests matter

Both models wrote roughly the same number of test cases for the parity test specification (34 vs 33). But qwen3.5 added 7 cross-rule interaction tests — scenarios like “what happens when fraud score is exactly 70 AND the claim exceeds the assessor’s threshold AND the customer has 4 claims in 12 months?”

Those combination tests are where legacy migration bugs hide. Individual rules work fine in isolation. It’s the interactions that break.

Finding 4: gemma4 had better modernisation instincts

On the Java JEE documentation task, gemma4 caught something qwen3.5 didn’t lead with: the EJB sends a JMS message to the warehouse before the database transaction commits. If the commit fails, the warehouse already has a pick request for an order that doesn’t exist. gemma4 immediately recommended the Transactional Outbox pattern and the Saga pattern for compensation.

gemma4’s modernisation recommendations were consistently more architectural. It thinks in patterns.

Model Specifications

Spec	qwen3.5:397b-cloud	gemma4:31b-cloud
Provider	Alibaba (Qwen)	Google (Gemma)
Parameters	397B	~32B
Context Window	256K tokens	256K tokens
Vision	Yes	Yes
Tool Use	Yes	Yes
Thinking/Reasoning	Yes	Yes
Quantization	BF16	BF16
Local download	Cloud only	9.6 GB (also runs locally)
Cost	Free (Ollama cloud)	Free (Ollama cloud or local)

My Setup

I’m using qwen3.5:397b-cloud as my primary coding agent for the work that matters most: rules extraction from legacy code, OpenAPI spec generation, forward engineering, and parity test specifications.

gemma4:31b-cloud is my secondary — I reach for it when I need quick documentation from code, a fast second opinion on integration patterns, or when I’m iterating rapidly and don’t need exhaustive analysis.

Recommendation by Use Case

Use Case	Recommended Model
Legacy code rules extraction	qwen3.5:397b-cloud
OpenAPI / spec generation	qwen3.5:397b-cloud
Forward engineering (.NET, Java)	qwen3.5:397b-cloud
Parity test specifications	qwen3.5:397b-cloud
Security code review	qwen3.5:397b-cloud
Quick code documentation	gemma4:31b-cloud
Integration pattern selection	Either — both strong
Rapid prototyping / iteration	gemma4:31b-cloud

How to Run These Tests Yourself

Both models are available free through Ollama. No API keys, no usage limits.

			
# Install Ollama (macOS)
brew install ollama
# Run the models
ollama run qwen3.5:397b-cloud
ollama run gemma4:31b-cloud

		

The test scripts and full raw outputs are available on my GitHub. If you’re doing application modernisation work and want to evaluate models against your own legacy codebase, the scripts are designed to be adapted — just swap in your own code samples.

For anyone doing application modernisation: the model that catches more edge cases in your legacy code is the one that saves you from production incidents later. A missed boundary condition in a PL/SQL procedure becomes a production defect in your shiny new microservice. That’s where I landed, and that’s why I’m going with qwen3.5:397b-cloud as my primary agent.

Agentic AI Engineering: Comparing Local Coding Models – Early 2026

April 2026 – Gemma 4 release

Part 1: General Coding Agent Tests

Test Results

Coding Agent Timing

Quality Scores — Coding Agent (out of 5)

Part 2: Application Modernisation Tests

The Tests

Test Results

App Modernisation Timing

Quality Scores — App Modernisation (out of 5)

The Key Findings

Finding 1: Qwen 3.5 is better when applied to modernisation work

Finding 2: 21 rules vs 13 from the same procedure

Finding 3: Cross-rule interaction tests matter

Finding 4: gemma4 had better modernisation instincts

Model Specifications

My Setup

Recommendation by Use Case

How to Run These Tests Yourself

Like this:

Related

Published by alokmishra

Leave a ReplyCancel reply

April 2026 – Gemma 4 release

Part 1: General Coding Agent Tests

Test Results

Coding Agent Timing

Quality Scores — Coding Agent (out of 5)

Part 2: Application Modernisation Tests

The Tests

Test Results

App Modernisation Timing

Quality Scores — App Modernisation (out of 5)

The Key Findings

Finding 1: Qwen 3.5 is better when applied to modernisation work

Finding 2: 21 rules vs 13 from the same procedure

Finding 3: Cross-rule interaction tests matter

Finding 4: gemma4 had better modernisation instincts

Model Specifications

My Setup

Recommendation by Use Case

How to Run These Tests Yourself

Share this:

Like this:

Related

Published by alokmishra

Leave a ReplyCancel reply

Discover more from Alok Mishra