Journal of Distributed Software Engineering, Architecture and Design
Agentic AI Engineering: Comparing Local Coding Models – Early 2026
<div class="cs-rating pd-rating" id="pd_rating_holder_1819065_post_3263"></div>
<p class="wp-block-paragraph">A lot of the work we do as Software Engineers is not just creating and shipping new applications with rich APIs, Events and other distributed system components – but we are doing a lot more of <strong>legacy application modernisation</strong> as we have realised that we can apply the power of large language models to reading code, extracting logic and documenting / forward engineering. Well atleast we hope we do more of this! </p>
<p class="wp-block-paragraph">With this intent I have built over time a local set of code samples for .NET, Java, Webservices, PL/SQL even COBOL / CAGEN applications that are functional and quite dense to represent a good sample set for testing the abilities of AI models, coding-agents and humans in transforming them into updated services. These also include PL/SQL stored procedures, Java JEE EJBs, .NET WCF services and the test requires turning them into API specifications, business rules documentation, and modern implementations as we look to use “Spec-driven development” for application modernisation context</p>
<h2 class="wp-block-heading">April 2026 – Gemma 4 release </h2>
<p class="wp-block-paragraph">With the April 2026 Gemma 4 release, I was excited to try out a new model and set out to compare Gemma 4 with Qwen 3.5 – now I lack good local hardware so my test uses Ollama Cloud and the best of the models. While I can compare locally gemma4:e4b vs <br>qwen3.5:9b, I chose to compare <strong>qwen3.5:397b-cloud</strong> (Alibaba’s 397-billion parameter flagship) and <strong>gemma4:31b-cloud</strong> (Google’s 31-billion parameter model).</p>
<p class="wp-block-paragraph">Both models are free via Ollama, both support 256K context windows, vision, tool use, and thinking/reasoning. The parameter count gap is 12x. The question: does that translate to meaningfully better output for real modernisation work?</p>
<p class="wp-block-paragraph">I ran two rounds of tests — 12 tasks total — covering both general coding agent work and application modernisation specifically.</p>
<h2 class="wp-block-heading">Part 1: General Coding Agent Tests</h2>
<p class="wp-block-paragraph">Six tasks that mirror everyday coding agent work: writing code, fixing bugs, reviewing PRs, refactoring, writing tests, and designing systems.</p>
<h3 class="wp-block-heading">Test Results</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Test</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th><th>Winner</th></tr></thead><tbody><tr><td><strong>1. Code Generation</strong><br>Merge sorted streams (Python, min-heap)</td><td>121.1s · Clean, correct, followed “code only” instruction</td><td>16.9s · Correct, but added examples despite “code only” instruction</td><td>gemma4 (speed) / qwen3.5 (instruction following)</td></tr><tr><td><strong>2. Bug Fixing</strong><br>Thread-safe rate limiter</td><td>19.4s · Found <strong>6 bugs</strong> inc. KeyError, race conditions, stale timestamps</td><td>17.3s · Found <strong>5 bugs</strong>, missed KeyError on <code>del</code></td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>3. Security Code Review</strong><br>SQL injection-riddled REST API</td><td>52.4s · Found <strong>12 issues</strong> inc. IDOR, mass assignment, privilege escalation</td><td>38.3s · Found <strong>8 issues</strong>, missed mass assignment & IDOR</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>4. Refactoring</strong><br>Nested callback hell to clean code</td><td>25.4s · Enterprise-style: extracted constants, separate functions, exports</td><td>63.8s · Modern concise JS: arrow fns, optional chaining, <code>??</code></td><td>Tie</td></tr><tr><td><strong>5. Test Writing</strong><br>LRU cache Jest tests</td><td>83.7s · <strong>15+ test cases</strong>, excellent edge case coverage</td><td>141.7s · <strong>11 test cases</strong>, creative async interleaving test</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>6. System Design</strong><br>Webhook delivery with retries + DLQ</td><td>68.8s · <strong>5 tables</strong>, full TypeScript types, HMAC with timing-safe compare</td><td>45.2s · <strong>2 tables</strong>, simpler but explained patterns well</td><td><strong>qwen3.5</strong> (completeness) / gemma4 (clarity)</td></tr></tbody></table></figure>
<p class="is-style-info wp-block-paragraph"><strong>Result: qwen3.5 won 4 out of 6 tests.</strong></p>
<h3 class="wp-block-heading">Coding Agent Timing</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Total time</td><td>370.8s</td><td>323.2s</td></tr><tr><td>Average per test</td><td>61.8s</td><td>53.9s</td></tr><tr><td>Fastest test</td><td>19.4s (bug fix)</td><td>16.9s (code gen)</td></tr><tr><td>Slowest test</td><td>121.1s (code gen)</td><td>141.7s (test writing)</td></tr></tbody></table></figure>
<p class="is-style-info wp-block-paragraph"><strong>Result: gemma4 was slightly faster</strong></p>
<h3 class="wp-block-heading">Quality Scores — Coding Agent (out of 5)</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Dimension</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Code correctness</td><td>5 / 5</td><td>5 / 5 </td></tr><tr><td>Security awareness</td><td><br>5 / 5</td><td><br>4 / 5</td></tr><tr><td>Instruction following</td><td><br>5 / 5</td><td><br>4 / 5</td></tr><tr><td>Thoroughness</td><td><br>5 / 5</td><td><br>4 / 5</td></tr><tr><td>Code style / idioms</td><td><br>4 / 5</td><td><br>5 / 5</td></tr><tr><td>Explanation quality</td><td><br>4/ 5</td><td><br>5 / 5</td></tr><tr><td>Speed</td><td><br>3 / 5</td><td><br>4 / 5</td></tr><tr><td><strong>Overall</strong></td><td><strong>4.4 / 5</strong></td><td><strong>4.3 / 5</strong></td></tr></tbody></table></figure>
<p class="is-style-info wp-block-paragraph"><strong>Result: qwen 3.5 was slightly better</strong></p>
<h2 class="wp-block-heading">Part 2: Application Modernisation Tests</h2>
<p class="wp-block-paragraph">These six tests mirror my actual workflow: reading legacy code, extracting rules, generating specs, forward engineering, designing integrations, and writing parity test specifications. I want to apply task and spec-driven development by asking the agents to break things down, follow the specs and then complete the tasks</p>
<h3 class="wp-block-heading">The Tests</h3>
<ol class="wp-block-list">
<li><strong>PL/SQL Business Rules Extraction</strong> — Read an insurance claims stored procedure (~130 lines) and extract every business rule with boundary values, edge cases, and modernisation risks</li>
<li><strong>Java JEE Documentation</strong> — Document an EJB order fulfillment service for a modernisation handover: data flow, dependencies, error handling, state transitions</li>
<li><strong>OpenAPI 3.1 Spec Generation</strong> — Generate a complete API spec from business requirements including JWT auth, rate limiting, RFC 9457 error responses, and pagination</li>
<li><strong>.NET Forward Engineering</strong> — Convert a WCF service with ADO.NET and stored procedures to .NET 8 minimal API with EF Core, FluentValidation, and structured logging</li>
<li><strong>Integration Architecture Design</strong> — Design the integration layer for decomposing a monolith into 5 microservices with event schemas, API contracts, and compensation strategies</li>
<li><strong>Test Specification</strong> — Write a comprehensive parity test spec from extracted business rules to validate the modernised system matches legacy behaviour</li>
</ol>
<h3 class="wp-block-heading">Test Results</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Test</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th><th>Winner</th></tr></thead><tbody><tr><td><strong>1. PL/SQL Rules Extraction</strong><br>Insurance claims procedure</td><td>139.5s · <strong>21 business rules</strong> with boundary values, edge cases, risk flags, and 6 modernisation recommendations</td><td>51.6s · <strong>13 rules</strong> in clean categories, missed audit log inconsistencies, race conditions, empty string vs NULL edge case</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>2. Java JEE Documentation</strong><br>EJB order fulfillment</td><td>27.9s · Full tech doc with Mermaid sequence diagram, state transition table, JNDI resources, error handling matrix</td><td>33.3s · Clean doc with modernisation roadmap (Saga pattern, Outbox, CompletableFuture), caught the JMS dual-write problem</td><td>Tie</td></tr><tr><td><strong>3. OpenAPI Spec Generation</strong><br>Claims API from requirements</td><td>103.0s · Complete OAS 3.1 with all paths, $ref schemas, rate limit headers, Problem Details (RFC 9457), pagination, examples</td><td>53.0s · Good OAS 3.1 with schemas and security, but fewer paths and missing reusable components</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>4. .NET Forward Engineering</strong><br>WCF to .NET 8 minimal API</td><td>57.8s · Full solution: Result pattern, EF Core model config, FluentValidation, Serilog, NuGet refs, architectural rationale</td><td>43.9s · Clean solution with records, repository pattern, async throughout, but less complete Program.cs</td><td><strong>qwen3.5</strong></td></tr><tr><td><strong>5. Integration Design</strong><br>Monolith to microservices</td><td>48.8s · CloudEvents schemas, Transactional Outbox with code (.NET + Node.js), Saga compensation, CQRS</td><td>28.6s · Clear sync vs async decision matrix, OpenAPI contract for sync calls, Outbox pattern, compensation events</td><td>Tie</td></tr><tr><td><strong>6. Test Specification</strong><br>Parity testing from business rules</td><td>63.8s · <strong>34 test cases</strong> + 7 cross-rule interaction tests + boundary value matrix + rule precedence matrix + negative tests</td><td>36.7s · <strong>33 test cases</strong> + BVA section + negative tests, but no cross-rule interaction or precedence analysis</td><td><strong>qwen3.5</strong></td></tr></tbody></table></figure>
<p class="is-style-info wp-block-paragraph"><strong>Result: qwen3.5 won 4 out of 6 tests, with 2 ties.</strong></p>
<h3 class="wp-block-heading">App Modernisation Timing</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Metric</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Total time</td><td>440.8s</td><td>247.1s</td></tr><tr><td>Average per test</td><td>73.5s</td><td>41.2s</td></tr><tr><td>Fastest test</td><td>27.9s (Java doc)</td><td>28.6s (integration)</td></tr><tr><td>Slowest test</td><td>139.5s (PL/SQL rules)</td><td>53.0s (OpenAPI)</td></tr></tbody></table></figure>
<p class="is-style-info wp-block-paragraph">gemma4 was <strong>78% faster</strong> on average across the app modernisation tests.</p>
<h3 class="wp-block-heading">Quality Scores — App Modernisation (out of 5)</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Dimension</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Legacy code comprehension</td><td>5/5</td><td>4/5</td></tr><tr><td>Business rules extraction</td><td>5/5</td><td>4/5</td></tr><tr><td>API spec generation</td><td>5/5</td><td>4/5</td></tr><tr><td>Forward engineering</td><td>5/5</td><td>4/5</td></tr><tr><td>Integration design</td><td>5/5</td><td>5/5</td></tr><tr><td>Test specification</td><td>5/5</td><td>4/5</td></tr><tr><td>Modernisation insight</td><td>4/5</td><td>5/5</td></tr><tr><td>Speed</td><td>3/5</td><td>5/5</td></tr><tr><td><strong>Overall</strong></td><td><strong>4.6 / 5</strong></td><td><strong>4.3 / 5</strong></td></tr></tbody></table></figure>
<h2 class="wp-block-heading">The Key Findings</h2>
<h3 class="wp-block-heading">Finding 1: Qwen 3.5 is better when applied to modernisation work</h3>
<p class="wp-block-paragraph">On general coding tasks, the two models were neck and neck (4.4 vs 4.3). On application modernisation, the gap widened to 4.6 vs 4.3. Modernisation tasks reward thoroughness — and that’s where qwen3.5’s extra parameters help</p>
<h3 class="wp-block-heading">Finding 2: 21 rules vs 13 from the same procedure</h3>
<p class="wp-block-paragraph">This was the most telling result. Given the same ~130-line PL/SQL insurance claims procedure, qwen3.5 extracted 21 distinct business rules while gemma4 found 13.</p>
<p class="wp-block-paragraph"><strong>The 8 rules gemma4 missed weren’t obscure. They included:</strong></p>
<ul class="wp-block-list">
<li>Boundary conditions (fraud score 70 passes, 71 blocks — is the check <code>></code> or <code>>=</code>?)</li>
<li>A race condition in the claim frequency counter</li>
<li>The difference between NULL and empty string in PL/SQL for the rejection notes check</li>
<li>A COMMIT inside an exception handler after a ROLLBACK (which only commits the error log)</li>
<li>Inconsistent audit logging paths depending on early returns</li>
</ul>
<p class="wp-block-paragraph">Every one of those is the kind of thing that becomes a production bug if you miss it during migration.</p>
<h3 class="wp-block-heading">Finding 3: Cross-rule interaction tests matter</h3>
<p class="wp-block-paragraph">Both models wrote roughly the same number of test cases for the parity test specification (34 vs 33). But qwen3.5 added 7 cross-rule interaction tests — scenarios like “what happens when fraud score is exactly 70 AND the claim exceeds the assessor’s threshold AND the customer has 4 claims in 12 months?”</p>
<p class="wp-block-paragraph">Those combination tests are where legacy migration bugs hide. Individual rules work fine in isolation. It’s the interactions that break.</p>
<h3 class="wp-block-heading">Finding 4: gemma4 had better modernisation instincts</h3>
<p class="wp-block-paragraph">On the Java JEE documentation task, <strong>gemma4 caught something qwen3.5 didn’t lead with: the EJB sends a JMS message to the warehouse <em>before</em> the database transaction commits</strong>. If the commit fails, the warehouse already has a pick request for an order that doesn’t exist. <strong>gemma4 immediately recommended the Transactional Outbox pattern and the Saga pattern for compensation.</strong></p>
<p class="wp-block-paragraph">gemma4’s modernisation recommendations were consistently more architectural. It thinks in patterns.</p>
<h2 class="wp-block-heading">Model Specifications</h2>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Spec</th><th>qwen3.5:397b-cloud</th><th>gemma4:31b-cloud</th></tr></thead><tbody><tr><td>Provider</td><td>Alibaba (Qwen)</td><td>Google (Gemma)</td></tr><tr><td>Parameters</td><td>397B</td><td>~32B</td></tr><tr><td>Context Window</td><td>256K tokens</td><td>256K tokens</td></tr><tr><td>Vision</td><td>Yes</td><td>Yes</td></tr><tr><td>Tool Use</td><td>Yes</td><td>Yes</td></tr><tr><td>Thinking/Reasoning</td><td>Yes</td><td>Yes</td></tr><tr><td>Quantization</td><td>BF16</td><td>BF16</td></tr><tr><td>Local download</td><td>Cloud only</td><td>9.6 GB (also runs locally)</td></tr><tr><td>Cost</td><td>Free (Ollama cloud)</td><td>Free (Ollama cloud or local)</td></tr></tbody></table></figure>
<h2 class="wp-block-heading">My Setup</h2>
<p class="wp-block-paragraph">I’m using <strong>qwen3.5:397b-cloud as my primary coding agent</strong> for the work that matters most: rules extraction from legacy code, OpenAPI spec generation, forward engineering, and parity test specifications.</p>
<p class="wp-block-paragraph"><strong>gemma4:31b-cloud</strong> is my secondary — I reach for it when I need quick documentation from code, a fast second opinion on integration patterns, or when I’m iterating rapidly and don’t need exhaustive analysis.</p>
<h3 class="wp-block-heading">Recommendation by Use Case</h3>
<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Use Case</th><th>Recommended Model</th></tr></thead><tbody><tr><td>Legacy code rules extraction</td><td>qwen3.5:397b-cloud</td></tr><tr><td>OpenAPI / spec generation</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Forward engineering (.NET, Java)</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Parity test specifications</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Security code review</td><td>qwen3.5:397b-cloud</td></tr><tr><td>Quick code documentation</td><td>gemma4:31b-cloud</td></tr><tr><td>Integration pattern selection</td><td>Either — both strong</td></tr><tr><td>Rapid prototyping / iteration</td><td>gemma4:31b-cloud</td></tr></tbody></table></figure>
<h2 class="wp-block-heading">How to Run These Tests Yourself</h2>
<p class="wp-block-paragraph">Both models are available free through <a href="https://ollama.com" target="_blank" rel="noopener">Ollama</a>. No API keys, no usage limits.</p>
<div class="wp-block-code">
<div class="cm-editor">
<div class="cm-scroller">
<pre><code><div class="cm-line"># Install Ollama (macOS)</div><div class="cm-line">brew install ollama</div><div class="cm-line"></div><div class="cm-line"># Run the models</div><div class="cm-line">ollama run qwen3.5:397b-cloud</div><div class="cm-line">ollama run gemma4:31b-cloud</div></code></pre>
</div>
</div>
</div>
<p class="wp-block-paragraph">The test scripts and full raw outputs are available on my GitHub. If you’re doing application modernisation work and want to evaluate models against your own legacy codebase, the scripts are designed to be adapted — just swap in your own code samples.</p>
<p class="wp-block-paragraph"><em>For anyone doing application modernisation: the model that catches more edge cases in your legacy code is the one that saves you from production incidents later. A missed boundary condition in a PL/SQL procedure becomes a production defect in your shiny new microservice. That’s where I landed, and that’s why I’m going with qwen3.5:397b-cloud as my primary agent.</em></p>
A lot of the work we do as Software Engineers is not just creating and shipping new applications with rich APIs, Events and other distributed system components – but we are doing a lot more of legacy application modernisation as we have realised that we can apply the power of large language models to reading code, extracting logic and documenting / forward engineering. Well atleast we hope we do more of this!
With this intent I have built over time a local set of code samples for .NET, Java, Webservices, PL/SQL even COBOL / CAGEN applications that are functional and quite dense to represent a good sample set for testing the abilities of AI models, coding-agents and humans in transforming them into updated services. These also include PL/SQL stored procedures, Java JEE EJBs, .NET WCF services and the test requires turning them into API specifications, business rules documentation, and modern implementations as we look to use “Spec-driven development” for application modernisation context
April 2026 – Gemma 4 release
With the April 2026 Gemma 4 release, I was excited to try out a new model and set out to compare Gemma 4 with Qwen 3.5 – now I lack good local hardware so my test uses Ollama Cloud and the best of the models. While I can compare locally gemma4:e4b vs qwen3.5:9b, I chose to compare qwen3.5:397b-cloud (Alibaba’s 397-billion parameter flagship) and gemma4:31b-cloud (Google’s 31-billion parameter model).
Both models are free via Ollama, both support 256K context windows, vision, tool use, and thinking/reasoning. The parameter count gap is 12x. The question: does that translate to meaningfully better output for real modernisation work?
I ran two rounds of tests — 12 tasks total — covering both general coding agent work and application modernisation specifically.
Part 1: General Coding Agent Tests
Six tasks that mirror everyday coding agent work: writing code, fixing bugs, reviewing PRs, refactoring, writing tests, and designing systems.
121.1s · Clean, correct, followed “code only” instruction
16.9s · Correct, but added examples despite “code only” instruction
gemma4 (speed) / qwen3.5 (instruction following)
2. Bug Fixing Thread-safe rate limiter
19.4s · Found 6 bugs inc. KeyError, race conditions, stale timestamps
17.3s · Found 5 bugs, missed KeyError on del
qwen3.5
3. Security Code Review SQL injection-riddled REST API
52.4s · Found 12 issues inc. IDOR, mass assignment, privilege escalation
38.3s · Found 8 issues, missed mass assignment & IDOR
qwen3.5
4. Refactoring Nested callback hell to clean code
25.4s · Enterprise-style: extracted constants, separate functions, exports
63.8s · Modern concise JS: arrow fns, optional chaining, ??
Tie
5. Test Writing LRU cache Jest tests
83.7s · 15+ test cases, excellent edge case coverage
141.7s · 11 test cases, creative async interleaving test
qwen3.5
6. System Design Webhook delivery with retries + DLQ
68.8s · 5 tables, full TypeScript types, HMAC with timing-safe compare
45.2s · 2 tables, simpler but explained patterns well
qwen3.5 (completeness) / gemma4 (clarity)
Result: qwen3.5 won 4 out of 6 tests.
Coding Agent Timing
Metric
qwen3.5:397b-cloud
gemma4:31b-cloud
Total time
370.8s
323.2s
Average per test
61.8s
53.9s
Fastest test
19.4s (bug fix)
16.9s (code gen)
Slowest test
121.1s (code gen)
141.7s (test writing)
Result: gemma4 was slightly faster
Quality Scores — Coding Agent (out of 5)
Dimension
qwen3.5:397b-cloud
gemma4:31b-cloud
Code correctness
5 / 5
5 / 5
Security awareness
5 / 5
4 / 5
Instruction following
5 / 5
4 / 5
Thoroughness
5 / 5
4 / 5
Code style / idioms
4 / 5
5 / 5
Explanation quality
4/ 5
5 / 5
Speed
3 / 5
4 / 5
Overall
4.4 / 5
4.3 / 5
Result: qwen 3.5 was slightly better
Part 2: Application Modernisation Tests
These six tests mirror my actual workflow: reading legacy code, extracting rules, generating specs, forward engineering, designing integrations, and writing parity test specifications. I want to apply task and spec-driven development by asking the agents to break things down, follow the specs and then complete the tasks
The Tests
PL/SQL Business Rules Extraction — Read an insurance claims stored procedure (~130 lines) and extract every business rule with boundary values, edge cases, and modernisation risks
Java JEE Documentation — Document an EJB order fulfillment service for a modernisation handover: data flow, dependencies, error handling, state transitions
OpenAPI 3.1 Spec Generation — Generate a complete API spec from business requirements including JWT auth, rate limiting, RFC 9457 error responses, and pagination
.NET Forward Engineering — Convert a WCF service with ADO.NET and stored procedures to .NET 8 minimal API with EF Core, FluentValidation, and structured logging
Integration Architecture Design — Design the integration layer for decomposing a monolith into 5 microservices with event schemas, API contracts, and compensation strategies
Test Specification — Write a comprehensive parity test spec from extracted business rules to validate the modernised system matches legacy behaviour
36.7s · 33 test cases + BVA section + negative tests, but no cross-rule interaction or precedence analysis
qwen3.5
Result: qwen3.5 won 4 out of 6 tests, with 2 ties.
App Modernisation Timing
Metric
qwen3.5:397b-cloud
gemma4:31b-cloud
Total time
440.8s
247.1s
Average per test
73.5s
41.2s
Fastest test
27.9s (Java doc)
28.6s (integration)
Slowest test
139.5s (PL/SQL rules)
53.0s (OpenAPI)
gemma4 was 78% faster on average across the app modernisation tests.
Quality Scores — App Modernisation (out of 5)
Dimension
qwen3.5:397b-cloud
gemma4:31b-cloud
Legacy code comprehension
5/5
4/5
Business rules extraction
5/5
4/5
API spec generation
5/5
4/5
Forward engineering
5/5
4/5
Integration design
5/5
5/5
Test specification
5/5
4/5
Modernisation insight
4/5
5/5
Speed
3/5
5/5
Overall
4.6 / 5
4.3 / 5
The Key Findings
Finding 1: Qwen 3.5 is better when applied to modernisation work
On general coding tasks, the two models were neck and neck (4.4 vs 4.3). On application modernisation, the gap widened to 4.6 vs 4.3. Modernisation tasks reward thoroughness — and that’s where qwen3.5’s extra parameters help
Finding 2: 21 rules vs 13 from the same procedure
This was the most telling result. Given the same ~130-line PL/SQL insurance claims procedure, qwen3.5 extracted 21 distinct business rules while gemma4 found 13.
The 8 rules gemma4 missed weren’t obscure. They included:
Boundary conditions (fraud score 70 passes, 71 blocks — is the check > or >=?)
A race condition in the claim frequency counter
The difference between NULL and empty string in PL/SQL for the rejection notes check
A COMMIT inside an exception handler after a ROLLBACK (which only commits the error log)
Inconsistent audit logging paths depending on early returns
Every one of those is the kind of thing that becomes a production bug if you miss it during migration.
Finding 3: Cross-rule interaction tests matter
Both models wrote roughly the same number of test cases for the parity test specification (34 vs 33). But qwen3.5 added 7 cross-rule interaction tests — scenarios like “what happens when fraud score is exactly 70 AND the claim exceeds the assessor’s threshold AND the customer has 4 claims in 12 months?”
Those combination tests are where legacy migration bugs hide. Individual rules work fine in isolation. It’s the interactions that break.
Finding 4: gemma4 had better modernisation instincts
On the Java JEE documentation task, gemma4 caught something qwen3.5 didn’t lead with: the EJB sends a JMS message to the warehouse before the database transaction commits. If the commit fails, the warehouse already has a pick request for an order that doesn’t exist. gemma4 immediately recommended the Transactional Outbox pattern and the Saga pattern for compensation.
gemma4’s modernisation recommendations were consistently more architectural. It thinks in patterns.
Model Specifications
Spec
qwen3.5:397b-cloud
gemma4:31b-cloud
Provider
Alibaba (Qwen)
Google (Gemma)
Parameters
397B
~32B
Context Window
256K tokens
256K tokens
Vision
Yes
Yes
Tool Use
Yes
Yes
Thinking/Reasoning
Yes
Yes
Quantization
BF16
BF16
Local download
Cloud only
9.6 GB (also runs locally)
Cost
Free (Ollama cloud)
Free (Ollama cloud or local)
My Setup
I’m using qwen3.5:397b-cloud as my primary coding agent for the work that matters most: rules extraction from legacy code, OpenAPI spec generation, forward engineering, and parity test specifications.
gemma4:31b-cloud is my secondary — I reach for it when I need quick documentation from code, a fast second opinion on integration patterns, or when I’m iterating rapidly and don’t need exhaustive analysis.
Recommendation by Use Case
Use Case
Recommended Model
Legacy code rules extraction
qwen3.5:397b-cloud
OpenAPI / spec generation
qwen3.5:397b-cloud
Forward engineering (.NET, Java)
qwen3.5:397b-cloud
Parity test specifications
qwen3.5:397b-cloud
Security code review
qwen3.5:397b-cloud
Quick code documentation
gemma4:31b-cloud
Integration pattern selection
Either — both strong
Rapid prototyping / iteration
gemma4:31b-cloud
How to Run These Tests Yourself
Both models are available free through Ollama. No API keys, no usage limits.
# Install Ollama (macOS)
brew install ollama
# Run the models
ollama run qwen3.5:397b-cloud
ollama run gemma4:31b-cloud
The test scripts and full raw outputs are available on my GitHub. If you’re doing application modernisation work and want to evaluate models against your own legacy codebase, the scripts are designed to be adapted — just swap in your own code samples.
For anyone doing application modernisation: the model that catches more edge cases in your legacy code is the one that saves you from production incidents later. A missed boundary condition in a PL/SQL procedure becomes a production defect in your shiny new microservice. That’s where I landed, and that’s why I’m going with qwen3.5:397b-cloud as my primary agent.
Alok brings experience in engineering and architecting distributed software systems from over 20 years across industry and consulting. His posts focus on Systems Integration, API design, Microservices and Event driven systems, Modern Enterprise Architecture and other related topics
View all posts by alokmishra