building adversarial testing for bash scripts with local llms: using code generation models to find edge cases

Why I Worked on This

I’ve been using local LLMs for code generation and automation for a while now. The more I used them, the more I realized that subtle edge cases slip through testing when you only use examples from your own day-to-day workflows. I needed a way to systematically test my bash scripts for robustness, especially when parts of them are generated by LLMs.

My Setup

I’m running this on a modest homelab setup with Proxmox hosting a variety of containers. My main tools for this were:

building adversarial testing for bash scripts with local llms: using code generation models to find edge cases

Ollama for running local LLMs
Some custom bash scripts I use daily for system maintenance

I already had a simple benchmarking script that could generate bash code from prompts, so I extended that to include adversarial testing.

What Worked

The most effective approach I found was to:

Run the LLM with a prompt like: “Generate a bash script that handles [X], including edge cases like [Y].”
Take that generated script and feed it back to the LLM with: “Critique this bash script for robustness against edge cases like [Z].”
Manually test the suggested improvements

For example, when generating a directory comparison script, the LLM initially missed:

Symbolic link handling
Permission-denied cases
Non-ASCII filenames

But when prompted to critique its own output, it caught most of these. The key was framing the prompt to explicitly ask for adversarial testing scenarios.

What Didn’t Work

I tried having the LLM generate test cases automatically, but:

It often missed obscure filesystem edge cases (like files with only read permissions)
Generated tests were sometimes too verbose or not focused enough
The LLM would occasionally suggest tests that weren’t actually possible to trigger

The most reliable results came from:

Generating code
Critiquing the code
Manually implementing the most severe critiques

Key Takeaways

LLMs are great for suggesting edge cases but need human validation
Prompting the LLM to critique its own output improves results
Real-world testing still beats generated tests for reliability
The 80/20 rule applies here – focus on the most likely edge cases

I’ve added this adversarial testing to my normal workflow. It’s not perfect, but it’s already caught several issues that would have been painful to debug later.

The Script

Here’s the core of what I’m using:

#!/bin/bash

# Generate initial code
GENERATED_CODE=$(ollama run llm_model --prompt "Write a bash script that compares two directories, including edge cases like symbolic links, permission issues, and non-ASCII filenames")

# Critique the code
CRITIQUE=$(ollama run llm_model --prompt "Critique this bash script for robustness against edge cases: $GENERATED_CODE")

# Extract suggested improvements
IMPROVEMENTS=$(echo "$CRITIQUE" | grep -E "should handle|consider adding|potential issue")

# Apply improvements (manual step)
echo "$IMPROVEMENTS"
echo "Generated code before improvements:"
echo "$GENERATED_CODE"

You’ll need to manually apply the improvements to your script. I’ve found this hybrid approach works best for my workflow.

Tech Expert & Vibe Coder

building adversarial testing for bash scripts with local llms: using code generation models to find edge cases

Why I Worked on This

My Setup

What Worked

What Didn’t Work

Key Takeaways

The Script

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Tech Expert & Vibe Coder

Why I Worked on This

My Setup

What Worked

What Didn’t Work

Key Takeaways

The Script

implementing llm-powered cron job failure analysis: sending prometheus alerts to ollama for root cause suggestions

setting up geofence-based home automation with owntracks and n8n: triggering server wake-on-lan and service scaling

Leave a Comment Cancel reply

Search Articles

Categories

About the Author

Vipin PG

Related articles

Building an n8n Workflow to Sync Obsidian Notes Between Devices Using Syncthing...

Setting Up Gotify Push Notifications for Cron Job Failures Across Multiple...

Building a Self-Healing n8n Workflow for Failed Docker Container Restarts

Get new posts and practical tech notes in your inbox.