Why I Worked on This
I’ve been using local LLMs for code generation and automation for a while now. The more I used them, the more I realized that subtle edge cases slip through testing when you only use examples from your own day-to-day workflows. I needed a way to systematically test my bash scripts for robustness, especially when parts of them are generated by LLMs.
My Setup
I’m running this on a modest homelab setup with Proxmox hosting a variety of containers. My main tools for this were:
- Ollama for running local LLMs
- Some custom bash scripts I use daily for system maintenance
I already had a simple benchmarking script that could generate bash code from prompts, so I extended that to include adversarial testing.
What Worked
The most effective approach I found was to:
- Run the LLM with a prompt like: “Generate a bash script that handles [X], including edge cases like [Y].”
- Take that generated script and feed it back to the LLM with: “Critique this bash script for robustness against edge cases like [Z].”
- Manually test the suggested improvements
For example, when generating a directory comparison script, the LLM initially missed:
- Symbolic link handling
- Permission-denied cases
- Non-ASCII filenames
But when prompted to critique its own output, it caught most of these. The key was framing the prompt to explicitly ask for adversarial testing scenarios.
What Didn’t Work
I tried having the LLM generate test cases automatically, but:
- It often missed obscure filesystem edge cases (like files with only read permissions)
- Generated tests were sometimes too verbose or not focused enough
- The LLM would occasionally suggest tests that weren’t actually possible to trigger
The most reliable results came from:
- Generating code
- Critiquing the code
- Manually implementing the most severe critiques
Key Takeaways
- LLMs are great for suggesting edge cases but need human validation
- Prompting the LLM to critique its own output improves results
- Real-world testing still beats generated tests for reliability
- The 80/20 rule applies here – focus on the most likely edge cases
I’ve added this adversarial testing to my normal workflow. It’s not perfect, but it’s already caught several issues that would have been painful to debug later.
The Script
Here’s the core of what I’m using:
#!/bin/bash
# Generate initial code
GENERATED_CODE=$(ollama run llm_model --prompt "Write a bash script that compares two directories, including edge cases like symbolic links, permission issues, and non-ASCII filenames")
# Critique the code
CRITIQUE=$(ollama run llm_model --prompt "Critique this bash script for robustness against edge cases: $GENERATED_CODE")
# Extract suggested improvements
IMPROVEMENTS=$(echo "$CRITIQUE" | grep -E "should handle|consider adding|potential issue")
# Apply improvements (manual step)
echo "$IMPROVEMENTS"
echo "Generated code before improvements:"
echo "$GENERATED_CODE"
You’ll need to manually apply the improvements to your script. I’ve found this hybrid approach works best for my workflow.