Generative AI models that write code like GitHub Copilot, ChatGPT, and specialised coding assistants have transformed how developers work. These tools can generate entire functions, debug complex issues, and even create complete applications from natural language descriptions. However, this powerful capability comes with significant risks: what happens when we actually run the code these AI models produce?

Whether you’re building an AI-powered coding tutor, an automated code review system, or a tool that generates and executes scripts on demand, you face the same critical challenge: how to execute potentially untrusted code safely, efficiently, and at scale.
Recent studies show that over 60% of developers now use AI assistants for code generation, yet fewer than 15% implement proper sandboxing for AI-generated code. This gap represents a significant security vulnerability that organisations must address immediately.
🚨 Critical Insight: A single prompt injection attack could turn your helpful AI coding assistant into a sophisticated malware delivery system.
In this comprehensive guide, we’ll explore four primary approaches to sandboxing and executing LLM-generated code, examining the trade-offs between security, speed, and practicality for different use cases.
The Core Challenge: Why Safe Execution Matters
Before diving into solutions, let’s understand why executing AI-generated code requires special consideration:
- Unpredictable Output: LLMs can generate code with security vulnerabilities, infinite loops, or resource-hungry operations.
- Malicious Potential: Even with good intentions, models might produce code that accesses sensitive data or system resources.
- Resource Abuse: Simple code can consume excessive CPU, memory, or disk space.
- Environmental Differences: Code that works in one environment might fail or behave dangerously in another.
With these risks in mind, let’s explore the four main approaches to safe execution.
1. Local Execution: The Simple but Risky Approach
Technically, this relies on the host OS’s native process scheduler. When the code is run locally, It’s essentially forking a child process from your main application.
The Mechanism
You aren’t using a sandbox, you are using the OS user’s permissions.
- Stdio Piping: You must capture
stdout(standard output) andstderr(errors) to feed the results back to the LLM context window. - Blocking vs. Async: A naive implementation blocks the main thread while the code runs.
Code Implementation (Python)
This is a standard implementation using Python’s subprocess module.
import subprocess
def run_local_code(code_snippet):
try:
# WRONG WAY: Security Nightmare
# Running 'python -c' creates a new process on your host
result = subprocess.run(
["python3", "-c", code_snippet],
capture_output=True,
text=True,
timeout=5 # Hard timeout to prevent infinite loops
)
return result.stdout if result.returncode == 0 else result.stderr
except subprocess.TimeoutExpired:
return "Error: Execution timed out."
# Example usage
llm_code = "import os; print(os.getcwd())" # Dangerous access!
print(run_local_code(llm_code))
Advantages
- Blazing fast – No overhead from containerisation or virtualisation.
- Simple implementation – Just a few lines of code.
- Full language features – Access to all libraries and capabilities.
- Zero infrastructure cost – No additional services required.
Technical Risks
- Shell Injection: If you aren’t careful with how arguments are passed, an LLM can execute shell commands.
- Resource Starvation: A simple
while True: passgenerated by an LLM can spike your CPU to 100%, freezing your production server. - Security Problem: Local execution is essentially running with scissors. Consider this innocent-looking Python code:
import os
os.system('rm -rf /') # Catastrophic!
Even “safer” implementations using restricted execution modes have been repeatedly exploited.
Best Practices
If local execution is used, one should consider following points:
- Use subprocess timeouts to prevent infinite loops.
- Implement resource limits (CPU, memory).
- Run as a low-privilege user.
- Employ static analysis tools before execution.
- Never use
eval()orexec()directly on LLM output.
However, completely banning local execution means sacrificing significant performance advantages. It doesn’t have to be a security nightmare if you implement it intelligently. By using a smart validator like smolagents, you can validate the code structure before it runs, effectively retaining the raw speed of local processing while neutralising the risks.
2. Local Executor with SmolAgents: Making local execution secure
If spinning up a full Docker container feels like overkill, but running exec() feels suicidal, there is a powerful middle ground: AST-based interpretation.
Libraries like Hugging Face’s smolagents (specifically the LocalPythonExecutor) take a smarter approach. Instead of blindly handing code to the Python interpreter, they parse the code’s Abstract Syntax Tree (AST) first. Think of it as a security guard who reads every line of the script before letting it into the building.
Why it’s a Game Changer:
- Granular “Default-Deny” Imports: Unlike standard Python, you cannot simply
import os. You must explicitly whitelist every module and even specific submodules (e.g., allowingnumpydoesn’t automatically allow dangerous sub-packages). - Infinite Loop Protection: The executor counts elementary operations and kills the process if it exceeds a cap, preventing resource exhaustion (something standard
exec()can’t easily do). - Banned Built-ins: Dangerous functions like
eval,exec, and__import__are hard-blocked at the syntax level.
Code Example using SmolAgents
Here is how smolagents allows you to run local code while keeping the “keys to the castle” safe:
#!pip install smolagents
from smolagents.local_python_executor import LocalPythonExecutor
executor = LocalPythonExecutor(
additional_authorized_imports=["numpy", "pandas"],
max_print_outputs_length=1000
)
llm_generated_code = """
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
df.A.sum()
"""
result = executor(llm_generated_code)
print("Result :", result)
# Result : CodeOutput(output=np.int64(6), logs='', is_final_answer=False)
Best For: Lightweight agents, RAG pipelines, and local tools where you need speed and simplicity but can’t afford to be vulnerable.
3. WASM: The Browser-Side Virtual Machine
Local execution is tempting for its speed, but as we discussed earlier, it’s a security minefield. Enter WebAssembly (WASM) and Pyodide, a dynamic duo that brings sandboxed Python right to your browser. But is this the holy grail for handling untrusted LLM code, or just a clever workaround? Let’s break it down.
What Are WASM and Pyodide, Anyway?
WebAssembly (WASM) is like a super-secure virtual machine that runs code in a browser sandbox. It’s a binary format designed for high-performance web apps, compiling languages like C, Rust, or even Python into something browsers can execute safely. No more worrying about code escaping and messing with your files, WASM enforces strict isolation by default.
Pyodide takes this further by porting Python (via Emscripten) to WASM. It’s essentially a full Python interpreter that lives in your browser tab. You can load it via a CDN, feed it LLM-generated code, and run it without touching your host OS. Think of it as Python in a bubble: import libraries, execute scripts, and even interact with JavaScript for UI elements. Setting it up? Just a few lines in an HTML file or a web worker.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<title>Pyodide Demo</title>
<script src="<https://cdn.jsdelivr.net/pyodide/v0.24.1/full/pyodide.js>"></script>
</head>
<body>
<button id="run" disabled>Run Python</button>
<pre id="out">Loading Pyodide...</pre>
<script>
const out = document.getElementById("out");
const btn = document.getElementById("run");
let pyodide;
async function init() {
pyodide = await loadPyodide();
out.textContent = "Ready";
btn.disabled = false;
}
btn.onclick = async () => {
try {
const result = await pyodide.runPythonAsync(`
from datetime import datetime
[f"Square: {i*i}" for i in range(5)], datetime.now()
`);
out.textContent = JSON.stringify(result.toJs(), null, 2);
} catch (e) {
out.textContent = e.toString();
}
};
init();
</script>
</body>
</html>
# Output :
# Run Python
#[
# [
# "Square: 0",
# "Square: 1",
# "Square: 4",
# "Square: 9",
# "Square: 16"
# ],
# {}
#]
Advantages: Why Bother with This Setup?
- Top-Notch Security: WASM’s sandbox prevents file system access, network calls (unless you explicitly allow them), or OS-level mischief. Perfect for untrusted LLM code, what if that generated script tries to delete your home directory? With Pyodide, it’s trapped.
- Zero Installation Hassle: No need to install Python or dependencies on your machine. Load Pyodide in a browser, and you’re good. Great for quick tests or sharing demos, anyone with Chrome or Firefox can run it.
- Portability and Integration: Runs consistently across platforms. Plus, it plays nice with web tech, visualise data with JavaScript libraries or build interactive tools. Imagine embedding LLM code execution in a web app without servers.
Challenges with WASM and Pyodide
While WASM offers near-native speed and portability, it presents significant hurdles for general-purpose LLM code execution:
- The “Python Problem” (Library Incompatibility):
- LLMs heavily favor Python data libraries (Pandas, NumPy, Scikit-learn) which rely on C-extensions.
- Standard Wasm cannot run these natively. You must use heavy, specialized runtimes (like Pyodide) that pre-compile the entire Python environment, making
pip installfor new packages impossible or extremely difficult.
- Missing System Capabilities:
- WASM is not a full OS. LLM code expecting a standard Linux environment (accessing
/tmp, sockets, or shell commands) will crash because those system interfaces don’t exist or require complex custom mapping (WASI).
- WASM is not a full OS. LLM code expecting a standard Linux environment (accessing
- Performance Bottlenecks:
- WASM is single-threaded by default. LLM scripts attempting to use Python’s
multiprocessingfor heavy tasks will fail or underperform compared to containers.
- WASM is single-threaded by default. LLM scripts attempting to use Python’s
- Cryptic Debugging:
- Crashes often produce low-level memory errors rather than clear Python exceptions. This makes it hard to feed error logs back to the LLM for self-correction.
4. mcp-run-python : Secure and Efficient Python Execution
Running Python code in a WebAssembly (Wasm) environment, particularly with Pyodide, offers exciting possibilities for client-side execution, enhanced security, and simplified deployment. However, it also presents challenges related to environment setup, dependency management, and secure sandboxing. The pydantic/mcp-run-python project addresses these issues head-on, providing a comprehensive server for executing Python code securely and efficiently.
What is mcp-run-python?
mcp-run-python is a server that facilitates the secure and isolated execution of Python code within a sandboxed WebAssembly environment. It achieves this by leveraging Pyodide, a WebAssembly port of CPython, running within Deno, a JavaScript/TypeScript runtime. This architecture allows for powerful Python capabilities to be integrated into diverse applications while maintaining strict isolation from the host operating system.
Code Example: using mcp-run-python library
The library exposes a high-level code_sandbox utility that abstracts away the Deno setup, allowing you to run secure, asynchronous Python easily.
from mcp_run_python import code_sandbox
import asyncio
code = """
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
print(df)
df.A.sum()
"""
async def main():
# Dependencies are installed when the sandbox starts
async with code_sandbox(dependencies=['pandas']) as sandbox:
result = await sandbox.eval(code) # Runs code and captures output/result
print("Sandbox result:", result)
asyncio.run(main())
# Output: Sandbox result:
# {
# 'status': 'success',
# 'output': [' A B', '0 1 4', '1 2 5', '2 3 6'],
# 'return_value': 6
# }
Using Deno acts as a “Hypervisor” for your Pyodide environment. It provides a server-grade security model that browsers can’t offer (because they are on the client) and that Node.js struggles to offer (because it defaults to full access). It turns the “soft” sandbox of Python-in-WASM into a “hard” sandbox suitable for running untrusted LLM code on your servers.
Advantages
- True Isolation: It runs Python within a WebAssembly sandbox managed by Deno, ensuring code cannot touch your host OS or file system.
- Automated Dependency Management: Unlike raw Pyodide, it handles complex installations (via
micropip) and automatically pulls in tricky transitive dependencies likesslandtyping_extensionsthat often break minimal environments. - Production-Ready: It supports asynchronous execution (non-blocking) and flexible communication protocols (HTTP or Stdio), making it easier to integrate into real-world apps than a raw script.
- Better Debugging: It strips away the noisy internal stack frames of Pyodide, giving you clean, readable error messages when the LLM’s code fails.
5. Docker Based Isolation: The Industry Standard
If local execution is the “Wild West” and WASM is a “walled garden,” then Docker is like a high-security prison cell. It is the most common way developers currently handle untrusted code because it uses tools they already know.
How it works
Docker doesn’t create a whole new imaginary computer (like a Virtual Machine). Instead, it creates a Container—a sealed-off section of your existing operating system.
It relies on two clever Linux features to keep things safe:
- Namespaces (The Blinders): This feature hides the rest of the system from the code. The AI’s code thinks it is “Process ID 1” on a lonely computer. It cannot see your database, your web server, or your other files.
- Cgroups (The Leash): This limits how much “stuff” the code can use. You can tell Docker, “This code can only use 10% of the CPU and 512MB of RAM.” If the AI writes an infinite loop that tries to eat all your memory, the Cgroup yanks the leash and kills the process.
Code Example: Running Python in a Docker Container
To do this in Python, you typically use the docker SDK. Here is a script that spins up a container, runs the code, and then destroys the container immediately.
import docker
client = docker.from_env()
def run_in_docker(llm_code):
try:
# Spin up a container using a lightweight Python image
container = client.containers.run(
image="python:3.9-slim",
command=["python", "-c", llm_code],
mem_limit="128m", # SECURITY: Cap memory usage
cpu_quota=50000, # SECURITY: Limit CPU speed
network_disabled=True, # SECURITY: No internet access for the code
detach=True # Run in background
)
# Wait for the result (with a timeout to prevent hanging)
container.wait(timeout=5)
# Capture the output
logs = container.logs().decode("utf-8")
return logs
except Exception as e:
return f"Execution Error: {e}"
finally:
# ALWAYS clean up! Otherwise, your disk will fill up with zombie containers.
try:
container.remove(force=True)
except:
pass
# Example Usage
print(run_in_docker("print('Hello from inside the container!')"))
Docker is an excellent starting point for internal tools or MVPs (Minimum Viable Products). It allows you to use the full power of Python libraries without the complexity of specialized sandboxes. However, if you are building a public-facing product where security is critical, you may eventually outgrow Docker in favour of MicroVMs.
Advantages
- Unlimited Library Support Unlike WASM (which struggles with C-extensions) or restricted environments, Docker allows you to pre-install anything. Does your agent need
ffmpegfor video processing,tesseractfor OCR, or a specific version ofPyTorch? You can bake all of these into your Docker image, giving the LLM a fully capable workstation. - Precise Resource Guardrails (Cgroups) LLMs are notorious for writing code that accidentally loops forever (
while True: pass). Docker allows you to set hard limits on CPU and RAM usage. If a script tries to consume more than 512MB of RAM or run for more than 10 seconds, Docker’s control groups will automatically kill the process before it freezes your server. - Network “Kill Switch” One of the biggest risks with AI agents is data exfiltration, an agent accidentally sending your API keys to a random server. Docker provides a simple, native flag (
network_disabled=True) that completely severs the container’s internet connection, ensuring that data can only leave the container via your controlled output channels. - Ephemeral “Snapshots” Every time you run code, Docker starts from a clean slate (an “image”). If the AI writes malicious code that deletes system files or messes up the configuration, it doesn’t matter. Once the container is stopped, those changes vanish instantly. The next execution starts fresh, ensuring no “pollution” carries over between users.
6. MicroVMs : The “Fort Knox” of AI Code Execution
In today’s cloud-native world, we often face a difficult trade-off: the agility of containers versus the security of traditional virtual machines (VMs). What if you didn’t have to choose? Enter the microVM, a revolutionary technology that combines the best of both worlds.
What Exactly is a MicroVM?
A microVM is a purpose-built, ultra-lightweight virtual machine. Think of it as a traditional VM that has been put on a radical diet. It strips away decades of legacy hardware emulation, complex BIOS systems, and unnecessary device drivers, leaving only the bare essentials needed to run a single application or function securely and efficiently.
This technology, pioneered by AWS with Firecracker, combines the impenetrable security of a traditional Virtual Machine with the lightning speed of a container.
How It Works: The “Guest Kernel” Difference
The critical difference between Docker and MicroVMs lies in the Kernel, the core of the operating system that controls hardware.
- Docker (Shared Kernel): All containers share the same kernel as the host machine. If a hacker exploits a bug in the kernel from inside a container, they can “escape” and take over the entire server.
- MicroVM (Isolated Kernel): Each MicroVM has its own tiny, private “Guest Kernel.” If malicious code crashes the kernel, it only destroys that specific MicroVM. Your server (and other users) remain completely untouched.
The Advantages of MicroVMs
- “Hard” Hardware Virtualisation (Security) : MicroVMs use KVM (Kernel-based Virtual Machine) to leverage the CPU’s built-in hardware virtualization features. This provides a physical barrier between the code and the host. It is the same security model used by AWS Lambda and Fargate to safely run code from millions of different customers on the same physical hardware.
- Millisecond Startup Times (Speed) : In the world of AI Chatbots, latency is everything. You cannot ask a user to wait 5 seconds for a Docker container to spin up. MicroVMs launch instantly, providing a seamless, real-time experience that feels like it’s running locally.
- Statefulness (Memory) : Unlike standard “Serverless Functions” which are stateless (they forget everything after they run), platforms built on MicroVMs (like E2B) allow for long-running sessions.
- Example : A user uploads a CSV file. The AI analyzes it, defines a variable
df = pandas.read_csv(...), and then waits. Two minutes later, the user asks “Plot a graph.” The MicroVM is still alive, remembering the variabledf. This is critical for Data Analysis agents.
- Example : A user uploads a CSV file. The AI analyzes it, defines a variable
The Tooling: Firecracker and E2B
While Firecracker is the open-source engine powering this technology, it is notoriously difficult to set up. It requires a bare-metal server and complex networking configuration.
For most developers, the solution is to use a managed provider like E2B, which wraps Firecracker in a developer-friendly SDK.
Code Example (Using E2B)
Notice how this looks just like running local Python, but it is actually running inside a dedicated MicroVM in the cloud.
from e2b_code_interpreter import Sandbox
from dotenv import load_dotenv
load_dotenv() #load the E2B API key from .env file
def main(code):
with Sandbox.create() as sandbox:
execution = sandbox.run_code(code)
print(execution.text) # outputs 2
return execution # returns 2
if __name__ == "__main__":
code="""
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
print(df.A.sum())
"""
result = main(code)
print(result)
# Output :
# Execution(Results: [], Logs: Logs(stdout: ['6\\n'], stderr: []), Error: None)
The Perfect Fit: When Should You Use MicroVMs?
MicroVMs shine where security and speed are non-negotiable:
- Serverless & Function-as-a-Service (FaaS) Platforms: Isolating customer code at the hardware level while maintaining cold-start performance.
- AI/ML Inference & Agent Runtimes: Safely sandboxing untrusted, LLM-generated code or models.
- High-Density Multi-Tenant SaaS: Offering secure, isolated environments to each customer on shared infrastructure.
- Secure CI/CD Pipelines: Running builds and tests in ephemeral, hardened environments.
The “Environment-as-Code” Alternative: Daytona
While Firecracker and E2B focus on ephemeral code snippets, Daytona offers a broader approach by providing full, standardized development environments as secure sandboxes. It excels at “Environment-as-Code,” allowing you to spin up sub-90ms sandboxes that are not just for executing code, but for managing entire project lifecycles with native Git integration and multi-language support.
Conclusion: The Trade-off Between Isolation and Capability
As we have explored, executing LLM-generated Python code is not merely a technical feature—it is a critical architectural decision that balances execution speed, capability, and security. There is no “one-size-fits-all” solution; the right choice depends entirely on your trust model and the environment in which you operate.
- Local Execution: Native speed but zero security. Suitable only for personal, offline prototyping where the environment can be wiped.
- WASM (WebAssembly): Offloads execution to the client’s browser. It offers perfect server-side safety but suffers from slower initialization and limited library support (no raw sockets).
- Docker Containers: The flexible standard. Efficient and easy to deploy, but relies on a shared kernel. Without extreme hardening, they remain vulnerable to “container escape” attacks.
- MicroVMs (e.g., Firecracker): The gold standard. Combines the speed of containers with hardware-level isolation. This is the only responsible choice for multi-tenant, public-facing applications.


