Stormkit Community

Navigating GitHub Actions DIND Bind Mounts: Insights from Recent GitHub Reports for CI/CD Productivity

Oleg — Fri, 10 Apr 2026 13:00:38 +0000

The DevOps Dilemma: When Docker-in-Docker Hinders Productivity

In the fast-paced world of software development, efficient CI/CD pipelines are the bedrock of rapid delivery and high-quality software. GitHub Actions, especially with self-hosted runners, offers immense flexibility. However, leveraging advanced features like containerMode: dind (Docker-in-Docker) can sometimes introduce subtle complexities that trip up even experienced teams. Recent github reports and community discussions frequently highlight a particular hurdle: the unexpected behavior of bind mounts when using DIND.

For dev teams, product managers, and CTOs focused on optimizing tooling and delivery, understanding these nuances is critical. A seemingly minor misconfiguration can lead to frustrating build failures, wasted developer time, and ultimately, slower time-to-market. This post dives into a common DIND bind mount issue, its root cause, and the surprisingly simple solution that can restore your CI/CD pipeline's efficiency.

The Challenge: Bind Mounts and DIND Isolation

The problem, as articulated by 'schrom' in a recent GitHub discussion, is a classic case of expectation versus reality. When using a self-hosted GitHub Actions runner with Helm (version 0.13.1 in this instance) and containerMode: dind, the goal is often to run containerized tests against a newly built image. This process often requires injecting configuration files or secrets into the test containers via bind mounts.

However, 'schrom' discovered that files created within the job container were not accessible when attempting to mount them into containers launched by the DIND service. Instead of the file, Docker either mounted an empty directory or threw an error indicating the source path did not exist. Here's 'schrom's' minimal example, which works perfectly locally but fails in the pipeline:

$ echo hello > /tmp/secret.txt
$ docker run -it -v /tmp/secret.txt:/mnt/secret.txt alpine:3

ls -al /mnt/

total 8
drwxr-xr-x 1 root root 4096 Mar 24 17:16 .
drwxr-xr-x 1 root root 4096 Mar 24 17:16 ..
drwxr-xr-x 2 root root 40 Mar 24 17:15 secret.txt
/ # ls -al /mnt/secret.txt/
total 4
drwxr-xr-x 2 root root 40 Mar 24 17:15 .
drwxr-xr-x 1 root root 4096 Mar 24 17:16 ..A similar issue arose with Docker Compose, leading to an Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /tmp/secret.txt. The core observation was that files were being mounted from the DIND container's filesystem, not the job container's. Creating the file inside the DIND sidecar itself made it accessible to the launched containers.

Diagram illustrating file isolation between GitHub Actions job container and DIND container### Understanding the "Why": DIND's Expected Behavior

As 'andreas-agouridis' clarified in the discussion, what 'schrom' observed is not a bug but an expected behavior of the Docker-in-Docker setup. When you use containerMode: dind in a self-hosted GitHub Actions runner, your main job container and the DIND sidecar container are distinct, isolated environments.

Think of it this way: the Docker daemon running inside the DIND sidecar container only "sees" its own filesystem. Any bind mounts you specify in your workflow are relative to that filesystem, not the filesystem of the parent job container where your workflow script is executing. Therefore, when you create /tmp/secret.txt in your job container, the DIND container's Docker daemon has no knowledge of it. When it tries to fulfill a bind mount request for that path, it finds nothing, leading to either an empty directory mount or a "path does not exist" error.

Shared volume enabling file access between job container and DIND container### The Implications for Productivity and Delivery

This isolation, while fundamental to Docker's security and portability, can become a significant roadblock for development teams. If your build pipeline generates dynamic configuration files, temporary secrets, or test data that needs to be mounted into containers for testing, this DIND limitation means:

Increased Build Times: Teams might resort to inefficient workarounds like copying files into the DIND container at runtime, adding overhead.
Fragile Pipelines: Inconsistent behavior between local development and CI/CD environments leads to "works on my machine" syndrome and debugging headaches.
Reduced Confidence: If tests cannot reliably access necessary resources, the integrity of your automated testing is compromised, impacting delivery confidence.
Wasted Resources: Failed builds consume compute resources and, more importantly, developer time that could be spent on feature development.

The Elegant Solution: Leveraging Shared Volumes

The good news, as discovered by 'schrom' with the help of the community, is that the solution is surprisingly straightforward and built right into the GitHub Actions self-hosted runner Helm chart. There is an already shared volume between the runner's job container and the DIND sidecar container: it's mounted as /home/runner/_work.

Anything placed within this directory (or its subdirectories) by the job container is automatically accessible to the DIND container. The key insight was that the default temporary directory, $RUNNER_TEMP, conveniently points to /home/runner/_work/_temp/. By simply directing generated files to $RUNNER_TEMP instead of a hard-coded /tmp/, the bind mount issue vanishes.

GitHub Actions runner showing the /home/runner/_work directory as a central shared workspace### Best Practices for Robust DIND Integrations

This experience underscores several critical lessons for technical leadership and engineering teams:

Understand Your Environment: Don't assume CI/CD environments behave identically to local setups. Invest time in understanding the underlying architecture, especially for complex features like Docker-in-Docker.
Leverage Documented Paths: Always prefer environment variables like $RUNNER_TEMP for temporary files over hard-coded paths. These are designed to ensure compatibility and leverage shared resources effectively. This directly contributes to better git statistics by reducing build failures caused by environmental discrepancies.
Utilize Shared Volumes: For persistent data or files that need to be shared across containers within a DIND setup, explicitly use shared volumes. The /home/runner/_work directory is your friend.
Consult the Docs (RTFM!): As 'schrom' humorously concluded, "RTFM and do as told." The documentation for GitHub Actions runners and Helm charts often contains these crucial details.
Community Engagement: Don't hesitate to engage with the community. Discussions like the one highlighted in these github reports are invaluable for collective problem-solving and knowledge sharing.

Conclusion: Building Resilient CI/CD for Peak Performance

While the DIND bind mount issue might seem like a minor technicality, its resolution has significant implications for CI/CD productivity and delivery. By understanding the isolation mechanisms of Docker-in-Docker and leveraging the built-in shared volumes, teams can build more robust, reliable, and efficient pipelines. This directly supports common okr examples for software engineers focused on CI/CD efficiency, faster feedback loops, and reduced operational overhead.

For dev teams, product managers, and CTOs, ensuring your tooling works seamlessly is paramount. This insight from recent github reports helps demystify a common DIND challenge, allowing you to focus on what matters most: delivering exceptional software.

Rust Async Secrets That Cut API Latency in Half

speed engineer — Fri, 10 Apr 2026 13:00:00 +0000

The hidden runtime configuration that transforms your APIs from sluggish to lightning-fast, backed by production data from high-throughput…

Rust Async Secrets That Cut API Latency in Half

The hidden runtime configuration that transforms your APIs from sluggish to lightning-fast, backed by production data from high-throughput systems

Most developers treat async Rust like magic — spawn some tasks, add .await, and hope for the best. But after profiling hundreds of production APIs, I discovered that 90% of async Rust applications leave massive performance on the table due to three critical misconceptions about how the runtime actually works.

The data is shocking: properly configured async Rust applications consistently achieve 50–70% lower P99 latencies compared to their naive counterparts, often with zero code changes. Here’s how the best-performing systems do it.

The Problem: When “Fast” Async Becomes Surprisingly Slow

Picture this: You’ve built a beautiful REST API in Rust using Tokio. Your load tests show impressive throughput numbers. Everything looks great until you check your P95 and P99 latency metrics — and they’re absolutely terrible.

This exact scenario played out at a fintech startup I worked with. Their Rust API was handling 50,000 requests per second with a median latency of just 2ms. Impressive, right? But their P99 latency was hitting 850ms — completely unacceptable for financial transactions.

The smoking gun came from detailed profiling: their async tasks were starving each other. Despite having 16 CPU cores, tasks were spending up to 800ms waiting in the scheduler queue because a few compute-heavy operations were monopolizing the runtime threads.

This isn’t an edge case. Production data from multiple high-traffic Rust services reveals three patterns that consistently destroy latency:

Runtime thread starvation : 73% of high-latency requests traced back to scheduler queue buildup
Inefficient task yielding : CPU-bound work blocking the async runtime for 100ms+ stretches
Poor connection pooling : Database connections thrashing under concurrent load

The Data That Changed Everything

After analyzing performance traces from 12 production Rust services, a clear pattern emerged. The highest-performing APIs all implemented the same three optimization strategies:

Benchmark Results: API Latency Comparison

Configuration Median Latency P95 Latency P99 Latency Throughput Default Tokio 2.1ms 45ms 850ms 48K req/s Optimized Runtime 1.8ms 12ms 28ms 52K req/s Improvement 15% 73% 97% 8%

The optimized configuration achieved 97% better P99 latency while maintaining higher throughput. The secret wasn’t complex algorithms or exotic libraries — it was understanding how to configure the async runtime for real-world workloads.

Secret #1: Strategic Task Yielding Prevents Runtime Starvation

The biggest latency killer in async Rust is cooperative scheduling gone wrong. Unlike preemptive systems, Tokio relies on tasks voluntarily yielding control. When they don’t, everything grinds to a halt.

Here’s the optimization that cut our P99 latency by 80%:

use tokio::task;  

// Before: CPU-intensive work blocks the runtime  
async fn process_data(items: Vec<DataItem>) -> Result<Vec<Result>, Error> {  
    let mut results = Vec::new();  
    for item in items {  
        results.push(expensive_computation(item)); // Blocks for ~10ms each  
    }  
    Ok(results)  
}  
// After: Strategic yielding keeps the runtime responsive  
async fn process_data_optimized(items: Vec<DataItem>) -> Result<Vec<Result>, Error> {  
    let mut results = Vec::new();  
    for (i, item) in items.iter().enumerate() {  
        results.push(expensive_computation(item));  

        // Yield control every 10 iterations  
        if i % 10 == 0 {  
            task::yield_now().await;  
        }  
    }  
    Ok(results)  
}

Impact : This simple change reduced P99 latency from 850ms to 180ms. The yield_now() calls allow other tasks to execute, preventing scheduler queue buildup.

The Science : Tokio’s automatic cooperative task yielding strategy has been found to be the best approach for reducing tail latencies, but manual yielding gives you precise control over when expensive operations release the runtime.

Secret #2: Runtime Configuration That Most Developers Miss

The default Tokio runtime configuration optimizes for general-purpose workloads, not low-latency APIs. Here’s the configuration that transformed our production performance:

use tokio::runtime::{Builder, Runtime};  

// Default: Good for general use, terrible for latency  
let rt = tokio::runtime::Runtime::new().unwrap();  
// Optimized: Tuned for low-latency APIs  
let rt = Builder::new_multi_thread()  
    .worker_threads(num_cpus::get() * 2)        // More threads = less queuing  
    .max_blocking_threads(256)                  // Handle blocking calls efficiently  
    .thread_keep_alive(Duration::from_secs(60)) // Reduce thread spawn overhead  
    .thread_name("api-worker")  
    .enable_all()  
    .build()  
    .unwrap();

The Critical Insight : Most APIs spend significant time on I/O operations (database queries, HTTP calls). The default runtime assumes a balanced workload, but APIs are I/O-heavy with occasional CPU spikes.

Performance Impact :

2x worker threads : Reduces task queuing when some threads are blocked on I/O
Increased blocking threads : Prevents spawn_blocking operations from starving each other
Thread keep-alive : Eliminates the 100μs overhead of spawning new threads under load

Secret #3: Connection Pool Configuration That Scales

Database connection pools are often the hidden bottleneck in async APIs. The default configurations are conservative and performance-killing:

use sqlx::{PgPool, postgres::PgPoolOptions};  
use std::time::Duration;  

// Before: Conservative defaults that create bottlenecks  
let pool = PgPool::connect("postgresql://...").await?;  
// After: Aggressive configuration that eliminates pool contention  
let pool = PgPoolOptions::new()  
    .min_connections(20)                    // Keep connections warm  
    .max_connections(100)                   // Allow burst capacity  
    .acquire_timeout(Duration::from_secs(1)) // Fail fast on contention  
    .idle_timeout(Duration::from_secs(300))  // Reduce connection churn  
    .max_lifetime(Duration::from_secs(1800)) // Prevent stale connections  
    .connect("postgresql://...")  
    .await?;

The Math : With 50,000 req/s and an average query time of 5ms, you need 250 concurrent database operations. The default pool size of 10 connections creates a massive bottleneck.

Real-World Results : Increasing the pool size from 10 to 100 connections reduced our database query P99 latency from 450ms to 8ms — a 98% improvement.

Secret #4: Memory Allocation Patterns That Make or Break Performance

Async Rust’s zero-cost abstractions aren’t actually zero-cost when you’re allocating heavily. The highest-performing APIs minimize allocations in hot paths:

use std::sync::Arc;  
use bytes::Bytes;  

// Before: Heavy allocation in request handlers  
async fn handle_request(data: String) -> Result<String, Error> {  
    let processed = data.to_uppercase(); // Allocation  
    let result = format!("Result: {}", processed); // Another allocation  
    Ok(result)  
}  
// After: Allocation-aware design  
async fn handle_request_optimized(data: Arc<str>) -> Result<Bytes, Error> {  
    // Reuse Arc to avoid cloning  
    let processed = data.to_uppercase(); // Still need this allocation  
    let result = Bytes::from(format!("Result: {}", processed));  
    Ok(result)  
}

Pro Tip : Use cargo flamegraph to identify allocation hotspots. In our case, 40% of CPU time was spent in the allocator during high-load scenarios.

The Decision Framework: When to Apply These Optimizations

Not every application needs extreme latency optimization. Here’s when to invest in these techniques:

Choose Aggressive Optimization When:

P99 latency > 100ms: Your tail latencies are unacceptable
High concurrency : >1,000 concurrent requests regularly
Latency-sensitive workloads : Financial, real-time, or gaming applications
Resource constraints : Running on expensive cloud infrastructure

Stick with Defaults When:

Internal tools : Latency isn’t business-critical
Low traffic : <100 req/s peak load
Batch processing : Throughput matters more than individual request latency
Development phase : Premature optimization wastes time

Implementation Strategy: The 48-Hour Performance Sprint

Here’s how to implement these optimizations systematically:

Day 1: Measurement and Runtime Tuning

Baseline metrics : Capture current P50, P95, P99 latency
Runtime configuration : Apply the multi-threaded runtime settings
Connection pools : Increase database connection limits
Quick win verification : Should see 30–50% latency improvement

Day 2: Code-Level Optimizations

Profile allocation patterns : Use cargo flamegraph under load
Add strategic yields : Focus on CPU-heavy loops
Optimize hot paths : Reduce allocations in request handlers
Load test validation : Confirm improvements hold under real traffic

Measuring Success: Metrics That Matter

Track these key performance indicators to validate your optimizations:

Primary Metrics:

P99 latency : Should drop by 50%+
Error rate : Must remain stable (<0.1%)
Throughput : Should improve or stay constant

Secondary Metrics:

CPU utilization : Should become more consistent
Memory usage : May increase slightly due to larger pools
Database connection usage : Should distribute more evenly

Common Pitfalls and How to Avoid Them

Pitfall #1: Over-yielding Adding yield_now() everywhere actually hurts performance by creating unnecessary context switches. Yield only in CPU-intensive loops processing >100 items.

Pitfall #2: Massive Connection Pools Setting max_connections to 1000+ can overwhelm your database. Start with 2-3x your expected concurrent query count.

Pitfall #3: Ignoring Blocking Operations File I/O, DNS resolution, and CPU-heavy crypto operations must use spawn_blocking. Blocking the async runtime destroys all your optimizations.

The Bigger Picture: Why This Matters Now

As Rust adoption accelerates in high-performance systems, understanding async optimization becomes crucial competitive advantage. Tokio’s scheduler improvements have delivered 10x speed ups in some benchmarks, but only if you configure the runtime correctly.

The techniques in this article represent battle-tested optimizations from production systems handling millions of requests daily. They’re not theoretical — they’re the difference between an API that scales gracefully and one that falls over under load.

The Bottom Line

Async Rust’s performance ceiling is incredibly high, but reaching it requires understanding how the runtime actually works under pressure. These optimizations consistently deliver 50%+ latency improvements because they eliminate the three most common performance bottlenecks in production systems.

Start with runtime configuration and connection pool tuning — you’ll see immediate results that justify the deeper optimizations.

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

What was your win this week??

Jess Lee — Fri, 10 Apr 2026 13:00:00 +0000

👋👋👋👋

Looking back on your week -- what was something you're proud of?

All wins count -- big or small 🎉

Examples of 'wins' include:

Getting a promotion!
Starting a new project
Fixing a tricky bug
Finally getting your inbox to zero 📧

Happy Friday!

The Real Problem With AI for Developers Is Not Capability, It's Overload

Max Mendes — Fri, 10 Apr 2026 12:59:33 +0000

AI code overload is not a model-quality problem anymore. It is an ownership problem. The tools are already good enough to flood your repo faster than your team can understand, review, or maintain it.

I see this in my own workflow every week. Tools like OpenClaw, Claude Code, and Copilot are great at getting past the blank page. They turn rough ideas into working code fast. The trap starts right after that. If I let them run too far ahead, I end up with more implementation than understanding. The code exists, tests might even pass, but I no longer have a clean mental model of the system. Margaret-Anne Storey called this cognitive debt, building on MIT Media Lab research from 2025, and Simon Willison amplified the concept by describing his own experience of losing mental models of his AI-assisted projects.

That framing clicked for me more than any technical-debt discussion ever has.

The Output Problem Nobody Warned You About

Most posts about AI coding still focus on whether the model is smart enough. I think that debate is already stale. The real bottleneck moved downstream.

The 2025 DORA report says AI adoption among software professionals hit roughly 90%, with over 80% reporting productivity gains. Sounds great until you look at organizational delivery metrics, which stayed flat. AI boosted individual output (21% more tasks completed, 98% more pull requests merged) but PR review time increased 91% and PR size grew 154%. More code in, same review capacity out.

The Stack Overflow 2025 survey found 84% of developers now use or plan to use AI coding tools. But trust in AI output accuracy dropped to 29%, down from 40% the year before. And 66% of developers cited "almost right, but not quite" as their top frustration.

Here is the number that should worry everyone: the METR randomized controlled trial found that experienced open-source developers were actually 19% slower with AI tools, despite believing they were 20% faster. That is a 39-point perception gap. We feel productive while we are falling behind.

Cognitive Debt Is Worse Than Technical Debt

Technical debt is code that works but is messy. You know it is there and you can plan around it. Cognitive debt is different. It is code that works but nobody on the team actually understands it well enough to modify safely. The second is harder to detect and much harder to fix.

Anthropic's own study of 52 engineers found that developers using AI assistance scored 17% lower on comprehension tests (50% vs 67%), with the biggest drops in debugging. The code shipped, but the understanding did not.

Harvard Business Review reported on what they call "AI brain fry." A BCG study of 1,488 workers found that people managing AI output experience 33% more decision fatigue and 39% more major errors. Productivity peaked at three simultaneous AI tools. Beyond that, performance actually dropped.

A Multitudes study of 500+ developers found a 19.6% rise in out-of-hour commits among AI tool users, with Saturday productive hours up 46%. As LeadDev reported, faster code generation does not automatically create calmer teams. It often just creates longer evenings. Axios recently compared agentic coding tools to slot machines, noting that some developers now need sleep medication to break the late-night coding loop.

What I See in My Own Workflow

I use AI on almost every project. When I built FlowMate, a production SaaS handling email management with AI integrations, every line of AI-assisted code went through manual review. When I built automation workflows to find businesses without websites, AI handled the repetitive parts while I designed the system architecture.

The pattern that works for me: start with the agent, stop it early, read everything, then continue. The pattern that burns me: let the agent run ahead for 20 minutes, then try to catch up with what it built. The second approach feels more productive. It is not. I end up spending twice as long untangling code I should have reviewed incrementally.

This is exactly why I wrote about vibe coding culture a few weeks ago. The core risk is the same: the tools outrun the review. Vibe coding is the cultural norm. Cognitive debt is the technical consequence. They feed each other.

That matters for AI integration work more than people realize. The value is not in generating code faster. The value is in keeping the human ahead of the machine at every step.

The 80% Trap

Addy Osmani described this well: agents generate 80% of the code, but the remaining 20% requires deep architectural knowledge. The trap is that 80% feels like progress. You merge it. Then the 20% arrives and you realize you do not understand the 80% well enough to finish.

The data backs this up. GitClear analyzed 211 million lines of code from 2020 to 2024 and found code duplication grew 8x since AI tools became widely adopted. Healthy refactoring ("moved" code) dropped 39.9%. For the first time in their dataset, developers were pasting code more often than restructuring it.

CodeRabbit's research on 470 pull requests found AI-generated code produces 1.7x more issues overall. Security vulnerabilities were 2.74x higher. Readability problems were 3x more frequent.

This is what borrowed speed looks like. You moved fast for a week and now you are stuck for a month debugging code you never properly understood.

The Counterargument (And Why It Is Partly Right)

The obvious pushback: more code is still better than no code. I agree, up to a point. I would rather start from a rough AI-generated feature than from an empty file. I use AI every day for exactly that reason.

But this only works when the human stays ahead of the abstraction. If the tool is writing code faster than you can explain it, then your throughput is synthetic. You borrowed speed from your future self, and your future self will not be happy about the interest rate.

What Actually Works

I think the winning developers will not be the ones who generate the most code. They will be the ones who keep the shortest path between generated code and human understanding. Here is what that looks like in practice:

Smaller batches. Let the agent generate one function, review it, then continue. Not one feature.

Aggressive review. Read every line before it leaves your machine. If you cannot explain it to a colleague, it is not ready to merge.

Saying no. When the agent is about to create a hundred lines you do not fully need, stop it. Removing code is easier than understanding code you never asked for.

Good notes. Write down why the system works the way it does, not just what it does. Cognitive debt accumulates in the gaps between code and comprehension.

In my case, AI works best when I use it to compress effort, not outsource comprehension. If you are building client systems, the boring parts still matter. From solid web architecture to keeping a clean path to future changes through real project maintenance, the dead internet problem taught us that quality and authenticity still win, whether we are talking about content or code.

The Developers Who Will Win This

Model capability keeps improving. That is not the bottleneck anymore. AI code overload is the bigger risk, because unread code, invisible decisions, and broken mental models are what actually slow you down six months from now.

The Stanford Digital Economy Lab found that employment among software developers aged 22 to 25 fell nearly 20% between 2022 and 2025, while developers over 26 saw stable or growing employment. The "write code from tutorials" job is disappearing. The "understand systems and make decisions" job is not.

I would rather ship less code I still understand than more code I already mentally abandoned. That is not a productivity problem. That is an engineering discipline, and it is the one thing AI cannot do for you.

This article was originally published on maxmendes.dev.

Building a Multimodal Cross Cloud Live Agent with ADK, Amazon ECS Express, and Gemini CLI

xbill — Fri, 10 Apr 2026 12:57:32 +0000

Leveraging the Google Agent Development Kit (ADK) and the underlying Gemini LLM to build cross cloud apps with the Python programming language deployed to the ECS Express service on AWS.

Aren’t There a Billion Python Agent Demos?

Yes there are.

Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable basic working MCP stdio server that can be run locally without any unneeded extra code or extensions.

What Is Python?

Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:

Welcome to Python.org

Python Version Management

One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.

The pyenv tool enables deploying consistent versions of Python:

GitHub - pyenv/pyenv: Simple Python version management

As of writing — the mainstream python version is 3.13. To validate your current Python:

admin@ip-172-31-70-211:~/gemini-cli-aws/mcp-lightsail-python-aws$ python --version
Python 3.13.12

Amazon ECS Express

Amazon ECS Express Mode (announced Nov 2025) is a simplified deployment feature for Amazon Elastic Container Service (ECS) designed to rapidly launch containerized applications, APIs, and web services on AWS Fargate. It automates infrastructure setup — including load balancing, networking, scaling, and HTTPS endpoints — allowing developers to deploy from container image to production in a single step.

More details are available here:

Amazon ECS Express Mode

Gemini CLI

If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:

npm install -g @google/gemini-cli

Testing the Gemini CLI Environment

Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:

▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade

Node Version Management

Gemini CLI needs a consistent, up to date version of Node. The nvm command can be used to get a standard Node environment:

GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions

Docker Version Management

The AWS Cli tools and Lightsail extensions need current version of Docker. If your environment does not provide a recent docker tool- the Docker Version Manager can be used to downlaod the latest supported Docker:

Install

AWS CLI

The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:

Install Docker, AWS CLI, and the Lightsail Control plugin for containers

Agent Development Kit

The Google Agent Development Kit (ADK) is an open-source, Python-based framework designed to streamline the creation, deployment, and orchestration of sophisticated, multi-agent AI systems. It treats agent development like software engineering, offering modularity, state management, and built-in tools (like Google Search) to build autonomous agents.

The ADK can be installed from here:

Agent Development Kit (ADK)

This seems like a lot of Configuration!

Getting the key tools in place is the first step to working across Cloud environments.

Where do I start?

The strategy for starting multimodal real time cross cloud agent development is a incremental step by step approach.

The agents in the demo are based on the original code lab:

Way Back Home - Building an ADK Bi-Directional Streaming Agent | Google Codelabs

First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.

Then, a minimal ADK Agent is built with the visual builder. Next — the entire solution is deployed to Amazon ECS Express.

Setup the Basic Environment

At this point you should have a working Python environment and a working Gemini CLI installation. All of the relevant code examples and documentation is available in GitHub. This repo has a wide variety of samples- but this lab will focus on the ‘level_3-ecsexpress’ setup.

The next step is to clone the GitHub repository to your local environment:

cd ~
git clone https://github.com/xbill9/gemini-cli-aws
cd level_3-ecsexpress

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Verify The ADK Installation

To verify the setup, run the ADK CLI locally with Agent1:

xbill@penguin:~/gemini-cli-aws/level_3-ecsexpress/backend/app$ adk run biometric_agent
Log setup complete: /tmp/agents_log/agent.20260405_093812.log
To access latest log: tail -F /tmp/agents_log/agent.latest.log
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/cli.py:204: UserWarning: [EXPERIMENTAL] InMemoryCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  credential_service = InMemoryCredentialService()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/auth/credential_service/in_memory_credential_service.py:33: UserWarning: [EXPERIMENTAL] BaseCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  super(). __init__ ()
Running agent biometric_agent, type exit to exit.



  
  
  Deploying to Amazon ECS Express


The first step is to refresh the AWS credentials in the current build environment:



xbill@penguin:~/gemini-cli-aws/level_3-ecsexpress$ aws login --remote






Then a utility script caches the credentials on the local system for building:




xbill@penguin:~/gemini-cli-aws/level_3-ecsexpress$ source save-aws-creds.sh 
Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.






Run the deploy version on the local system: 0.0s 0.0s

You can validate the final result by checking the messages:




make deploy

✦ The application has been successfully deployed to AWS ECS Express Mode.

- Service Status: ACTIVE
   - Public Endpoint: [https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws](https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws)
     ([https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws](https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws))
   - Deployment Cycle: IAM roles created/verified, Docker image built and pushed to ECR, and ECS service updated.

You can now access your biometric-scout-service at the above URL.






Once the container is deployed- you can then get the endpoint:




make status






You can then get the endpoint URL:




  bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws






The service will be visible in the AWS console:




  
  
  Running the Web Interface


Start a connection to the ECS Express Deployed app:




https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws/






Then connect to the app :



Then use the Live model to process audio and video:



Finally — complete the sequence:




  
  
  Summary


The Agent Development Kit was used to enable a multi-modal agent using the Gemini Live Model. This Agent was tested locally with the CLI and then deployed to Amazon ECS Express. This approach validates that cross cloud tools can be used — even with more complex agents.

Cross Cloud Multi Agent Comic Builder with ADK, Amazon EKS, and Gemini CLI

xbill — Fri, 10 Apr 2026 12:56:03 +0000

Leveraging the Google Agent Development Kit (ADK) and the underlying Gemini LLM to build low code apps with the Python programming language deployed to the EKS service on AWS.

Aren’t There a Billion Python MCP Demos?

Yes there are.

What Is Python?

Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:

Welcome to Python.org

Python Version Management

One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.

The pyenv tool enables deploying consistent versions of Python:

GitHub - pyenv/pyenv: Simple Python version management

As of writing — the mainstream python version is 3.13. To validate your current Python:

admin@ip-172-31-70-211:~/gemini-cli-aws/mcp-lightsail-python-aws$ python --version
Python 3.13.12

Amazon EKS

Amazon Elastic Kubernetes Service (EKS) is a fully managed service from Amazon Web Services (AWS) that makes it easy to run Kubernetes on AWS without needing to install, operate, or maintain your own Kubernetes control plane. It automates cluster management, security, and scaling, supporting applications on both Amazon EC2 and AWS Fargate.

More information is available here:

What is Amazon EKS?

Gemini CLI

If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:

npm install -g @google/gemini-cli

Testing the Gemini CLI Environment

Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:

▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade no sandbox (see /docs) /model Auto (Gemini 3) | 239.8 MB

Node Version Management

Gemini CLI needs a consistent, up to date version of Node. The nvm command can be used to get a standard Node environment:

GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions

Docker Version Management

Install

AWS CLI

The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:

Install Docker, AWS CLI, and the Lightsail Control plugin for containers

Agent Development Kit

The ADK can be installed from here:

Agent Development Kit (ADK)

This seems like a lot of Configuration!

Getting the key tools in place is the first step to working across Cloud environments.

Where do I start?

The strategy for starting low code agent development is a incremental step by step approach.

The agents in the demo are based on the original code lab:

Create and deploy low code ADK (Agent Deployment Kit) agents using ADK Visual Builder | Google Codelabs

First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.

Then, a minimal ADK Agent is built with the visual builder. Next — the entire solution is deployed to Google Cloud Run.

Setup the Basic Environment

At this point you should have a working Python environment and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemini-cli-aws
cd adkui-eks

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Verify The ADK Installation

To verify the setup, run the ADK CLI locally with Agent1:

xbill@penguin:~/gemini-cli-aws/adkui-eks$ adk run Agent1
Log setup complete: /tmp/agents_log/agent.20260404_211625.log
To access latest log: tail -F /tmp/agents_log/agent.latest.log
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/utils/agent_loader.py:277: UserWarning: [EXPERIMENTAL] _load_from_yaml_config: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  if root_agent := self._load_from_yaml_config(actual_agent_name, agents_dir):
/home/xbill/.local/lib/python3.13/site-packages/google/adk/features/_feature_decorator.py:81: UserWarning: [EXPERIMENTAL] feature FeatureName.AGENT_CONFIG is enabled.
  check_feature_enabled()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/cli.py:204: UserWarning: [EXPERIMENTAL] InMemoryCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  credential_service = InMemoryCredentialService()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/auth/credential_service/in_memory_credential_service.py:33: UserWarning: [EXPERIMENTAL] BaseCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  super(). __init__ ()
Running agent Agent1, type exit to exit.
[user]: what is Amazon EKS
[Agent1]: Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service offered by Amazon Web Services (AWS). It simplifies the process of deploying, managing, and scaling containerized applications using Kubernetes on AWS.

Key aspects of Amazon EKS include:
* **Fully Managed Control Plane** AWS takes on the responsibility of managing the Kubernetes control plane, which includes tasks like scheduling applications, managing their lifecycle, and storing the cluster's state. This means AWS handles the infrastructure, scaling, patching, and updates for the control plane, freeing users to focus on their applications.
* **Kubernetes Conformance** Amazon EKS is a certified Kubernetes conformant service, allowing you to use standard Kubernetes tools and plugins. This also means you can easily migrate existing Kubernetes applications without modification.
* **High Availability and Scalability** The EKS control plane is automatically distributed across multiple Availability Zones to ensure high availability and resilience. EKS also integrates with AWS services for scaling worker nodes and pods.
* **Integration with AWS Services** EKS seamlessly integrates with other AWS services such as Amazon VPC for networking, AWS Identity and Access Management (IAM) for authentication, Amazon CloudWatch for monitoring, and Auto Scaling Groups for scaling.
* **Deployment Options** While primarily for running Kubernetes on the AWS cloud, Amazon EKS also offers deployment options for on-premises and edge environments through Amazon EKS Anywhere and Amazon EKS on AWS Outposts. These options allow for consistent Kubernetes management across various infrastructures.

Essentially, Amazon EKS reduces the operational complexity of running Kubernetes, allowing organizations to leverage the benefits of container orchestration without the overhead of managing the underlying infrastructure themselves.Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service provided by Amazon Web Services (AWS). It is designed to simplify the deployment, management, and scaling of containerized applications using Kubernetes on the AWS cloud, and also offers options for on-premises and edge environments. 0.0s 0.0s

Deploying to Amazon EKS

First authenticate:

aws login --remote

Then cache the credentials locally:

xbill@penguin:~/gemini-cli-aws/adkui-eks$ source save-aws-creds.sh
Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.

Then start the deployment:

✦ Deployment to Amazon EKS was successful.

  Deployment Summary

   - EKS Cluster: adkui-eks-cluster (Status: ACTIVE)
   - Image: 106059658660.dkr.ecr.us-east-1.amazonaws.com/adk-comic-image:latest
   - Pod Status: Running (1/1 READY)
   - Service Endpoint: http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com

  The make deploy command completed the following steps:
   1. Updated kubeconfig for the EKS cluster.
   2. Built the Docker image based on the Dockerfile.
   3. Logged in to Amazon ECR and pushed the image.
   4. Generated k8s-deployment.yaml and applied it to the cluster.

  You can now access the ADK Web UI at the endpoint listed above.

You can validate the final result by checking the messages:

✦ The EKS LoadBalancer endpoint is:
  http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com

You can then get the endpoint:


✦ The EKS LoadBalancer endpoint is:
  http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com

The service will be visible in the AWS console. The console will look similar to:

Running the ADK Web Interface

Start a connection to the EKS Deployed ADK:


  http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com

This will bring up the ADK UI. Select the sub-agent “Agent3”:

This will generate the Comic by using a multi-agent pipeline:

Once the multi Agent system is complete:

Visual Edit Agent Pipeline

The version of the ADK Deployed includes a visual builder:

Run the Online Viewer Agent

Once Agent3 has completed — go to the ADK agent selector and select “Agent4”. This agent will allow you to browse your online comic:

View the Final Artifacts

You can use Agent4 to visualize the results of the agent pipeline:

and the final panels:

Summary

The Agent Development Kit was used to visually define a multi Agent pipeline to generate comic book style HTML. This Agent was tested locally with the CLI and then with the ADK web tool. Then, several sample ADK agents were run directly from the EKS deployment in AWS. This approach validates that cross cloud tools can be used — even with more complex agents.

Cross Cloud Multi Agent Comic Builder with ADK, Amazon ECS Express, and Gemini CLI

xbill — Fri, 10 Apr 2026 12:54:24 +0000

Leveraging the Google Agent Development Kit (ADK) and the underlying Gemini LLM to build low code apps with the Python programming language deployed to the ECS express service on AWS.

Aren’t There a Billion Python MCP Demos?

Yes there are.

What Is Python?

Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:

Welcome to Python.org

Python Version Management

One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.

The pyenv tool enables deploying consistent versions of Python:

GitHub - pyenv/pyenv: Simple Python version management

As of writing — the mainstream python version is 3.13. To validate your current Python:

Python 3.13.12

Amazon ECS Express Configuration

More details are available here:

Amazon ECS Express Mode

Gemini CLI

If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:

npm install -g @google/gemini-cli

Testing the Gemini CLI Environment

Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:

gemini

▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade no sandbox (see /docs) /model Auto (Gemini 3) | 239.8 MB

Node Version Management

Gemini CLI needs a consistent, up to date version of Node. The nvm command can be used to get a standard Node environment:

GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions

Docker Version Management

The AWS CLI tools need current version of Docker. If your environment does not provide a recent docker tool- the Docker Version Manager can be used to downlaod the latest supported Docker:

Install

AWS CLI

The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:

Install Docker, AWS CLI, and the Lightsail Control plugin for containers

Agent Development Kit

The ADK can be installed from here:

Agent Development Kit (ADK)

This seems like a lot of Configuration!

Getting the key tools in place is the first step to working across Cloud environments.

Where do I start?

The strategy for starting low code agent development is a incremental step by step approach.

The agents in the demo are based on the original code lab:

Create and deploy low code ADK (Agent Deployment Kit) agents using ADK Visual Builder | Google Codelabs

First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.

Then, a minimal ADK Agent is built with the visual builder. Next — the entire solution is deployed to Google Cloud Run.

Setup the Basic Environment

At this point you should have a working Python environment and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts:

cd ~
git clone https://github.com/xbill9/gemini-cli-aws
cd adkui-ecsexpress

Then run init.sh from the cloned directory.

The script will attempt to determine your shell environment and set the correct variables:

source init.sh

If your session times out or you need to re-authenticate- you can run the set_env.sh script to reset your environment variables:

source set_env.sh

Variables like PROJECT_ID need to be setup for use in the various build scripts- so the set_env script can be used to reset the environment if you time-out.

Verify The ADK Installation

To verify the setup, run the ADK CLI locally with Agent1:

xbill@penguin:~/gemini-cli-aws/adkui-ecsexpress$ adk run Agent1
Log setup complete: /tmp/agents_log/agent.20260404_202121.log
To access latest log: tail -F /tmp/agents_log/agent.latest.log
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/utils/agent_loader.py:277: UserWarning: [EXPERIMENTAL] _load_from_yaml_config: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  if root_agent := self._load_from_yaml_config(actual_agent_name, agents_dir):
/home/xbill/.local/lib/python3.13/site-packages/google/adk/features/_feature_decorator.py:81: UserWarning: [EXPERIMENTAL] feature FeatureName.AGENT_CONFIG is enabled.
  check_feature_enabled()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/cli.py:204: UserWarning: [EXPERIMENTAL] InMemoryCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  credential_service = InMemoryCredentialService()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/auth/credential_service/in_memory_credential_service.py:33: UserWarning: [EXPERIMENTAL] BaseCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  super(). __init__ ()
Running agent Agent1, type exit to exit.
[user]: what is amazon ecs express
[Agent1]: Amazon ECS Express Mode is a new feature for Amazon Elastic Container Service (ECS) that simplifies and accelerates the deployment and management of containerized applications, particularly web applications and APIs, on AWS. It aims to reduce the operational overhead for developers by automating much of the infrastructure setup that typically accompanies deploying containerized applications to production. 0.0s 0.0s

Deploying to Amazon ECS Express

First authenticate:

aws login --remote

Then cache the credentials locally:

xbill@penguin:~/gemini-cli-aws/adkui-ecsexpress$ source save-aws-creds.sh
Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.

Then start the deployment:

 > make deploy
✦ I will execute make deploy to initiate the full ECS Express Mode deployment cycle, including building the Docker
  image, pushing it to ECR, and deploying to ECS.

You can validate the final result by checking the messages:

✦ The ECS service adkui-ecsexpress is currently ACTIVE.

   * Service Name: adkui-ecsexpress
   * Status: ACTIVE
   * Endpoint: http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws
     (http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws)

You can then get the endpoint:

   * Endpoint: http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws
     (http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws)

The service will be visible in the AWS console. The console will look similar to:

Running the ADK Web Interface

Start a connection to the AWS Deployed ADK:

http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws

This will bring up the ADK UI. Select the sub-agent “Agent3”:

This will generate the Comic by using a multi-agent pipeline:

Once the multi Agent system is complete:

Visual Edit Agent Pipeline

The version of the ADK Deployed includes a visual builder:

Run the Online Viewer Agent

Once Agent3 has completed — go to the ADK agent selector and select “Agent4”. This agent will allow you to browse your online comic:

View the Final Artifacts

You can use Agent4 to visualize the results of the agent pipeline:

and the final panels:

Summary

The Agent Development Kit was used to visually define a multi Agent pipeline to generate comic book style HTML. This Agent was tested locally with the CLI and then with the ADK web tool. Then, several sample ADK agents were run directly from the ECS express deployment in AWS. This approach validates that cross cloud tools can be used — even with more complex agents.

The AI Development Stack: Fundamentals Every Developer Should Actually Understand

Tomás Garcia — Fri, 10 Apr 2026 12:53:22 +0000

Most developers are already using AI tools daily — Copilot, Claude, ChatGPT. But when it comes to building with AI, there's a gap. Not in tutorials or API docs, but in the foundational mental model of how these systems actually work and fit together.

This is the stuff I wish someone had laid out clearly when I started building AI-powered features. Not the hype, not the theory — the practical fundamentals that change how you architect, debug, and think about AI systems.

Language Models: What's Actually Happening

A Language Model (LM) is a neural network that encodes statistical information about language. Intuitively, it tells you how likely a word is to appear in a given context. Given "my favorite color is ___", a well-trained LM should predict "blue" more often than "car."

The atomic unit here is the token — which can be a character, a word, or a subword (like "tion") depending on the model's tokenizer.

A Large Language Model (LLM) is just an LM trained on massive amounts of data using self-supervised learning. The key distinction isn't just scale — it's that at scale, capabilities emerge that were never explicitly programmed. An LM predicts the next token. An LLM does it at such scale that reasoning, coding, and creative abilities appear as emergent properties.

Foundation Models (FMs) is the broadest term. It includes both LLMs (text-only) and Large Multimodal Models (LMMs), which can process text, images, video, audio, and 3D assets.

What Is an Agent?

An agent is a system that uses an LLM to operate in a loop: it reasons about what to do, takes action (tool calls, code execution, API calls), observes the result, and repeats until the task is complete.

The basic loop looks like this:

THINK — the agent receives the current context and decides what to do: respond directly, or call a tool.

ACT — if it decided to use a tool, it executes it (web search, DB query, API call).

OBSERVE — the result gets added to the context, and the cycle starts again.

The loop terminates when the model has enough information to give a final answer, or when an external limit is reached (max iterations, timeout).

This is deceptively simple. But every meaningful AI product you've used — from Claude Code to Cursor to Devin — is some variation of this loop.

Tools: How LLMs Touch the Real World

A tool is an external function that the agent can invoke to interact with the world outside the LLM.

Here's what's important to understand: the LLM by itself only generates text. Tools are what let it do real things — fetch live information, read files, execute code, call APIs, write to a database.

Concrete examples:

web_search("dollar price today")
query_db("SELECT * FROM orders WHERE status = 'pending'")
send_email(to="client@mail.com", body="...")

Without tools, an LLM is a very sophisticated autocomplete. With tools, it becomes an agent that can actually operate in your environment.

Context: The Model's Working Memory

Context is all the information the agent has "in memory" at a given moment to generate a coherent response. Think of it as a text box the model reads in its entirety on every call.

It contains:

System prompt — base instructions defining the model's behavior
Documents — reference material injected for the task
User message — the actual request
Previous responses — conversation history
Tool results — outputs from tool executions

The context has a physical limit called the context window, measured in tokens. Anything that doesn't fit in that window, the model simply doesn't see.

This is why context design matters so much when building agents. The system prompt, the conversation history you preserve, what you include and what you drop — all of that directly impacts response quality, latency, and cost.

Memory: Beyond the Context Window

Memory is the mechanism that allows an agent to access information beyond its context window.

Two concrete examples you're probably already using:

Claude.ai — at the start of every conversation, the context is empty. What it "remembers" from past chats is because Anthropic injects a summary of previous conversations into the context before you start typing.

Claude Code — when you're working on a project, it reads files like CLAUDE.md, the directory tree, and relevant codebase files. It doesn't "know" them from memory — it loads them into context when needed, via tools.

The key insight: there is no magic persistence. Everything the model "remembers" was explicitly loaded into the context window for that specific call.

Prompting: The Developer's Primary Interface

Prompting is the skill of giving instructions to an LLM to get the output you want. It's the primary interface between you and the model.

What the LLM receives isn't just what the user types. A complete message typically includes:

System prompt — base instructions defining behavior, role, constraints, response format
User prompt — the user's message
Context — conversation history, tool results, relevant documents, retrieved memory
Available tools — the list of functions the agent can invoke, with their descriptions and parameters

All of that together is what the LLM "reads" before generating its response.

Core techniques:

Zero-shot — you ask directly without examples.
"Translate this text to English"

Few-shot — you provide examples of expected behavior before the question.
"Input: 'loved it' → Sentiment: positive. Input: 'disgusting' → Sentiment: negative. Input: 'it was okay' → Sentiment:"

Chain of thought — you ask the model to reason step by step before answering.
"Think step by step before responding"

A practical rule: the model doesn't guess your intent, it only predicts the next token. The clearer and more specific the prompt, the more predictable and useful the output.

Evals: Testing in a Non-Deterministic World

An LLM is not a deterministic function. The same input can produce different outputs on every run.

This breaks something fundamental for developers: you can't write an assert on an LLM's response.

# this doesn't work with LLMs
assert llm.respond("Capital of France?") == "Paris"
# it might respond "Paris.", "The capital is Paris", "París"...

The conceptual distinction matters:

Test — verifies that a function produces an exact, predictable output given an input. Pass or fail. Works when the system is deterministic.

Eval — measures how good a response is according to one or more criteria: relevance, coherence, correctness, tone. Produces a score, not a boolean.

For most open-ended tasks, a perfect reference answer doesn't exist. This led to AI-as-a-Judge, where one AI model evaluates the output of another. It's popular because it's fast, scalable, and can evaluate subjective criteria like creativity or coherence without needing reference text.

But it has known limitations: AI judges have biases like position bias (favoring the first response in a comparison) and verbosity bias (preferring longer answers even when they contain errors).

Guardrails: The Safety Net You Need

Guardrails protect the system both from malicious inputs and problematic outputs.

They operate in two layers:

Input Guardrails prevent prompt injection attacks and filter sensitive data (PII) before it reaches external APIs.

Output Guardrails verify the model's responses for toxicity, factual inconsistencies, and format errors — typically using a fast classifier or an AI judge before showing the response to the user.

The reasoning is straightforward: since the LLM is probabilistic, you can't guarantee it will always behave as expected. Guardrails implement checks at both ends.

User input
    ↓
[Input Guardrail]  ← PII, prompt injection, malicious content
    ↓
   LLM
    ↓
[Output Guardrail] ← toxicity, hallucinations, bad formatting
    ↓
Response

The trade-off: guardrails add latency to every response. It's a cost worth paying for production systems, but you need to be intentional about what you check and how.

MCP (Model Context Protocol): The USB-C of AI Tools

Before MCP, if you wanted an agent to use an external tool — say, search Notion, query a database, or read a Google Drive file — you had to implement that integration yourself: authentication, request formatting, error handling, and then describe it to the LLM in the system prompt so it knew how to use it.

The problem: every agent, every LLM, every app was reimplementing the same integrations from scratch.

MCP is a standard interface between agents and external tools — it defines how an LLM discovers, invokes, and receives results from tools, regardless of who implemented them.

Two components:

MCP Server — exposes tools to the agent. Can be local (a process running on your machine) or remote (a cloud service). Implements the concrete tools: read files, query APIs, execute code.

MCP Client — the agent or app that consumes the tools. Connects to the server, discovers available tools, and invokes them during the think-act-observe loop.

[Agent / MCP Client]
    ↓  "what tools do you have?"
[MCP Server]
    ↓  "I have: read_file, search_notion, query_db"
[Agent]
    ↓  calls read_file("README.md")
[MCP Server]
    ↓  returns the content
[Agent]  ← adds result to context and continues

In Claude Code, for example, you act as the MCP Client. You can add MCP Servers with a simple command — claude mcp add server-name — and from that moment Claude Code has access to whatever tools that server exposes. A Postgres MCP Server gives Claude Code the ability to query your database directly during a development session.

RAG (Retrieval-Augmented Generation): Grounding Responses in Your Data

The problem: the LLM's knowledge is limited to its training data. It knows nothing about your codebase, your internal docs, real-time data, or anything after its knowledge cutoff date.

RAG is the pragmatic alternative: instead of teaching the model, you pass it the relevant information in context right before it responds.

The flow:

User question
    ↓
[Search] ← finds the most relevant fragments
           in a vector database (fed with document chunks)
    ↓
[Augmented context] ← question + relevant fragments
    ↓
   LLM
    ↓
Response grounded in those documents

The three components:

Ingestion — documents are split into fragments (chunks) and converted into vectors (embeddings) that represent their semantic meaning. Stored in a vector database.

Retrieval — when a question arrives, it's also converted into a vector and the most semantically similar fragments are retrieved from the database.

Generation — the retrieved fragments are injected into the LLM's context along with the question, and the model responds based on that information.

When to use RAG:

Chatbots over internal documentation or knowledge bases
Assistants that need real-time information (news, prices, live data)
Q&A over code, contracts, reports — any data the model doesn't know
Reducing hallucinations by anchoring responses to concrete sources

Developer Interfaces: How You Actually Use LLMs

An LLM can be consumed in different ways depending on the use case:

Web — the most accessible form. Go to a URL, type, get a response. Ideal for exploring, iterating on prompts, or one-off tasks. No code required. Examples: Claude.ai, ChatGPT, Gemini.

API — the programmatic form. You make an HTTP request and get the response in your code. It's the foundation of any product or agent you build. Gives you full control over the prompt, model, parameters, and integration with your system.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -d '{"model": "claude-sonnet-4-20250514",
       "messages": [{"role": "user", "content": "Hello"}]}'

CLI (Terminal) — command-line tools that wrap the API and let you interact with the LLM from your terminal, integrated into your development workflow. The most relevant example today is Claude Code: an agent that runs in your terminal, has access to your codebase, can read and write files, execute commands, and operates in the think-act-observe loop we already covered.

IDE — extensions that integrate the LLM directly into your editor. The model sees your code in context and can suggest, complete, refactor, or explain without leaving the environment. Examples: Cursor, GitHub Copilot, or the Claude extension for VS Code.

Putting It All Together

None of these concepts exist in isolation. When you use Claude Code to refactor a function, here's what's actually happening: the LLM is processing your request within a context window loaded with your system prompt, codebase files (loaded via tools), and conversation memory. It operates in an agent loop — thinking, acting, observing. The tools it uses to read and write your files might come through MCP servers. If it's pulling in documentation, that might be RAG at work. And somewhere in the pipeline, guardrails are ensuring the outputs are safe.

Understanding these fundamentals doesn't just help you use AI tools better — it's the foundation for building them.

Velero Going CNCF Isn't About Backup. It's About Control.

NTCTech — Fri, 10 Apr 2026 12:53:01 +0000

The Velero CNCF backup announcement at KubeCon EU 2026 was framed as an open source governance story. Broadcom contributed Velero — its Kubernetes-native backup, restore, and migration tool — to the CNCF Sandbox, where it was accepted by the CNCF Technical Oversight Committee.

Most coverage treated this as a backup story. It isn't.

Velero moving to CNCF governance is a control plane story disguised as an open source announcement. And if your team is running stateful workloads on Kubernetes, the distinction between vendor-neutral governance and vendor-independent operations is the architectural decision that sits beneath the headline.

What the Velero CNCF Backup Move Actually Means

Velero originated at Heptio — founded by Kubernetes co-creators Joe Beda and Craig McLuckie — which VMware acquired in 2019. It's been under VMware, then Broadcom stewardship ever since. The project operates at the Kubernetes API layer, not the storage layer. All backup operations are defined via CRDs (Backup, Restore, Schedule, BackupStorageLocation, VolumeSnapshotLocation) and managed through standard Kubernetes control loops.

At KubeCon EU, Broadcom formalized the transition: Velero is now a CNCF Sandbox project, with maintainers from Broadcom, Red Hat, and Microsoft.

Broadcom's own framing was telling: "We really don't want people to mistrust the open source project and believe that it's somehow a VMware thing even though it hasn't been a VMware thing for quite some time."

This move is as much about trust repair as governance mechanics.

Vendor-Neutral ≠ Vendor-Independent

This is the distinction most teams will miss.

Vendor-neutral governance means no single vendor controls the roadmap. CNCF governance means Broadcom can no longer make breaking changes to Velero unilaterally. Community-steered, broader contributor base. That's real.

Vendor-independent operations means your recovery path survives without the vendor. That's a different question entirely — and CNCF governance doesn't answer it.

Your backup storage location is still a cloud bucket outside your cluster. Your IAM credentials still have to reach that bucket. Your restore workflow still depends on a working target cluster. None of those operational dependencies changed on March 24th.

The Real Architecture Question

When your cluster dies — what actually survives?

Velero operates at the Kubernetes API layer, which makes it a state reconstruction layer, not a storage tool. A Velero backup is a portable snapshot of declarative cluster state — namespaces, CRDs, RBAC policies, PVC claims — not a disk image.

That portability is the real capability. A backup taken on VKS can theoretically be restored on EKS, AKS, or bare-metal kubeadm — because it operates through the Kubernetes API, not hypervisor-specific snapshots.

But state reconstruction has limits:

Axis	What Velero Controls	What Velero Depends On
Backup Definitions	CRDs inside cluster	etcd — gone if cluster is gone
Restore Logic	Velero controller + API server	Working target cluster
Metadata	Object metadata, resource specs	External object storage bucket
APIs	Kubernetes API layer ops	Cloud IAM for bucket access

Velero cannot bootstrap a cluster from nothing. It cannot authenticate to object storage without valid IAM credentials. It cannot run a restore without a target cluster already operational.

The Four Production Failure Modes

These won't appear in the press releases:

01 / Object Storage Dependency
Every backup lands outside your cluster in object storage. Full cluster failure + network partition = recovery blocked, regardless of whether the backup data is intact.

02 / IAM Credential Survivability
Velero authenticates via IAM roles, IRSA, or Workload Identity — all provisioned outside Velero itself. If your identity system is compromised or the cloud control plane is unavailable, the data exists but is unreachable.

03 / Restore-Time Complexity
Velero restores Kubernetes objects. It does not restore external databases, DNS records, ingress configurations, or certificate bindings. The gap between "backup succeeded" and "system restored" is proportional to how many external dependencies your workloads carry.

04 / Air Gap Theater
Velero deployed with on-premises MinIO, backups running, compliance checkbox ticked. The problem: restore still requires live access to that storage endpoint, live IAM credentials, and a functional API server. If those dependencies fail, the air gap was theater. The backup exists. The restore doesn't work.

The Broadcom Signal Worth Reading

Broadcom has been navigating a trust deficit since the VMware acquisition — the pricing restructuring, perpetual license elimination, and VCF bundling created a market perception that it would eventually lock down everything it touched.

The Velero CNCF contribution is a counter-signal. By relinquishing governance of a project at the center of Kubernetes backup and migration, Broadcom is demonstrating that at least some of its stack is genuinely community-governed.

It also creates a clean architectural separation: Velero as open, portable, community-governed backup — VKS/VCF as proprietary platform layer. That separation is useful for teams evaluating VMware Cloud Foundation. Your backup portability is no longer contingent on your platform choice.

That's a genuine architectural benefit — independent of the marketing attached to it.

Architect's Verdict

The CNCF move is real and it matters — but not for the reasons most teams will act on.

If your concern is Broadcom controlling Velero's roadmap to disadvantage non-VMware users: that concern is now materially reduced. Multi-vendor maintainership and CNCF oversight create real structural separation.

If your concern is operational — whether Velero works when your cluster is down: the CNCF transition changes nothing. Object storage dependency still exists. IAM credential chain still needs to survive the same incident your cluster didn't. Restore-time complexity is still proportional to your external dependencies.

The teams that benefit most from this transition are those running multi-distribution environments who hesitated to standardize on Velero because of its VMware lineage. The governance change removes a legitimate organizational objection. The operational architecture still requires the same engineering discipline it always did.

CNCF doesn't remove risk. It changes where the risk lives — from project governance to operational design. Most teams haven't engineered the latter. That's the work.

Originally published at rack2cloud.com — architecture-first analysis for enterprise infrastructure teams.

Redis connection monkey patching in Ruby Jungles

Roman Tsypuk — Fri, 10 Apr 2026 12:51:31 +0000

Some programming languages allow developers to “hack” or extend their internals by overriding existing methods in standard libraries, dynamically attaching new behavior to objects, or modifying classes at runtime.

One of the languages that strongly embraces this flexibility is Ruby.

This ability is often referred to as monkey patching, and while it should be used with caution, it can be extremely powerful in real-world scenarios—especially when dealing with legacy systems or unavailable source code.

Ruby and Runtime Flexibility

Ruby is a highly dynamic, object-oriented language where:

Classes can be reopened and modified at any time
Methods can be overridden or extended dynamically
Behavior can be injected into existing objects or modules
Even core classes (like String, Array, etc.) can be modified

This makes Ruby particularly well-suited for rapid prototyping, metaprogramming, runtime instrumentation, patching legacy dependencies.

However, this flexibility comes with responsibility: poorly designed patches can introduce hard-to-debug issues.

Example

A simple example of extending a built-in class:

class String
  def patch
    "---" +self.upcase + "---"
  end
end

# rbi
> "aaa".patch
=> "---AAA---"

> "test".patch
=> "---aaa---"

This demonstrates how easily Ruby allows you to modify even core classes like String.

Real-world Example: Patching Redis Connection Pool

I encountered a set of legacy Ruby applications that depended on outdated libraries. These dependencies were no longer available in Git repositories, although prebuilt gems were still stored in an internal artifact repository.

As part of a Redis migration, I needed to identify all polyglot services connecting to Redis instances. The goal was to introduce a CLIENT_NAME for every Redis client, regardless of the programming language used.
So that majority of services followed projects structure +/- similar go-lang stack, but those Ruby legacy services were out of the landscape.

Challenges

No access to source repositories of dependencies
No explicit Redis connection URLs
A proprietary “DIY Redis discovery” mechanism
Redis connections abstracted behind internal libraries

This made it difficult to instrument Redis clients in a standard way.

Solution: Monkey Patching

Fortunately, Ruby’s monkey patching capabilities provided a way forward.

Even without modifying third-party libraries, I was able to intercept Redis connection creation and inject metadata at runtime.

The idea was simple:

As soon as a Redis connection is established, annotate it with metadata such as service name, Ruby version, and Redis client version.

Original Connection Code (Simplified):

module RedisConfig
  class Connection
    def self.create_instance!(r_name)
      redis = Redis.new(options)
      redis
    end
  end
end

Patched Implementation

I created a module that overrides the create_instance! method and augments it with additional instrumentation:

module ServicePatch
  module RedisMetadataPatch
    def create_instance!(r_name, &blk)
      super(r_name) do |redis|
        set_open_api_metadata!(redis, r_name)
        blk.call(redis) if blk
      end
    end

    private

    def set_open_api_metadata!(redis, r_name)
      safe_call(redis, [:client, :setname, 'SERVICE_NAME'], r_name)
      safe_call(redis, [:client, :setinfo, 'LIB-NAME', "ruby:#{RUBY_VERSION}"], r_name)
      safe_call(redis, [:client, :setinfo, 'LIB-VER', Redis::VERSION], r_name)
    end

    def safe_call(redis, command, r_name)
      redis.call(command)
    rescue Redis::BaseError, StandardError => e
      warn("[redis metadata] #{r_name} #{command.inspect} failed: #{e.class}: #{e.message}")
      nil
    end
  end
end

RedisConfig::Connection.singleton_class.prepend(ServicePatch::RedisMetadataPatch)

Using prepend ensures that:

The patched method runs before the original implementation
super correctly delegates to the original method
The patch is cleanly layered without modifying original code

Results

After deploying this patch, all Redis clients automatically started reporting metadata.
Here is monitoring from Redis server-side that shows how now these ruby services are instrumenting connection name:

valkey.xxxx.xx.xxxx.xxx.cache.amazonaws.com:6379> monitor
OK
1774951026.839060 [0 xx.xx.95.236:48528] "hello" "3" "setname" "service-api1"
1774951026.839435 [0 xx.xx.95.236:48528] "client" "setname" "service-api1"
1774951026.840134 [0 xx.xx.95.236:48528] "client" "setinfo" "LIB-NAME" "ruby:4.0.1"
1774951026.840142 [0 xx.xx.95.236:48528] "client" "setinfo" "LIB-VER" "5.4.1"
1774951026.840614 [0 xx.xx.95.236:48528] "ping"
1774951031.463576 [0 xx.xx.70.215:58252] "hello" "3" "setname" "service-api2"
1774951031.464538 [0 xx.xx.70.215:58252] "client" "setname" "service-api1"
1774951031.468056 [0 xx.xx.70.215:58252] "client" "setinfo" "LIB-NAME" "ruby:4.0.1"
1774951031.468066 [0 xx.xx.70.215:58252] "client" "setinfo" "LIB-VER" "5.4.1"
1774951031.468728 [0 xx.xx.70.215:58252] "ping"

Observability Gains

Once the instrumentation was in place, I was able to use a custom Redis client scanner to analyze traffic to:

identify which services are connected to which Redis instances
track command usage patterns
detect idle or misbehaving clients
correlate activity across polyglot systems

Example output:

┌─────────────────────┬──────────────────────┬──────────────────────┬─────────┬───────┬───────┬────────┬────────┬────────┬────────┐
│ Client Addr         │ Name                 │ Lib                  │ Lib Ver │ Age   │ Idle  │    GET │   MGET │    SET │ ZRANGE │
├─────────────────────┼──────────────────────┼──────────────────────┼─────────┼───────┼───────┼────────┼────────┼────────┼────────┤
│ xx.xx.226.123:27613 │ service-api1         │ ruby:4.0.1           │ 5.4.1   │ 27740 │ 14    │      0 │      2 │     12 │      0 │
│ xx.xx.240.240:32031 │ service-api2         │ ruby:4.0.1           │ 5.4.1   │ 89306 │ 1838  │      0 │      8 │     48 │      0 │
│ xx.xx.240.240:41498 │ service-api3         │ ruby:4.0.1           │ 5.4.1   │ 89306 │ 189   │      0 │     13 │     87 │      0 │
│ xx.xx.254.221:58628 │ service-api4         │ ruby:4.0.1           │ 5.4.1   │ 10503 │ 64    │      0 │     11 │     72 │      0 │
│ xx.xx.254.221:9620  │ service-api5         │ ruby:4.0.1           │ 5.4.1   │ 10503 │ 1238  │      0 │      9 │     54 │      0 │
└─────────────────────┴──────────────────────┴──────────────────────┴─────────┴───────┴───────────────────────────────────────────

Conclusion

This approach allowed me to instrument legacy Ruby applications without modifying their dependencies or internal logic. By leveraging Ruby’s dynamic capabilities, I was able to introduce observability into a previously opaque system.

In environments with legacy constraints, such techniques can turn blockers into manageable engineering problems.

And Ruby is very straightforward language to write code, some ideas from it migrated to kotlin.

Azure Kubernetes Security: Checklist and Best Practices

Mohamed Amine Hlali — Fri, 10 Apr 2026 12:47:52 +0000

Kubernetes has become the dominant platform for container orchestration. As cloud-native architecture takes over enterprise IT, securing your Azure Kubernetes Service (AKS) environment is no longer optional it's critical.

This guide covers everything you need: how AKS security works, the key challenges, best practices, and a production-ready checklist.

What Is Azure Kubernetes Security?

Azure Kubernetes Security is the set of practices, protocols, and tools that protect Kubernetes clusters running on Microsoft Azure. It covers three main areas:

Identity & access control who can do what inside the cluster
Network security controlling traffic between pods, namespaces, and external services
Continuous monitoring detecting threats and anomalies in real time

Why It Matters

Here are the top reasons AKS security deserves serious attention:

Growing threat landscape Kubernetes-specific attacks are increasing as cloud adoption grows
Compliance requirements GDPR, HIPAA, and other regulations mandate proper data protection
High cost of breaches Beyond data loss: legal fees, fines, and reputational damage
Shared responsibility model Azure secures the control plane; you secure the workloads
Microservices complexity Every service-to-service connection is a potential attack vector

How AKS Security Works

1. Identity & Access (AAD + RBAC)

Integrate AKS with Azure Active Directory and enforce Role-Based Access Control:

az aks create \
  --resource-group myRG \
  --name myAKSCluster \
  --enable-aad \
  --aad-admin-group-object-ids <group-object-id>

Apply least-privilege RBAC roles for developers:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: developer-readonly
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch"]

2. Network Security Default Deny

Block all traffic by default, then allow only what's needed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

3. Secrets Management with Azure Key Vault

Never store secrets in YAML manifests. Use the Secrets Store CSI Driver:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: azure-kvname
spec:
  provider: azure
  parameters:
    keyvaultName: "myKeyVault"
    objects: |
      array:
        - |
          objectName: mySecret
          objectType: secret
    tenantId: "<tenant-id>"

4. Pod Security Standards

Enforce security at the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted

5. Resource Limits

Prevent resource exhaustion attacks:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container

Top 5 Best Practices

Use Private Clusters Remove public API server exposure entirely:

az aks create \
  --resource-group myRG \
  --name myPrivateCluster \
  --enable-private-cluster

Enable Defender for Containers Runtime threat detection at cluster and node level:

az aks update \
  --resource-group myRG \
  --name myAKSCluster \
  --enable-defender

Use Managed Identities Eliminate service principal credential management:

az aks update \
  --resource-group myRG \
  --name myAKSCluster \
  --enable-managed-identity

Enable Auto-Upgrade Stay patched against known CVEs:

az aks update \
  --resource-group myRG \
  --name myAKSCluster \
  --auto-upgrade-channel stable

Scan Images in CI/CD Catch vulnerabilities before they reach production:

- name: Run Trivy vulnerability scanner
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'myacr.azurecr.io/myapp:latest'
    severity: 'CRITICAL,HIGH'
    exit-code: '1'

AKS Security Checklist ✅

Identity & Access

[ ] AAD integration enabled
[ ] RBAC with least-privilege roles enforced
[ ] Managed identities used (no service principal secrets)
[ ] Workload Identity enabled for pods

Network

[ ] Private cluster (no public API server)
[ ] Default-deny NetworkPolicies applied
[ ] Azure Firewall / NSGs configured
[ ] Authorized IP ranges set for API access

Workloads

[ ] Pod Security Standards enforced (restricted)
[ ] All containers run as non-root
[ ] Read-only root filesystem where possible
[ ] CPU/memory limits defined for all containers

Secrets & Data

[ ] No secrets in manifests or images
[ ] Azure Key Vault integrated via CSI Driver
[ ] etcd encryption at rest enabled

Monitoring

[ ] Microsoft Defender for Containers enabled
[ ] Kubernetes audit logs → Log Analytics
[ ] Azure Policy for Kubernetes applied
[ ] Image scanning in CI/CD pipeline
[ ] Auto-upgrade channel configured

Conclusion

AKS security is a continuous practice not a one-time configuration. The platform gives you a strong foundation with its managed control plane and native integrations, but workload security is your responsibility.

Start with the basics: private clusters, AAD + RBAC, Key Vault for secrets, and Defender for monitoring. Then build on that foundation with network policies, pod security standards, and automated image scanning.

The checklist above is a solid starting point for any production AKS deployment.

~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

Hermes Rodríguez — Fri, 10 Apr 2026 12:46:26 +0000

Hands-on guide based on a real setup: Ubuntu 24.04 LTS, AMD Radeon 760M (Ryzen iGPU), lots of RAM (e.g. 96 GiB), llama.cpp built with GGML_VULKAN, OpenAI-compatible API via llama-server, Open WebUI in Docker, and OpenCode or VS Code (§11) using the same API.

Who this is for: if you buy (or plan to buy) a mini PC or small tower with plenty of RAM and disk, this walkthrough gets you to local inference — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is AMD iGPU + Vulkan; if your hardware differs, keep the Ubuntu → llama.cpp → weights → server flow and adjust §5–§6 (deps and build) for your GPU.

Reference hardware (validated while writing this guide): Minisforum UM760 Slim mini PC (Device Type: MINI PC on the chassis label; vendor Minisforum / Micro Computer (HK) Tech Limited) with AMD Ryzen 5 7640HS, Radeon 760M Graphics, 96 GiB DDR5 RAM, ~1 TiB NVMe, Ubuntu 24.04 LTS. This is not a minimum-requirements bar—it anchors compile times, download comfort, and token throughput vs other CPUs, RAM, or disks. To verify memory type and size on your box, see §3 (Quick hardware inventory). A photo of the box is at the end, under Closing thoughts.

Replace YOUR_USER, model paths, and hostname as needed. If the machine is server-only (no monitor), start with §4.

TL;DR

Too long; didn’t read — one-minute skim before the full guide. Full table of contents →

What you’re building: local inference on Ubuntu 24.04 with llama.cpp + Vulkan, a GGUF weights file, OpenAI-style API via llama-server (:8080); optional Open WebUI in Docker (:3000); OpenCode and Visual Studio Code can talk to the same http://…:8080/v1 base URL as an OpenAI-compatible provider (§11).
Shortest path: BIOS/UMA if relevant (§2) → deps + Vulkan (§5) → build llama.cpp (§6) → download .gguf (§7: wget --continue or huggingface-cli; screen / tmux for long SSH sessions) → smoke-test llama-cli → run llama-server manually or under systemd (§8–§9) → point Open WebUI at the host (§10) → optional: OpenCode / VS Code (§11).
Tight RAM / OOM: same user as the service; match llama-cli -c / -ngl to ExecStart; if it fails, drop -c (e.g. 4096) and -ngl (e.g. 40) before chasing 99 / 999. Don’t enable the unit until the GGUF is fully downloaded.
More models: §7 covers Gemma 4, Qwen Coder, DeepSeek Lite, Llama 3.1 (downloads, huggingface-cli, quick tests).
Swap in YOUR_USER, paths, and hostname; server-only box → start at §4.

Links jump to headings on GitHub, Cursor, and most Markdown viewers. If a link does not match your renderer, search for the heading in the file.

TL;DR
1. Context and choices
2. BIOS (before or right after installing Ubuntu)
3. Installing Ubuntu
- Quick hardware inventory (optional)
4. Ubuntu Server without a desktop (headless)
- Installation
- Networking
- Vulkan without a display (vkcube not applicable)
- Rest of this guide
5. Base dependencies and Vulkan check
6. Building llama.cpp with Vulkan
- Update and rebuild llama.cpp
7. GGUF models and paths
- What GGUF is (name, role, trade-offs)
- Quant labels in filenames (Q2, Q4, Q8, suffixes like _K_M, IQ…)
- Where models live and how to list them
- Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)
- Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)
- Example: local Llama 3.1 8B Instruct Q8_0
- llama-bench: measure throughput (tokens/s)
- Quick terminal test
- Adding or switching models
- Experimenting with more models: setup, testing, and limits
- One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)
- Common steps (every model swap)
- Reference table (repos + sample file)
- Download (wget --continue, one file per command)
- Per-model quick test (right after download)
- Typical ExecStart tweaks (example)
8. Minimal web server (llama-server)
9. systemd service (start on boot)
10. Open WebUI with Docker (port 3000 → backend on 8080)
- Connect Open WebUI to llama-server
- Chat up and running (example)
- No browsing or GitHub fetch: real limits (and confident wrong answers)
- Model picker shows “No results found” / no models listed
- “Failed to fetch models” under Ollama (Settings → Models)
- Updating Open WebUI (Docker)
- If you also run Ollama
11. OpenCode and VS Code with your llama-server
- OpenCode
- Visual Studio Code
12. Troubleshooting: Vulkan / glslc on Ubuntu 24.04
- 12.1 Universe repository and packages
- 12.2 LunarG repository (Vulkan SDK)
- 12.3 Conflict between Ubuntu’s libshaderc-dev and LunarG’s Shaderc
- 12.4 Snap fallback for glslc
13. Performance and models (rough guide)
- htop looks “light” while you chat (is that normal?)
- AMD: amdgpu_pm_info and dri/N (not always dri/0)
14. Remote desktop (Ubuntu 24.04 Desktop, LAN)
- 14.1 Enable on the mini PC
- 14.2 Connect from another machine
- 14.3 Firewall (ufw)
- 14.4 If connection fails
Final checklist
Quick port reference
Closing thoughts

1. Context and choices

Topic	Recommendation
OS	Ubuntu 24.04 LTS (desktop or server; server without a GUI saves RAM).
AMD iGPU	Vulkan + Mesa is usually simpler than ROCm for llama.cpp inference.
Models	GGUF format; Q4_K_M quantization (balance) or Q8_0 (higher quality, larger).
Engine	llama.cpp with `-DGGML_VULKAN=1` uses the GPU for layers (`-ngl`).
Lots of RAM	You can load large models in system RAM even if the iGPU has little dedicated VRAM; the BIOS can give the GPU a larger framebuffer (see §2).

Reference diagram (browser / container / host):

2. BIOS (before or right after installing Ubuntu)

On Minisforum boxes (e.g. UM760 Slim) with AMI BIOS and Ryzen:

Enter BIOS (Del, F2, or F7 on many systems).
Typical path: Advanced → AMD CBS → NBIO Common Options → GFX Configuration.
Set UMA Frame Buffer Size (or similar) from Auto / 2 GiB to 8 G or 16 G if available.

Goal: give the iGPU more unified memory for model layers; with plenty of system RAM the trade-off is usually worth it.

3. Installing Ubuntu

Enable third-party software for graphics and Wi‑Fi if you use the graphical installer.
The minimal install drops extra packages if the box is mainly an inference server.

Typical order of this guide (§4 and §10 are optional depending on your setup):

Quick hardware inventory (optional)

Before picking huge models and quantizations, check RAM, disk on /, and whether the integrated GPU shows up on the PCI bus (this does not replace a Vulkan test, but it sets expectations).

sudo lspci | grep -i -E 'vga|3d|display'
free -h
df -h /

What to look for in lspci: on Ryzen Phoenix / Hawk Point boards you often see something like VGA compatible controller: … Phoenix1 plus an AMD HDMI audio line. The marketing name “Radeon 760M” may not appear verbatim; the real check is that an AMD VGA/Display controller exists and that vulkaninfo / llama-cli see RADV (§4–§5).

free: total and available RAM tell you how large a GGUF you can keep comfortably in memory alongside the OS.

df: each .gguf costs whatever the card lists (e.g. ~8 GiB for an 8B Q8_0); leave headroom for updates, Docker, and rebuilds.

DDR4 vs DDR5 (re-check RAM type): data comes from firmware SMBIOS. Install sudo apt install -y dmidecode if needed. Note: some dmidecode builds indent fields with spaces, not tabs—an overly strict grep can print nothing even when DMI works.

# One line per interesting field (tab- or space-indented)
sudo dmidecode -t memory 2>/dev/null | grep -iE 'Locator|Size:|Type:|Speed:|Configured Memory Speed:'

If that is still empty, dump the start of the table—some boards expose only a subset of fields:

sudo dmidecode -t memory | head -n 120

For each populated slot, Type: should read DDR5, DDR4, etc. All-Unknown or an empty dump may mean a locked BIOS, a hypervisor restriction, or needs a firmware update—cross-check the mini PC spec sheet or DIMM/SODIMM silkscreen/label. Ryzen 7040 mobile (e.g. 7640HS) is usually DDR5-only on recent kits; still verify through one of these paths.

4. Ubuntu Server without a desktop (headless)

When the mini PC only serves the model (SSH + browser on another machine), Ubuntu Server 24.04 LTS saves RAM and attack surface by skipping GNOME and desktop services.

Installation

Download the Ubuntu Server ISO from ubuntu.com/download/server.
In the installer, enable OpenSSH for remote administration.
Create a normal user with sudo (this guide assumes that user’s $HOME).
BIOS (§2) is configured the same as on a desktop.

Networking

After first boot:

hostname -I
sudo systemctl status ssh

Open only what you need in the firewall (e.g. SSH, and later 8080/3000 if not using VPN only):

sudo apt install -y ufw
sudo ufw allow OpenSSH
# Optional: sudo ufw allow 8080/tcp && sudo ufw allow 3000/tcp
sudo ufw enable

Vulkan without a display (`vkcube` not applicable)

Server images have no display server by default: you cannot run vkcube unless you add a minimal GUI just for that test. To validate Vulkan from the console:

sudo apt update
sudo apt install -y vulkan-tools
vulkaninfo --summary 2>/dev/null | head -n 80

What to look for: besides the instance version (e.g. Vulkan Instance Version: 1.4.x), the Devices: section should list your AMD GPU (deviceName like Radeon …, deviceType INTEGRATED_GPU or DISCRETE_GPU, vendorID 0x1002 on AMD hardware).

Real-world sample (trimmed): you often see the instance and a long extension list first; Devices: comes later. As a normal user you may see only a software device:

Vulkan Instance Version: 1.4.313
...
Devices:
========
GPU0:
    apiVersion         = 1.4.318
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …, 256 bits)
    driverName         = llvmpipe

Same machine, but sudo shows the Radeon: if your user only gets llvmpipe but root sees e.g. GPU0 AMD Radeon 760M Graphics (RADV PHOENIX) (vendorID 0x1002, INTEGRATED_GPU) and GPU1 llvmpipe, the kernel and Mesa are fine; your user lacks permission on the DRM nodes (/dev/dri/renderD*). You should not run llama-server as root long-term to “fix” Vulkan—fix group membership instead.

groups                    # should include render and video
ls -l /dev/dri/
sudo usermod -aG render,video "$USER"
# Log out of the desktop session or reboot, then (tighter grep: a broad
# GPU|deviceName|deviceType pattern may also match layer descriptions containing "GPU"):
vulkaninfo --summary 2>/dev/null | grep -E '^GPU[0-9]+:|^[[:space:]]+device(Name|Type)' | head -n 30

Expected output without sudo (RADV as GPU0, llvmpipe as an extra device):

GPU0:
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 20.1.2, 256 bits)

Typical “before” example: if groups does not list render or video, and you only see entries like adm cdrom sudo dip plugdev users lpadmin docker, that matches “Vulkan as your user = llvmpipe only; as root = RADV + llvmpipe”.

After usermod: the command may print nothing, but your already-running session keeps the old group set—groups in the same shell will not change until you log out of the desktop (or reboot). Open a new terminal and check again; id -nG is a handy way to list all group names. For a quick test without logging out of the whole session: newgrp render (spawns a subshell with that group active; fine for testing only).

On Ubuntu 24.04 the groups are usually render and video. Once the new session includes them, vulkaninfo without sudo should list the AMD device as well as llvmpipe.

A healthy summary often has the Radeon as GPU0 and llvmpipe as an extra entry:

GPU0:
    vendorID           = 0x1002
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
    driverName         = radv
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …)

Only llvmpipe even as root: then llvmpipe / PHYSICAL_DEVICE_TYPE_CPU is CPU-only Vulkan (Mesa) and the iGPU is not in the Vulkan device list. Check lspci -nn | grep -i vga, the amdgpu module, mesa-vulkan-drivers, and BIOS. On very minimal servers the render stack may still need setup before Vulkan enumerates the chip.

Rest of this guide

Install the same packages as §5, build llama.cpp in §6, and use Open WebUI from another PC at http://SERVER_IP:3000. Docker + llama-server does not require a graphical session on the server.

5. Base dependencies and Vulkan check

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git libvulkan-dev vulkan-tools

Confirm the GPU is visible:

vkcube

A window with a spinning cube should open. Close it when done.

If vkcube works but vulkaninfo --summary as your user still shows only llvmpipe, add the same render and video groups as in §4 (and log out/in).

6. Building llama.cpp with Vulkan

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

If cmake fails with Could NOT find Vulkan or missing: glslc, go to §12 (common on Ubuntu 24.04).

Update and rebuild `llama.cpp`

Newer GGUF architectures (Gemma 4, recent MoE builds, etc.) often need a fresh llama.cpp. Before blaming the weight file, update the tree and rebuild the same build folder (or wipe build and rerun CMake if CMakeLists changed a lot):

cd "$HOME/llama.cpp"
git pull
cmake --build build --config Release -j"$(nproc)"

If git pull changes CMake heavily and linking fails:

rm -rf build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

After rebuilding, if you use §9, restart so the service picks up new binaries: sudo systemctl restart llama-web.service. Check journalctl -u llama-web.service -n 30 --no-pager if a GGUF is rejected.

Useful binaries:

build/bin/llama-cli — terminal tests.
build/bin/llama-server — HTTP API compatible with OpenAI-style clients.

7. GGUF models and paths

What GGUF is (name, role, trade-offs)

GGUF (GGML Universal File Format) is a single-file container aimed at inference with llama.cpp and friends: it packs weights in a tensor layout tuned for efficient loading, metadata, and—in practice—what you need to tokenize and run the model without pulling in the full PyTorch/JAX training stack.

Why it matters here: you download a .gguf, pass its path as -m to llama-cli / llama-server, and the engine runs locally (CPU, and in this guide Vulkan on the GPU). You do not need the original framework runtime just to serve the converted file.
Typical upsides: one portable blob; quantized variants (Q4_K_M, Q8_0, IQ*, …) trade a bit of quality for disk / RAM / VRAM; huge Hugging Face catalog (community repos such as TheBloke, bartowski, Unsloth, …); first-class support in llama.cpp.
Limitations: quality depends on quant level and conversion tooling; brand-new architectures may need a fresh llama.cpp build or lack mature GGUFs yet; training / fine-tuning usually happens elsewhere, then you convert/export to GGUF; it is not a full cloud SaaS substitute without extra plumbing.

The rest of this section assumes a ready-to-run GGUF; paths and downloads always point at that file.

Quant labels in filenames (Q2, Q4, Q8, suffixes like `_K_M`, IQ…)

Repos list GGUFs with prefixes like Q2_, Q3_, Q4_, Q5_, Q6_, Q8_ and cousins (IQ2_, IQ3_, …). Naming is not one single marketing standard, but in practice:

The Q and number hint at quantization depth—roughly how many bits are used for weights (simplified). Lower → smaller file, less RAM/VRAM, sometimes more quality loss; higher (e.g. Q8) → heavier and often closer to “full” model behavior.
Suffixes such as _K_M, _K_S, _K_L, … are llama.cpp k-quant schemes: they mix layers/blocks at different precisions to balance quality vs size—it is not “literally 4-bit everything.”
IQ (imatrix / importance-weighted) lines aim for aggressive compression while protecting weights that matter most for output quality.
For this guide: Q4_K_M is a common sweet spot for disk, memory, and quality; Q8_0-class files if you favor quality and have RAM to spare. If names feel overwhelming, sort by MiB/GiB under the repo’s Files tab and pick the largest file that fits your machine comfortably.

Hugging Face CLI (huggingface-cli): Ubuntu 24.04 ships externally managed system Python (PEP 668), so python3 -m pip install … fails with externally-managed-environment. Prefer a small virtualenv for this tool. This guide uses $HOME/.venv/huggingface: install python3-venv, create the venv once, run source …/bin/activate before pip / huggingface-cli, or call "$HOME/.venv/huggingface/bin/huggingface-cli" directly. Avoid --break-system-packages unless you understand the risk. Alternative: pipx install 'huggingface_hub[cli]' (after sudo apt install pipx and pipx ensurepath).

Use one consistent directory (avoid mixing ~/models and llama.cpp/models by mistake):

mkdir -p "$HOME/models"

Where models live and how to list them

llama.cpp has no built-in model catalog: a model is a .gguf file. You always pass the path with -m (absolute paths are best in systemd).

List the usual folder:

ls -lh "$HOME/models"/*.gguf 2>/dev/null

If that prints nothing, you may still have GGUFs elsewhere (Downloads, etc.).

Search under your home (limited depth, faster):

find "$HOME" -maxdepth 5 -name '*.gguf' 2>/dev/null -ls

Sort by size:

find "$HOME" -maxdepth 5 -name '*.gguf' 2>/dev/null -printf '%s\t%p\n' | sort -n

Important: Open WebUI does not enumerate “every GGUF on disk”. What matters is whichever file llama-server loads via -m. To “use another model”, change that -m (and restart the process or service §9), or run another llama-server on another port (advanced; not detailed here).

Generic example (swap the URL for the file link under the repo’s Files tab on Hugging Face):

wget -O "$HOME/models/model-name.gguf" \
  "https://huggingface.co/ORG/REPO/resolve/main/file.gguf?download=true"

Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)

Recent quantized model (Apache 2.0), Gemma 4 / MoE architecture; a good fit for machines with lots of RAM (e.g. ~96 GiB). Full file list and sizes: bartowski/google_gemma-4-26B-A4B-it-GGUF.

Reasonable disk/RAM use: Q4_K_M (~17 GiB per the model card). Maximum quality in this repo: Q8_0 (~27 GiB).

Important: you need a recent llama.cpp with Gemma 4 support (before building: cd llama.cpp && git pull). If loading the GGUF reports architecture or tokenizer errors, update and rebuild (§6).

Recommended download (Q4_K_M):

mkdir -p "$HOME/models"
wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"

Higher-quality option (Q8_0):

wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q8_0.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q8_0.gguf?download=true"

Equivalent using huggingface-cli (handy for resumable downloads):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

Notes:

On Hugging Face the model is tagged Image-Text-to-Text; for text-only chat, llama-server / Open WebUI usually work with the GGUF and embedded template. If message formatting breaks, check the Prompt format section on the model card.
resolve/main/... URLs can break if files are renamed; if so, open the repo and copy the download link for the exact .gguf.

Important: when running llama-cli or llama-server, use the real path to the .gguf (absolute or relative to your current working directory).

Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)

A very large MoE (~32 B activated params / 1 T total per the model card). Community GGUFs: unsloth/Kimi-K2-Instruct-0905-GGUF. Run guide and flags: Unsloth — Kimi K2.

Hardware warning: Unsloth’s README recommends ≥ 128 GB unified RAM even for “small” quants. Boxes in the ~64–80 GiB range may fail to load, run very slowly, or thrash swap—treat it as an experiment (see §7 Experimenting with more models).

Hugging Face: access may be gated; sign in, accept terms on the model page, and use huggingface-cli login if required.

Shards: each quantization lives in a folder (UD-TQ1_0/, UD-IQ1_S/, IQ4_XS/, …) with files like …-00001-of-00006.gguf, … Download every .gguf in that folder. For llama-cli and llama-server, -m must point at the first shard (…-00001-of-….gguf); current llama.cpp loaders pick up sibling shards in the same directory.

Download one folder (example UD-TQ1_0, six parts; confirm names under Files on Hugging Face):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
huggingface-cli login    # if token or gated access is required

mkdir -p "$HOME/models/kimi-k2-0905"
huggingface-cli download unsloth/Kimi-K2-Instruct-0905-GGUF \
  --include "UD-TQ1_0/*.gguf" \
  --local-dir "$HOME/models/kimi-k2-0905"

Other folders in the same repo are other quants (more disk / more quality). Pick based on free disk and RAM.

Before loading: git pull and rebuild llama.cpp (§6). Short smoke test:

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/kimi-k2-0905/UD-TQ1_0/Kimi-K2-Instruct-0905-UD-TQ1_0-00001-of-00006.gguf" \
  -c 4096 \
  -ngl 80 \
  -p "Say hi in one sentence."

Tune -ngl and -c; on architecture/tokenizer errors, update and rebuild. For §9 / Open WebUI, ExecStart uses the same path to the first shard; read the id from /v1/models via curl once llama-server is up for Model IDs.

Example: local Llama 3.1 8B Instruct Q8_0

If you already have e.g. $HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf (~8 GiB on disk), replace every -m path in this guide with yours. Q8_0 favors quality over speed; for higher tok/s on an iGPU, try a Q4_K_M in the same model family.

`llama-bench`: measure throughput (tokens/s)

Use this to compare the same machine with different -ngl, different GGUFs, or different builds (CPU vs Vulkan), without UI noise.

Verify the binary (size/date are hints; it should refresh after rebuilds):

cd "$HOME/llama.cpp"
ls -lh build/bin/llama-bench

If it is missing, rebuild the project (§6); most full builds already include llama-bench.
Flags change across versions—always start from help:

./build/bin/llama-bench --help | less

Minimal example (swap the path):

./build/bin/llama-bench \
  -m "$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf" \
  -ngl 999 \
  -n 128

-m: path to the .gguf.
-ngl: GPU layers; many builds accept 999 or -1 as “as many as possible”. If rejected, try 35, 45, etc., and increase until it breaks or slows down.
-n: generated tokens per benchmark run (tune for longer runs).

Reading output: you usually see prompt processing vs generation tok/s. If numbers are tiny and logs show no Vulkan / ggml_vulkan, the binary might lack GGML_VULKAN, or /dev/dri permissions were wrong at build/run time (§4).
Fair comparisons: same llama-bench build, same model, same -n, only change -ngl or the .gguf.

Sample real output (same command order as above; Ubuntu 24.04, Radeon 760M RADV, Llama 3.1 8B Instruct Q8_0; numbers shift with BIOS, thermals, quantization, and llama.cpp revision):

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           pp512 |        235.96 ± 0.19 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           tg128 |          9.80 ± 0.00 |

build: 4d688f9eb (8016)

The ggml_vulkan lines show one Vulkan device and that the bench is on RADV (not llvmpipe only). Errors or zero devices → revisit §4–§5.
pp512: prompt processing — tok/s for a ~512-token prefill; usually higher than generation.
tg128: token generation — tok/s while emitting 128 output tokens; closest bench metric to “reply speed” in chat. Here ≈9.8 t/s for Q8_0 on this iGPU.
The build: line is your llama.cpp llama-bench commit; it changes after git pull + rebuild.

Another sample (same mini PC class, Gemma 4 26B Instruct Q4_K_M — the model this guide uses in many examples):

./build/bin/llama-bench \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -ngl 999 \
  -n 128

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           pp512 |        239.04 ± 1.97 |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           tg128 |         20.94 ± 0.02 |

build: d12cc3d1c (8720)

The gemma4 ?B label is cosmetic on some llama-bench builds; trust size (~15.85 GiB), params (~25.23 B), and your -m path.
What this run says: with Vulkan and ngl 999, expect on the order of ~239 tok/s for prefill (pp512) and ~21 tok/s for generation (tg128). That ~21 t/s is the most useful single number for “raw” reply speed (no Open WebUI overhead, no long reasoning block, no huge prompts); real chat often lands near this ballpark or a bit lower.

Other GGUFs, ngl, or build: revisions will move tg* a lot; record your own table after major changes.

Quick terminal test

From the llama.cpp directory:

./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -p "Answer in one sentence what Linux is." \
  -cnv \
  -ngl 99

Gemma 4 and on-screen reasoning ([Start thinking] … [End thinking]): many Instruct GGUFs emit a “thinking” block before the final answer. On a recent llama-cli, --help normally documents (verify with ./build/bin/llama-cli --help | grep -iE 'reason|think|template'):

-rea, --reasoning on|off|auto — default auto (template decides). For clean screenshots, use --reasoning off (short -rea off if your build prints it).
--reasoning-budget N — 0 ends the thinking block immediately; -1 is unrestricted. Pair with off if needed.
--chat-template-kwargs STRING — JSON for the template parser (e.g. '{"enable_thinking": false}' in bash with outer single quotes).
--reasoning-format FORMAT — tag handling / extraction (DeepSeek-style paths); --reasoning off is usually enough for Gemma in interactive CLI.

Screenshot-friendly example (same command as above + reasoning disabled):

./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -p "Answer in one sentence what Linux is." \
  -cnv -ngl 99 \
  --reasoning off

Reference run (validated hardware in the intro; no [Start thinking] block; t/s are indicative):

You can also export the env vars mentioned in --help (LLAMA_ARG_REASONING, LLAMA_ARG_THINK_BUDGET, …) if you prefer not to repeat flags.

For llama-server (§8–§9), add the same switches to ExecStart (--reasoning off, --reasoning-budget 0, --chat-template-kwargs …) as your binary supports. If nothing disables it, try another GGUF/variant, or another model for a one-off capture (e.g. Llama in this same §7).

Example with a local Llama 3.1 8B (single-turn demo; chat template depends on the GGUF). An overly vague -p (“summarize llama.cpp”) may yield “I don’t have that information”; give context in the question (e.g. open-source inference, GGUF, local execution).

./build/bin/llama-cli \
  -m "$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf" \
  -p "Answer in exactly one sentence: What does the llama.cpp project do for running language models locally?" \
  -ngl 999

Actual reference screenshot (same validated hardware in the intro: Ryzen 5 7640HS, Radeon 760M, DDR5; t/s varies with thermals, BIOS, and llama.cpp commit):

-ngl 99 / 999: tries to offload many layers to the GPU; on large models or a small unified VRAM budget you may need to lower -ngl or increase the BIOS framebuffer (§2).
On startup, look for lines like ggml_vulkan: and your GPU name (e.g. Radeon 760M) to confirm Vulkan.

Adding or switching models

Each additional model you want to run—another family, quantization, or file from Hugging Face—is one more .gguf in your folder (e.g. $HOME/models). ML slang often says “weights” for the trained parameters inside that file; here it is enough to think “another .gguf.” The flow is always download → test → point the server at that path.

Download using the same pattern as above (wget, huggingface-cli, or the repo’s download link on Hugging Face).
Smoke-test in the terminal with llama-cli -m "$HOME/models/your-new-file.gguf" (like the quick test). If the architecture is brand new and load fails, update and rebuild llama.cpp (§6).
Manual llama-server (§8): stop the process (Ctrl+C) and start it again with -m pointing at the new file.
systemd service (§9): edit /etc/systemd/system/llama-web.service, change only the -m /full/path/new.gguf argument inside ExecStart, save, then run:

sudo systemctl daemon-reload
sudo systemctl restart llama-web.service
sudo systemctl status llama-web.service

Open WebUI (§10): llama-server loads one model at a time (whichever you set at startup). After restarting the service, reload the UI; the model dropdown may show the filename or a generic label (default), depending on the version.
OpenCode / VS Code (§11): same host and port (…:8080/v1); in editors use the server IP or 127.0.0.1 depending on where the IDE runs.

Serving several models at once requires multiple llama-server processes on different ports (and matching entries in Open WebUI or more containers); that advanced layout is not spelled out here.

Experimenting with more models: setup, testing, and limits

If you want to try multiple GGUFs, follow a clear flow and know your hardware ceiling—this avoids pointless downloads and false “it’s broken” moments.

Recommended flow

Check disk and RAM (free -h, df -h /, §3). Each quantization costs what the model card says; keep headroom.
Update llama.cpp when the model is new (§6, Update and rebuild).
Download the .gguf into $HOME/models (wget, huggingface-cli, etc.).
Smoke-test with llama-cli and short generations; confirm ggml_vulkan if the GPU should participate (§7).
Optional: llama-bench with the same -ngl you plan for production to compare quantizations (§7).
Change -m in §9 (or manual §8), daemon-reload + restart, then curl /v1/models and Open WebUI (Admin → Connections; Model IDs if needed).

Typical limits on a mini PC with an iGPU

Topic	What it means
RAM	GGUF size + OS + context cannot grow without limit; huge MoE releases (e.g. Kimi K2-class GGUFs) can exceed usable RAM on 64–96 GiB class boxes or crawl at extremely low tok/s.
iGPU Vulkan	Caps tok/s on GPU; lots of RAM helps you load weights, not mimic a big discrete GPU.
One active model per `llama-server`	Switching models means changing `-m` and restarting (or a second server on another port).
Templates / chat	Weird chat in Open WebUI may be the GGUF chat template; check the Hugging Face card or try another frontend.
Network / disk	Large downloads take time; use `wget --continue` or resumable `huggingface-cli`.

Set expectations: an 8B–13B or a quantized 26B can be a great fit with ample RAM; datacenter-scale GGUF may not fit or run under ~1–2 tok/s with aggressive paging—that is a memory bandwidth issue, not an Ubuntu bug.

One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)

For a mini PC–style setup: Ubuntu 24.04, AMD iGPU Vulkan, ~64–96 GiB RAM, llama-server on 8080, systemd §9, Open WebUI §10. Swap in your paths and username.

Common steps (every model swap)

Refresh the engine if the weight is new or load fails: cd ~/llama.cpp && git pull and rebuild (§6).
Download the .gguf (per-family commands below). Verify the filename under Hugging Face → Files; if it is renamed, fix the URL.
Smoke test (tune -ngl and -c); or use the copy-paste commands per model under Per-model quick test below.

cd ~/llama.cpp
./build/bin/llama-cli -m "/absolute/path/to/file.gguf" -ngl 999 -c 4096 -n 80 -p "Answer in one short sentence."

Tuning: on OOM, hangs, or very slow output, lower -ngl (e.g. 50, 35) and/or -c (e.g. 2048). Unified iGPU memory is usually the limiter, not raw RAM alone.
llama-bench (optional, §7) with the same path and -ngl to compare quants or families.
systemd (§9): in /etc/systemd/system/llama-web.service, edit ExecStart: same path in -m, and match -c / -ngl to what worked in the smoke test.

sudo systemctl daemon-reload
sudo systemctl restart llama-web.service
sudo systemctl status llama-web.service

API check: curl -s http://127.0.0.1:8080/v1/models
Open WebUI: Admin → Connections → OpenAI (host.docker.internal:8080/v1). If the picker stays empty, paste the id from that JSON into Model IDs, save, and hard-refresh.

Reference table (repos + sample file)

Family	Hugging Face repo	Sample file (quant)	Notes (~machine with plenty of RAM)
Gemma 4 26B Instruct	bartowski/google_gemma-4-26B-A4B-it-GGUF	`google_gemma-4-26B-A4B-it-Q4_K_M.gguf`	~17 GiB on disk; usually needs fresh llama.cpp. Start `-c` around 4096–8192.
Qwen2.5 Coder 7B	bartowski/Qwen2.5-Coder-7B-Instruct-GGUF	`Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf`	Much lighter than Gemma 26B. For 14B / 32B, check Files sizes; 32B Q4 is often ~18–20 GiB+ and heavier.
DeepSeek Coder V2 Lite Instruct	bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF	`DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf`	“Lite” ≈ ~10 GiB class in Q4_K_M; solid code/disk trade-off locally.
Llama 3.1 8B Instruct	bartowski/Meta-Llama-3.1-8B-Instruct-GGUF	`Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf` or `-Q8_0.gguf`	Q4_K_M faster; Q8_0 heavier / often higher quality. If your file name differs, keep your real path in `-m`.

Download (`wget --continue`, one file per command)

If you use SSH and the download runs a long time, run it inside screen or tmux so a dropped connection does not kill the job. Example with screen (install if needed: sudo apt install -y screen):

screen -S hf-models
mkdir -p "$HOME/models"

wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"
# When this wget finishes, you can paste the next command from the block below without leaving screen.

# Detach (leave download running): Ctrl+A, release, D
# Reattach later: screen -r hf-models
# List sessions: screen -ls

The same pattern works for the other URLs in this section or for huggingface-cli download.

mkdir -p "$HOME/models"

# Gemma 4 26B Q4_K_M
wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"

# Qwen2.5 Coder 7B Q4_K_M
wget --continue -O "$HOME/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf?download=true"

# DeepSeek Coder V2 Lite Q4_K_M
wget --continue -O "$HOME/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF/resolve/main/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf?download=true"

# Llama 3.1 8B Q4_K_M
wget --continue -O "$HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf?download=true"

Meta / Llama (gated): if wget returns 403 or Hugging Face asks you to sign in, open the model page while logged in, accept the license, create a read token, and run huggingface-cli login. Gated repos usually need huggingface-cli download ..., not anonymous wget to resolve/main/....

huggingface-cli alternative (resumable; each command pulls one GGUF under --local-dir):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
# huggingface-cli login   # required for *gated* repos (e.g. Llama/Meta); optional otherwise
mkdir -p "$HOME/models"

huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
  --include "Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF \
  --include "DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

Depending on the CLI version, the .gguf may end up in a subfolder under --local-dir. Point -m at the real absolute path (for example find "$HOME/models" -name '*.gguf').

Per-model quick test (right after download)

Run one block (paths match the wget names above). -n caps generated tokens so the run stays short; if your llama-cli rejects -n, check ./build/bin/llama-cli --help (sometimes --predict or another alias). Earlier in §7, Quick terminal test shows a -cnv example for Gemma and a Llama variant.

Gemma 4 26B Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 80 \
  -p "Answer in one short sentence what a tensor is in machine learning."

Qwen2.5 Coder 7B Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 128 \
  -p "Write a one-line Python factorial(n) function; code only."

DeepSeek Coder V2 Lite Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 128 \
  -p "Write a JavaScript arrow function that adds two numbers; code only."

Llama 3.1 8B Instruct Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 80 \
  -p "Say in one sentence what llama.cpp is for."

On startup you should see ggml: / ggml_vulkan: lines naming your GPU when Vulkan is in use (§4–§5).

Typical `ExecStart` tweaks (example)

Same shape as §9; only -m (and possibly -c / -ngl) change:

…/llama-server \
    -m /home/YOUR_USER/models/THE_FILE_YOU_TESTED.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 999 \
    --n-predict -1

If Gemma 26B Q4 or another big model OOMs on a box with only ~16 GiB RAM, lower -c (e.g. 4096) and -ngl (e.g. 40 or less) before pushing 99 / 999. Always validate with llama-cli using the same -m, -c, and -ngl you plan in ExecStart, then automate with systemd (§9).

8. Minimal web server (`llama-server`)

Run manually, listening on all interfaces on port 8080:

cd "$HOME/llama.cpp"
./build/bin/llama-server \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99 \
  --n-predict -1

On another machine: http://SERVER_IP:8080 (llama.cpp’s built-in UI is very basic).

9. systemd service (start on boot)

Create /etc/systemd/system/llama-web.service (e.g. with sudo nano):

[Unit]
Description=Llama.cpp API server (Vulkan)
After=network.target

[Service]
Type=simple
User=YOUR_USER
Group=YOUR_USER
# Vulkan on AMD: the service user must access /dev/dri (groups in §4).
# If the service loads the model on CPU only, check `groups` / `id` for that user.
SupplementaryGroups=render video
WorkingDirectory=/home/YOUR_USER/llama.cpp
ExecStart=/home/YOUR_USER/llama.cpp/build/bin/llama-server \
    -m /home/YOUR_USER/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 99 \
    --n-predict -1
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable --now llama-web.service
sudo systemctl status llama-web.service

Recommended order (tight RAM):

The .gguf must be fully downloaded; a truncated file makes the unit fail or restart in a loop (Restart=always).
Smoke-test with llama-cli first as the same user as the systemd unit, with the same -m, -c, and -ngl as in ExecStart (§7 Per-model quick test or step 3’s generic example). If that already OOMs or hangs, tune flags before enable --now.
If systemd shows OOM in journalctl, the process dies and respawns every few seconds, or the kernel kills the worker, edit ExecStart: drop -c (e.g. 4096) and -ngl (e.g. 40 or less) instead of staying on 99 / 999 until status shows a stable active (running); then sudo systemctl daemon-reload and sudo systemctl restart llama-web.service.

If startup fails, check logs: journalctl -u llama-web.service -n 80 --no-pager (GGUF path, /dev/dri permissions, -ngl, Vulkan).

10. Open WebUI with Docker (port 3000 → backend on 8080)

Install Docker if needed:

sudo apt install -y docker.io
sudo usermod -aG docker "$USER"
# Log out again, or run: newgrp docker

Container (UI on 3000; engine stays on host 8080):

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

In the browser: http://SERVER_IP:3000.

Connect Open WebUI to llama-server

Not the same as “External tools”. In regular user settings you may see External tools (Manage tool servers, openapi.json): that is for optional tool servers, not for the main LLM backend. Putting your URL only there leaves the model picker empty.

Use Admin Settings, not the gear icon that only shows General / Interface / External tools (personal user settings). Typical path: profile avatar → Admin Settings / Administration → Settings → Connections → OpenAI → Add connection. If Admin Settings is missing, your account is not an instance admin (the first registered user usually is). Docs: OpenAI-Compatible.

Admin panel → Settings → Connections.
OpenAI section (llama-server mimics the OpenAI API):
- Base URL: http://host.docker.internal:8080/v1
- API key: any string (e.g. sk-no-key-required).
Save and use verify connection if shown.
Turn off “Direct connections” (or equivalent) if you enabled it: otherwise the browser will try to resolve host.docker.internal outside Docker and fail. The UI should proxy to the backend.

Chat up and running (example)

With the backend wired, pick a model in chat (often the same label as the .gguf filename llama-server loaded), send a prompt, and the reply is generated on the host. The screenshot shows google_gemma-4-26B-A4B-it-Q4_K_M.gguf: the header dropdown reflects that file, and you get a “Thought for …”-style block (internal reasoning before the visible answer). That adds latency before you see the final text; for terminal use and less explicit “thinking” output with Gemma, try llama-cli with --reasoning off (§7 Quick terminal test).

No browsing or GitHub fetch: real limits (and confident wrong answers)

With llama-server + Open WebUI as wired here, the model is text → text only: it does not browse the web, issue its own internet requests, download a https://github.com/... tree, or run code in a sandbox. All it “sees” is what you type (plus whatever context the UI forwards) and knowledge frozen inside the GGUF up to training cutoff.

It may still answer very confidently as if it had tools—for example claiming it “can analyze a public repo if you share the link” or outlining how it will “read” a remote README. In this stack that is false if you only paste a URL: the backend never fetches HTML or the repo; Gemma (or any local GGUF) hallucinates or repeats patterns from training. Real analysis needs you to paste files / diffs, or separate plumbing (RAG, Open WebUI functions, agents, APIs) that this guide does not set up.

A “Thought for …” / reasoning block (§7, §10) does not verify anything online—it only extends generation and can read like a super-capable assistant; double-check claims about repos, “current” versions, or anything that depends on today.

Same stack, different tone: ask bluntly can you browse the Internet for new info? and Gemma may plainly refuse—no live search, only training data plus whatever you paste. That does not undo the GitHub-URL problem above: the model shifts persona with prompt framing (literal capability question vs. “please review this repo”). Ground truth is unchanged: llama-server still issues no HTTP on its own until you wire tools.

Live demo (the joke writes itself): the assistant just told you to “send the link”; you reply analyze https://github.com/…/pgwd and tell me what to improve—or the same request in Spanish (or any other language you type in the UI); llama-server does not switch behavior by chat language. Open WebUI shows Thinking… and Gemma looks busy, but llama-server never fetched that repo: it only sees the message string. The answer may sound technical yet be untethered from the real tree—paste files, use git yourself, or wire tools if you want grounded review.

Same experiment, a minute later: the model may return Thought for ~45–60s and a long “review” that reads like a real audit. The screenshot below is English (analyze in details…): it leans into Flask and Blueprints; in another chat the same Gemma might rattle off Go cmd//internal/—still with no tree read. That is template + guesswork, not repository access: some bullets may match the name (pgwd, “dashboard”, …), some may be wrong; length and “thought” time are not a substitute for cloning and diffing.

Model picker shows “No results found” / no models listed

This almost never means “the .gguf is missing on disk”; it means Open WebUI is not getting /v1/models from the backend you configured. Walk through in order:

llama-server must be running on the same host as Docker (§8 manual or §9 systemd). Nothing listening on 8080 → empty list.
On the host (mini PC shell), hit the API:

curl -sS http://127.0.0.1:8080/v1/models | head

You should see JSON (data, at least one id). Connection refused → start or fix llama-server. If it only bound a weird interface, use --host 0.0.0.0 in ExecStart (not only 127.0.0.1 if LAN clients need 8080; for Docker→host this is the usual choice).

From the Open WebUI container, the host port must be reachable:

docker exec open-webui sh -c 'wget -qO- http://host.docker.internal:8080/v1/models 2>/dev/null || curl -sS http://host.docker.internal:8080/v1/models' | head

If this fails but step 2 works, you are missing --add-host=host.docker.internal:host-gateway in docker run (§10), or a firewall blocks Docker bridge → host (ufw may need a rule; many setups allow it by default).

UI wiring: Settings → Connections → OpenAI (or Admin → Settings, depending on version), base URL http://host.docker.internal:8080/v1 (/v1 required). Save a dummy API key and verify if offered.
Do not mix with Ollama: putting the llama-server URL only under Ollama, or using port 8080 without /v1, can leave the dropdown empty. See the table below.
After fixing, hard-refresh the UI. The model label may match the .gguf name, default, or whatever id appears in the JSON from step 2.

“Failed to fetch models” under Ollama (Settings → Models)

If Settings → Models → Manage Models shows the Ollama service with URL http://host.docker.internal:8080 (and nothing else), you often get Failed to fetch models. That usually means two different backends are mixed up:

What you run	Typical port	Where to configure it in Open WebUI
llama-server (this guide)	8080, OpenAI-style API	Settings → Connections → OpenAI (or equivalent), base URL `http://host.docker.internal:8080/v1` (the `/v1` suffix is required).
Ollama (only if installed separately)	11434, Ollama API	Ollama connection / model management, typically `http://host.docker.internal:11434` (only if Ollama listens on the host and the container can reach it).

llama-server is not Ollama. If you put the llama-server URL in the Ollama field, the UI uses the wrong protocol and fails even when port 8080 is open.

If you only use llama-server:

Keep Connections → OpenAI exactly as above (…8080/v1, dummy key, verify).
If you do not run Ollama, clear or disable the Ollama URL (do not point it at 8080).
Return to Models or chat: available models follow whatever llama-server loaded with -m (§8–§9).

If host.docker.internal does not resolve inside the container, confirm your docker run includes --add-host=host.docker.internal:host-gateway (§10). On Linux that hostname is not defined by default without it.

Updating Open WebUI (Docker)

The UI often shows a banner like “A new version (v0.x.y) is now available…” when a newer image exists. Your chats and settings live in the open-webui named volume; they are kept when you recreate the container as long as you mount the same -v open-webui:/app/backend/data.

Pull the updated image (same tag you used at install; this guide uses main):

docker pull ghcr.io/open-webui/open-webui:main

Stop and remove only the container (the volume stays intact):

docker stop open-webui
docker rm open-webui

Run the same docker run block from §10 again (same -p 3000:8080, --add-host=host.docker.internal:host-gateway, -v open-webui:…, container name open-webui, etc.). The new container starts from the image you just pulled.

If you originally used a different tag (e.g. v0.8.12 or a cuda variant) instead of main, substitute that tag in both docker pull and docker run.

Notes: updating the UI does not update llama-server or your GGUF weights; the engine is still §6–§9. If you do not want to track main, pin an explicit image tag in docker run and repeat this flow when you choose to upgrade.

If you also run Ollama

A default endpoint may appear on port 11434. To keep using your Vulkan llama-server with the same -ngl/RAM behavior, prioritize the OpenAI entry pointing at :8080/v1 and do not rely on Ollama for that backend.

11. OpenCode and VS Code with your `llama-server`

Same API surface as Open WebUI: llama-server exposes an OpenAI-compatible endpoint at http://HOST:8080/v1 (keep §8 or §9 running). Use the mini PC’s IP instead of 127.0.0.1 when you work from another machine on the LAN (and open port 8080 in the firewall if needed).

OpenCode

OpenCode can use OpenAI-compatible providers through @ai-sdk/openai-compatible. The official docs include a llama.cpp / llama-server example: Providers — llama.cpp.

Confirm llama-server answers (e.g. curl -s http://127.0.0.1:8080/v1/models).
Create or edit opencode.json for your project or OpenCode’s config path ($schema: https://opencode.ai/config.json).
Add a provider with "npm": "@ai-sdk/openai-compatible" and "options.baseURL": "http://127.0.0.1:8080/v1" (or the remote IP).
Under provider.<id>.models, add keys that match what the API expects. If unsure, read the id field from /v1/models; it is often the .gguf filename or default.
In OpenCode, use /models to pick provider_id/model_id, or set "model": "provider_id/model_id" in the JSON.

Minimal example (adjust IDs to your setup):

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "default": {
          "name": "Local model (default)"
        }
      }
    }
  },
  "model": "llama-local/default"
}

If OpenCode cannot see the model, align models keys with /v1/models. Tools and heavy agentic flows depend on the GGUF; a general chat model may underperform on coding-agent tasks.

Visual Studio Code

VS Code does not talk to your server by itself—you need an extension that supports a custom OpenAI-style endpoint.

Common picks: Continue and others advertising OpenAI-compatible API or “local LLM”. You typically set Base URL to http://127.0.0.1:8080/v1 (or the server IP) and API key to any placeholder (e.g. sk-local).
Visual Studio GitHub Copilot does not route through your llama-server; it is a separate service.
From another PC, use the host IP where llama-server runs—not host.docker.internal (that name is for containers such as Open WebUI).

Extensions usually trail cloud models on tools and huge context. Start on the same machine you already validated with llama-cli or Open WebUI.

12. Troubleshooting: Vulkan / `glslc` on Ubuntu 24.04

Typical CMake symptoms:

Could NOT find Vulkan (missing: ... glslc)
Vulkan found but glslc still missing

Suggested order (simplest first):

12.1 Universe repository and packages

sudo add-apt-repository universe
sudo apt update
sudo apt install -y libvulkan-dev vulkan-tools shaderc

Verify:

command -v glslc && glslc --version

Clean and reconfigure the build:

cd ~/llama.cpp
rm -rf build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

12.2 LunarG repository (Vulkan SDK)

If your Ubuntu mirror does not offer shaderc or glslc is still missing:

wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc \
  | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-noble.list \
  https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list
sudo apt update
sudo apt install -y vulkan-sdk

Then rm -rf build and run cmake again.

12.3 Conflict between Ubuntu’s `libshaderc-dev` and LunarG’s Shaderc

If dpkg complains about overwriting files between packages, as a last resort you can force-remove the blocking package, then repair:

sudo dpkg --remove --force-depends libshaderc-dev
sudo apt --fix-broken install -y
sudo apt install -y shaderc

Only do this if you understand mixed repos can leave messy dependencies; often sticking to either LunarG or Ubuntu for Shaderc dev packages is enough.

12.4 Snap fallback for `glslc`

sudo snap install google-shaderc
sudo ln -sf /snap/bin/glslc /usr/local/bin/glslc

Check glslc --version again and retry CMake.

13. Performance and models (rough guide)

With lots of RAM and a modest iGPU, unified VRAM and -ngl cap GPU tokens/s; larger models can spill into system RAM.

Scale	Notes
Gemma 4 26B A4B (e.g. Q4_K_M ~17 GiB)	Good balance with high RAM; needs an up-to-date llama.cpp.
Same family Q8_0 (~27 GiB)	Better quality; more pressure on RAM/unified VRAM.
Mixtral 8×7B, 70B, others	Feasible mainly thanks to RAM; slower.

Use a lower quantization (e.g. Q4_K_M) if you prioritize speed over quality.

For hard numbers on your box, run llama-bench (§7): it is the most direct way to compare -ngl and quantizations without the web UI in the way.

`htop` looks “light” while you chat (is that normal?)

If htop shows llama-server / llama-cli with low CPU across cores and only a few GiB of RES, that is often expected when:

-ngl leaves much of the model on the iGPU — heavy matmul runs on the graphics core; the CPU orchestrates and shuffles data, so you may not see all cores pegged at 100%.
The GGUF is small (e.g. 7B/8B Q4) — small resident RAM footprint; a 26B run would show much more RES if most weights live in system memory.
Bursts happen while scoring the prompt and generating tokens; between turns or while you read output, usage drops.
With unified memory (UMA), some model cost may not show up as a huge process RSS: the GPU also holds part of the working set.

Do not assume nothing is working just because htop stays calm: check t/s in llama-cli, llama-bench (§7), or a GPU monitor if you want to see graphics load.

Reference screenshot (same class of mini PC as the validated hardware; SSH + htop: llama.cpp around ~5 GiB RES and moderate CPU on one core—consistent with a non-huge model and GPU-bound ‑ngl):

AMD: `amdgpu_pm_info` and `dri/N` (not always `dri/0`)

Many snippets use /sys/kernel/debug/dri/0/amdgpu_pm_info. On Ryzen mini PCs with amdgpu, dri/0 often does not exist: the kernel exposes the GPU under a PCI BDF directory (0000:c4:00.0, …) and provides symlinks such as dri/1 or dri/128 into the same tree. If cat returns No such file or directory, inspect first:

mount | grep debugfs   # expect debugfs on /sys/kernel/debug
ls -la /sys/kernel/debug/dri/

Then read amdgpu_pm_info using the N or PCI path that belongs to your AMDGPU (1 or 0000:…:….0 usually works):

sudo cat /sys/kernel/debug/dri/1/amdgpu_pm_info
# same content if 1 → 0000:c4:00.0:
# sudo cat /sys/kernel/debug/dri/0000:c4:00.0/amdgpu_pm_info

If the directory exists but amdgpu_pm_info is missing, your kernel may not export that node; try ls … | grep -i pm. That does not mean Vulkan is broken.

How to read it (sample text, idle mini PC): GPU Load: 0 % with VCN powered down matches idle. While llama-cli / llama-server runs a long ‑ngl job, run cat during generation: you should usually see Load > 0 % (the counter may not peg the iGPU). For a live view, radeontop is often easier (sudo apt install -y radeontop).

GFX Clocks and Power:
    2800 MHz (MCLK)
    800 MHz (SCLK)
    ...
GPU Temperature: 36 C
GPU Load: 0 %
VCN Load: 0 %
VCN: Powered down

(Illustrative excerpt; clocks, millivolts, and watts vary with BIOS, governor, and workload.)

14. Remote desktop (Ubuntu 24.04 Desktop, LAN)

When the mini PC runs GNOME and you want the full desktop from another machine on the same network (Windows, Mac, Linux), Ubuntu 24.04 usually ships RDP built in; you often do not need xrdp unless you want different behavior.

14.1 Enable on the mini PC

Settings → System → Remote Desktop.
Turn Remote Desktop on.
Finish the assistant (password / auth as GNOME shows).

Underlying package: gnome-remote-desktop. If the toggle is missing or fails:

sudo apt update
sudo apt install --reinstall gnome-remote-desktop

Log out or reboot and open Settings again.

14.2 Connect from another machine

Native RDP clients: Windows (Remote Desktop Connection / mstsc), macOS (Microsoft Remote Desktop from the App Store), Linux (e.g. Remmina, RDP protocol).
Host: the Ubuntu box’s LAN IP (hostname -I | awk '{print $1}' on the mini PC).
Port: 3389/TCP by default.

14.3 Firewall (`ufw`)

If ufw is enabled:

sudo ufw allow 3389/tcp comment 'GNOME RDP'
sudo ufw status

14.4 If connection fails

On the Ubuntu host:

hostname -I
sudo ss -tlnp | grep 3389 || true

With Remote Desktop enabled, something should listen on 3389. Confirm the client is on the same LAN and that no AP isolation blocks client-to-client Wi‑Fi.

If GNOME/RDP misbehaves on Wayland, try the Ubuntu on Xorg session on the login screen and enable Remote Desktop again.

Security: exposing RDP to the public Internet without VPN/tunnel is a bad idea; keep it on a trusted LAN or behind VPN/WireGuard.

Final checklist

[ ] BIOS: UMA / VRAM for iGPU adjusted if applicable.
[ ] Vulkan OK: on desktop vkcube; on Ubuntu Server vulkaninfo --summary shows the GPU.
[ ] User is in render and video (id -nG); if you ran usermod, you logged out or rebooted (an old shell session does not pick up new groups).
[ ] cmake -B build -DGGML_VULKAN=1 succeeds; build reaches 100 %.
[ ] You can update llama.cpp (git pull, rebuild §6) and follow try model → systemd → Open WebUI when experimenting with new GGUFs (§7, Experimenting…).
[ ] llama-cli shows the Vulkan device when loading the model.
[ ] llama-server responds on :8080.
[ ] Open WebUI on :3000 with http://host.docker.internal:8080/v1 and Direct connections off.
[ ] You know the model does not browse or read GitHub from a URL alone; it may hallucinate capabilities (§10 No browsing or GitHub fetch).
[ ] You know how to upgrade Open WebUI: docker pull, stop/rm the container, rerun the same docker run with the open-webui volume (§10).
[ ] systemd service enabled if you want a persistent boot setup.
[ ] You know how to switch models: after adding another .gguf, you update -m in llama-web.service (or in the manual command), run sudo systemctl daemon-reload && sudo systemctl restart llama-web.service, and reload Open WebUI.
[ ] You can list your .gguf files (ls / find, §7) and measure throughput with llama-bench (§7) when comparing quantizations or -ngl.
[ ] You can follow the unified playbook for Gemma 4 / Qwen Coder / DeepSeek Lite / Llama 3.1 (§7): download → llama-cli → systemd → /v1/models → Open WebUI.
Remote desktop §14: RDP enabled in Settings, 3389 allowed in ufw if needed, smoke tested from another PC on the LAN.

Quick port reference

Service	Port
llama-server	8080
Open WebUI	3000
Remote desktop (GNOME RDP)	3389 TCP
Ollama (optional)	11434

Closing thoughts

Running local inference on Ubuntu with Vulkan and an AMD iGPU is not a one-click setup, but it is worth it: a model that answers on your LAN, without routing every request through a third-party API, and with the freedom to swap GGUFs or quantizations when you need to.

The stack moves fast: llama.cpp, Ubuntu packages, and Hugging Face repos change over time. If a command or package name no longer matches this guide, cmake and apt errors usually point you in the right direction; double-check the project’s current docs.

Once the checklist is green, the natural next step is tuning -ngl, context size (-c), and the model until you get the quality-vs-tokens-per-second balance you want on your hardware.

This is the mini PC we used for the tests and validation in this guide: Minisforum UM760 Slim (Ryzen 5 7640HS, Radeon 760M), Ubuntu 24.04 LTS, plenty of DDR5 RAM and NVMe — the same box behind the llama-bench runs, llama-cli screenshots, Open WebUI examples, and the other reference captures. The photo is the actual machine (powered on, front panel as shown), not a marketing render.

Now go tinker: this walkthrough is rooted in Ryzen + iGPU, but the playbook travels—mini PCs (Minisforum, Beelink, ASUS ExpertCenter PN, ZOTAC ZBOX, modern Intel NUC-class boxes…), Mac mini / Mac Studio on Apple Silicon if that is your stack, or compact power boxes like NVIDIA DGX Spark when budget and goals match. Build llama.cpp (or your preferred runtime), stress GGUF quantizations, run llama-bench on your iron, and tune -ngl until the ceiling feels right. Share what you learn—a dev.to post, a blog, Mastodon, article comments, or whatever community you use; real numbers beat brochure claims every time.

One quiet takeaway: on your codebases the model usually helps more as a copilot you feed—a diff, a log slice, a trimmed README—than as an all-knowing reviewer from a bare URL or a polished persona. When the answer feels too slick without anything concrete in the prompt, the limit is rarely the mini PC: it is text-in, text-out with nobody else reading disk for you. §10 walks the receipts; day-to-day, you supply the ground truth.