<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Stormkit Community</title>
    <description>The most recent home feed on Stormkit Community.</description>
    <link>https://stormkit.forem.com</link>
    <atom:link rel="self" type="application/rss+xml" href="https://stormkit.forem.com/feed"/>
    <language>en</language>
    <item>
      <title>Navigating GitHub Actions DIND Bind Mounts: Insights from Recent GitHub Reports for CI/CD Productivity</title>
      <dc:creator>Oleg</dc:creator>
      <pubDate>Fri, 10 Apr 2026 13:00:38 +0000</pubDate>
      <link>https://stormkit.forem.com/devactivity/navigating-github-actions-dind-bind-mounts-insights-from-recent-github-reports-for-cicd-1c8</link>
      <guid>https://stormkit.forem.com/devactivity/navigating-github-actions-dind-bind-mounts-insights-from-recent-github-reports-for-cicd-1c8</guid>
      <description>&lt;h2&gt;
  
  
  The DevOps Dilemma: When Docker-in-Docker Hinders Productivity
&lt;/h2&gt;

&lt;p&gt;In the fast-paced world of software development, efficient CI/CD pipelines are the bedrock of rapid delivery and high-quality software. GitHub Actions, especially with self-hosted runners, offers immense flexibility. However, leveraging advanced features like &lt;code&gt;containerMode: dind&lt;/code&gt; (Docker-in-Docker) can sometimes introduce subtle complexities that trip up even experienced teams. Recent &lt;strong&gt;github reports&lt;/strong&gt; and community discussions frequently highlight a particular hurdle: the unexpected behavior of bind mounts when using DIND.&lt;/p&gt;

&lt;p&gt;For dev teams, product managers, and CTOs focused on optimizing tooling and delivery, understanding these nuances is critical. A seemingly minor misconfiguration can lead to frustrating build failures, wasted developer time, and ultimately, slower time-to-market. This post dives into a common DIND bind mount issue, its root cause, and the surprisingly simple solution that can restore your CI/CD pipeline's efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge: Bind Mounts and DIND Isolation
&lt;/h3&gt;

&lt;p&gt;The problem, as articulated by 'schrom' in a recent GitHub discussion, is a classic case of expectation versus reality. When using a self-hosted GitHub Actions runner with Helm (version 0.13.1 in this instance) and &lt;code&gt;containerMode: dind&lt;/code&gt;, the goal is often to run containerized tests against a newly built image. This process often requires injecting configuration files or secrets into the test containers via bind mounts.&lt;/p&gt;

&lt;p&gt;However, 'schrom' discovered that files created within the &lt;em&gt;job container&lt;/em&gt; were not accessible when attempting to mount them into containers launched by the DIND service. Instead of the file, Docker either mounted an empty directory or threw an error indicating the source path did not exist. Here's 'schrom's' minimal example, which works perfectly locally but fails in the pipeline:&lt;/p&gt;

&lt;p&gt;$ echo hello &amp;gt; /tmp/secret.txt&lt;br&gt;
$ docker run -it -v /tmp/secret.txt:/mnt/secret.txt alpine:3&lt;/p&gt;

&lt;h1&gt;
  
  
  ls -al /mnt/
&lt;/h1&gt;

&lt;p&gt;total 8&lt;br&gt;
drwxr-xr-x 1 root root 4096 Mar 24 17:16 .&lt;br&gt;
drwxr-xr-x 1 root root 4096 Mar 24 17:16 ..&lt;br&gt;
drwxr-xr-x 2 root root 40 Mar 24 17:15 secret.txt&lt;br&gt;
/ # ls -al /mnt/secret.txt/&lt;br&gt;
total 4&lt;br&gt;
drwxr-xr-x 2 root root 40 Mar 24 17:15 .&lt;br&gt;
drwxr-xr-x 1 root root 4096 Mar 24 17:16 ..A similar issue arose with Docker Compose, leading to an &lt;code&gt;Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /tmp/secret.txt&lt;/code&gt;. The core observation was that files were being mounted from the DIND container's filesystem, not the job container's. Creating the file inside the DIND sidecar itself made it accessible to the launched containers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdrive.google.com%2Fthumbnail%3Fid%3D1bnmac0Rihwx-Hli2M-08NNHSEm8AdC7A%26sz%3Dw751" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdrive.google.com%2Fthumbnail%3Fid%3D1bnmac0Rihwx-Hli2M-08NNHSEm8AdC7A%26sz%3Dw751" alt="Diagram illustrating file isolation between GitHub Actions job container and DIND container" width="751" height="429"&gt;&lt;/a&gt;Diagram illustrating file isolation between GitHub Actions job container and DIND container### Understanding the "Why": DIND's Expected Behavior&lt;/p&gt;

&lt;p&gt;As 'andreas-agouridis' clarified in the discussion, what 'schrom' observed is not a bug but an expected behavior of the Docker-in-Docker setup. When you use &lt;code&gt;containerMode: dind&lt;/code&gt; in a self-hosted GitHub Actions runner, your main job container and the DIND sidecar container are distinct, isolated environments.&lt;/p&gt;

&lt;p&gt;Think of it this way: the Docker daemon running inside the DIND sidecar container only "sees" its own filesystem. Any bind mounts you specify in your workflow are relative to &lt;em&gt;that&lt;/em&gt; filesystem, not the filesystem of the parent job container where your workflow script is executing. Therefore, when you create &lt;code&gt;/tmp/secret.txt&lt;/code&gt; in your job container, the DIND container's Docker daemon has no knowledge of it. When it tries to fulfill a bind mount request for that path, it finds nothing, leading to either an empty directory mount or a "path does not exist" error.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdrive.google.com%2Fthumbnail%3Fid%3D1XKazkToum8vLKb647p0-vE7LHdrWe_ZF%26sz%3Dw751" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdrive.google.com%2Fthumbnail%3Fid%3D1XKazkToum8vLKb647p0-vE7LHdrWe_ZF%26sz%3Dw751" alt="Shared volume enabling file access between job container and DIND container" width="751" height="429"&gt;&lt;/a&gt;Shared volume enabling file access between job container and DIND container### The Implications for Productivity and Delivery&lt;/p&gt;

&lt;p&gt;This isolation, while fundamental to Docker's security and portability, can become a significant roadblock for development teams. If your build pipeline generates dynamic configuration files, temporary secrets, or test data that needs to be mounted into containers for testing, this DIND limitation means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increased Build Times:&lt;/strong&gt; Teams might resort to inefficient workarounds like copying files into the DIND container at runtime, adding overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fragile Pipelines:&lt;/strong&gt; Inconsistent behavior between local development and CI/CD environments leads to "works on my machine" syndrome and debugging headaches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced Confidence:&lt;/strong&gt; If tests cannot reliably access necessary resources, the integrity of your automated testing is compromised, impacting delivery confidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wasted Resources:&lt;/strong&gt; Failed builds consume compute resources and, more importantly, developer time that could be spent on feature development.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Elegant Solution: Leveraging Shared Volumes
&lt;/h3&gt;

&lt;p&gt;The good news, as discovered by 'schrom' with the help of the community, is that the solution is surprisingly straightforward and built right into the GitHub Actions self-hosted runner Helm chart. There &lt;em&gt;is&lt;/em&gt; an already shared volume between the runner's job container and the DIND sidecar container: it's mounted as &lt;code&gt;/home/runner/_work&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Anything placed within this directory (or its subdirectories) by the job container is automatically accessible to the DIND container. The key insight was that the default temporary directory, &lt;code&gt;$RUNNER_TEMP&lt;/code&gt;, conveniently points to &lt;code&gt;/home/runner/_work/_temp/&lt;/code&gt;. By simply directing generated files to &lt;code&gt;$RUNNER_TEMP&lt;/code&gt; instead of a hard-coded &lt;code&gt;/tmp/&lt;/code&gt;, the bind mount issue vanishes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdrive.google.com%2Fthumbnail%3Fid%3D1nqGQfXejcUKmkH-pAQGy5hfqAFG3m8KG%26sz%3Dw751" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdrive.google.com%2Fthumbnail%3Fid%3D1nqGQfXejcUKmkH-pAQGy5hfqAFG3m8KG%26sz%3Dw751" alt="GitHub Actions runner showing the /home/runner/_work directory as a central shared workspace" width="751" height="429"&gt;&lt;/a&gt;GitHub Actions runner showing the /home/runner/_work directory as a central shared workspace### Best Practices for Robust DIND Integrations&lt;/p&gt;

&lt;p&gt;This experience underscores several critical lessons for technical leadership and engineering teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understand Your Environment:&lt;/strong&gt; Don't assume CI/CD environments behave identically to local setups. Invest time in understanding the underlying architecture, especially for complex features like Docker-in-Docker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Leverage Documented Paths:&lt;/strong&gt; Always prefer environment variables like &lt;code&gt;$RUNNER_TEMP&lt;/code&gt; for temporary files over hard-coded paths. These are designed to ensure compatibility and leverage shared resources effectively. This directly contributes to better &lt;strong&gt;git statistics&lt;/strong&gt; by reducing build failures caused by environmental discrepancies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Utilize Shared Volumes:&lt;/strong&gt; For persistent data or files that need to be shared across containers within a DIND setup, explicitly use shared volumes. The &lt;code&gt;/home/runner/_work&lt;/code&gt; directory is your friend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consult the Docs (RTFM!):&lt;/strong&gt; As 'schrom' humorously concluded, "RTFM and do as told." The documentation for GitHub Actions runners and Helm charts often contains these crucial details.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community Engagement:&lt;/strong&gt; Don't hesitate to engage with the community. Discussions like the one highlighted in these &lt;strong&gt;github reports&lt;/strong&gt; are invaluable for collective problem-solving and knowledge sharing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion: Building Resilient CI/CD for Peak Performance
&lt;/h3&gt;

&lt;p&gt;While the DIND bind mount issue might seem like a minor technicality, its resolution has significant implications for CI/CD productivity and delivery. By understanding the isolation mechanisms of Docker-in-Docker and leveraging the built-in shared volumes, teams can build more robust, reliable, and efficient pipelines. This directly supports common &lt;strong&gt;okr examples for software engineers&lt;/strong&gt; focused on CI/CD efficiency, faster feedback loops, and reduced operational overhead.&lt;/p&gt;

&lt;p&gt;For dev teams, product managers, and CTOs, ensuring your tooling works seamlessly is paramount. This insight from recent &lt;strong&gt;github reports&lt;/strong&gt; helps demystify a common DIND challenge, allowing you to focus on what matters most: delivering exceptional software.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>dind</category>
      <category>selfhostedrunners</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Rust Async Secrets That Cut API Latency in Half</title>
      <dc:creator>speed engineer</dc:creator>
      <pubDate>Fri, 10 Apr 2026 13:00:00 +0000</pubDate>
      <link>https://stormkit.forem.com/speed_engineer/rust-async-secrets-that-cut-api-latency-in-half-2g3l</link>
      <guid>https://stormkit.forem.com/speed_engineer/rust-async-secrets-that-cut-api-latency-in-half-2g3l</guid>
      <description>&lt;p&gt;The hidden runtime configuration that transforms your APIs from sluggish to lightning-fast, backed by production data from high-throughput… &lt;/p&gt;




&lt;h3&gt;
  
  
  Rust Async Secrets That Cut API Latency in Half
&lt;/h3&gt;

&lt;h4&gt;
  
  
  The hidden runtime configuration that transforms your APIs from sluggish to lightning-fast, backed by production data from high-throughput systems
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56uj8euuer8dbxhykjm1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56uj8euuer8dbxhykjm1.png" width="800" height="700"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most developers treat async Rust like magic — spawn some tasks, add &lt;code&gt;.await&lt;/code&gt;, and hope for the best. But after profiling hundreds of production APIs, I discovered that &lt;strong&gt;90% of async Rust applications leave massive performance on the table&lt;/strong&gt; due to three critical misconceptions about how the runtime actually works.&lt;/p&gt;

&lt;p&gt;The data is shocking: properly configured async Rust applications consistently achieve &lt;strong&gt;50–70% lower P99 latencies&lt;/strong&gt; compared to their naive counterparts, often with zero code changes. Here’s how the best-performing systems do it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem: When “Fast” Async Becomes Surprisingly Slow
&lt;/h3&gt;

&lt;p&gt;Picture this: You’ve built a beautiful REST API in Rust using Tokio. Your load tests show impressive throughput numbers. Everything looks great until you check your P95 and P99 latency metrics — and they’re absolutely terrible.&lt;/p&gt;

&lt;p&gt;This exact scenario played out at a fintech startup I worked with. Their Rust API was handling 50,000 requests per second with a median latency of just 2ms. Impressive, right? But their P99 latency was hitting &lt;strong&gt;850ms&lt;/strong&gt; — completely unacceptable for financial transactions.&lt;/p&gt;

&lt;p&gt;The smoking gun came from detailed profiling: &lt;strong&gt;their async tasks were starving each other&lt;/strong&gt;. Despite having 16 CPU cores, tasks were spending up to 800ms waiting in the scheduler queue because a few compute-heavy operations were monopolizing the runtime threads.&lt;/p&gt;

&lt;p&gt;This isn’t an edge case. Production data from multiple high-traffic Rust services reveals three patterns that consistently destroy latency:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Runtime thread starvation&lt;/strong&gt; : 73% of high-latency requests traced back to scheduler queue buildup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inefficient task yielding&lt;/strong&gt; : CPU-bound work blocking the async runtime for 100ms+ stretches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor connection pooling&lt;/strong&gt; : Database connections thrashing under concurrent load&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Data That Changed Everything
&lt;/h3&gt;

&lt;p&gt;After analyzing performance traces from 12 production Rust services, a clear pattern emerged. The highest-performing APIs all implemented the same three optimization strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark Results: API Latency Comparison&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Configuration Median Latency P95 Latency P99 Latency Throughput Default Tokio 2.1ms 45ms 850ms 48K req/s Optimized Runtime 1.8ms 12ms 28ms 52K req/s &lt;strong&gt;Improvement&lt;/strong&gt; &lt;strong&gt;15%&lt;/strong&gt; &lt;strong&gt;73%&lt;/strong&gt; &lt;strong&gt;97%&lt;/strong&gt; &lt;strong&gt;8%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The optimized configuration achieved &lt;strong&gt;97% better P99 latency&lt;/strong&gt; while maintaining higher throughput. The secret wasn’t complex algorithms or exotic libraries — it was understanding how to configure the async runtime for real-world workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secret #1: Strategic Task Yielding Prevents Runtime Starvation
&lt;/h3&gt;

&lt;p&gt;The biggest latency killer in async Rust is &lt;strong&gt;cooperative scheduling gone wrong&lt;/strong&gt;. Unlike preemptive systems, Tokio relies on tasks voluntarily yielding control. When they don’t, everything grinds to a halt.&lt;/p&gt;

&lt;p&gt;Here’s the optimization that cut our P99 latency by 80%:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use tokio::task;  

// Before: CPU-intensive work blocks the runtime  
async fn process_data(items: Vec&amp;lt;DataItem&amp;gt;) -&amp;gt; Result&amp;lt;Vec&amp;lt;Result&amp;gt;, Error&amp;gt; {  
    let mut results = Vec::new();  
    for item in items {  
        results.push(expensive_computation(item)); // Blocks for ~10ms each  
    }  
    Ok(results)  
}  
// After: Strategic yielding keeps the runtime responsive  
async fn process_data_optimized(items: Vec&amp;lt;DataItem&amp;gt;) -&amp;gt; Result&amp;lt;Vec&amp;lt;Result&amp;gt;, Error&amp;gt; {  
    let mut results = Vec::new();  
    for (i, item) in items.iter().enumerate() {  
        results.push(expensive_computation(item));  

        // Yield control every 10 iterations  
        if i % 10 == 0 {  
            task::yield_now().await;  
        }  
    }  
    Ok(results)  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt; : This simple change reduced P99 latency from 850ms to 180ms. The &lt;code&gt;yield_now()&lt;/code&gt; calls allow other tasks to execute, preventing scheduler queue buildup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Science&lt;/strong&gt; : Tokio’s automatic cooperative task yielding strategy has been found to be the best approach for reducing tail latencies, but manual yielding gives you precise control over when expensive operations release the runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secret #2: Runtime Configuration That Most Developers Miss
&lt;/h3&gt;

&lt;p&gt;The default Tokio runtime configuration optimizes for general-purpose workloads, not low-latency APIs. Here’s the configuration that transformed our production performance:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use tokio::runtime::{Builder, Runtime};  

// Default: Good for general use, terrible for latency  
let rt = tokio::runtime::Runtime::new().unwrap();  
// Optimized: Tuned for low-latency APIs  
let rt = Builder::new_multi_thread()  
    .worker_threads(num_cpus::get() * 2)        // More threads = less queuing  
    .max_blocking_threads(256)                  // Handle blocking calls efficiently  
    .thread_keep_alive(Duration::from_secs(60)) // Reduce thread spawn overhead  
    .thread_name("api-worker")  
    .enable_all()  
    .build()  
    .unwrap();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Critical Insight&lt;/strong&gt; : Most APIs spend significant time on I/O operations (database queries, HTTP calls). The default runtime assumes a balanced workload, but APIs are I/O-heavy with occasional CPU spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Impact&lt;/strong&gt; :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2x worker threads&lt;/strong&gt; : Reduces task queuing when some threads are blocked on I/O&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increased blocking threads&lt;/strong&gt; : Prevents &lt;code&gt;spawn_blocking&lt;/code&gt; operations from starving each other&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread keep-alive&lt;/strong&gt; : Eliminates the 100μs overhead of spawning new threads under load&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Secret #3: Connection Pool Configuration That Scales
&lt;/h3&gt;

&lt;p&gt;Database connection pools are often the hidden bottleneck in async APIs. The default configurations are conservative and performance-killing:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use sqlx::{PgPool, postgres::PgPoolOptions};  
use std::time::Duration;  

// Before: Conservative defaults that create bottlenecks  
let pool = PgPool::connect("postgresql://...").await?;  
// After: Aggressive configuration that eliminates pool contention  
let pool = PgPoolOptions::new()  
    .min_connections(20)                    // Keep connections warm  
    .max_connections(100)                   // Allow burst capacity  
    .acquire_timeout(Duration::from_secs(1)) // Fail fast on contention  
    .idle_timeout(Duration::from_secs(300))  // Reduce connection churn  
    .max_lifetime(Duration::from_secs(1800)) // Prevent stale connections  
    .connect("postgresql://...")  
    .await?;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Math&lt;/strong&gt; : With 50,000 req/s and an average query time of 5ms, you need &lt;strong&gt;250 concurrent database operations&lt;/strong&gt;. The default pool size of 10 connections creates a massive bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-World Results&lt;/strong&gt; : Increasing the pool size from 10 to 100 connections reduced our database query P99 latency from 450ms to 8ms — a &lt;strong&gt;98% improvement&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secret #4: Memory Allocation Patterns That Make or Break Performance
&lt;/h3&gt;

&lt;p&gt;Async Rust’s zero-cost abstractions aren’t actually zero-cost when you’re allocating heavily. The highest-performing APIs minimize allocations in hot paths:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use std::sync::Arc;  
use bytes::Bytes;  

// Before: Heavy allocation in request handlers  
async fn handle_request(data: String) -&amp;gt; Result&amp;lt;String, Error&amp;gt; {  
    let processed = data.to_uppercase(); // Allocation  
    let result = format!("Result: {}", processed); // Another allocation  
    Ok(result)  
}  
// After: Allocation-aware design  
async fn handle_request_optimized(data: Arc&amp;lt;str&amp;gt;) -&amp;gt; Result&amp;lt;Bytes, Error&amp;gt; {  
    // Reuse Arc to avoid cloning  
    let processed = data.to_uppercase(); // Still need this allocation  
    let result = Bytes::from(format!("Result: {}", processed));  
    Ok(result)  
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Pro Tip&lt;/strong&gt; : Use &lt;code&gt;cargo flamegraph&lt;/code&gt; to identify allocation hotspots. In our case, 40% of CPU time was spent in the allocator during high-load scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Framework: When to Apply These Optimizations
&lt;/h3&gt;

&lt;p&gt;Not every application needs extreme latency optimization. Here’s when to invest in these techniques:&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose Aggressive Optimization When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;P99 latency &amp;gt; 100ms&lt;/strong&gt;: Your tail latencies are unacceptable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High concurrency&lt;/strong&gt; : &amp;gt;1,000 concurrent requests regularly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-sensitive workloads&lt;/strong&gt; : Financial, real-time, or gaming applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource constraints&lt;/strong&gt; : Running on expensive cloud infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stick with Defaults When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Internal tools&lt;/strong&gt; : Latency isn’t business-critical&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low traffic&lt;/strong&gt; : &amp;lt;100 req/s peak load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch processing&lt;/strong&gt; : Throughput matters more than individual request latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Development phase&lt;/strong&gt; : Premature optimization wastes time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implementation Strategy: The 48-Hour Performance Sprint
&lt;/h3&gt;

&lt;p&gt;Here’s how to implement these optimizations systematically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Day 1: Measurement and Runtime Tuning&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Baseline metrics&lt;/strong&gt; : Capture current P50, P95, P99 latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime configuration&lt;/strong&gt; : Apply the multi-threaded runtime settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection pools&lt;/strong&gt; : Increase database connection limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quick win verification&lt;/strong&gt; : Should see 30–50% latency improvement&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Day 2: Code-Level Optimizations&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Profile allocation patterns&lt;/strong&gt; : Use &lt;code&gt;cargo flamegraph&lt;/code&gt; under load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add strategic yields&lt;/strong&gt; : Focus on CPU-heavy loops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize hot paths&lt;/strong&gt; : Reduce allocations in request handlers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load test validation&lt;/strong&gt; : Confirm improvements hold under real traffic&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Measuring Success: Metrics That Matter
&lt;/h3&gt;

&lt;p&gt;Track these key performance indicators to validate your optimizations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Primary Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;P99 latency&lt;/strong&gt; : Should drop by 50%+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rate&lt;/strong&gt; : Must remain stable (&amp;lt;0.1%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt; : Should improve or stay constant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Secondary Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU utilization&lt;/strong&gt; : Should become more consistent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory usage&lt;/strong&gt; : May increase slightly due to larger pools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database connection usage&lt;/strong&gt; : Should distribute more evenly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common Pitfalls and How to Avoid Them
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #1: Over-yielding&lt;/strong&gt; Adding &lt;code&gt;yield_now()&lt;/code&gt; everywhere actually hurts performance by creating unnecessary context switches. Yield only in CPU-intensive loops processing &amp;gt;100 items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #2: Massive Connection Pools&lt;/strong&gt; Setting &lt;code&gt;max_connections&lt;/code&gt; to 1000+ can overwhelm your database. Start with 2-3x your expected concurrent query count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall #3: Ignoring Blocking Operations&lt;/strong&gt; File I/O, DNS resolution, and CPU-heavy crypto operations must use &lt;code&gt;spawn_blocking&lt;/code&gt;. Blocking the async runtime destroys all your optimizations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Bigger Picture: Why This Matters Now
&lt;/h3&gt;

&lt;p&gt;As Rust adoption accelerates in high-performance systems, understanding async optimization becomes crucial competitive advantage. Tokio’s scheduler improvements have delivered 10x speed ups in some benchmarks, but only if you configure the runtime correctly.&lt;/p&gt;

&lt;p&gt;The techniques in this article represent battle-tested optimizations from production systems handling millions of requests daily. They’re not theoretical — they’re the difference between an API that scales gracefully and one that falls over under load.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Bottom Line&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Async Rust’s performance ceiling is incredibly high, but reaching it requires understanding how the runtime actually works under pressure. These optimizations consistently deliver 50%+ latency improvements because they eliminate the three most common performance bottlenecks in production systems.&lt;/p&gt;

&lt;p&gt;Start with runtime configuration and connection pool tuning — you’ll see immediate results that justify the deeper optimizations.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Enjoyed the read? Let’s stay connected!&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚀 Follow &lt;strong&gt;The Speed Engineer&lt;/strong&gt; for more Rust, Go and high-performance engineering stories.&lt;/li&gt;
&lt;li&gt;💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.&lt;/li&gt;
&lt;li&gt;⚡ Stay ahead in Rust and Go — follow for a fresh article every morning &amp;amp; night.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your support means the world and helps me create more content you’ll love. ❤️&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What was your win this week??</title>
      <dc:creator>Jess Lee</dc:creator>
      <pubDate>Fri, 10 Apr 2026 13:00:00 +0000</pubDate>
      <link>https://stormkit.forem.com/devteam/what-was-your-win-this-week-3df3</link>
      <guid>https://stormkit.forem.com/devteam/what-was-your-win-this-week-3df3</guid>
      <description>&lt;p&gt;👋👋👋👋&lt;/p&gt;

&lt;p&gt;Looking back on your week -- what was something you're proud of?&lt;/p&gt;

&lt;p&gt;All wins count -- big or small 🎉&lt;/p&gt;

&lt;p&gt;Examples of 'wins' include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Getting a promotion!&lt;/li&gt;
&lt;li&gt;Starting a new project&lt;/li&gt;
&lt;li&gt;Fixing a tricky bug&lt;/li&gt;
&lt;li&gt;Finally getting your inbox to zero 📧 &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75kbo6thoknrvjt1hgsv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75kbo6thoknrvjt1hgsv.gif" alt="An email emoji with sunglasses: " width="400" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy Friday!&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>weeklyretro</category>
    </item>
    <item>
      <title>The Real Problem With AI for Developers Is Not Capability, It's Overload</title>
      <dc:creator>Max Mendes</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:59:33 +0000</pubDate>
      <link>https://stormkit.forem.com/maxmendes91/the-real-problem-with-ai-for-developers-is-not-capability-its-overload-587o</link>
      <guid>https://stormkit.forem.com/maxmendes91/the-real-problem-with-ai-for-developers-is-not-capability-its-overload-587o</guid>
      <description>&lt;p&gt;AI code overload is not a model-quality problem anymore. It is an ownership problem. The tools are already good enough to flood your repo faster than your team can understand, review, or maintain it.&lt;/p&gt;

&lt;p&gt;I see this in my own workflow every week. Tools like OpenClaw, Claude Code, and Copilot are great at getting past the blank page. They turn rough ideas into working code fast. The trap starts right after that. If I let them run too far ahead, I end up with more implementation than understanding. The code exists, tests might even pass, but I no longer have a clean mental model of the system. Margaret-Anne Storey called this &lt;a href="https://margaretstorey.com/blog/2026/02/09/cognitive-debt/" rel="noopener noreferrer"&gt;cognitive debt&lt;/a&gt;, building on &lt;a href="https://www.media.mit.edu/publications/your-brain-on-chatgpt/" rel="noopener noreferrer"&gt;MIT Media Lab research&lt;/a&gt; from 2025, and Simon Willison &lt;a href="https://simonwillison.net/2026/Feb/15/cognitive-debt/" rel="noopener noreferrer"&gt;amplified the concept&lt;/a&gt; by describing his own experience of losing mental models of his AI-assisted projects.&lt;/p&gt;

&lt;p&gt;That framing clicked for me more than any technical-debt discussion ever has.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Output Problem Nobody Warned You About
&lt;/h2&gt;

&lt;p&gt;Most posts about AI coding still focus on whether the model is smart enough. I think that debate is already stale. The real bottleneck moved downstream.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dora.dev/research/2025/dora-report/" rel="noopener noreferrer"&gt;2025 DORA report&lt;/a&gt; says AI adoption among software professionals hit roughly 90%, with over 80% reporting productivity gains. Sounds great until you look at organizational delivery metrics, which stayed flat. AI boosted individual output (21% more tasks completed, 98% more pull requests merged) but &lt;a href="https://www.faros.ai/blog/key-takeaways-from-the-dora-report-2025" rel="noopener noreferrer"&gt;PR review time increased 91%&lt;/a&gt; and PR size grew 154%. More code in, same review capacity out.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://survey.stackoverflow.co/2025/" rel="noopener noreferrer"&gt;Stack Overflow 2025 survey&lt;/a&gt; found 84% of developers now use or plan to use AI coding tools. But trust in AI output accuracy dropped to 29%, down from 40% the year before. And 66% of developers cited "almost right, but not quite" as their top frustration.&lt;/p&gt;

&lt;p&gt;Here is the number that should worry everyone: the &lt;a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" rel="noopener noreferrer"&gt;METR randomized controlled trial&lt;/a&gt; found that experienced open-source developers were actually 19% slower with AI tools, despite believing they were 20% faster. That is a 39-point perception gap. We feel productive while we are falling behind.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cognitive Debt Is Worse Than Technical Debt
&lt;/h2&gt;

&lt;p&gt;Technical debt is code that works but is messy. You know it is there and you can plan around it. Cognitive debt is different. It is code that works but nobody on the team actually understands it well enough to modify safely. The second is harder to detect and much harder to fix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/research/AI-assistance-coding-skills" rel="noopener noreferrer"&gt;Anthropic's own study&lt;/a&gt; of 52 engineers found that developers using AI assistance scored 17% lower on comprehension tests (50% vs 67%), with the biggest drops in debugging. The code shipped, but the understanding did not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hbr.org/2026/03/when-using-ai-leads-to-brain-fry" rel="noopener noreferrer"&gt;Harvard Business Review reported&lt;/a&gt; on what they call "AI brain fry." A BCG study of 1,488 workers found that people managing AI output experience 33% more decision fatigue and 39% more major errors. Productivity peaked at three simultaneous AI tools. Beyond that, performance actually dropped.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://www.scientificamerican.com/article/why-developers-using-ai-are-working-longer-hours/" rel="noopener noreferrer"&gt;Multitudes study&lt;/a&gt; of 500+ developers found a 19.6% rise in out-of-hour commits among AI tool users, with Saturday productive hours up 46%. As &lt;a href="https://leaddev.com/ai/addictive-agentic-coding-has-developers-losing-sleep" rel="noopener noreferrer"&gt;LeadDev reported&lt;/a&gt;, faster code generation does not automatically create calmer teams. It often just creates longer evenings. &lt;a href="https://www.axios.com/2026/04/04/ai-agents-burnout-addiction-claude-code-openclaw" rel="noopener noreferrer"&gt;Axios recently compared&lt;/a&gt; agentic coding tools to slot machines, noting that some developers now need sleep medication to break the late-night coding loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I See in My Own Workflow
&lt;/h2&gt;

&lt;p&gt;I use AI on almost every project. When I &lt;a href="https://maxmendes.dev/en/projects/flowmate" rel="noopener noreferrer"&gt;built FlowMate&lt;/a&gt;, a production SaaS handling email management with AI integrations, every line of AI-assisted code went through manual review. When I &lt;a href="https://maxmendes.dev/en/blog/ai-automation-finding-businesses-without-websites" rel="noopener noreferrer"&gt;built automation workflows&lt;/a&gt; to find businesses without websites, AI handled the repetitive parts while I designed the system architecture.&lt;/p&gt;

&lt;p&gt;The pattern that works for me: start with the agent, stop it early, read everything, then continue. The pattern that burns me: let the agent run ahead for 20 minutes, then try to catch up with what it built. The second approach feels more productive. It is not. I end up spending twice as long untangling code I should have reviewed incrementally.&lt;/p&gt;

&lt;p&gt;This is exactly why I wrote about &lt;a href="https://maxmendes.dev/en/blog/vibe-coding-eating-software-development" rel="noopener noreferrer"&gt;vibe coding culture&lt;/a&gt; a few weeks ago. The core risk is the same: the tools outrun the review. Vibe coding is the cultural norm. Cognitive debt is the technical consequence. They feed each other.&lt;/p&gt;

&lt;p&gt;That matters for &lt;a href="https://maxmendes.dev/en/services/ai-integration" rel="noopener noreferrer"&gt;AI integration work&lt;/a&gt; more than people realize. The value is not in generating code faster. The value is in keeping the human ahead of the machine at every step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 80% Trap
&lt;/h2&gt;

&lt;p&gt;Addy Osmani &lt;a href="https://addyo.substack.com/p/the-80-problem-in-agentic-coding" rel="noopener noreferrer"&gt;described this well&lt;/a&gt;: agents generate 80% of the code, but the remaining 20% requires deep architectural knowledge. The trap is that 80% feels like progress. You merge it. Then the 20% arrives and you realize you do not understand the 80% well enough to finish.&lt;/p&gt;

&lt;p&gt;The data backs this up. &lt;a href="https://www.gitclear.com/ai_assistant_code_quality_2025_research" rel="noopener noreferrer"&gt;GitClear analyzed 211 million lines of code&lt;/a&gt; from 2020 to 2024 and found code duplication grew 8x since AI tools became widely adopted. Healthy refactoring ("moved" code) dropped 39.9%. For the first time in their dataset, developers were pasting code more often than restructuring it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report" rel="noopener noreferrer"&gt;CodeRabbit's research&lt;/a&gt; on 470 pull requests found AI-generated code produces 1.7x more issues overall. Security vulnerabilities were 2.74x higher. Readability problems were 3x more frequent.&lt;/p&gt;

&lt;p&gt;This is what borrowed speed looks like. You moved fast for a week and now you are stuck for a month debugging code you never properly understood.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Counterargument (And Why It Is Partly Right)
&lt;/h2&gt;

&lt;p&gt;The obvious pushback: more code is still better than no code. I agree, up to a point. I would rather start from a rough AI-generated feature than from an empty file. I use AI every day for exactly that reason.&lt;/p&gt;

&lt;p&gt;But this only works when the human stays ahead of the abstraction. If the tool is writing code faster than you can explain it, then your throughput is synthetic. You borrowed speed from your future self, and your future self will not be happy about the interest rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;I think the winning developers will not be the ones who generate the most code. They will be the ones who keep the shortest path between generated code and human understanding. Here is what that looks like in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smaller batches.&lt;/strong&gt; Let the agent generate one function, review it, then continue. Not one feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggressive review.&lt;/strong&gt; Read every line before it leaves your machine. If you cannot explain it to a colleague, it is not ready to merge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saying no.&lt;/strong&gt; When the agent is about to create a hundred lines you do not fully need, stop it. Removing code is easier than understanding code you never asked for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good notes.&lt;/strong&gt; Write down why the system works the way it does, not just what it does. Cognitive debt accumulates in the gaps between code and comprehension.&lt;/p&gt;

&lt;p&gt;In my case, AI works best when I use it to compress effort, not outsource comprehension. If you are building client systems, the boring parts still matter. From &lt;a href="https://maxmendes.dev/en/services/web-development" rel="noopener noreferrer"&gt;solid web architecture&lt;/a&gt; to keeping a clean path to future changes through &lt;a href="https://maxmendes.dev/en/projects" rel="noopener noreferrer"&gt;real project maintenance&lt;/a&gt;, the &lt;a href="https://maxmendes.dev/en/blog/dead-internet-human-made-websites" rel="noopener noreferrer"&gt;dead internet problem&lt;/a&gt; taught us that quality and authenticity still win, whether we are talking about content or code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Developers Who Will Win This
&lt;/h2&gt;

&lt;p&gt;Model capability keeps improving. That is not the bottleneck anymore. AI code overload is the bigger risk, because unread code, invisible decisions, and broken mental models are what actually slow you down six months from now.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://digitaleconomy.stanford.edu/wp-content/uploads/2025/11/CanariesintheCoalMine_Nov25.pdf" rel="noopener noreferrer"&gt;Stanford Digital Economy Lab found&lt;/a&gt; that employment among software developers aged 22 to 25 fell nearly 20% between 2022 and 2025, while developers over 26 saw stable or growing employment. The "write code from tutorials" job is disappearing. The "understand systems and make decisions" job is not.&lt;/p&gt;

&lt;p&gt;I would rather ship less code I still understand than more code I already mentally abandoned. That is not a productivity problem. That is an engineering discipline, and it is the one thing AI cannot do for you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://maxmendes.dev/en/blog/ai-code-overload-developers" rel="noopener noreferrer"&gt;maxmendes.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>webdev</category>
      <category>codequality</category>
    </item>
    <item>
      <title>Building a Multimodal Cross Cloud Live Agent with ADK, Amazon ECS Express, and Gemini CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:57:32 +0000</pubDate>
      <link>https://stormkit.forem.com/gde/building-a-multimodal-cross-cloud-live-agent-with-adk-amazon-ecs-express-and-gemini-cli-30a8</link>
      <guid>https://stormkit.forem.com/gde/building-a-multimodal-cross-cloud-live-agent-with-adk-amazon-ecs-express-and-gemini-cli-30a8</guid>
      <description>&lt;p&gt;Leveraging the Google Agent Development Kit (ADK) and the underlying Gemini LLM to build cross cloud apps with the Python programming language deployed to the ECS Express service on AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fchg09muwt24i30zbhx7d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fchg09muwt24i30zbhx7d.png" width="758" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Aren’t There a Billion Python Agent Demos?
&lt;/h4&gt;

&lt;p&gt;Yes there are.&lt;/p&gt;

&lt;p&gt;Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable basic working MCP stdio server that can be run locally without any unneeded extra code or extensions.&lt;/p&gt;

&lt;h4&gt;
  
  
  What Is Python?
&lt;/h4&gt;

&lt;p&gt;Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;Welcome to Python.org&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python Version Management
&lt;/h4&gt;

&lt;p&gt;One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;pyenv&lt;/strong&gt; tool enables deploying consistent versions of Python:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pyenv/pyenv" rel="noopener noreferrer"&gt;GitHub - pyenv/pyenv: Simple Python version management&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As of writing — the mainstream python version is 3.13. To validate your current Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;admin@ip-172-31-70-211:~/gemini-cli-aws/mcp-lightsail-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="go"&gt;Python 3.13.12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Amazon ECS Express
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.google.com/search?q=Amazon+ECS+Express+Mode&amp;amp;rlz=1CAIWTJ_enUS1110&amp;amp;oq=what+is+amazon+ecs+express&amp;amp;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIJCAEQIRgKGKAB0gEIMzI0MWowajeoAgCwAgA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&amp;amp;mstk=AUtExfAELWySw4fS4VoaovwdGE8MUNcOltEQ-lyCKwxY4t3OArbcxO8JX30JpX02tjJDKML-JgcQEQDIaZjDgUHMoJTycp046hy8F-_Y_zxJ9Bo0rZyERUQ6geXGT9MPUb02ZLA7LpFjGlcpRgGkURGERCNHTKdtI2kGtm-bh5XT5dS4hpo&amp;amp;csui=3&amp;amp;ved=2ahUKEwiu_YSzptWTAxVPF1kFHY8nLbwQgK4QegQIARAB" rel="noopener noreferrer"&gt;Amazon ECS Express Mode&lt;/a&gt; (announced Nov 2025) is a simplified deployment feature for Amazon Elastic Container Service (ECS) designed to rapidly launch containerized applications, APIs, and web services on AWS Fargate. It automates infrastructure setup — including load balancing, networking, scaling, and HTTPS endpoints — allowing developers to deploy from container image to production in a single step.&lt;/p&gt;

&lt;p&gt;More details are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/express-service-overview.html" rel="noopener noreferrer"&gt;Amazon ECS Express Mode&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Gemini CLI
&lt;/h4&gt;

&lt;p&gt;If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Testing the Gemini CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Node Version Management
&lt;/h4&gt;

&lt;p&gt;Gemini CLI needs a consistent, up to date version of Node. The &lt;strong&gt;nvm&lt;/strong&gt; command can be used to get a standard Node environment:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/nvm-sh/nvm" rel="noopener noreferrer"&gt;GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Docker Version Management
&lt;/h4&gt;

&lt;p&gt;The AWS Cli tools and Lightsail extensions need current version of Docker. If your environment does not provide a recent docker tool- the Docker Version Manager can be used to downlaod the latest supported Docker:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://howtowhale.github.io/dvm/install.html" rel="noopener noreferrer"&gt;Install&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS CLI
&lt;/h4&gt;

&lt;p&gt;The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/lightsail/latest/userguide/amazon-lightsail-install-software.html" rel="noopener noreferrer"&gt;Install Docker, AWS CLI, and the Lightsail Control plugin for containers&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Agent Development Kit
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://www.google.com/search?q=Google+Agent+Development+Kit&amp;amp;rlz=1CAIWTJ_enUS1114&amp;amp;oq=what+is+the+adk+google&amp;amp;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIICAEQABgWGB4yCAgCEAAYFhgeMggIAxAAGBYYHjIICAQQABgWGB4yCAgFEAAYFhgeMggIBhAAGBYYHjIKCAcQABgKGBYYHjINCAgQABiGAxiABBiKBTIKCAkQABiABBiiBNIBCDMxODlqMGo3qAIAsAIA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&amp;amp;mstk=AUtExfB5Oo7ZHHcDEHu7aqZiPBA2l1c-QGh5dB7xkkDPIiYcn8O1Imt2IHNR7bzA6JnyDCSDCUGpGWTeBW14namlN_QqzJLLI5-px1BE9jfSxwli6njPDPERjm5pRqNP3uC6HhUKiRcTJ1T8x5LHQrCkVxylw7QWg0N8B4dQDIcWpnVX9Gc&amp;amp;csui=3&amp;amp;ved=2ahUKEwjYu-G8p-uSAxXrv4kEHUbpLo0QgK4QegQIARAB" rel="noopener noreferrer"&gt;Google Agent Development Kit&lt;/a&gt; (ADK) is an open-source, Python-based framework designed to streamline the creation, deployment, and orchestration of sophisticated, multi-agent AI systems. It treats agent development like software engineering, offering modularity, state management, and built-in tools (like Google Search) to build autonomous agents.&lt;/p&gt;

&lt;p&gt;The ADK can be installed from here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;Agent Development Kit (ADK)&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  This seems like a lot of Configuration!
&lt;/h4&gt;

&lt;p&gt;Getting the key tools in place is the first step to working across Cloud environments.&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting multimodal real time cross cloud agent development is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;The agents in the demo are based on the original code lab:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://codelabs.developers.google.com/way-back-home-level-3/instructions#3" rel="noopener noreferrer"&gt;Way Back Home - Building an ADK Bi-Directional Streaming Agent | Google Codelabs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal ADK Agent is built with the visual builder. Next — the entire solution is deployed to Amazon ECS Express.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Gemini CLI installation. All of the relevant code examples and documentation is available in GitHub. This repo has a wide variety of samples- but this lab will focus on the ‘level_3-ecsexpress’ setup.&lt;/p&gt;

&lt;p&gt;The next step is to clone the GitHub repository to your local environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemini-cli-aws
&lt;span class="nb"&gt;cd &lt;/span&gt;level_3-ecsexpress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the &lt;strong&gt;set_env&lt;/strong&gt; script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify The ADK Installation
&lt;/h4&gt;

&lt;p&gt;To verify the setup, run the ADK CLI locally with Agent1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/level_3-ecsexpress/backend/app$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;adk run biometric_agent
&lt;span class="go"&gt;Log setup complete: /tmp/agents_log/agent.20260405_093812.log
To access latest log: tail -F /tmp/agents_log/agent.latest.log
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/cli.py:204: UserWarning: [EXPERIMENTAL] InMemoryCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  credential_service = InMemoryCredentialService()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/auth/credential_service/in_memory_credential_service.py:33: UserWarning: [EXPERIMENTAL] BaseCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  super(). __init__ ()
Running agent biometric_agent, type exit to exit.


&lt;h4&gt;
  
  
  Deploying to Amazon ECS Express
&lt;/h4&gt;

&lt;p&gt;The first step is to refresh the AWS credentials in the current build environment:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemini-cli-aws/level_3-ecsexpress&lt;span class="nv"&gt;$ &lt;/span&gt;aws login &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then a utility script caches the credentials on the local system for building:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;xbill@penguin:~/gemini-cli-aws/level_3-ecsexpress&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;source &lt;/span&gt;save-aws-creds.sh 
Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these &lt;span class="k"&gt;for &lt;/span&gt;deployments.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the deploy version on the local system: 0.0s 0.0s&lt;/p&gt;

&lt;p&gt;You can validate the final result by checking the messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;make deploy

✦ The application has been successfully deployed to AWS ECS Express Mode.

- Service Status: ACTIVE
   - Public Endpoint: [https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws](https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws)
     ([https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws](https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws))
   - Deployment Cycle: IAM roles created/verified, Docker image built and pushed to ECR, and ECS service updated.

You can now access your biometric-scout-service at the above URL.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the container is deployed- you can then get the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then get the endpoint URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service will be visible in the AWS console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0uasfk2xabqi35iin9k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0uasfk2xabqi35iin9k.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Running the Web Interface
&lt;/h4&gt;

&lt;p&gt;Start a connection to the ECS Express Deployed app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://bi-59e66ed2dcde45dcb1b347ce8d6ca7b8.ecs.us-east-1.on.aws/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then connect to the app :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffsd0jys3igkz3kari9pj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffsd0jys3igkz3kari9pj.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then use the Live model to process audio and video:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4muhndap65r4ou8fsa5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4muhndap65r4ou8fsa5.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally — complete the sequence:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8xd82i11fookffa2kgh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8xd82i11fookffa2kgh.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The Agent Development Kit was used to enable a multi-modal agent using the Gemini Live Model. This Agent was tested locally with the CLI and then deployed to Amazon ECS Express. This approach validates that cross cloud tools can be used — even with more complex agents.&lt;/p&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>geminilive</category>
      <category>python</category>
      <category>gemini</category>
      <category>googleadk</category>
    </item>
    <item>
      <title>Cross Cloud Multi Agent Comic Builder with ADK, Amazon EKS, and Gemini CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:56:03 +0000</pubDate>
      <link>https://stormkit.forem.com/gde/cross-cloud-multi-agent-comic-builder-with-adk-amazon-eks-and-gemini-cli-4o10</link>
      <guid>https://stormkit.forem.com/gde/cross-cloud-multi-agent-comic-builder-with-adk-amazon-eks-and-gemini-cli-4o10</guid>
      <description>&lt;p&gt;Leveraging the Google Agent Development Kit (ADK) and the underlying Gemini LLM to build low code apps with the Python programming language deployed to the EKS service on AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F722gp42o7m9epkc213p4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F722gp42o7m9epkc213p4.jpeg" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Aren’t There a Billion Python MCP Demos?
&lt;/h4&gt;

&lt;p&gt;Yes there are.&lt;/p&gt;

&lt;p&gt;Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable basic working MCP stdio server that can be run locally without any unneeded extra code or extensions.&lt;/p&gt;

&lt;h4&gt;
  
  
  What Is Python?
&lt;/h4&gt;

&lt;p&gt;Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;Welcome to Python.org&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python Version Management
&lt;/h4&gt;

&lt;p&gt;One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;pyenv&lt;/strong&gt; tool enables deploying consistent versions of Python:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pyenv/pyenv" rel="noopener noreferrer"&gt;GitHub - pyenv/pyenv: Simple Python version management&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As of writing — the mainstream python version is 3.13. To validate your current Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;admin@ip-172-31-70-211:~/gemini-cli-aws/mcp-lightsail-python-aws$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="go"&gt;Python 3.13.12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Amazon EKS
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.google.com/search?q=Amazon+Elastic+Kubernetes+Service&amp;amp;rlz=1CAIWTJ_enUS1110&amp;amp;oq=what+is+amazon+eks&amp;amp;gs_lcrp=EgZjaHJvbWUqBwgAEAAYgAQyBwgAEAAYgAQyBwgBEAAYgAQyBwgCEAAYgAQyCAgDEAAYFhgeMggIBBAAGBYYHjIICAUQABgWGB4yCAgGEAAYFhgeMggIBxAAGBYYHjIICAgQABgWGB4yCAgJEAAYFhge0gEINjg1N2owajSoAgCwAgA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&amp;amp;ved=2ahUKEwjj6LrXrtWTAxV3LFkFHRstPUQQgK4QegYIAQgAEAQ" rel="noopener noreferrer"&gt;Amazon Elastic Kubernetes Service&lt;/a&gt; (EKS) is a fully managed service from Amazon Web Services (AWS) that makes it easy to run &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt; on AWS without needing to install, operate, or maintain your own Kubernetes control plane. It automates cluster management, security, and scaling, supporting applications on both Amazon EC2 and AWS Fargate.&lt;/p&gt;

&lt;p&gt;More information is available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html" rel="noopener noreferrer"&gt;What is Amazon EKS?&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Gemini CLI
&lt;/h4&gt;

&lt;p&gt;If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Testing the Gemini CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade no sandbox (see /docs) /model Auto (Gemini 3) | 239.8 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Node Version Management
&lt;/h4&gt;

&lt;p&gt;Gemini CLI needs a consistent, up to date version of Node. The &lt;strong&gt;nvm&lt;/strong&gt; command can be used to get a standard Node environment:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/nvm-sh/nvm" rel="noopener noreferrer"&gt;GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Docker Version Management
&lt;/h4&gt;

&lt;p&gt;The AWS Cli tools and Lightsail extensions need current version of Docker. If your environment does not provide a recent docker tool- the Docker Version Manager can be used to downlaod the latest supported Docker:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://howtowhale.github.io/dvm/install.html" rel="noopener noreferrer"&gt;Install&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS CLI
&lt;/h4&gt;

&lt;p&gt;The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/lightsail/latest/userguide/amazon-lightsail-install-software.html" rel="noopener noreferrer"&gt;Install Docker, AWS CLI, and the Lightsail Control plugin for containers&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Agent Development Kit
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://www.google.com/search?q=Google+Agent+Development+Kit&amp;amp;rlz=1CAIWTJ_enUS1114&amp;amp;oq=what+is+the+adk+google&amp;amp;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIICAEQABgWGB4yCAgCEAAYFhgeMggIAxAAGBYYHjIICAQQABgWGB4yCAgFEAAYFhgeMggIBhAAGBYYHjIKCAcQABgKGBYYHjINCAgQABiGAxiABBiKBTIKCAkQABiABBiiBNIBCDMxODlqMGo3qAIAsAIA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&amp;amp;mstk=AUtExfB5Oo7ZHHcDEHu7aqZiPBA2l1c-QGh5dB7xkkDPIiYcn8O1Imt2IHNR7bzA6JnyDCSDCUGpGWTeBW14namlN_QqzJLLI5-px1BE9jfSxwli6njPDPERjm5pRqNP3uC6HhUKiRcTJ1T8x5LHQrCkVxylw7QWg0N8B4dQDIcWpnVX9Gc&amp;amp;csui=3&amp;amp;ved=2ahUKEwjYu-G8p-uSAxXrv4kEHUbpLo0QgK4QegQIARAB" rel="noopener noreferrer"&gt;Google Agent Development Kit&lt;/a&gt; (ADK) is an open-source, Python-based framework designed to streamline the creation, deployment, and orchestration of sophisticated, multi-agent AI systems. It treats agent development like software engineering, offering modularity, state management, and built-in tools (like Google Search) to build autonomous agents.&lt;/p&gt;

&lt;p&gt;The ADK can be installed from here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;Agent Development Kit (ADK)&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  This seems like a lot of Configuration!
&lt;/h4&gt;

&lt;p&gt;Getting the key tools in place is the first step to working across Cloud environments.&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting low code agent development is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;The agents in the demo are based on the original code lab:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://codelabs.developers.google.com/codelabs/create-low-code-agent-with-ADK-visual-builder#0" rel="noopener noreferrer"&gt;Create and deploy low code ADK (Agent Deployment Kit) agents using ADK Visual Builder | Google Codelabs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal ADK Agent is built with the visual builder. Next — the entire solution is deployed to Google Cloud Run.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemini-cli-aws
&lt;span class="nb"&gt;cd &lt;/span&gt;adkui-eks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the &lt;strong&gt;set_env&lt;/strong&gt; script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify The ADK Installation
&lt;/h4&gt;

&lt;p&gt;To verify the setup, run the ADK CLI locally with Agent1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/adkui-eks$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;adk run Agent1
&lt;span class="go"&gt;Log setup complete: /tmp/agents_log/agent.20260404_211625.log
To access latest log: tail -F /tmp/agents_log/agent.latest.log
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/utils/agent_loader.py:277: UserWarning: [EXPERIMENTAL] _load_from_yaml_config: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  if root_agent := self._load_from_yaml_config(actual_agent_name, agents_dir):
/home/xbill/.local/lib/python3.13/site-packages/google/adk/features/_feature_decorator.py:81: UserWarning: [EXPERIMENTAL] feature FeatureName.AGENT_CONFIG is enabled.
  check_feature_enabled()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/cli.py:204: UserWarning: [EXPERIMENTAL] InMemoryCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  credential_service = InMemoryCredentialService()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/auth/credential_service/in_memory_credential_service.py:33: UserWarning: [EXPERIMENTAL] BaseCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  super(). __init__ ()
Running agent Agent1, type exit to exit.
[user]: what is Amazon EKS
[Agent1]: Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service offered by Amazon Web Services (AWS). It simplifies the process of deploying, managing, and scaling containerized applications using Kubernetes on AWS.

Key aspects of Amazon EKS include:
* **Fully Managed Control Plane** AWS takes on the responsibility of managing the Kubernetes control plane, which includes tasks like scheduling applications, managing their lifecycle, and storing the cluster's state. This means AWS handles the infrastructure, scaling, patching, and updates for the control plane, freeing users to focus on their applications.
* **Kubernetes Conformance** Amazon EKS is a certified Kubernetes conformant service, allowing you to use standard Kubernetes tools and plugins. This also means you can easily migrate existing Kubernetes applications without modification.
* **High Availability and Scalability** The EKS control plane is automatically distributed across multiple Availability Zones to ensure high availability and resilience. EKS also integrates with AWS services for scaling worker nodes and pods.
* **Integration with AWS Services** EKS seamlessly integrates with other AWS services such as Amazon VPC for networking, AWS Identity and Access Management (IAM) for authentication, Amazon CloudWatch for monitoring, and Auto Scaling Groups for scaling.
* **Deployment Options** While primarily for running Kubernetes on the AWS cloud, Amazon EKS also offers deployment options for on-premises and edge environments through Amazon EKS Anywhere and Amazon EKS on AWS Outposts. These options allow for consistent Kubernetes management across various infrastructures.

Essentially, Amazon EKS reduces the operational complexity of running Kubernetes, allowing organizations to leverage the benefits of container orchestration without the overhead of managing the underlying infrastructure themselves.Amazon Elastic Kubernetes Service (Amazon EKS) is a fully managed Kubernetes service provided by Amazon Web Services (AWS). It is designed to simplify the deployment, management, and scaling of containerized applications using Kubernetes on the AWS cloud, and also offers options for on-premises and edge environments. 0.0s 0.0s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploying to Amazon EKS
&lt;/h4&gt;

&lt;p&gt;First authenticate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws login &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then cache the credentials locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/adkui-eks$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;source &lt;/span&gt;save-aws-creds.sh
&lt;span class="go"&gt;Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then start the deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✦ Deployment to Amazon EKS was successful.

  Deployment Summary

   - EKS Cluster: adkui-eks-cluster (Status: ACTIVE)
   - Image: 106059658660.dkr.ecr.us-east-1.amazonaws.com/adk-comic-image:latest
   - Pod Status: Running (1/1 READY)
   - Service Endpoint: http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com

  The make deploy command completed the following steps:
   1. Updated kubeconfig for the EKS cluster.
   2. Built the Docker image based on the Dockerfile.
   3. Logged in to Amazon ECR and pushed the image.
   4. Generated k8s-deployment.yaml and applied it to the cluster.

  You can now access the ADK Web UI at the endpoint listed above.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can validate the final result by checking the messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✦ The EKS LoadBalancer endpoint is:
  http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then get the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
✦ The EKS LoadBalancer endpoint is:
  http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service will be visible in the AWS console. The console will look similar to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femr6g0fen6vd2zzx2j60.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femr6g0fen6vd2zzx2j60.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Running the ADK Web Interface
&lt;/h4&gt;

&lt;p&gt;Start a connection to the EKS Deployed ADK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
  http://af62eb56d13b74cefb372550e726efaa-1528063823.us-east-1.elb.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will bring up the ADK UI. Select the sub-agent “Agent3”:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjgz3evwq8vbwr9iuvhn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjgz3evwq8vbwr9iuvhn.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will generate the Comic by using a multi-agent pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoyph004li1vuxc99x6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoyph004li1vuxc99x6e.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the multi Agent system is complete:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1aafxi3vhxiteigdl7z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1aafxi3vhxiteigdl7z.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Visual Edit Agent Pipeline
&lt;/h4&gt;

&lt;p&gt;The version of the ADK Deployed includes a visual builder:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn90tclf72xbrzpbih7x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn90tclf72xbrzpbih7x.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Run the Online Viewer Agent
&lt;/h4&gt;

&lt;p&gt;Once Agent3 has completed — go to the ADK agent selector and select “Agent4”. This agent will allow you to browse your online comic:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmo089cu74w6f3q9yx9us.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmo089cu74w6f3q9yx9us.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  View the Final Artifacts
&lt;/h4&gt;

&lt;p&gt;You can use Agent4 to visualize the results of the agent pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F328l95esisz8h8gz2scm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F328l95esisz8h8gz2scm.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and the final panels:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k79vjjgymrknxj7y148.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k79vjjgymrknxj7y148.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The Agent Development Kit was used to visually define a multi Agent pipeline to generate comic book style HTML. This Agent was tested locally with the CLI and then with the ADK web tool. Then, several sample ADK agents were run directly from the EKS deployment in AWS. This approach validates that cross cloud tools can be used — even with more complex agents.&lt;/p&gt;

</description>
      <category>gemini</category>
      <category>googleadk</category>
      <category>python</category>
      <category>aws</category>
    </item>
    <item>
      <title>Cross Cloud Multi Agent Comic Builder with ADK, Amazon ECS Express, and Gemini CLI</title>
      <dc:creator>xbill</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:54:24 +0000</pubDate>
      <link>https://stormkit.forem.com/gde/cross-cloud-multi-agent-comic-builder-with-adk-amazon-ecs-express-and-gemini-cli-41me</link>
      <guid>https://stormkit.forem.com/gde/cross-cloud-multi-agent-comic-builder-with-adk-amazon-ecs-express-and-gemini-cli-41me</guid>
      <description>&lt;p&gt;Leveraging the Google Agent Development Kit (ADK) and the underlying Gemini LLM to build low code apps with the Python programming language deployed to the ECS express service on AWS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F722gp42o7m9epkc213p4.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F722gp42o7m9epkc213p4.jpeg" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Aren’t There a Billion Python MCP Demos?
&lt;/h4&gt;

&lt;p&gt;Yes there are.&lt;/p&gt;

&lt;p&gt;Python has traditionally been the main coding language for ML and AI tools. The goal of this article is to provide a minimal viable basic working MCP stdio server that can be run locally without any unneeded extra code or extensions.&lt;/p&gt;

&lt;h4&gt;
  
  
  What Is Python?
&lt;/h4&gt;

&lt;p&gt;Python is an interpreted language that allows for rapid development and testing and has deep libraries for working with ML and AI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.python.org/" rel="noopener noreferrer"&gt;Welcome to Python.org&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Python Version Management
&lt;/h4&gt;

&lt;p&gt;One of the downsides of the wide deployment of Python has been managing the language versions across platforms and maintaining a supported version.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;pyenv&lt;/strong&gt; tool enables deploying consistent versions of Python:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pyenv/pyenv" rel="noopener noreferrer"&gt;GitHub - pyenv/pyenv: Simple Python version management&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As of writing — the mainstream python version is 3.13. To validate your current Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.13.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Amazon ECS Express Configuration
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://www.google.com/search?q=Amazon+ECS+Express+Mode&amp;amp;rlz=1CAIWTJ_enUS1110&amp;amp;oq=what+is+amazon+ecs+express&amp;amp;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIJCAEQIRgKGKAB0gEIMzI0MWowajeoAgCwAgA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&amp;amp;mstk=AUtExfAELWySw4fS4VoaovwdGE8MUNcOltEQ-lyCKwxY4t3OArbcxO8JX30JpX02tjJDKML-JgcQEQDIaZjDgUHMoJTycp046hy8F-_Y_zxJ9Bo0rZyERUQ6geXGT9MPUb02ZLA7LpFjGlcpRgGkURGERCNHTKdtI2kGtm-bh5XT5dS4hpo&amp;amp;csui=3&amp;amp;ved=2ahUKEwiu_YSzptWTAxVPF1kFHY8nLbwQgK4QegQIARAB" rel="noopener noreferrer"&gt;Amazon ECS Express Mode&lt;/a&gt; (announced Nov 2025) is a simplified deployment feature for Amazon Elastic Container Service (ECS) designed to rapidly launch containerized applications, APIs, and web services on AWS Fargate. It automates infrastructure setup — including load balancing, networking, scaling, and HTTPS endpoints — allowing developers to deploy from container image to production in a single step.&lt;/p&gt;

&lt;p&gt;More details are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECS/latest/developerguide/express-service-overview.html" rel="noopener noreferrer"&gt;Amazon ECS Express Mode&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Gemini CLI
&lt;/h4&gt;

&lt;p&gt;If not pre-installed you can download the Gemini CLI to interact with the source files and provide real-time assistance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Testing the Gemini CLI Environment
&lt;/h4&gt;

&lt;p&gt;Once you have all the tools and the correct Node.js version in place- you can test the startup of Gemini CLI. You will need to authenticate with a Key or your Google Account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gemini

▝▜▄ Gemini CLI v0.33.1
    ▝▜▄
   ▗▟▀ Logged in with Google /auth
  ▝▀ Gemini Code Assist Standard /upgrade no sandbox (see /docs) /model Auto (Gemini 3) | 239.8 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Node Version Management
&lt;/h4&gt;

&lt;p&gt;Gemini CLI needs a consistent, up to date version of Node. The &lt;strong&gt;nvm&lt;/strong&gt; command can be used to get a standard Node environment:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/nvm-sh/nvm" rel="noopener noreferrer"&gt;GitHub - nvm-sh/nvm: Node Version Manager - POSIX-compliant bash script to manage multiple active node.js versions&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Docker Version Management
&lt;/h4&gt;

&lt;p&gt;The AWS CLI tools need current version of Docker. If your environment does not provide a recent docker tool- the Docker Version Manager can be used to downlaod the latest supported Docker:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://howtowhale.github.io/dvm/install.html" rel="noopener noreferrer"&gt;Install&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  AWS CLI
&lt;/h4&gt;

&lt;p&gt;The AWS CLI provides a command line tool to directly access AWS services from your current environment. Full details on the CLI are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/lightsail/latest/userguide/amazon-lightsail-install-software.html" rel="noopener noreferrer"&gt;Install Docker, AWS CLI, and the Lightsail Control plugin for containers&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Agent Development Kit
&lt;/h4&gt;

&lt;p&gt;The &lt;a href="https://www.google.com/search?q=Google+Agent+Development+Kit&amp;amp;rlz=1CAIWTJ_enUS1114&amp;amp;oq=what+is+the+adk+google&amp;amp;gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIICAEQABgWGB4yCAgCEAAYFhgeMggIAxAAGBYYHjIICAQQABgWGB4yCAgFEAAYFhgeMggIBhAAGBYYHjIKCAcQABgKGBYYHjINCAgQABiGAxiABBiKBTIKCAkQABiABBiiBNIBCDMxODlqMGo3qAIAsAIA&amp;amp;sourceid=chrome&amp;amp;ie=UTF-8&amp;amp;mstk=AUtExfB5Oo7ZHHcDEHu7aqZiPBA2l1c-QGh5dB7xkkDPIiYcn8O1Imt2IHNR7bzA6JnyDCSDCUGpGWTeBW14namlN_QqzJLLI5-px1BE9jfSxwli6njPDPERjm5pRqNP3uC6HhUKiRcTJ1T8x5LHQrCkVxylw7QWg0N8B4dQDIcWpnVX9Gc&amp;amp;csui=3&amp;amp;ved=2ahUKEwjYu-G8p-uSAxXrv4kEHUbpLo0QgK4QegQIARAB" rel="noopener noreferrer"&gt;Google Agent Development Kit&lt;/a&gt; (ADK) is an open-source, Python-based framework designed to streamline the creation, deployment, and orchestration of sophisticated, multi-agent AI systems. It treats agent development like software engineering, offering modularity, state management, and built-in tools (like Google Search) to build autonomous agents.&lt;/p&gt;

&lt;p&gt;The ADK can be installed from here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;Agent Development Kit (ADK)&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  This seems like a lot of Configuration!
&lt;/h4&gt;

&lt;p&gt;Getting the key tools in place is the first step to working across Cloud environments.&lt;/p&gt;

&lt;h4&gt;
  
  
  Where do I start?
&lt;/h4&gt;

&lt;p&gt;The strategy for starting low code agent development is a incremental step by step approach.&lt;/p&gt;

&lt;p&gt;The agents in the demo are based on the original code lab:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://codelabs.developers.google.com/codelabs/create-low-code-agent-with-ADK-visual-builder#0" rel="noopener noreferrer"&gt;Create and deploy low code ADK (Agent Deployment Kit) agents using ADK Visual Builder | Google Codelabs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, the basic development environment is setup with the required system variables, and a working Gemini CLI configuration.&lt;/p&gt;

&lt;p&gt;Then, a minimal ADK Agent is built with the visual builder. Next — the entire solution is deployed to Google Cloud Run.&lt;/p&gt;

&lt;h4&gt;
  
  
  Setup the Basic Environment
&lt;/h4&gt;

&lt;p&gt;At this point you should have a working Python environment and a working Gemini CLI installation. The next step is to clone the GitHub samples repository with support scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
git clone https://github.com/xbill9/gemini-cli-aws
&lt;span class="nb"&gt;cd &lt;/span&gt;adkui-ecsexpress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;strong&gt;init.sh&lt;/strong&gt; from the cloned directory.&lt;/p&gt;

&lt;p&gt;The script will attempt to determine your shell environment and set the correct variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;init.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your session times out or you need to re-authenticate- you can run the &lt;strong&gt;set_env.sh&lt;/strong&gt; script to reset your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;set_env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like PROJECT_ID need to be setup for use in the various build scripts- so the &lt;strong&gt;set_env&lt;/strong&gt; script can be used to reset the environment if you time-out.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify The ADK Installation
&lt;/h4&gt;

&lt;p&gt;To verify the setup, run the ADK CLI locally with Agent1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/adkui-ecsexpress$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;adk run Agent1
&lt;span class="go"&gt;Log setup complete: /tmp/agents_log/agent.20260404_202121.log
To access latest log: tail -F /tmp/agents_log/agent.latest.log
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/utils/agent_loader.py:277: UserWarning: [EXPERIMENTAL] _load_from_yaml_config: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  if root_agent := self._load_from_yaml_config(actual_agent_name, agents_dir):
/home/xbill/.local/lib/python3.13/site-packages/google/adk/features/_feature_decorator.py:81: UserWarning: [EXPERIMENTAL] feature FeatureName.AGENT_CONFIG is enabled.
  check_feature_enabled()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/cli/cli.py:204: UserWarning: [EXPERIMENTAL] InMemoryCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  credential_service = InMemoryCredentialService()
/home/xbill/.local/lib/python3.13/site-packages/google/adk/auth/credential_service/in_memory_credential_service.py:33: UserWarning: [EXPERIMENTAL] BaseCredentialService: This feature is experimental and may change or be removed in future versions without notice. It may introduce breaking changes at any time.
  super(). __init__ ()
Running agent Agent1, type exit to exit.
[user]: what is amazon ecs express
[Agent1]: Amazon ECS Express Mode is a new feature for Amazon Elastic Container Service (ECS) that simplifies and accelerates the deployment and management of containerized applications, particularly web applications and APIs, on AWS. It aims to reduce the operational overhead for developers by automating much of the infrastructure setup that typically accompanies deploying containerized applications to production. 0.0s 0.0s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Deploying to Amazon ECS Express
&lt;/h4&gt;

&lt;p&gt;First authenticate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws login &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then cache the credentials locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;xbill@penguin:~/gemini-cli-aws/adkui-ecsexpress$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;source &lt;/span&gt;save-aws-creds.sh
&lt;span class="go"&gt;Exporting AWS credentials...
Successfully saved credentials to .aws_creds
The Makefile will now automatically use these for deployments.

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then start the deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt; &lt;span class="err"&gt;&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;deploy&lt;/span&gt;
&lt;span class="err"&gt;✦&lt;/span&gt; &lt;span class="err"&gt;I&lt;/span&gt; &lt;span class="err"&gt;will&lt;/span&gt; &lt;span class="err"&gt;execute&lt;/span&gt; &lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;deploy&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;initiate&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;full&lt;/span&gt; &lt;span class="err"&gt;ECS&lt;/span&gt; &lt;span class="err"&gt;Express&lt;/span&gt; &lt;span class="err"&gt;Mode&lt;/span&gt; &lt;span class="err"&gt;deployment&lt;/span&gt; &lt;span class="err"&gt;cycle,&lt;/span&gt; &lt;span class="err"&gt;including&lt;/span&gt; &lt;span class="err"&gt;building&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;Docker&lt;/span&gt;
  &lt;span class="err"&gt;image,&lt;/span&gt; &lt;span class="err"&gt;pushing&lt;/span&gt; &lt;span class="err"&gt;it&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;ECR,&lt;/span&gt; &lt;span class="err"&gt;and&lt;/span&gt; &lt;span class="err"&gt;deploying&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;ECS.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can validate the final result by checking the messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✦ The ECS service adkui-ecsexpress is currently ACTIVE.

   * Service Name: adkui-ecsexpress
   * Status: ACTIVE
   * Endpoint: http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws
     (http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then get the endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   * Endpoint: http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws
     (http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service will be visible in the AWS console. The console will look similar to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7dl0tnrpxr5voaoiehhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7dl0tnrpxr5voaoiehhg.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Running the ADK Web Interface
&lt;/h4&gt;

&lt;p&gt;Start a connection to the AWS Deployed ADK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://ad-27f169e1d3994ae3a8fd357bc014bbd2.ecs.us-east-1.on.aws
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will bring up the ADK UI. Select the sub-agent “Agent3”:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjgz3evwq8vbwr9iuvhn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjgz3evwq8vbwr9iuvhn.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will generate the Comic by using a multi-agent pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoyph004li1vuxc99x6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoyph004li1vuxc99x6e.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the multi Agent system is complete:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1aafxi3vhxiteigdl7z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1aafxi3vhxiteigdl7z.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Visual Edit Agent Pipeline
&lt;/h4&gt;

&lt;p&gt;The version of the ADK Deployed includes a visual builder:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn90tclf72xbrzpbih7x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkn90tclf72xbrzpbih7x.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Run the Online Viewer Agent
&lt;/h4&gt;

&lt;p&gt;Once Agent3 has completed — go to the ADK agent selector and select “Agent4”. This agent will allow you to browse your online comic:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmo089cu74w6f3q9yx9us.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmo089cu74w6f3q9yx9us.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  View the Final Artifacts
&lt;/h4&gt;

&lt;p&gt;You can use Agent4 to visualize the results of the agent pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F328l95esisz8h8gz2scm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F328l95esisz8h8gz2scm.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and the final panels:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k79vjjgymrknxj7y148.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k79vjjgymrknxj7y148.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;The Agent Development Kit was used to visually define a multi Agent pipeline to generate comic book style HTML. This Agent was tested locally with the CLI and then with the ADK web tool. Then, several sample ADK agents were run directly from the ECS express deployment in AWS. This approach validates that cross cloud tools can be used — even with more complex agents.&lt;/p&gt;

</description>
      <category>google</category>
      <category>gemini</category>
      <category>ecsexpress</category>
      <category>python</category>
    </item>
    <item>
      <title>The AI Development Stack: Fundamentals Every Developer Should Actually Understand</title>
      <dc:creator>Tomás Garcia</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:53:22 +0000</pubDate>
      <link>https://stormkit.forem.com/toms_garcia_6574fe315ddb/the-ai-development-stack-fundamentals-every-developer-should-actually-understand-5fei</link>
      <guid>https://stormkit.forem.com/toms_garcia_6574fe315ddb/the-ai-development-stack-fundamentals-every-developer-should-actually-understand-5fei</guid>
      <description>&lt;p&gt;Most developers are already using AI tools daily — Copilot, Claude, ChatGPT. But when it comes to &lt;em&gt;building&lt;/em&gt; with AI, there's a gap. Not in tutorials or API docs, but in the foundational mental model of how these systems actually work and fit together.&lt;/p&gt;

&lt;p&gt;This is the stuff I wish someone had laid out clearly when I started building AI-powered features. Not the hype, not the theory — the practical fundamentals that change how you architect, debug, and think about AI systems.&lt;/p&gt;




&lt;h3&gt;
  
  
  Language Models: What's Actually Happening
&lt;/h3&gt;

&lt;p&gt;A Language Model (LM) is a neural network that encodes statistical information about language. Intuitively, it tells you how likely a word is to appear in a given context. Given "my favorite color is ___", a well-trained LM should predict "blue" more often than "car."&lt;/p&gt;

&lt;p&gt;The atomic unit here is the &lt;strong&gt;token&lt;/strong&gt; — which can be a character, a word, or a subword (like "tion") depending on the model's tokenizer.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Large Language Model (LLM)&lt;/strong&gt; is just an LM trained on massive amounts of data using self-supervised learning. The key distinction isn't just scale — it's that at scale, capabilities &lt;em&gt;emerge&lt;/em&gt; that were never explicitly programmed. An LM predicts the next token. An LLM does it at such scale that reasoning, coding, and creative abilities appear as emergent properties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Foundation Models (FMs)&lt;/strong&gt; is the broadest term. It includes both LLMs (text-only) and Large Multimodal Models (LMMs), which can process text, images, video, audio, and 3D assets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvovffxclp8co618vpfsw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvovffxclp8co618vpfsw.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  What Is an Agent?
&lt;/h3&gt;

&lt;p&gt;An agent is a system that uses an LLM to operate in a loop: it reasons about what to do, takes action (tool calls, code execution, API calls), observes the result, and repeats until the task is complete.&lt;/p&gt;

&lt;p&gt;The basic loop looks like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THINK&lt;/strong&gt; — the agent receives the current context and decides what to do: respond directly, or call a tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ACT&lt;/strong&gt; — if it decided to use a tool, it executes it (web search, DB query, API call).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OBSERVE&lt;/strong&gt; — the result gets added to the context, and the cycle starts again.&lt;/p&gt;

&lt;p&gt;The loop terminates when the model has enough information to give a final answer, or when an external limit is reached (max iterations, timeout).&lt;/p&gt;

&lt;p&gt;This is deceptively simple. But every meaningful AI product you've used — from Claude Code to Cursor to Devin — is some variation of this loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhznwgupxm1g05vu4jm1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhznwgupxm1g05vu4jm1u.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Tools: How LLMs Touch the Real World
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;tool&lt;/strong&gt; is an external function that the agent can invoke to interact with the world outside the LLM.&lt;/p&gt;

&lt;p&gt;Here's what's important to understand: the LLM by itself &lt;em&gt;only generates text&lt;/em&gt;. Tools are what let it do real things — fetch live information, read files, execute code, call APIs, write to a database.&lt;/p&gt;

&lt;p&gt;Concrete examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dollar price today&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;query_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM orders WHERE status = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client@mail.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without tools, an LLM is a very sophisticated autocomplete. With tools, it becomes an agent that can actually operate in your environment.&lt;/p&gt;




&lt;h3&gt;
  
  
  Context: The Model's Working Memory
&lt;/h3&gt;

&lt;p&gt;Context is all the information the agent has "in memory" at a given moment to generate a coherent response. Think of it as a text box the model reads in its entirety on every call.&lt;/p&gt;

&lt;p&gt;It contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt&lt;/strong&gt; — base instructions defining the model's behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documents&lt;/strong&gt; — reference material injected for the task&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User message&lt;/strong&gt; — the actual request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Previous responses&lt;/strong&gt; — conversation history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool results&lt;/strong&gt; — outputs from tool executions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The context has a physical limit called the &lt;strong&gt;context window&lt;/strong&gt;, measured in tokens. Anything that doesn't fit in that window, the model simply &lt;em&gt;doesn't see&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This is why context design matters so much when building agents. The system prompt, the conversation history you preserve, what you include and what you drop — all of that directly impacts response quality, latency, and cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4fffvb46aylsx4fgsr62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4fffvb46aylsx4fgsr62.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Memory: Beyond the Context Window
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Memory&lt;/strong&gt; is the mechanism that allows an agent to access information beyond its context window.&lt;/p&gt;

&lt;p&gt;Two concrete examples you're probably already using:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude.ai&lt;/strong&gt; — at the start of every conversation, the context is empty. What it "remembers" from past chats is because Anthropic injects a summary of previous conversations into the context before you start typing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code&lt;/strong&gt; — when you're working on a project, it reads files like &lt;code&gt;CLAUDE.md&lt;/code&gt;, the directory tree, and relevant codebase files. It doesn't "know" them from memory — it loads them into context when needed, via tools.&lt;/p&gt;

&lt;p&gt;The key insight: there is no magic persistence. Everything the model "remembers" was explicitly loaded into the context window for that specific call.&lt;/p&gt;




&lt;h3&gt;
  
  
  Prompting: The Developer's Primary Interface
&lt;/h3&gt;

&lt;p&gt;Prompting is the skill of giving instructions to an LLM to get the output you want. It's the primary interface between you and the model.&lt;/p&gt;

&lt;p&gt;What the LLM receives isn't just what the user types. A complete message typically includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt&lt;/strong&gt; — base instructions defining behavior, role, constraints, response format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User prompt&lt;/strong&gt; — the user's message&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context&lt;/strong&gt; — conversation history, tool results, relevant documents, retrieved memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Available tools&lt;/strong&gt; — the list of functions the agent can invoke, with their descriptions and parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of that together is what the LLM "reads" before generating its response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core techniques:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-shot&lt;/strong&gt; — you ask directly without examples.&lt;br&gt;
&lt;em&gt;"Translate this text to English"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few-shot&lt;/strong&gt; — you provide examples of expected behavior before the question.&lt;br&gt;
&lt;em&gt;"Input: 'loved it' → Sentiment: positive. Input: 'disgusting' → Sentiment: negative. Input: 'it was okay' → Sentiment:"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chain of thought&lt;/strong&gt; — you ask the model to reason step by step before answering.&lt;br&gt;
&lt;em&gt;"Think step by step before responding"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A practical rule: the model doesn't guess your intent, it only predicts the next token. The clearer and more specific the prompt, the more predictable and useful the output.&lt;/p&gt;


&lt;h3&gt;
  
  
  Evals: Testing in a Non-Deterministic World
&lt;/h3&gt;

&lt;p&gt;An LLM is not a deterministic function. The same input can produce different outputs on every run.&lt;/p&gt;

&lt;p&gt;This breaks something fundamental for developers: you can't write an &lt;code&gt;assert&lt;/code&gt; on an LLM's response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# this doesn't work with LLMs
&lt;/span&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Capital of France?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;# it might respond "Paris.", "The capital is Paris", "París"...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conceptual distinction matters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test&lt;/strong&gt; — verifies that a function produces an exact, predictable output given an input. Pass or fail. Works when the system is deterministic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eval&lt;/strong&gt; — measures how &lt;em&gt;good&lt;/em&gt; a response is according to one or more criteria: relevance, coherence, correctness, tone. Produces a score, not a boolean.&lt;/p&gt;

&lt;p&gt;For most open-ended tasks, a perfect reference answer doesn't exist. This led to &lt;strong&gt;AI-as-a-Judge&lt;/strong&gt;, where one AI model evaluates the output of another. It's popular because it's fast, scalable, and can evaluate subjective criteria like creativity or coherence without needing reference text.&lt;/p&gt;

&lt;p&gt;But it has known limitations: AI judges have biases like &lt;strong&gt;position bias&lt;/strong&gt; (favoring the first response in a comparison) and &lt;strong&gt;verbosity bias&lt;/strong&gt; (preferring longer answers even when they contain errors).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6n9dbwwo9oa3h6brqn9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6n9dbwwo9oa3h6brqn9.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Guardrails: The Safety Net You Need
&lt;/h3&gt;

&lt;p&gt;Guardrails protect the system both from malicious inputs and problematic outputs.&lt;/p&gt;

&lt;p&gt;They operate in two layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input Guardrails&lt;/strong&gt; prevent prompt injection attacks and filter sensitive data (PII) before it reaches external APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output Guardrails&lt;/strong&gt; verify the model's responses for toxicity, factual inconsistencies, and format errors — typically using a fast classifier or an AI judge before showing the response to the user.&lt;/p&gt;

&lt;p&gt;The reasoning is straightforward: since the LLM is probabilistic, you can't guarantee it will always behave as expected. Guardrails implement checks at both ends.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input
    ↓
[Input Guardrail]  ← PII, prompt injection, malicious content
    ↓
   LLM
    ↓
[Output Guardrail] ← toxicity, hallucinations, bad formatting
    ↓
Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trade-off: guardrails add latency to every response. It's a cost worth paying for production systems, but you need to be intentional about what you check and how.&lt;/p&gt;




&lt;h3&gt;
  
  
  MCP (Model Context Protocol): The USB-C of AI Tools
&lt;/h3&gt;

&lt;p&gt;Before MCP, if you wanted an agent to use an external tool — say, search Notion, query a database, or read a Google Drive file — you had to implement that integration yourself: authentication, request formatting, error handling, and then describe it to the LLM in the system prompt so it knew how to use it.&lt;/p&gt;

&lt;p&gt;The problem: every agent, every LLM, every app was reimplementing the same integrations from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP&lt;/strong&gt; is a standard interface between agents and external tools — it defines how an LLM discovers, invokes, and receives results from tools, regardless of who implemented them.&lt;/p&gt;

&lt;p&gt;Two components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Server&lt;/strong&gt; — exposes tools to the agent. Can be local (a process running on your machine) or remote (a cloud service). Implements the concrete tools: read files, query APIs, execute code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Client&lt;/strong&gt; — the agent or app that consumes the tools. Connects to the server, discovers available tools, and invokes them during the think-act-observe loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Agent / MCP Client]
    ↓  "what tools do you have?"
[MCP Server]
    ↓  "I have: read_file, search_notion, query_db"
[Agent]
    ↓  calls read_file("README.md")
[MCP Server]
    ↓  returns the content
[Agent]  ← adds result to context and continues
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Claude Code, for example, you act as the MCP Client. You can add MCP Servers with a simple command — &lt;code&gt;claude mcp add server-name&lt;/code&gt; — and from that moment Claude Code has access to whatever tools that server exposes. A Postgres MCP Server gives Claude Code the ability to query your database directly during a development session.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbeg1kcc1nmig11r9ajaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbeg1kcc1nmig11r9ajaj.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  RAG (Retrieval-Augmented Generation): Grounding Responses in Your Data
&lt;/h3&gt;

&lt;p&gt;The problem: the LLM's knowledge is limited to its training data. It knows nothing about your codebase, your internal docs, real-time data, or anything after its knowledge cutoff date.&lt;/p&gt;

&lt;p&gt;RAG is the pragmatic alternative: instead of teaching the model, you pass it the relevant information in context right before it responds.&lt;/p&gt;

&lt;p&gt;The flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User question
    ↓
[Search] ← finds the most relevant fragments
           in a vector database (fed with document chunks)
    ↓
[Augmented context] ← question + relevant fragments
    ↓
   LLM
    ↓
Response grounded in those documents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion&lt;/strong&gt; — documents are split into fragments (chunks) and converted into vectors (embeddings) that represent their semantic meaning. Stored in a vector database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval&lt;/strong&gt; — when a question arrives, it's also converted into a vector and the most semantically similar fragments are retrieved from the database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation&lt;/strong&gt; — the retrieved fragments are injected into the LLM's context along with the question, and the model responds based on that information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to use RAG:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chatbots over internal documentation or knowledge bases&lt;/li&gt;
&lt;li&gt;Assistants that need real-time information (news, prices, live data)&lt;/li&gt;
&lt;li&gt;Q&amp;amp;A over code, contracts, reports — any data the model doesn't know&lt;/li&gt;
&lt;li&gt;Reducing hallucinations by anchoring responses to concrete sources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6bxc74hendt3l7pr9wx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6bxc74hendt3l7pr9wx.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Developer Interfaces: How You Actually Use LLMs
&lt;/h3&gt;

&lt;p&gt;An LLM can be consumed in different ways depending on the use case:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Web&lt;/strong&gt; — the most accessible form. Go to a URL, type, get a response. Ideal for exploring, iterating on prompts, or one-off tasks. No code required. Examples: Claude.ai, ChatGPT, Gemini.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API&lt;/strong&gt; — the programmatic form. You make an HTTP request and get the response in your code. It's the foundation of any product or agent you build. Gives you full control over the prompt, model, parameters, and integration with your system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://api.anthropic.com/v1/messages &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-api-key: &lt;/span&gt;&lt;span class="nv"&gt;$ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "claude-sonnet-4-20250514",
       "messages": [{"role": "user", "content": "Hello"}]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CLI (Terminal)&lt;/strong&gt; — command-line tools that wrap the API and let you interact with the LLM from your terminal, integrated into your development workflow. The most relevant example today is Claude Code: an agent that runs in your terminal, has access to your codebase, can read and write files, execute commands, and operates in the think-act-observe loop we already covered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IDE&lt;/strong&gt; — extensions that integrate the LLM directly into your editor. The model sees your code in context and can suggest, complete, refactor, or explain without leaving the environment. Examples: Cursor, GitHub Copilot, or the Claude extension for VS Code.&lt;/p&gt;




&lt;h3&gt;
  
  
  Putting It All Together
&lt;/h3&gt;

&lt;p&gt;None of these concepts exist in isolation. When you use Claude Code to refactor a function, here's what's actually happening: the &lt;strong&gt;LLM&lt;/strong&gt; is processing your request within a &lt;strong&gt;context window&lt;/strong&gt; loaded with your &lt;strong&gt;system prompt&lt;/strong&gt;, codebase files (loaded via &lt;strong&gt;tools&lt;/strong&gt;), and conversation &lt;strong&gt;memory&lt;/strong&gt;. It operates in an &lt;strong&gt;agent loop&lt;/strong&gt; — thinking, acting, observing. The tools it uses to read and write your files might come through &lt;strong&gt;MCP servers&lt;/strong&gt;. If it's pulling in documentation, that might be &lt;strong&gt;RAG&lt;/strong&gt; at work. And somewhere in the pipeline, &lt;strong&gt;guardrails&lt;/strong&gt; are ensuring the outputs are safe.&lt;/p&gt;

&lt;p&gt;Understanding these fundamentals doesn't just help you use AI tools better — it's the foundation for building them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>beginners</category>
      <category>programming</category>
    </item>
    <item>
      <title>Velero Going CNCF Isn't About Backup. It's About Control.</title>
      <dc:creator>NTCTech</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:53:01 +0000</pubDate>
      <link>https://stormkit.forem.com/ntctech/velero-going-cncf-isnt-about-backup-its-about-control-3lp7</link>
      <guid>https://stormkit.forem.com/ntctech/velero-going-cncf-isnt-about-backup-its-about-control-3lp7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7htdxap9xlt28vj62nqi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7htdxap9xlt28vj62nqi.jpg" alt="Velero CNCF backup governance shift illustrated as dark server room with purple and cyan gradient lighting overlaid with architectural blueprint grid lines representing Kubernetes control plane authority" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Velero CNCF backup announcement at KubeCon EU 2026 was framed as an open source governance story. Broadcom contributed Velero — its Kubernetes-native backup, restore, and migration tool — to the CNCF Sandbox, where it was accepted by the CNCF Technical Oversight Committee.&lt;/p&gt;

&lt;p&gt;Most coverage treated this as a backup story. It isn't.&lt;/p&gt;

&lt;p&gt;Velero moving to CNCF governance is a control plane story disguised as an open source announcement. And if your team is running stateful workloads on Kubernetes, the distinction between vendor-neutral governance and vendor-independent operations is the architectural decision that sits beneath the headline.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Velero CNCF Backup Move Actually Means
&lt;/h2&gt;

&lt;p&gt;Velero originated at Heptio — founded by Kubernetes co-creators Joe Beda and Craig McLuckie — which VMware acquired in 2019. It's been under VMware, then Broadcom stewardship ever since. The project operates at the Kubernetes API layer, not the storage layer. All backup operations are defined via CRDs (&lt;code&gt;Backup&lt;/code&gt;, &lt;code&gt;Restore&lt;/code&gt;, &lt;code&gt;Schedule&lt;/code&gt;, &lt;code&gt;BackupStorageLocation&lt;/code&gt;, &lt;code&gt;VolumeSnapshotLocation&lt;/code&gt;) and managed through standard Kubernetes control loops.&lt;/p&gt;

&lt;p&gt;At KubeCon EU, Broadcom formalized the transition: Velero is now a CNCF Sandbox project, with maintainers from Broadcom, Red Hat, and Microsoft.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdijcggn7eijzgx47vh1u.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdijcggn7eijzgx47vh1u.jpg" alt="Timeline diagram showing Velero's governance history from Heptio 2017 to VMware acquisition 2019 to Broadcom 2023 to CNCF Sandbox 2026 with purple accent markers" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Broadcom's own framing was telling: &lt;em&gt;"We really don't want people to mistrust the open source project and believe that it's somehow a VMware thing even though it hasn't been a VMware thing for quite some time."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This move is as much about trust repair as governance mechanics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vendor-Neutral ≠ Vendor-Independent
&lt;/h2&gt;

&lt;p&gt;This is the distinction most teams will miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor-neutral governance&lt;/strong&gt; means no single vendor controls the roadmap. CNCF governance means Broadcom can no longer make breaking changes to Velero unilaterally. Community-steered, broader contributor base. That's real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor-independent operations&lt;/strong&gt; means your recovery path survives without the vendor. That's a different question entirely — and CNCF governance doesn't answer it.&lt;/p&gt;

&lt;p&gt;Your backup storage location is still a cloud bucket outside your cluster. Your IAM credentials still have to reach that bucket. Your restore workflow still depends on a working target cluster. None of those operational dependencies changed on March 24th.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Architecture Question
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;When your cluster dies — what actually survives?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Velero operates at the Kubernetes API layer, which makes it a &lt;strong&gt;state reconstruction layer&lt;/strong&gt;, not a storage tool. A Velero backup is a portable snapshot of declarative cluster state — namespaces, CRDs, RBAC policies, PVC claims — not a disk image.&lt;/p&gt;

&lt;p&gt;That portability is the real capability. A backup taken on VKS can theoretically be restored on EKS, AKS, or bare-metal kubeadm — because it operates through the Kubernetes API, not hypervisor-specific snapshots.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4w9xfd8qvfz02w51dsi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4w9xfd8qvfz02w51dsi.jpg" alt="Diagram showing Velero operating at Kubernetes API layer between cluster state and object storage, with arrows showing backup flow from CRDs and namespace resources through API to object storage and back on restore" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But state reconstruction has limits:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;What Velero Controls&lt;/th&gt;
&lt;th&gt;What Velero Depends On&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Backup Definitions&lt;/td&gt;
&lt;td&gt;CRDs inside cluster&lt;/td&gt;
&lt;td&gt;etcd — gone if cluster is gone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Restore Logic&lt;/td&gt;
&lt;td&gt;Velero controller + API server&lt;/td&gt;
&lt;td&gt;Working target cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Metadata&lt;/td&gt;
&lt;td&gt;Object metadata, resource specs&lt;/td&gt;
&lt;td&gt;External object storage bucket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;APIs&lt;/td&gt;
&lt;td&gt;Kubernetes API layer ops&lt;/td&gt;
&lt;td&gt;Cloud IAM for bucket access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Velero cannot bootstrap a cluster from nothing. It cannot authenticate to object storage without valid IAM credentials. It cannot run a restore without a target cluster already operational.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Four Production Failure Modes
&lt;/h2&gt;

&lt;p&gt;These won't appear in the press releases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;01 / Object Storage Dependency&lt;/strong&gt;&lt;br&gt;
Every backup lands outside your cluster in object storage. Full cluster failure + network partition = recovery blocked, regardless of whether the backup data is intact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;02 / IAM Credential Survivability&lt;/strong&gt;&lt;br&gt;
Velero authenticates via IAM roles, IRSA, or Workload Identity — all provisioned outside Velero itself. If your identity system is compromised or the cloud control plane is unavailable, the data exists but is unreachable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;03 / Restore-Time Complexity&lt;/strong&gt;&lt;br&gt;
Velero restores Kubernetes objects. It does not restore external databases, DNS records, ingress configurations, or certificate bindings. The gap between "backup succeeded" and "system restored" is proportional to how many external dependencies your workloads carry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;04 / Air Gap Theater&lt;/strong&gt;&lt;br&gt;
Velero deployed with on-premises MinIO, backups running, compliance checkbox ticked. The problem: restore still requires live access to that storage endpoint, live IAM credentials, and a functional API server. If those dependencies fail, the air gap was theater. The backup exists. The restore doesn't work.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhr5vxb472rilwnhcxc5.jpg" alt="Dark moody illustration of a network diagram bisected by a physical wall representing an air gap, with Kubernetes cluster nodes on one side and isolated object storage on the other, but a faint glowing credential key visibly bridging the gap suggesting false isolation" width="800" height="437"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Broadcom Signal Worth Reading
&lt;/h2&gt;

&lt;p&gt;Broadcom has been navigating a trust deficit since the VMware acquisition — the pricing restructuring, perpetual license elimination, and VCF bundling created a market perception that it would eventually lock down everything it touched.&lt;/p&gt;

&lt;p&gt;The Velero CNCF contribution is a counter-signal. By relinquishing governance of a project at the center of Kubernetes backup and migration, Broadcom is demonstrating that at least some of its stack is genuinely community-governed.&lt;/p&gt;

&lt;p&gt;It also creates a clean architectural separation: Velero as open, portable, community-governed backup — VKS/VCF as proprietary platform layer. That separation is useful for teams evaluating VMware Cloud Foundation. Your backup portability is no longer contingent on your platform choice.&lt;/p&gt;

&lt;p&gt;That's a genuine architectural benefit — independent of the marketing attached to it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architect's Verdict
&lt;/h2&gt;

&lt;p&gt;The CNCF move is real and it matters — but not for the reasons most teams will act on.&lt;/p&gt;

&lt;p&gt;If your concern is Broadcom controlling Velero's roadmap to disadvantage non-VMware users: that concern is now materially reduced. Multi-vendor maintainership and CNCF oversight create real structural separation.&lt;/p&gt;

&lt;p&gt;If your concern is operational — whether Velero works when your cluster is down: the CNCF transition changes nothing. Object storage dependency still exists. IAM credential chain still needs to survive the same incident your cluster didn't. Restore-time complexity is still proportional to your external dependencies.&lt;/p&gt;

&lt;p&gt;The teams that benefit most from this transition are those running multi-distribution environments who hesitated to standardize on Velero because of its VMware lineage. The governance change removes a legitimate organizational objection. The operational architecture still requires the same engineering discipline it always did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CNCF doesn't remove risk. It changes where the risk lives — from project governance to operational design. Most teams haven't engineered the latter. That's the work.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.rack2cloud.com/velero-cncf-backup-control/" rel="noopener noreferrer"&gt;rack2cloud.com&lt;/a&gt; — architecture-first analysis for enterprise infrastructure teams.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Redis connection monkey patching in Ruby Jungles</title>
      <dc:creator>Roman Tsypuk</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:51:31 +0000</pubDate>
      <link>https://stormkit.forem.com/aws-builders/redis-connection-monkey-patching-in-ruby-jungles-4k7o</link>
      <guid>https://stormkit.forem.com/aws-builders/redis-connection-monkey-patching-in-ruby-jungles-4k7o</guid>
      <description>&lt;p&gt;Some programming languages allow developers to “hack” or extend their internals by overriding existing methods in standard libraries, dynamically attaching new behavior to objects, or modifying classes at runtime.&lt;/p&gt;

&lt;p&gt;One of the languages that strongly embraces this flexibility is &lt;strong&gt;Ruby&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This ability is often referred to as &lt;strong&gt;monkey patching&lt;/strong&gt;, and while it should be used with caution, it can be extremely powerful in real-world scenarios—especially when dealing with legacy systems or unavailable source code.&lt;/p&gt;

&lt;h1&gt;
  
  
  Ruby and Runtime Flexibility
&lt;/h1&gt;

&lt;p&gt;Ruby is a highly dynamic, object-oriented language where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classes can be reopened and modified at any time&lt;/li&gt;
&lt;li&gt;Methods can be overridden or extended dynamically&lt;/li&gt;
&lt;li&gt;Behavior can be injected into existing objects or modules&lt;/li&gt;
&lt;li&gt;Even core classes (like String, Array, etc.) can be modified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes Ruby particularly well-suited for rapid prototyping, metaprogramming, runtime instrumentation, patching legacy dependencies.&lt;/p&gt;

&lt;p&gt;However, this flexibility comes with responsibility: poorly designed patches can introduce hard-to-debug issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;A simple example of extending a built-in class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;patch&lt;/span&gt;
    &lt;span class="s2"&gt;"---"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upcase&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s2"&gt;"---"&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# rbi&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"aaa"&lt;/span&gt;.patch
&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"---AAA---"&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"test"&lt;/span&gt;.patch
&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"---aaa---"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This demonstrates how easily Ruby allows you to modify even core classes like &lt;code&gt;String&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world Example: Patching Redis Connection Pool
&lt;/h2&gt;

&lt;p&gt;I encountered a set of legacy Ruby applications that depended on outdated libraries. These dependencies were no longer available in Git repositories, although prebuilt gems were still stored in an internal artifact repository.&lt;/p&gt;

&lt;p&gt;As part of a Redis migration, I needed to identify all polyglot services connecting to Redis instances. The goal was to introduce a &lt;code&gt;CLIENT_NAME&lt;/code&gt; for every Redis client, regardless of the programming language used.&lt;br&gt;
So that majority of services followed projects structure +/- similar &lt;code&gt;go-lang&lt;/code&gt; stack, but those Ruby legacy services were out of the landscape.&lt;/p&gt;
&lt;h3&gt;
  
  
  Challenges
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;No access to source repositories of dependencies&lt;/li&gt;
&lt;li&gt;No explicit Redis connection URLs&lt;/li&gt;
&lt;li&gt;A proprietary “DIY Redis discovery” mechanism&lt;/li&gt;
&lt;li&gt;Redis connections abstracted behind internal libraries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made it difficult to instrument Redis clients in a standard way.&lt;/p&gt;
&lt;h2&gt;
  
  
  Solution: Monkey Patching
&lt;/h2&gt;

&lt;p&gt;Fortunately, Ruby’s monkey patching capabilities provided a way forward.&lt;/p&gt;

&lt;p&gt;Even without modifying third-party libraries, I was able to intercept Redis connection creation and inject metadata at runtime.&lt;/p&gt;

&lt;p&gt;The idea was simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;As soon as a Redis connection is established, annotate it with metadata such as service name, Ruby version, and Redis client version.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Original Connection Code (Simplified):
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;module&lt;/span&gt; &lt;span class="nn"&gt;RedisConfig&lt;/span&gt;
  &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Connection&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nc"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_instance!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;redis&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Patched Implementation
&lt;/h3&gt;

&lt;p&gt;I created a module that overrides the &lt;strong&gt;create_instance!&lt;/strong&gt; method and augments it with additional instrumentation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;module&lt;/span&gt; &lt;span class="nn"&gt;ServicePatch&lt;/span&gt;
  &lt;span class="k"&gt;module&lt;/span&gt; &lt;span class="nn"&gt;RedisMetadataPatch&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_instance!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="n"&gt;set_open_api_metadata!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;blk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;blk&lt;/span&gt;
      &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;

    &lt;span class="kp"&gt;private&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_open_api_metadata!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;safe_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:setname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'SERVICE_NAME'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;safe_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:setinfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'LIB-NAME'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"ruby:&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="no"&gt;RUBY_VERSION&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;safe_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:setinfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'LIB-VER'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;Redis&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;VERSION&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;rescue&lt;/span&gt; &lt;span class="no"&gt;Redis&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;BaseError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;StandardError&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
      &lt;span class="nb"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"[redis metadata] &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;r_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inspect&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; failed: &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;class&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="kp"&gt;nil&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="no"&gt;RedisConfig&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;singleton_class&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;ServicePatch&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;RedisMetadataPatch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using prepend ensures that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The patched method runs before the original implementation&lt;/li&gt;
&lt;li&gt;super correctly delegates to the original method&lt;/li&gt;
&lt;li&gt;The patch is cleanly layered without modifying original code&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After deploying this patch, all Redis clients automatically started reporting metadata.&lt;br&gt;
Here is monitoring from &lt;strong&gt;Redis&lt;/strong&gt; server-side that shows how now these ruby services are instrumenting connection name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;valkey.xxxx.xx.xxxx.xxx.cache.amazonaws.com:6379&amp;gt; monitor
OK
1774951026.839060 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.95.236:48528] &lt;span class="s2"&gt;"hello"&lt;/span&gt; &lt;span class="s2"&gt;"3"&lt;/span&gt; &lt;span class="s2"&gt;"setname"&lt;/span&gt; &lt;span class="s2"&gt;"service-api1"&lt;/span&gt;
1774951026.839435 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.95.236:48528] &lt;span class="s2"&gt;"client"&lt;/span&gt; &lt;span class="s2"&gt;"setname"&lt;/span&gt; &lt;span class="s2"&gt;"service-api1"&lt;/span&gt;
1774951026.840134 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.95.236:48528] &lt;span class="s2"&gt;"client"&lt;/span&gt; &lt;span class="s2"&gt;"setinfo"&lt;/span&gt; &lt;span class="s2"&gt;"LIB-NAME"&lt;/span&gt; &lt;span class="s2"&gt;"ruby:4.0.1"&lt;/span&gt;
1774951026.840142 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.95.236:48528] &lt;span class="s2"&gt;"client"&lt;/span&gt; &lt;span class="s2"&gt;"setinfo"&lt;/span&gt; &lt;span class="s2"&gt;"LIB-VER"&lt;/span&gt; &lt;span class="s2"&gt;"5.4.1"&lt;/span&gt;
1774951026.840614 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.95.236:48528] &lt;span class="s2"&gt;"ping"&lt;/span&gt;
1774951031.463576 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.70.215:58252] &lt;span class="s2"&gt;"hello"&lt;/span&gt; &lt;span class="s2"&gt;"3"&lt;/span&gt; &lt;span class="s2"&gt;"setname"&lt;/span&gt; &lt;span class="s2"&gt;"service-api2"&lt;/span&gt;
1774951031.464538 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.70.215:58252] &lt;span class="s2"&gt;"client"&lt;/span&gt; &lt;span class="s2"&gt;"setname"&lt;/span&gt; &lt;span class="s2"&gt;"service-api1"&lt;/span&gt;
1774951031.468056 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.70.215:58252] &lt;span class="s2"&gt;"client"&lt;/span&gt; &lt;span class="s2"&gt;"setinfo"&lt;/span&gt; &lt;span class="s2"&gt;"LIB-NAME"&lt;/span&gt; &lt;span class="s2"&gt;"ruby:4.0.1"&lt;/span&gt;
1774951031.468066 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.70.215:58252] &lt;span class="s2"&gt;"client"&lt;/span&gt; &lt;span class="s2"&gt;"setinfo"&lt;/span&gt; &lt;span class="s2"&gt;"LIB-VER"&lt;/span&gt; &lt;span class="s2"&gt;"5.4.1"&lt;/span&gt;
1774951031.468728 &lt;span class="o"&gt;[&lt;/span&gt;0 xx.xx.70.215:58252] &lt;span class="s2"&gt;"ping"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Observability Gains
&lt;/h2&gt;

&lt;p&gt;Once the instrumentation was in place, I was able to use a custom Redis client scanner to analyze traffic to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;identify which services are connected to which Redis instances&lt;/li&gt;
&lt;li&gt;track command usage patterns&lt;/li&gt;
&lt;li&gt;detect idle or misbehaving clients&lt;/li&gt;
&lt;li&gt;correlate activity across polyglot systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────┬──────────────────────┬──────────────────────┬─────────┬───────┬───────┬────────┬────────┬────────┬────────┐
│ Client Addr         │ Name                 │ Lib                  │ Lib Ver │ Age   │ Idle  │    GET │   MGET │    SET │ ZRANGE │
├─────────────────────┼──────────────────────┼──────────────────────┼─────────┼───────┼───────┼────────┼────────┼────────┼────────┤
│ xx.xx.226.123:27613 │ service-api1         │ ruby:4.0.1           │ 5.4.1   │ 27740 │ 14    │      0 │      2 │     12 │      0 │
│ xx.xx.240.240:32031 │ service-api2         │ ruby:4.0.1           │ 5.4.1   │ 89306 │ 1838  │      0 │      8 │     48 │      0 │
│ xx.xx.240.240:41498 │ service-api3         │ ruby:4.0.1           │ 5.4.1   │ 89306 │ 189   │      0 │     13 │     87 │      0 │
│ xx.xx.254.221:58628 │ service-api4         │ ruby:4.0.1           │ 5.4.1   │ 10503 │ 64    │      0 │     11 │     72 │      0 │
│ xx.xx.254.221:9620  │ service-api5         │ ruby:4.0.1           │ 5.4.1   │ 10503 │ 1238  │      0 │      9 │     54 │      0 │
└─────────────────────┴──────────────────────┴──────────────────────┴─────────┴───────┴───────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This approach allowed me to instrument legacy Ruby applications without modifying their dependencies or internal logic. By leveraging Ruby’s dynamic capabilities, I was able to introduce observability into a previously opaque system.&lt;/p&gt;

&lt;p&gt;In environments with legacy constraints, such techniques can turn blockers into manageable engineering problems.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;Ruby&lt;/code&gt; is very straightforward language to write code, some ideas from it migrated to &lt;code&gt;kotlin&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.ruby-lang.org/en/" rel="noopener noreferrer"&gt;https://www.ruby-lang.org/en/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    <item>
      <title>Azure Kubernetes Security: Checklist and Best Practices</title>
      <dc:creator>Mohamed Amine Hlali</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:47:52 +0000</pubDate>
      <link>https://stormkit.forem.com/mohamed_amine_hlali/azure-kubernetes-security-checklist-and-best-practices-3e89</link>
      <guid>https://stormkit.forem.com/mohamed_amine_hlali/azure-kubernetes-security-checklist-and-best-practices-3e89</guid>
      <description>&lt;p&gt;Kubernetes has become the dominant platform for container orchestration. As cloud-native architecture takes over enterprise IT, securing your Azure Kubernetes Service (AKS) environment is no longer optional it's critical.&lt;/p&gt;

&lt;p&gt;This guide covers everything you need: how AKS security works, the key challenges, best practices, and a production-ready checklist.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Azure Kubernetes Security?
&lt;/h2&gt;

&lt;p&gt;Azure Kubernetes Security is the set of practices, protocols, and tools that protect Kubernetes clusters running on Microsoft Azure. It covers three main areas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identity &amp;amp; access control&lt;/strong&gt; who can do what inside the cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network security&lt;/strong&gt; controlling traffic between pods, namespaces, and external services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous monitoring&lt;/strong&gt; detecting threats and anomalies in real time&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;Here are the top reasons AKS security deserves serious attention:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Growing threat landscape&lt;/strong&gt; Kubernetes-specific attacks are increasing as cloud adoption grows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance requirements&lt;/strong&gt; GDPR, HIPAA, and other regulations mandate proper data protection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High cost of breaches&lt;/strong&gt; Beyond data loss: legal fees, fines, and reputational damage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared responsibility model&lt;/strong&gt; Azure secures the control plane; &lt;em&gt;you&lt;/em&gt; secure the workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microservices complexity&lt;/strong&gt; Every service-to-service connection is a potential attack vector&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  How AKS Security Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Identity &amp;amp; Access (AAD + RBAC)
&lt;/h3&gt;

&lt;p&gt;Integrate AKS with Azure Active Directory and enforce Role-Based Access Control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az aks create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-group&lt;/span&gt; myRG &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myAKSCluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-aad&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--aad-admin-group-object-ids&lt;/span&gt; &amp;lt;group-object-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply least-privilege RBAC roles for developers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;developer-readonly&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;services"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Network Security Default Deny
&lt;/h3&gt;

&lt;p&gt;Block all traffic by default, then allow only what's needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deny-all-ingress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Secrets Management with Azure Key Vault
&lt;/h3&gt;

&lt;p&gt;Never store secrets in YAML manifests. Use the Secrets Store CSI Driver:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secrets-store.csi.x-k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretProviderClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure-kvname&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keyvaultName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myKeyVault"&lt;/span&gt;
    &lt;span class="na"&gt;objects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;array:&lt;/span&gt;
        &lt;span class="s"&gt;- |&lt;/span&gt;
          &lt;span class="s"&gt;objectName: mySecret&lt;/span&gt;
          &lt;span class="s"&gt;objectType: secret&lt;/span&gt;
    &lt;span class="na"&gt;tenantId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;tenant-id&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Pod Security Standards
&lt;/h3&gt;

&lt;p&gt;Enforce security at the namespace level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;pod-security.kubernetes.io/enforce&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restricted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Resource Limits
&lt;/h3&gt;

&lt;p&gt;Prevent resource exhaustion attacks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LimitRange&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-limits&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
    &lt;span class="na"&gt;defaultRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;
      &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128Mi"&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Container&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Top 5 Best Practices
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Private Clusters&lt;/strong&gt; Remove public API server exposure entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az aks create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-group&lt;/span&gt; myRG &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myPrivateCluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-private-cluster&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enable Defender for Containers&lt;/strong&gt; Runtime threat detection at cluster and node level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az aks update &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-group&lt;/span&gt; myRG &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myAKSCluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-defender&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use Managed Identities&lt;/strong&gt; Eliminate service principal credential management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az aks update &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-group&lt;/span&gt; myRG &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myAKSCluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-managed-identity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Enable Auto-Upgrade&lt;/strong&gt; Stay patched against known CVEs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az aks update &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-group&lt;/span&gt; myRG &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; myAKSCluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--auto-upgrade-channel&lt;/span&gt; stable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scan Images in CI/CD&lt;/strong&gt; Catch vulnerabilities before they reach production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Trivy vulnerability scanner&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;myacr.azurecr.io/myapp:latest'&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CRITICAL,HIGH'&lt;/span&gt;
    &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  AKS Security Checklist ✅
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Identity &amp;amp; Access
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] AAD integration enabled&lt;/li&gt;
&lt;li&gt;[ ] RBAC with least-privilege roles enforced&lt;/li&gt;
&lt;li&gt;[ ] Managed identities used (no service principal secrets)&lt;/li&gt;
&lt;li&gt;[ ] Workload Identity enabled for pods&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Network
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Private cluster (no public API server)&lt;/li&gt;
&lt;li&gt;[ ] Default-deny NetworkPolicies applied&lt;/li&gt;
&lt;li&gt;[ ] Azure Firewall / NSGs configured&lt;/li&gt;
&lt;li&gt;[ ] Authorized IP ranges set for API access&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Workloads
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Pod Security Standards enforced (restricted)&lt;/li&gt;
&lt;li&gt;[ ] All containers run as non-root&lt;/li&gt;
&lt;li&gt;[ ] Read-only root filesystem where possible&lt;/li&gt;
&lt;li&gt;[ ] CPU/memory limits defined for all containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Secrets &amp;amp; Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] No secrets in manifests or images&lt;/li&gt;
&lt;li&gt;[ ] Azure Key Vault integrated via CSI Driver&lt;/li&gt;
&lt;li&gt;[ ] etcd encryption at rest enabled&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Microsoft Defender for Containers enabled&lt;/li&gt;
&lt;li&gt;[ ] Kubernetes audit logs → Log Analytics&lt;/li&gt;
&lt;li&gt;[ ] Azure Policy for Kubernetes applied&lt;/li&gt;
&lt;li&gt;[ ] Image scanning in CI/CD pipeline&lt;/li&gt;
&lt;li&gt;[ ] Auto-upgrade channel configured&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AKS security is a continuous practice not a one-time configuration. The platform gives you a strong foundation with its managed control plane and native integrations, but workload security is your responsibility.&lt;/p&gt;

&lt;p&gt;Start with the basics: private clusters, AAD + RBAC, Key Vault for secrets, and Defender for monitoring. Then build on that foundation with network policies, pod security standards, and automated image scanning.&lt;/p&gt;

&lt;p&gt;The checklist above is a solid starting point for any production AKS deployment.&lt;/p&gt;




</description>
      <category>azure</category>
      <category>kubernetes</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat</title>
      <dc:creator>Hermes Rodríguez</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:46:26 +0000</pubDate>
      <link>https://stormkit.forem.com/hrodrig/21-toks-gemma-4-on-a-ryzen-mini-pc-llamacpp-vulkan-and-the-messy-truth-about-local-chat-m82</link>
      <guid>https://stormkit.forem.com/hrodrig/21-toks-gemma-4-on-a-ryzen-mini-pc-llamacpp-vulkan-and-the-messy-truth-about-local-chat-m82</guid>
      <description>&lt;p&gt;Hands-on guide based on a real setup: &lt;strong&gt;Ubuntu 24.04 LTS&lt;/strong&gt;, &lt;strong&gt;AMD Radeon 760M&lt;/strong&gt; (Ryzen iGPU), &lt;strong&gt;lots of RAM&lt;/strong&gt; (e.g. 96 GiB), &lt;strong&gt;llama.cpp&lt;/strong&gt; built with &lt;strong&gt;GGML_VULKAN&lt;/strong&gt;, OpenAI-compatible API via &lt;strong&gt;llama-server&lt;/strong&gt;, &lt;strong&gt;Open WebUI&lt;/strong&gt; in Docker, and &lt;strong&gt;OpenCode&lt;/strong&gt; or &lt;strong&gt;VS Code&lt;/strong&gt; (§11) using the same API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; if you buy (or plan to buy) a &lt;strong&gt;mini PC&lt;/strong&gt; or small tower with &lt;strong&gt;plenty of RAM and disk&lt;/strong&gt;, this walkthrough gets you to &lt;strong&gt;local inference&lt;/strong&gt; — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is &lt;strong&gt;AMD iGPU + Vulkan&lt;/strong&gt;; if your hardware differs, keep the &lt;strong&gt;Ubuntu → llama.cpp → weights → server&lt;/strong&gt; flow and adjust &lt;strong&gt;§5–§6&lt;/strong&gt; (deps and build) for your GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference hardware (validated while writing this guide):&lt;/strong&gt; &lt;strong&gt;Minisforum UM760 Slim&lt;/strong&gt; mini PC (&lt;em&gt;Device Type: MINI PC&lt;/em&gt; on the chassis label; vendor &lt;strong&gt;Minisforum&lt;/strong&gt; / &lt;strong&gt;Micro Computer (HK) Tech Limited&lt;/strong&gt;) with &lt;strong&gt;AMD Ryzen 5 7640HS&lt;/strong&gt;, &lt;strong&gt;Radeon 760M Graphics&lt;/strong&gt;, &lt;strong&gt;96 GiB&lt;/strong&gt; &lt;strong&gt;DDR5&lt;/strong&gt; RAM, &lt;strong&gt;~1 TiB&lt;/strong&gt; NVMe, &lt;strong&gt;Ubuntu 24.04 LTS&lt;/strong&gt;. This is not a minimum-requirements bar—it &lt;strong&gt;anchors&lt;/strong&gt; compile times, download comfort, and token throughput vs other CPUs, RAM, or disks. To &lt;strong&gt;verify memory type and size&lt;/strong&gt; on your box, see §3 (&lt;em&gt;Quick hardware inventory&lt;/em&gt;). A &lt;strong&gt;photo of the box&lt;/strong&gt; is at the &lt;strong&gt;end&lt;/strong&gt;, under Closing thoughts.&lt;/p&gt;

&lt;p&gt;Replace &lt;code&gt;YOUR_USER&lt;/code&gt;, model paths, and hostname as needed. If the machine is &lt;strong&gt;server-only&lt;/strong&gt; (no monitor), start with §4.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjziq4fb01nlq7tlbzvx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjziq4fb01nlq7tlbzvx.png" alt="local LLM stack on Ubuntu — reference illustration" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Too long; didn’t read&lt;/em&gt; — one-minute skim before the full guide. Full table of contents →&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What you’re building:&lt;/strong&gt; &lt;strong&gt;local&lt;/strong&gt; inference on &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; with &lt;strong&gt;llama.cpp&lt;/strong&gt; + &lt;strong&gt;Vulkan&lt;/strong&gt;, a &lt;strong&gt;GGUF&lt;/strong&gt; weights file, OpenAI-style API via &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;:8080&lt;/code&gt;&lt;/strong&gt;); optional &lt;strong&gt;Open WebUI&lt;/strong&gt; in Docker (&lt;strong&gt;&lt;code&gt;:3000&lt;/code&gt;&lt;/strong&gt;); &lt;strong&gt;OpenCode&lt;/strong&gt; and &lt;strong&gt;Visual Studio Code&lt;/strong&gt; can talk to the same &lt;strong&gt;&lt;code&gt;http://…:8080/v1&lt;/code&gt;&lt;/strong&gt; base URL as an OpenAI-compatible provider (&lt;strong&gt;§11&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortest path:&lt;/strong&gt; &lt;strong&gt;BIOS/UMA&lt;/strong&gt; if relevant (§2) → deps + &lt;strong&gt;Vulkan&lt;/strong&gt; (§5) → build &lt;strong&gt;llama.cpp&lt;/strong&gt; (§6) → download &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/strong&gt; (§7: &lt;strong&gt;&lt;code&gt;wget --continue&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;huggingface-cli&lt;/code&gt;&lt;/strong&gt;; &lt;strong&gt;&lt;code&gt;screen&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;tmux&lt;/code&gt;&lt;/strong&gt; for long SSH sessions) → smoke-test &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; → run &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; manually or under &lt;strong&gt;systemd&lt;/strong&gt; (§8–§9) → point &lt;strong&gt;Open WebUI&lt;/strong&gt; at the host (§10) → &lt;strong&gt;optional:&lt;/strong&gt; &lt;strong&gt;OpenCode&lt;/strong&gt; / &lt;strong&gt;VS Code&lt;/strong&gt; (§11).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tight RAM / OOM:&lt;/strong&gt; same user as the service; match &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; to &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;; if it fails, drop &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;4096&lt;/strong&gt;) and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;40&lt;/strong&gt;) before chasing &lt;strong&gt;99&lt;/strong&gt; / &lt;strong&gt;999&lt;/strong&gt;. Don’t &lt;strong&gt;enable&lt;/strong&gt; the unit until the GGUF is &lt;strong&gt;fully&lt;/strong&gt; downloaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More models:&lt;/strong&gt; §7 covers &lt;strong&gt;Gemma 4&lt;/strong&gt;, &lt;strong&gt;Qwen Coder&lt;/strong&gt;, &lt;strong&gt;DeepSeek Lite&lt;/strong&gt;, &lt;strong&gt;Llama 3.1&lt;/strong&gt; (downloads, &lt;strong&gt;&lt;code&gt;huggingface-cli&lt;/code&gt;&lt;/strong&gt;, quick tests).&lt;/li&gt;
&lt;li&gt;Swap in &lt;strong&gt;&lt;code&gt;YOUR_USER&lt;/code&gt;&lt;/strong&gt;, paths, and hostname; &lt;strong&gt;server-only&lt;/strong&gt; box → start at §4.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Links jump to headings on GitHub, Cursor, and most Markdown viewers. If a link does not match your renderer, search for the heading in the file.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TL;DR&lt;/li&gt;
&lt;li&gt;1. Context and choices&lt;/li&gt;
&lt;li&gt;2. BIOS (before or right after installing Ubuntu)&lt;/li&gt;
&lt;li&gt;
3. Installing Ubuntu

&lt;ul&gt;
&lt;li&gt;Quick hardware inventory (optional)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

4. Ubuntu Server without a desktop (headless)

&lt;ul&gt;
&lt;li&gt;Installation&lt;/li&gt;
&lt;li&gt;Networking&lt;/li&gt;
&lt;li&gt;Vulkan without a display (&lt;code&gt;vkcube&lt;/code&gt; not applicable)&lt;/li&gt;
&lt;li&gt;Rest of this guide&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;5. Base dependencies and Vulkan check&lt;/li&gt;

&lt;li&gt;

6. Building llama.cpp with Vulkan

&lt;ul&gt;
&lt;li&gt;Update and rebuild &lt;code&gt;llama.cpp&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

7. GGUF models and paths

&lt;ul&gt;
&lt;li&gt;What GGUF is (name, role, trade-offs)&lt;/li&gt;
&lt;li&gt;Quant labels in filenames (Q2, Q4, Q8, suffixes like &lt;code&gt;_K_M&lt;/code&gt;, IQ…)&lt;/li&gt;
&lt;li&gt;Where models live and how to list them&lt;/li&gt;
&lt;li&gt;Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)&lt;/li&gt;
&lt;li&gt;Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)&lt;/li&gt;
&lt;li&gt;Example: local Llama 3.1 8B Instruct Q8_0&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama-bench&lt;/code&gt;: measure throughput (tokens/s)&lt;/li&gt;
&lt;li&gt;Quick terminal test&lt;/li&gt;
&lt;li&gt;Adding or switching models&lt;/li&gt;
&lt;li&gt;Experimenting with more models: setup, testing, and limits&lt;/li&gt;
&lt;li&gt;One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)&lt;/li&gt;
&lt;li&gt;Common steps (every model swap)&lt;/li&gt;
&lt;li&gt;Reference table (repos + sample file)&lt;/li&gt;
&lt;li&gt;Download (&lt;code&gt;wget --continue&lt;/code&gt;, one file per command)&lt;/li&gt;
&lt;li&gt;Per-model quick test (right after download)&lt;/li&gt;
&lt;li&gt;Typical &lt;code&gt;ExecStart&lt;/code&gt; tweaks (example)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;8. Minimal web server (&lt;code&gt;llama-server&lt;/code&gt;)&lt;/li&gt;

&lt;li&gt;9. systemd service (start on boot)&lt;/li&gt;

&lt;li&gt;

10. Open WebUI with Docker (port 3000 → backend on 8080)

&lt;ul&gt;
&lt;li&gt;Connect Open WebUI to llama-server&lt;/li&gt;
&lt;li&gt;Chat up and running (example)&lt;/li&gt;
&lt;li&gt;No browsing or GitHub fetch: real limits (and confident wrong answers)&lt;/li&gt;
&lt;li&gt;Model picker shows &lt;strong&gt;“No results found”&lt;/strong&gt; / no models listed&lt;/li&gt;
&lt;li&gt;“Failed to fetch models” under &lt;strong&gt;Ollama&lt;/strong&gt; (Settings → Models)&lt;/li&gt;
&lt;li&gt;Updating Open WebUI (Docker)&lt;/li&gt;
&lt;li&gt;If you also run Ollama&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

11. OpenCode and VS Code with your &lt;code&gt;llama-server&lt;/code&gt;

&lt;ul&gt;
&lt;li&gt;OpenCode&lt;/li&gt;
&lt;li&gt;Visual Studio Code&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

12. Troubleshooting: Vulkan / &lt;code&gt;glslc&lt;/code&gt; on Ubuntu 24.04

&lt;ul&gt;
&lt;li&gt;12.1 Universe repository and packages&lt;/li&gt;
&lt;li&gt;12.2 LunarG repository (Vulkan SDK)&lt;/li&gt;
&lt;li&gt;12.3 Conflict between Ubuntu’s &lt;code&gt;libshaderc-dev&lt;/code&gt; and LunarG’s Shaderc&lt;/li&gt;
&lt;li&gt;12.4 Snap fallback for &lt;code&gt;glslc&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

13. Performance and models (rough guide)

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;htop&lt;/code&gt; looks “light” while you chat (is that normal?)&lt;/li&gt;
&lt;li&gt;AMD: &lt;code&gt;amdgpu_pm_info&lt;/code&gt; and &lt;code&gt;dri/N&lt;/code&gt; (not always &lt;code&gt;dri/0&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

14. Remote desktop (Ubuntu 24.04 Desktop, LAN)

&lt;ul&gt;
&lt;li&gt;14.1 Enable on the mini PC&lt;/li&gt;
&lt;li&gt;14.2 Connect from another machine&lt;/li&gt;
&lt;li&gt;14.3 Firewall (&lt;code&gt;ufw&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;14.4 If connection fails&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Final checklist&lt;/li&gt;

&lt;li&gt;Quick port reference&lt;/li&gt;

&lt;li&gt;Closing thoughts&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. Context and choices
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04 LTS (desktop or server; server without a GUI saves RAM).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMD iGPU&lt;/td&gt;
&lt;td&gt;Vulkan + Mesa is usually simpler than ROCm for llama.cpp inference.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Models&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;GGUF&lt;/strong&gt; format; Q4_K_M quantization (balance) or Q8_0 (higher quality, larger).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; with &lt;code&gt;-DGGML_VULKAN=1&lt;/code&gt; uses the &lt;strong&gt;GPU&lt;/strong&gt; for layers (&lt;code&gt;-ngl&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lots of RAM&lt;/td&gt;
&lt;td&gt;You can load large models in system RAM even if the iGPU has little dedicated VRAM; the BIOS can give the GPU a larger framebuffer (see §2).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reference diagram (browser / container / host):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdytd7gma9fyqudcg4v6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdytd7gma9fyqudcg4v6.png" alt="Reference diagram (browser / container / host)" width="800" height="173"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frygeqyvo5foqvdwlgurs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frygeqyvo5foqvdwlgurs.png" alt="Illustration: browser and IDE → Open WebUI container → llama-server and GGUF on the host" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. BIOS (before or right after installing Ubuntu)
&lt;/h2&gt;

&lt;p&gt;On &lt;strong&gt;Minisforum&lt;/strong&gt; boxes (e.g. &lt;strong&gt;UM760 Slim&lt;/strong&gt;) with AMI BIOS and Ryzen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter BIOS (&lt;strong&gt;Del&lt;/strong&gt;, &lt;strong&gt;F2&lt;/strong&gt;, or &lt;strong&gt;F7&lt;/strong&gt; on many systems).&lt;/li&gt;
&lt;li&gt;Typical path: &lt;strong&gt;Advanced → AMD CBS → NBIO Common Options → GFX Configuration&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;UMA Frame Buffer Size&lt;/strong&gt; (or similar) from &lt;em&gt;Auto&lt;/em&gt; / 2 GiB to &lt;strong&gt;8 G&lt;/strong&gt; or &lt;strong&gt;16 G&lt;/strong&gt; if available.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Goal: give the iGPU more unified memory for model layers; with plenty of system RAM the trade-off is usually worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Installing Ubuntu
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Enable &lt;strong&gt;third-party software&lt;/strong&gt; for graphics and Wi‑Fi if you use the graphical installer.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;minimal&lt;/strong&gt; install drops extra packages if the box is mainly an inference server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical order of this guide (§4 and §10 are optional depending on your setup):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffstg5ebfb8y5i6pnmrmh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffstg5ebfb8y5i6pnmrmh.png" alt="Tipical installation steps" width="646" height="1250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick hardware inventory (optional)
&lt;/h3&gt;

&lt;p&gt;Before picking huge models and quantizations, check &lt;strong&gt;RAM&lt;/strong&gt;, &lt;strong&gt;disk on &lt;code&gt;/&lt;/code&gt;&lt;/strong&gt;, and whether the &lt;strong&gt;integrated GPU&lt;/strong&gt; shows up on the PCI bus (this does not replace a Vulkan test, but it sets expectations).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;lspci | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'vga|3d|display'&lt;/span&gt;
free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What to look for in &lt;code&gt;lspci&lt;/code&gt;:&lt;/strong&gt; on &lt;strong&gt;Ryzen Phoenix / Hawk Point&lt;/strong&gt; boards you often see something like &lt;strong&gt;&lt;code&gt;VGA compatible controller: … Phoenix1&lt;/code&gt;&lt;/strong&gt; plus an AMD &lt;strong&gt;HDMI audio&lt;/strong&gt; line. The marketing name “Radeon 760M” may not appear verbatim; the real check is that an &lt;strong&gt;AMD VGA/Display&lt;/strong&gt; controller exists and that &lt;strong&gt;&lt;code&gt;vulkaninfo&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; see &lt;strong&gt;RADV&lt;/strong&gt; (§4–§5).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;free&lt;/code&gt;:&lt;/strong&gt; total and &lt;strong&gt;available&lt;/strong&gt; RAM tell you how large a GGUF you can keep &lt;strong&gt;comfortably&lt;/strong&gt; in memory alongside the OS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;df&lt;/code&gt;:&lt;/strong&gt; each &lt;code&gt;.gguf&lt;/code&gt; costs whatever the card lists (e.g. ~8 GiB for an 8B Q8_0); leave headroom for updates, Docker, and rebuilds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DDR4 vs DDR5 (re-check RAM type):&lt;/strong&gt; data comes from firmware &lt;strong&gt;SMBIOS&lt;/strong&gt;. Install &lt;strong&gt;&lt;code&gt;sudo apt install -y dmidecode&lt;/code&gt;&lt;/strong&gt; if needed. &lt;strong&gt;Note:&lt;/strong&gt; some &lt;code&gt;dmidecode&lt;/code&gt; builds indent fields with &lt;strong&gt;spaces&lt;/strong&gt;, not tabs—an overly strict &lt;code&gt;grep&lt;/code&gt; can print &lt;strong&gt;nothing&lt;/strong&gt; even when DMI works.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One line per interesting field (tab- or space-indented)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dmidecode &lt;span class="nt"&gt;-t&lt;/span&gt; memory 2&amp;gt;/dev/null | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-iE&lt;/span&gt; &lt;span class="s1"&gt;'Locator|Size:|Type:|Speed:|Configured Memory Speed:'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that is still empty, dump the start of the table—some boards expose only a subset of fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dmidecode &lt;span class="nt"&gt;-t&lt;/span&gt; memory | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 120
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each populated slot, &lt;strong&gt;&lt;code&gt;Type:&lt;/code&gt;&lt;/strong&gt; should read &lt;strong&gt;&lt;code&gt;DDR5&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;DDR4&lt;/code&gt;&lt;/strong&gt;, etc. All-&lt;strong&gt;&lt;code&gt;Unknown&lt;/code&gt;&lt;/strong&gt; or an empty dump may mean a &lt;strong&gt;locked&lt;/strong&gt; BIOS, a &lt;strong&gt;hypervisor&lt;/strong&gt; restriction, or needs a firmware update—cross-check the &lt;strong&gt;mini PC spec sheet&lt;/strong&gt; or &lt;strong&gt;DIMM/SODIMM silkscreen/label&lt;/strong&gt;. &lt;strong&gt;Ryzen 7040&lt;/strong&gt; mobile (e.g. 7640HS) is usually &lt;strong&gt;DDR5-only&lt;/strong&gt; on recent kits; still verify through one of these paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Ubuntu Server without a desktop (headless)
&lt;/h2&gt;

&lt;p&gt;When the mini PC only serves the model (SSH + browser on another machine), &lt;strong&gt;Ubuntu Server 24.04 LTS&lt;/strong&gt; saves RAM and attack surface by skipping GNOME and desktop services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Download the &lt;strong&gt;Ubuntu Server&lt;/strong&gt; ISO from &lt;a href="https://ubuntu.com/download/server" rel="noopener noreferrer"&gt;ubuntu.com/download/server&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;In the installer, enable &lt;strong&gt;OpenSSH&lt;/strong&gt; for remote administration.&lt;/li&gt;
&lt;li&gt;Create a normal user with &lt;code&gt;sudo&lt;/code&gt; (this guide assumes that user’s &lt;code&gt;$HOME&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;BIOS (§2) is configured the same as on a desktop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Networking
&lt;/h3&gt;

&lt;p&gt;After first boot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;hostname&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open only what you need in the firewall (e.g. SSH, and later 8080/3000 if not using VPN only):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; ufw
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow OpenSSH
&lt;span class="c"&gt;# Optional: sudo ufw allow 8080/tcp &amp;amp;&amp;amp; sudo ufw allow 3000/tcp&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw &lt;span class="nb"&gt;enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Vulkan without a display (&lt;code&gt;vkcube&lt;/code&gt; not applicable)
&lt;/h3&gt;

&lt;p&gt;Server images have no display server by default: &lt;strong&gt;you cannot run &lt;code&gt;vkcube&lt;/code&gt;&lt;/strong&gt; unless you add a minimal GUI just for that test. To validate Vulkan from the console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; vulkan-tools
vulkaninfo &lt;span class="nt"&gt;--summary&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What to look for:&lt;/strong&gt; besides the instance version (e.g. &lt;code&gt;Vulkan Instance Version: 1.4.x&lt;/code&gt;), the &lt;strong&gt;&lt;code&gt;Devices:&lt;/code&gt;&lt;/strong&gt; section should list &lt;strong&gt;your AMD GPU&lt;/strong&gt; (&lt;code&gt;deviceName&lt;/code&gt; like &lt;em&gt;Radeon …&lt;/em&gt;, &lt;code&gt;deviceType&lt;/code&gt; &lt;em&gt;INTEGRATED_GPU&lt;/em&gt; or &lt;em&gt;DISCRETE_GPU&lt;/em&gt;, &lt;code&gt;vendorID&lt;/code&gt; &lt;strong&gt;0x1002&lt;/strong&gt; on AMD hardware).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world sample (trimmed):&lt;/strong&gt; you often see the instance and a long extension list first; &lt;code&gt;Devices:&lt;/code&gt; comes later. As a &lt;strong&gt;normal user&lt;/strong&gt; you may see &lt;strong&gt;only&lt;/strong&gt; a software device:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vulkan Instance Version: 1.4.313
...
Devices:
========
GPU0:
    apiVersion         = 1.4.318
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …, 256 bits)
    driverName         = llvmpipe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Same machine, but &lt;code&gt;sudo&lt;/code&gt; shows the Radeon:&lt;/strong&gt; if your user only gets &lt;code&gt;llvmpipe&lt;/code&gt; but &lt;strong&gt;root&lt;/strong&gt; sees e.g. &lt;strong&gt;GPU0&lt;/strong&gt; &lt;code&gt;AMD Radeon 760M Graphics (RADV PHOENIX)&lt;/code&gt; (&lt;code&gt;vendorID&lt;/code&gt; &lt;strong&gt;0x1002&lt;/strong&gt;, &lt;code&gt;INTEGRATED_GPU&lt;/code&gt;) &lt;strong&gt;and&lt;/strong&gt; &lt;strong&gt;GPU1&lt;/strong&gt; &lt;code&gt;llvmpipe&lt;/code&gt;, the kernel and Mesa are fine; your user lacks &lt;strong&gt;permission&lt;/strong&gt; on the DRM nodes (&lt;code&gt;/dev/dri/renderD*&lt;/code&gt;). You should &lt;strong&gt;not&lt;/strong&gt; run &lt;code&gt;llama-server&lt;/code&gt; as root long-term to “fix” Vulkan—fix group membership instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;groups&lt;/span&gt;                    &lt;span class="c"&gt;# should include render and video&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; /dev/dri/
&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; render,video &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# Log out of the desktop session or reboot, then (tighter grep: a broad&lt;/span&gt;
&lt;span class="c"&gt;# GPU|deviceName|deviceType pattern may also match layer descriptions containing "GPU"):&lt;/span&gt;
vulkaninfo &lt;span class="nt"&gt;--summary&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'^GPU[0-9]+:|^[[:space:]]+device(Name|Type)'&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output without &lt;code&gt;sudo&lt;/code&gt;&lt;/strong&gt; (RADV as &lt;strong&gt;GPU0&lt;/strong&gt;, &lt;code&gt;llvmpipe&lt;/code&gt; as an extra device):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU0:
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 20.1.2, 256 bits)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical “before” example:&lt;/strong&gt; if &lt;code&gt;groups&lt;/code&gt; &lt;strong&gt;does not&lt;/strong&gt; list &lt;code&gt;render&lt;/code&gt; or &lt;code&gt;video&lt;/code&gt;, and you only see entries like &lt;code&gt;adm cdrom sudo dip plugdev users lpadmin docker&lt;/code&gt;, that matches “Vulkan as your user = llvmpipe only; as root = RADV + llvmpipe”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After &lt;code&gt;usermod&lt;/code&gt;:&lt;/strong&gt; the command may print nothing, but &lt;strong&gt;your already-running session keeps the old group set&lt;/strong&gt;—&lt;code&gt;groups&lt;/code&gt; in the same shell will not change until you &lt;strong&gt;log out of the desktop&lt;/strong&gt; (or &lt;strong&gt;reboot&lt;/strong&gt;). Open a new terminal and check again; &lt;strong&gt;&lt;code&gt;id -nG&lt;/code&gt;&lt;/strong&gt; is a handy way to list all group names. For a quick test without logging out of the whole session: &lt;strong&gt;&lt;code&gt;newgrp render&lt;/code&gt;&lt;/strong&gt; (spawns a subshell with that group active; fine for testing only).&lt;/p&gt;

&lt;p&gt;On Ubuntu 24.04 the groups are usually &lt;strong&gt;&lt;code&gt;render&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;video&lt;/code&gt;&lt;/strong&gt;. Once the new session includes them, &lt;code&gt;vulkaninfo&lt;/code&gt; &lt;strong&gt;without&lt;/strong&gt; &lt;code&gt;sudo&lt;/code&gt; should list the AMD device as well as &lt;code&gt;llvmpipe&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A healthy summary often has the Radeon as &lt;strong&gt;GPU0&lt;/strong&gt; and llvmpipe as an extra entry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU0:
    vendorID           = 0x1002
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
    driverName         = radv
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Only &lt;code&gt;llvmpipe&lt;/code&gt; even as root:&lt;/strong&gt; then &lt;code&gt;llvmpipe&lt;/code&gt; / &lt;code&gt;PHYSICAL_DEVICE_TYPE_CPU&lt;/code&gt; is &lt;strong&gt;CPU-only&lt;/strong&gt; Vulkan (Mesa) and the iGPU is not in the Vulkan device list. Check &lt;code&gt;lspci -nn | grep -i vga&lt;/code&gt;, the &lt;strong&gt;&lt;code&gt;amdgpu&lt;/code&gt;&lt;/strong&gt; module, &lt;code&gt;mesa-vulkan-drivers&lt;/code&gt;, and BIOS. On very minimal servers the render stack may still need setup before Vulkan enumerates the chip.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rest of this guide
&lt;/h3&gt;

&lt;p&gt;Install the same packages as §5, build llama.cpp in §6, and use &lt;strong&gt;Open WebUI from another PC&lt;/strong&gt; at &lt;code&gt;http://SERVER_IP:3000&lt;/code&gt;. Docker + &lt;code&gt;llama-server&lt;/code&gt; does not require a graphical session on the server.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Base dependencies and Vulkan check
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt upgrade &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; build-essential cmake git libvulkan-dev vulkan-tools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm the GPU is visible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vkcube
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A window with a spinning cube should open. Close it when done.&lt;/p&gt;

&lt;p&gt;If &lt;strong&gt;vkcube&lt;/strong&gt; works but &lt;code&gt;vulkaninfo --summary&lt;/code&gt; as your user still shows only &lt;code&gt;llvmpipe&lt;/code&gt;, add the same &lt;strong&gt;&lt;code&gt;render&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;video&lt;/code&gt;&lt;/strong&gt; groups as in §4 (and log out/in).&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Building llama.cpp with Vulkan
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;strong&gt;cmake&lt;/strong&gt; fails with &lt;em&gt;Could NOT find Vulkan&lt;/em&gt; or &lt;em&gt;missing: glslc&lt;/em&gt;, go to §12 (common on Ubuntu 24.04).&lt;/p&gt;

&lt;h3&gt;
  
  
  Update and rebuild &lt;code&gt;llama.cpp&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Newer GGUF architectures&lt;/strong&gt; (Gemma 4, recent MoE builds, etc.) often need a &lt;strong&gt;fresh llama.cpp&lt;/strong&gt;. Before blaming the weight file, update the tree and rebuild the &lt;strong&gt;same &lt;code&gt;build&lt;/code&gt;&lt;/strong&gt; folder (or wipe &lt;code&gt;build&lt;/code&gt; and rerun CMake if CMakeLists changed a lot):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
git pull
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;strong&gt;&lt;code&gt;git pull&lt;/code&gt;&lt;/strong&gt; changes CMake heavily and linking fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After rebuilding, if you use &lt;strong&gt;§9&lt;/strong&gt;, restart so the service picks up new binaries: &lt;code&gt;sudo systemctl restart llama-web.service&lt;/code&gt;. Check &lt;code&gt;journalctl -u llama-web.service -n 30 --no-pager&lt;/code&gt; if a GGUF is rejected.&lt;/p&gt;

&lt;p&gt;Useful binaries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;build/bin/llama-cli&lt;/code&gt; — terminal tests.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;build/bin/llama-server&lt;/code&gt; — HTTP API compatible with OpenAI-style clients.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. GGUF models and paths
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What GGUF is (name, role, trade-offs)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GGUF&lt;/strong&gt; (&lt;strong&gt;G&lt;/strong&gt;GML &lt;strong&gt;U&lt;/strong&gt;niversal &lt;strong&gt;F&lt;/strong&gt;ile &lt;strong&gt;F&lt;/strong&gt;ormat) is a &lt;strong&gt;single-file&lt;/strong&gt; container aimed at &lt;strong&gt;inference&lt;/strong&gt; with &lt;strong&gt;llama.cpp&lt;/strong&gt; and friends: it packs &lt;strong&gt;weights&lt;/strong&gt; in a tensor layout tuned for efficient loading, &lt;strong&gt;metadata&lt;/strong&gt;, and—in practice—what you need to &lt;strong&gt;tokenize&lt;/strong&gt; and &lt;strong&gt;run&lt;/strong&gt; the model without pulling in the full PyTorch/JAX training stack.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why it matters here:&lt;/strong&gt; you download a &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/strong&gt;, pass its path as &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; to &lt;code&gt;llama-cli&lt;/code&gt; / &lt;code&gt;llama-server&lt;/code&gt;, and the engine runs &lt;strong&gt;locally&lt;/strong&gt; (CPU, and in this guide &lt;strong&gt;Vulkan&lt;/strong&gt; on the GPU). You do not need the original framework runtime just to serve the converted file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Typical upsides:&lt;/strong&gt; &lt;strong&gt;one portable blob&lt;/strong&gt;; &lt;strong&gt;quantized&lt;/strong&gt; variants (Q4_K_M, Q8_0, IQ*, …) trade a bit of quality for &lt;strong&gt;disk / RAM / VRAM&lt;/strong&gt;; &lt;strong&gt;huge Hugging Face catalog&lt;/strong&gt; (community repos such as &lt;em&gt;TheBloke&lt;/em&gt;, &lt;em&gt;bartowski&lt;/em&gt;, Unsloth, …); first-class support in &lt;strong&gt;llama.cpp&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; &lt;strong&gt;quality&lt;/strong&gt; depends on &lt;strong&gt;quant level&lt;/strong&gt; and conversion tooling; &lt;strong&gt;brand-new&lt;/strong&gt; architectures may need a &lt;strong&gt;fresh llama.cpp build&lt;/strong&gt; or lack mature GGUFs yet; &lt;strong&gt;training / fine-tuning&lt;/strong&gt; usually happens elsewhere, then you &lt;strong&gt;convert/export&lt;/strong&gt; to GGUF; it is not a full cloud SaaS substitute without extra plumbing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rest of this section assumes a &lt;strong&gt;ready-to-run GGUF&lt;/strong&gt;; paths and downloads always point at that file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quant labels in filenames (Q2, Q4, Q8, suffixes like &lt;code&gt;_K_M&lt;/code&gt;, IQ…)
&lt;/h3&gt;

&lt;p&gt;Repos list GGUFs with prefixes like &lt;strong&gt;Q2_&lt;/strong&gt;, &lt;strong&gt;Q3_&lt;/strong&gt;, &lt;strong&gt;Q4_&lt;/strong&gt;, &lt;strong&gt;Q5_&lt;/strong&gt;, &lt;strong&gt;Q6_&lt;/strong&gt;, &lt;strong&gt;Q8_&lt;/strong&gt; and cousins (&lt;strong&gt;IQ2_&lt;/strong&gt;, &lt;strong&gt;IQ3_&lt;/strong&gt;, …). Naming is not one single marketing standard, but &lt;strong&gt;in practice&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Q&lt;/strong&gt; and &lt;strong&gt;number&lt;/strong&gt; hint at &lt;strong&gt;quantization depth&lt;/strong&gt;—roughly how many &lt;strong&gt;bits&lt;/strong&gt; are used for weights (&lt;strong&gt;simplified&lt;/strong&gt;). &lt;strong&gt;Lower&lt;/strong&gt; → &lt;strong&gt;smaller&lt;/strong&gt; file, less &lt;strong&gt;RAM/VRAM&lt;/strong&gt;, sometimes &lt;strong&gt;more&lt;/strong&gt; quality loss; &lt;strong&gt;higher&lt;/strong&gt; (e.g. &lt;strong&gt;Q8&lt;/strong&gt;) → heavier and often closer to “full” model behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suffixes&lt;/strong&gt; such as &lt;strong&gt;&lt;code&gt;_K_M&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;_K_S&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;_K_L&lt;/code&gt;&lt;/strong&gt;, … are &lt;strong&gt;llama.cpp k-quant&lt;/strong&gt; schemes: they &lt;strong&gt;mix&lt;/strong&gt; layers/blocks at different precisions to balance quality vs size—it is not “literally 4-bit everything.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IQ&lt;/strong&gt; (&lt;em&gt;imatrix&lt;/em&gt; / importance-weighted) lines aim for &lt;strong&gt;aggressive&lt;/strong&gt; compression while protecting weights that matter most for output quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For this guide:&lt;/strong&gt; &lt;strong&gt;Q4_K_M&lt;/strong&gt; is a common &lt;strong&gt;sweet spot&lt;/strong&gt; for &lt;strong&gt;disk&lt;/strong&gt;, &lt;strong&gt;memory&lt;/strong&gt;, and &lt;strong&gt;quality&lt;/strong&gt;; &lt;strong&gt;Q8_0&lt;/strong&gt;-class files if you favor quality and have RAM to spare. If names feel overwhelming, sort by &lt;strong&gt;MiB/GiB&lt;/strong&gt; under the repo’s &lt;em&gt;Files&lt;/em&gt; tab and pick the largest file that &lt;strong&gt;fits&lt;/strong&gt; your machine comfortably.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face CLI (&lt;code&gt;huggingface-cli&lt;/code&gt;):&lt;/strong&gt; &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; ships &lt;em&gt;externally managed&lt;/em&gt; system Python (&lt;strong&gt;PEP 668&lt;/strong&gt;), so &lt;strong&gt;&lt;code&gt;python3 -m pip install …&lt;/code&gt; fails&lt;/strong&gt; with &lt;code&gt;externally-managed-environment&lt;/code&gt;. Prefer a small &lt;strong&gt;virtualenv&lt;/strong&gt; for this tool. This guide uses &lt;strong&gt;&lt;code&gt;$HOME/.venv/huggingface&lt;/code&gt;&lt;/strong&gt;: install &lt;strong&gt;&lt;code&gt;python3-venv&lt;/code&gt;&lt;/strong&gt;, create the venv &lt;strong&gt;once&lt;/strong&gt;, run &lt;strong&gt;&lt;code&gt;source …/bin/activate&lt;/code&gt;&lt;/strong&gt; before &lt;code&gt;pip&lt;/code&gt; / &lt;code&gt;huggingface-cli&lt;/code&gt;, or call &lt;strong&gt;&lt;code&gt;"$HOME/.venv/huggingface/bin/huggingface-cli"&lt;/code&gt;&lt;/strong&gt; directly. Avoid &lt;strong&gt;&lt;code&gt;--break-system-packages&lt;/code&gt;&lt;/strong&gt; unless you understand the risk. &lt;strong&gt;Alternative:&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;pipx install 'huggingface_hub[cli]'&lt;/code&gt;&lt;/strong&gt; (after &lt;strong&gt;&lt;code&gt;sudo apt install pipx&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;pipx ensurepath&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;Use one consistent directory (avoid mixing &lt;code&gt;~/models&lt;/code&gt; and &lt;code&gt;llama.cpp/models&lt;/code&gt; by mistake):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where models live and how to list them
&lt;/h3&gt;

&lt;p&gt;llama.cpp has &lt;strong&gt;no&lt;/strong&gt; built-in model catalog: a model is a &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt; file&lt;/strong&gt;. You always pass the path with &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; (absolute paths are best in &lt;code&gt;systemd&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;List the usual folder:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-lh&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt;.gguf 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that prints nothing, you may still have GGUFs elsewhere (Downloads, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search under your home (limited depth, faster):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-maxdepth&lt;/span&gt; 5 &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s1"&gt;'*.gguf'&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="nt"&gt;-ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sort by size:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-maxdepth&lt;/span&gt; 5 &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s1"&gt;'*.gguf'&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="nt"&gt;-printf&lt;/span&gt; &lt;span class="s1"&gt;'%s\t%p\n'&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Open WebUI does &lt;strong&gt;not&lt;/strong&gt; enumerate “every GGUF on disk”. What matters is whichever file &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; loads via &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;. To “use another model”, change that &lt;code&gt;-m&lt;/code&gt; (and restart the process or service §9), or run &lt;strong&gt;another&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; on &lt;strong&gt;another&lt;/strong&gt; port (advanced; not detailed here).&lt;/p&gt;

&lt;p&gt;Generic example (swap the URL for the file link under the repo’s &lt;em&gt;Files&lt;/em&gt; tab on Hugging Face):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/model-name.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/ORG/REPO/resolve/main/file.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)
&lt;/h3&gt;

&lt;p&gt;Recent quantized model (&lt;strong&gt;Apache 2.0&lt;/strong&gt;), &lt;strong&gt;Gemma 4&lt;/strong&gt; / MoE architecture; a good fit for machines with &lt;strong&gt;lots of RAM&lt;/strong&gt; (e.g. ~96 GiB). Full file list and sizes: &lt;a href="https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;bartowski/google_gemma-4-26B-A4B-it-GGUF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Reasonable disk/RAM use: &lt;strong&gt;Q4_K_M&lt;/strong&gt; (~17 GiB per the model card). Maximum quality in this repo: &lt;strong&gt;Q8_0&lt;/strong&gt; (~27 GiB).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; you need a &lt;strong&gt;recent llama.cpp&lt;/strong&gt; with Gemma 4 support (before building: &lt;code&gt;cd llama.cpp &amp;amp;&amp;amp; git pull&lt;/code&gt;). If loading the GGUF reports architecture or tokenizer errors, update and rebuild (§6).&lt;/p&gt;

&lt;p&gt;Recommended download (Q4_K_M):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Higher-quality option (Q8_0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q8_0.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Equivalent using &lt;a href="https://huggingface.co/docs/huggingface_hub/guides/cli" rel="noopener noreferrer"&gt;huggingface-cli&lt;/a&gt; (handy for resumable downloads):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-venv
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface"&lt;/span&gt;   &lt;span class="c"&gt;# once; skip if this directory already exists&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface/bin/activate"&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"huggingface_hub[cli]"&lt;/span&gt;
huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On Hugging Face the model is tagged &lt;strong&gt;Image-Text-to-Text&lt;/strong&gt;; for text-only chat, &lt;code&gt;llama-server&lt;/code&gt; / Open WebUI usually work with the GGUF and embedded template. If message formatting breaks, check the &lt;em&gt;Prompt format&lt;/em&gt; section on the model card.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;resolve/main/...&lt;/code&gt; URLs can break if files are renamed; if so, open the repo and copy the &lt;em&gt;download&lt;/em&gt; link for the exact &lt;code&gt;.gguf&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; when running &lt;code&gt;llama-cli&lt;/code&gt; or &lt;code&gt;llama-server&lt;/code&gt;, use the real path to the &lt;code&gt;.gguf&lt;/code&gt; (absolute or relative to your current working directory).&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;very large&lt;/strong&gt; MoE (~32 B activated params / 1 T total per the model card). Community GGUFs: &lt;a href="https://huggingface.co/unsloth/Kimi-K2-Instruct-0905-GGUF" rel="noopener noreferrer"&gt;unsloth/Kimi-K2-Instruct-0905-GGUF&lt;/a&gt;. Run guide and flags: &lt;a href="https://docs.unsloth.ai/basics/kimi-k2" rel="noopener noreferrer"&gt;Unsloth — Kimi K2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware warning:&lt;/strong&gt; Unsloth’s README recommends &lt;strong&gt;≥ 128 GB unified RAM&lt;/strong&gt; even for “small” quants. Boxes in the ~64–80 GiB range may &lt;strong&gt;fail to load&lt;/strong&gt;, run &lt;strong&gt;very slowly&lt;/strong&gt;, or thrash &lt;strong&gt;swap&lt;/strong&gt;—treat it as an experiment (see §7 &lt;em&gt;Experimenting with more models&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face:&lt;/strong&gt; access may be &lt;strong&gt;gated&lt;/strong&gt;; sign in, accept terms on the model page, and use &lt;strong&gt;&lt;code&gt;huggingface-cli login&lt;/code&gt;&lt;/strong&gt; if required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shards:&lt;/strong&gt; each quantization lives in a folder (&lt;code&gt;UD-TQ1_0/&lt;/code&gt;, &lt;code&gt;UD-IQ1_S/&lt;/code&gt;, &lt;code&gt;IQ4_XS/&lt;/code&gt;, …) with files like &lt;code&gt;…-00001-of-00006.gguf&lt;/code&gt;, … Download &lt;strong&gt;every&lt;/strong&gt; &lt;code&gt;.gguf&lt;/code&gt; in &lt;strong&gt;that&lt;/strong&gt; folder. For &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; must point at the &lt;strong&gt;first&lt;/strong&gt; shard (&lt;code&gt;…-00001-of-….gguf&lt;/code&gt;); current &lt;code&gt;llama.cpp&lt;/code&gt; loaders pick up sibling shards in the same directory.&lt;/p&gt;

&lt;p&gt;Download &lt;strong&gt;one&lt;/strong&gt; folder (example &lt;strong&gt;UD-TQ1_0&lt;/strong&gt;, six parts; confirm names under &lt;em&gt;Files&lt;/em&gt; on Hugging Face):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-venv
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface"&lt;/span&gt;   &lt;span class="c"&gt;# once; skip if this directory already exists&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface/bin/activate"&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"huggingface_hub[cli]"&lt;/span&gt;
huggingface-cli login    &lt;span class="c"&gt;# if token or gated access is required&lt;/span&gt;

&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/kimi-k2-0905"&lt;/span&gt;
huggingface-cli download unsloth/Kimi-K2-Instruct-0905-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"UD-TQ1_0/*.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/kimi-k2-0905"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other folders in the same repo are other quants (more disk / more quality). Pick based on &lt;strong&gt;free disk&lt;/strong&gt; and &lt;strong&gt;RAM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before loading: &lt;strong&gt;&lt;code&gt;git pull&lt;/code&gt;&lt;/strong&gt; and rebuild &lt;strong&gt;llama.cpp&lt;/strong&gt; (§6). Short smoke test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/kimi-k2-0905/UD-TQ1_0/Kimi-K2-Instruct-0905-UD-TQ1_0-00001-of-00006.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Say hi in one sentence."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tune &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt;; on architecture/tokenizer errors, update and rebuild. For &lt;strong&gt;§9&lt;/strong&gt; / Open WebUI, &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt; uses the same path to the &lt;strong&gt;first&lt;/strong&gt; shard; read the &lt;strong&gt;&lt;code&gt;id&lt;/code&gt;&lt;/strong&gt; from &lt;code&gt;/v1/models&lt;/code&gt; via &lt;code&gt;curl&lt;/code&gt; once &lt;code&gt;llama-server&lt;/code&gt; is up for &lt;em&gt;Model IDs&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: local Llama 3.1 8B Instruct Q8_0
&lt;/h3&gt;

&lt;p&gt;If you already have e.g. &lt;strong&gt;&lt;code&gt;$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf&lt;/code&gt;&lt;/strong&gt; (~8 GiB on disk), &lt;strong&gt;replace&lt;/strong&gt; every &lt;code&gt;-m&lt;/code&gt; path in this guide with yours. &lt;strong&gt;Q8_0&lt;/strong&gt; favors quality over speed; for higher &lt;strong&gt;tok/s&lt;/strong&gt; on an iGPU, try a &lt;strong&gt;Q4_K_M&lt;/strong&gt; in the same model family.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;llama-bench&lt;/code&gt;: measure throughput (tokens/s)
&lt;/h3&gt;

&lt;p&gt;Use this to compare &lt;strong&gt;the same machine&lt;/strong&gt; with different &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;, different GGUFs, or different builds (CPU vs Vulkan), without UI noise.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify the binary&lt;/strong&gt; (size/date are hints; it should refresh after rebuilds):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-lh&lt;/span&gt; build/bin/llama-bench
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;If it is &lt;strong&gt;missing&lt;/strong&gt;, rebuild the project (§6); most full builds already include &lt;code&gt;llama-bench&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flags&lt;/strong&gt; change across versions—always start from help:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="nt"&gt;--help&lt;/span&gt; | less
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Minimal example&lt;/strong&gt; (swap the path):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Llama-3.1-8B-Instruct-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;: path to the &lt;code&gt;.gguf&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;: GPU layers; many builds accept &lt;strong&gt;&lt;code&gt;999&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;-1&lt;/code&gt;&lt;/strong&gt; as “as many as possible”. If rejected, try &lt;strong&gt;&lt;code&gt;35&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;45&lt;/code&gt;&lt;/strong&gt;, etc., and increase until it breaks or slows down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-n&lt;/code&gt;&lt;/strong&gt;: generated tokens per benchmark run (tune for longer runs).&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reading output:&lt;/strong&gt; you usually see &lt;em&gt;prompt processing&lt;/em&gt; vs &lt;em&gt;generation&lt;/em&gt; tok/s. If numbers are tiny and logs show &lt;strong&gt;no&lt;/strong&gt; Vulkan / &lt;code&gt;ggml_vulkan&lt;/code&gt;, the binary might lack &lt;code&gt;GGML_VULKAN&lt;/code&gt;, or &lt;code&gt;/dev/dri&lt;/code&gt; permissions were wrong at build/run time (§4).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fair comparisons:&lt;/strong&gt; same &lt;code&gt;llama-bench&lt;/code&gt; build, same model, same &lt;code&gt;-n&lt;/code&gt;, only change &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; or the &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Sample real output&lt;/strong&gt; (same command order as above; &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt;, &lt;strong&gt;Radeon 760M RADV&lt;/strong&gt;, &lt;strong&gt;Llama 3.1 8B Instruct Q8_0&lt;/strong&gt;; numbers shift with BIOS, thermals, quantization, and llama.cpp revision):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           pp512 |        235.96 ± 0.19 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           tg128 |          9.80 ± 0.00 |

build: 4d688f9eb (8016)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;&lt;code&gt;ggml_vulkan&lt;/code&gt;&lt;/strong&gt; lines show &lt;strong&gt;one&lt;/strong&gt; Vulkan device and that the bench is on &lt;strong&gt;RADV&lt;/strong&gt; (not &lt;code&gt;llvmpipe&lt;/code&gt; only). Errors or zero devices → revisit §4–§5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pp512&lt;/code&gt;&lt;/strong&gt;: prompt processing — tok/s for a ~512-token prefill; usually &lt;strong&gt;higher&lt;/strong&gt; than generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;tg128&lt;/code&gt;&lt;/strong&gt;: token generation — tok/s while emitting &lt;strong&gt;128&lt;/strong&gt; output tokens; closest bench metric to “reply speed” in chat. Here ≈&lt;strong&gt;9.8 t/s&lt;/strong&gt; for Q8_0 on this iGPU.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;&lt;code&gt;build:&lt;/code&gt;&lt;/strong&gt; line is your llama.cpp &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; commit; it changes after &lt;code&gt;git pull&lt;/code&gt; + rebuild.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Another sample&lt;/strong&gt; (&lt;strong&gt;same mini PC class&lt;/strong&gt;, &lt;strong&gt;Gemma 4 26B&lt;/strong&gt; Instruct &lt;strong&gt;Q4_K_M&lt;/strong&gt; — the model this guide uses in many examples):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           pp512 |        239.04 ± 1.97 |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           tg128 |         20.94 ± 0.02 |

build: d12cc3d1c (8720)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;&lt;code&gt;gemma4 ?B&lt;/code&gt;&lt;/strong&gt; label is &lt;strong&gt;cosmetic&lt;/strong&gt; on some &lt;code&gt;llama-bench&lt;/code&gt; builds; trust &lt;strong&gt;size&lt;/strong&gt; (~&lt;strong&gt;15.85 GiB&lt;/strong&gt;), &lt;strong&gt;params&lt;/strong&gt; (~&lt;strong&gt;25.23 B&lt;/strong&gt;), and your &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What this run says:&lt;/strong&gt; with &lt;strong&gt;Vulkan&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;ngl&lt;/code&gt; 999&lt;/strong&gt;, expect on the order of &lt;strong&gt;~239 tok/s&lt;/strong&gt; for &lt;strong&gt;prefill&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;pp512&lt;/code&gt;&lt;/strong&gt;) and &lt;strong&gt;~21 tok/s&lt;/strong&gt; for &lt;strong&gt;generation&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;tg128&lt;/code&gt;&lt;/strong&gt;). That &lt;strong&gt;~21 t/s&lt;/strong&gt; is the most useful single number for “raw” reply speed (no Open WebUI overhead, no long reasoning block, no huge prompts); real chat often lands near this ballpark or a bit lower.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Other GGUFs&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;ngl&lt;/code&gt;&lt;/strong&gt;, or &lt;strong&gt;&lt;code&gt;build:&lt;/code&gt;&lt;/strong&gt; revisions will move &lt;strong&gt;&lt;code&gt;tg*&lt;/code&gt;&lt;/strong&gt; a lot; record your own table after major changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick terminal test
&lt;/h3&gt;

&lt;p&gt;From the &lt;code&gt;llama.cpp&lt;/code&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one sentence what Linux is."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-cnv&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gemma 4 and on-screen reasoning (&lt;code&gt;[Start thinking]&lt;/code&gt; … &lt;code&gt;[End thinking]&lt;/code&gt;):&lt;/strong&gt; many &lt;strong&gt;Instruct&lt;/strong&gt; GGUFs emit a “thinking” block before the final answer. On a &lt;strong&gt;recent &lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt;, &lt;code&gt;--help&lt;/code&gt; normally documents (verify with &lt;code&gt;./build/bin/llama-cli --help | grep -iE 'reason|think|template'&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-rea, --reasoning on|off|auto&lt;/code&gt;&lt;/strong&gt; — default &lt;strong&gt;&lt;code&gt;auto&lt;/code&gt;&lt;/strong&gt; (template decides). For &lt;strong&gt;clean screenshots&lt;/strong&gt;, use &lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt; (short &lt;strong&gt;&lt;code&gt;-rea off&lt;/code&gt;&lt;/strong&gt; if your build prints it).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-budget N&lt;/code&gt;&lt;/strong&gt; — &lt;strong&gt;&lt;code&gt;0&lt;/code&gt;&lt;/strong&gt; ends the thinking block immediately; &lt;strong&gt;&lt;code&gt;-1&lt;/code&gt;&lt;/strong&gt; is unrestricted. Pair with &lt;strong&gt;&lt;code&gt;off&lt;/code&gt;&lt;/strong&gt; if needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--chat-template-kwargs STRING&lt;/code&gt;&lt;/strong&gt; — JSON for the template parser (e.g. &lt;strong&gt;&lt;code&gt;'{"enable_thinking": false}'&lt;/code&gt;&lt;/strong&gt; in bash with outer single quotes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-format FORMAT&lt;/code&gt;&lt;/strong&gt; — tag handling / extraction (DeepSeek-style paths); &lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt; is usually enough for Gemma in interactive CLI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Screenshot-friendly example (same command as above + reasoning disabled):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one sentence what Linux is."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-cnv&lt;/span&gt; &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning&lt;/span&gt; off
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Reference run&lt;/strong&gt; (validated hardware in the intro; &lt;strong&gt;no&lt;/strong&gt; &lt;code&gt;[Start thinking]&lt;/code&gt; block; &lt;strong&gt;t/s&lt;/strong&gt; are indicative):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt1nfyvk4sjdl30fchvu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt1nfyvk4sjdl30fchvu.png" alt="llama-cli: Gemma 4 26B Q4_K_M with  raw `--reasoning off` endraw , one-sentence answer and prompt/generation **t/s**." width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also export the env vars mentioned in &lt;code&gt;--help&lt;/code&gt; (&lt;strong&gt;&lt;code&gt;LLAMA_ARG_REASONING&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;LLAMA_ARG_THINK_BUDGET&lt;/code&gt;&lt;/strong&gt;, …) if you prefer not to repeat flags.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; (§8–§9), add the same switches to &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;--reasoning-budget 0&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;--chat-template-kwargs …&lt;/code&gt;&lt;/strong&gt;) as your binary supports. If &lt;strong&gt;nothing&lt;/strong&gt; disables it, try another GGUF/variant, or another model for a one-off capture (e.g. Llama in this same §7).&lt;/p&gt;

&lt;p&gt;Example with a local &lt;strong&gt;Llama 3.1 8B&lt;/strong&gt; (single-turn demo; chat template depends on the GGUF). An overly vague &lt;strong&gt;&lt;code&gt;-p&lt;/code&gt;&lt;/strong&gt; (“summarize llama.cpp”) may yield “I don’t have that information”; give &lt;strong&gt;context&lt;/strong&gt; in the question (e.g. open-source inference, GGUF, local execution).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Llama-3.1-8B-Instruct-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in exactly one sentence: What does the llama.cpp project do for running language models locally?"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Actual reference screenshot&lt;/strong&gt; (same &lt;strong&gt;validated&lt;/strong&gt; hardware in the intro: Ryzen 5 &lt;strong&gt;7640HS&lt;/strong&gt;, Radeon &lt;strong&gt;760M&lt;/strong&gt;, &lt;strong&gt;DDR5&lt;/strong&gt;; &lt;strong&gt;t/s&lt;/strong&gt; varies with thermals, BIOS, and &lt;code&gt;llama.cpp&lt;/code&gt; commit):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzjdukig8roakl1iozsl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzjdukig8roakl1iozsl.png" alt="llama-cli: Llama 3.1 8B Instruct Q8_0 — answer about llama.cpp and prompt/generation t/s." width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl 99&lt;/code&gt; / &lt;code&gt;999&lt;/code&gt;&lt;/strong&gt;: tries to offload many layers to the GPU; on large models or a small unified VRAM budget you may need to &lt;strong&gt;lower&lt;/strong&gt; &lt;code&gt;-ngl&lt;/code&gt; or increase the BIOS framebuffer (§2).&lt;/li&gt;
&lt;li&gt;On startup, look for lines like &lt;code&gt;ggml_vulkan:&lt;/code&gt; and your GPU name (e.g. Radeon 760M) to confirm Vulkan.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Adding or switching models
&lt;/h3&gt;

&lt;p&gt;Each &lt;strong&gt;additional model&lt;/strong&gt; you want to run—another family, quantization, or file from Hugging Face—is &lt;strong&gt;one&lt;/strong&gt; more &lt;code&gt;.gguf&lt;/code&gt; in your folder (e.g. &lt;code&gt;$HOME/models&lt;/code&gt;). ML slang often says &lt;strong&gt;“weights”&lt;/strong&gt; for the &lt;strong&gt;trained parameters&lt;/strong&gt; inside that file; here it is enough to think “another &lt;code&gt;.gguf&lt;/code&gt;.” The flow is always &lt;strong&gt;download → test → point the server&lt;/strong&gt; at that path.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Download&lt;/strong&gt; using the same pattern as above (&lt;code&gt;wget&lt;/code&gt;, &lt;code&gt;huggingface-cli&lt;/code&gt;, or the repo’s &lt;em&gt;download&lt;/em&gt; link on Hugging Face).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke-test in the terminal&lt;/strong&gt; with &lt;code&gt;llama-cli -m "$HOME/models/your-new-file.gguf"&lt;/code&gt; (like the quick test). If the architecture is brand new and load fails, update and rebuild llama.cpp (§6).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual &lt;code&gt;llama-server&lt;/code&gt; (§8):&lt;/strong&gt; stop the process (&lt;strong&gt;Ctrl+C&lt;/strong&gt;) and start it again with &lt;code&gt;-m&lt;/code&gt; pointing at the new file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;systemd service (§9):&lt;/strong&gt; edit &lt;code&gt;/etc/systemd/system/llama-web.service&lt;/code&gt;, change only the &lt;code&gt;-m /full/path/new.gguf&lt;/code&gt; argument inside &lt;code&gt;ExecStart&lt;/code&gt;, save, then run:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart llama-web.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-web.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open WebUI (§10):&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; loads &lt;strong&gt;one&lt;/strong&gt; model at a time (whichever you set at startup). After restarting the service, reload the UI; the model dropdown may show the filename or a generic label (&lt;code&gt;default&lt;/code&gt;), depending on the version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenCode / VS Code (§11):&lt;/strong&gt; same host and port (&lt;code&gt;…:8080/v1&lt;/code&gt;); in editors use the server IP or &lt;code&gt;127.0.0.1&lt;/code&gt; depending on where the IDE runs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Serving &lt;strong&gt;several models at once&lt;/strong&gt; requires multiple &lt;code&gt;llama-server&lt;/code&gt; processes on &lt;strong&gt;different ports&lt;/strong&gt; (and matching entries in Open WebUI or more containers); that advanced layout is not spelled out here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experimenting with more models: setup, testing, and limits
&lt;/h3&gt;

&lt;p&gt;If you want to &lt;strong&gt;try multiple GGUFs&lt;/strong&gt;, follow a clear flow and know your hardware ceiling—this avoids pointless downloads and false “it’s broken” moments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended flow&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check disk and RAM&lt;/strong&gt; (&lt;code&gt;free -h&lt;/code&gt;, &lt;code&gt;df -h /&lt;/code&gt;, §3). Each quantization costs what the model card says; keep headroom.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/strong&gt; when the model is new (§6, &lt;em&gt;Update and rebuild&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download&lt;/strong&gt; the &lt;code&gt;.gguf&lt;/code&gt; into &lt;code&gt;$HOME/models&lt;/code&gt; (&lt;code&gt;wget&lt;/code&gt;, &lt;code&gt;huggingface-cli&lt;/code&gt;, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke-test&lt;/strong&gt; with &lt;code&gt;llama-cli&lt;/code&gt; and &lt;strong&gt;short&lt;/strong&gt; generations; confirm &lt;code&gt;ggml_vulkan&lt;/code&gt; if the GPU should participate (§7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optional:&lt;/strong&gt; &lt;code&gt;llama-bench&lt;/code&gt; with the same &lt;code&gt;-ngl&lt;/code&gt; you plan for production to compare quantizations (§7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change &lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; in &lt;strong&gt;§9&lt;/strong&gt; (or manual §8), &lt;code&gt;daemon-reload&lt;/code&gt; + &lt;code&gt;restart&lt;/code&gt;, then &lt;strong&gt;&lt;code&gt;curl /v1/models&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;Open WebUI&lt;/strong&gt; (Admin → Connections; &lt;strong&gt;Model IDs&lt;/strong&gt; if needed).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Typical limits on a mini PC with an iGPU&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GGUF size + OS + context cannot grow without limit; huge &lt;strong&gt;MoE&lt;/strong&gt; releases (e.g. &lt;strong&gt;Kimi K2&lt;/strong&gt;-class GGUFs) can &lt;strong&gt;exceed&lt;/strong&gt; usable RAM on 64–96 GiB class boxes or crawl at &lt;strong&gt;extremely&lt;/strong&gt; low tok/s.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;iGPU Vulkan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Caps &lt;strong&gt;tok/s&lt;/strong&gt; on GPU; lots of RAM helps you &lt;strong&gt;load&lt;/strong&gt; weights, not mimic a big discrete GPU.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;One active model per &lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Switching models means changing &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;restarting&lt;/strong&gt; (or a second server on another port).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Templates / chat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weird chat in Open WebUI may be the GGUF &lt;strong&gt;chat template&lt;/strong&gt;; check the Hugging Face card or try another frontend.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network / disk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large downloads take time; use &lt;code&gt;wget --continue&lt;/code&gt; or resumable &lt;code&gt;huggingface-cli&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Set expectations:&lt;/strong&gt; an &lt;strong&gt;8B–13B&lt;/strong&gt; or a quantized &lt;strong&gt;26B&lt;/strong&gt; can be a great fit with ample RAM; &lt;strong&gt;datacenter-scale&lt;/strong&gt; GGUF may &lt;strong&gt;not fit&lt;/strong&gt; or run &lt;strong&gt;under ~1–2 tok/s&lt;/strong&gt; with aggressive paging—that is a &lt;strong&gt;memory bandwidth&lt;/strong&gt; issue, not an Ubuntu bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)
&lt;/h3&gt;

&lt;p&gt;For a &lt;strong&gt;mini PC–style&lt;/strong&gt; setup: Ubuntu 24.04, &lt;strong&gt;AMD iGPU Vulkan&lt;/strong&gt;, &lt;strong&gt;~64–96 GiB&lt;/strong&gt; RAM, &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; on &lt;strong&gt;8080&lt;/strong&gt;, &lt;strong&gt;systemd&lt;/strong&gt; §9, &lt;strong&gt;Open WebUI&lt;/strong&gt; §10. Swap in your paths and username.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common steps (every model swap)
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Refresh the engine&lt;/strong&gt; if the weight is new or load fails: &lt;code&gt;cd ~/llama.cpp &amp;amp;&amp;amp; git pull&lt;/code&gt; and rebuild (§6).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download&lt;/strong&gt; the &lt;code&gt;.gguf&lt;/code&gt; (per-family commands below). &lt;strong&gt;Verify&lt;/strong&gt; the filename under Hugging Face → &lt;em&gt;Files&lt;/em&gt;; if it is renamed, fix the URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke test&lt;/strong&gt; (tune &lt;code&gt;-ngl&lt;/code&gt; and &lt;code&gt;-c&lt;/code&gt;); or use the &lt;strong&gt;copy-paste commands per model&lt;/strong&gt; under &lt;em&gt;Per-model quick test&lt;/em&gt; below.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/llama.cpp
./build/bin/llama-cli &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"/absolute/path/to/file.gguf"&lt;/span&gt; &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-n&lt;/span&gt; 80 &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one short sentence."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tuning:&lt;/strong&gt; on &lt;strong&gt;OOM&lt;/strong&gt;, &lt;strong&gt;hangs&lt;/strong&gt;, or very slow output, &lt;strong&gt;lower &lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. 50, 35) and/or &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. 2048). &lt;strong&gt;Unified&lt;/strong&gt; iGPU memory is usually the limiter, not raw RAM alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; (optional, §7) with the same path and &lt;code&gt;-ngl&lt;/code&gt; to compare quants or families.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;systemd (§9):&lt;/strong&gt; in &lt;code&gt;/etc/systemd/system/llama-web.service&lt;/code&gt;, edit &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;: same path in &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;, and match &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; to what worked in the smoke test.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart llama-web.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-web.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;API check:&lt;/strong&gt; &lt;code&gt;curl -s http://127.0.0.1:8080/v1/models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open WebUI:&lt;/strong&gt; Admin → Connections → OpenAI (&lt;code&gt;host.docker.internal:8080/v1&lt;/code&gt;). If the picker stays empty, paste the &lt;strong&gt;&lt;code&gt;id&lt;/code&gt;&lt;/strong&gt; from that JSON into &lt;strong&gt;Model IDs&lt;/strong&gt;, save, and hard-refresh.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Reference table (repos + sample file)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;Hugging Face repo&lt;/th&gt;
&lt;th&gt;Sample file (quant)&lt;/th&gt;
&lt;th&gt;Notes (~machine with plenty of RAM)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Gemma 4&lt;/strong&gt; 26B Instruct&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;bartowski/google_gemma-4-26B-A4B-it-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;google_gemma-4-26B-A4B-it-Q4_K_M.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~17 GiB on disk; usually needs &lt;strong&gt;fresh llama.cpp&lt;/strong&gt;. Start &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; around &lt;strong&gt;4096&lt;/strong&gt;–&lt;strong&gt;8192&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Qwen2.5 Coder&lt;/strong&gt; 7B&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF" rel="noopener noreferrer"&gt;bartowski/Qwen2.5-Coder-7B-Instruct-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Much lighter than Gemma 26B. For &lt;strong&gt;14B / 32B&lt;/strong&gt;, check &lt;em&gt;Files&lt;/em&gt; sizes; 32B Q4 is often &lt;strong&gt;~18–20 GiB+&lt;/strong&gt; and heavier.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;DeepSeek Coder V2 Lite&lt;/strong&gt; Instruct&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF" rel="noopener noreferrer"&gt;bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;“Lite” ≈ &lt;strong&gt;~10 GiB&lt;/strong&gt; class in Q4_K_M; solid code/disk trade-off locally.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Llama 3.1&lt;/strong&gt; 8B Instruct&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF" rel="noopener noreferrer"&gt;bartowski/Meta-Llama-3.1-8B-Instruct-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf&lt;/code&gt; or &lt;code&gt;-Q8_0.gguf&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Q4_K_M&lt;/strong&gt; faster; &lt;strong&gt;Q8_0&lt;/strong&gt; heavier / often higher quality. If your file name differs, keep your real path in &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Download (&lt;code&gt;wget --continue&lt;/code&gt;, one file per command)
&lt;/h4&gt;

&lt;p&gt;If you use &lt;strong&gt;SSH&lt;/strong&gt; and the download runs a long time, run it inside &lt;strong&gt;&lt;code&gt;screen&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;tmux&lt;/code&gt;&lt;/strong&gt; so a dropped connection does not kill the job. Example with &lt;strong&gt;&lt;code&gt;screen&lt;/code&gt;&lt;/strong&gt; (install if needed: &lt;code&gt;sudo apt install -y screen&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;screen &lt;span class="nt"&gt;-S&lt;/span&gt; hf-models
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"&lt;/span&gt;
&lt;span class="c"&gt;# When this wget finishes, you can paste the next command from the block below without leaving screen.&lt;/span&gt;

&lt;span class="c"&gt;# Detach (leave download running): Ctrl+A, release, D&lt;/span&gt;
&lt;span class="c"&gt;# Reattach later: screen -r hf-models&lt;/span&gt;
&lt;span class="c"&gt;# List sessions: screen -ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same pattern works for the other URLs in this section or for &lt;strong&gt;&lt;code&gt;huggingface-cli download&lt;/code&gt;&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

&lt;span class="c"&gt;# Gemma 4 26B Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"&lt;/span&gt;

&lt;span class="c"&gt;# Qwen2.5 Coder 7B Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf?download=true"&lt;/span&gt;

&lt;span class="c"&gt;# DeepSeek Coder V2 Lite Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF/resolve/main/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf?download=true"&lt;/span&gt;

&lt;span class="c"&gt;# Llama 3.1 8B Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Meta / Llama (gated):&lt;/strong&gt; if &lt;code&gt;wget&lt;/code&gt; returns &lt;strong&gt;403&lt;/strong&gt; or Hugging Face asks you to sign in, open the model page while logged in, &lt;strong&gt;accept the license&lt;/strong&gt;, create a &lt;strong&gt;read&lt;/strong&gt; token, and run &lt;strong&gt;&lt;code&gt;huggingface-cli login&lt;/code&gt;&lt;/strong&gt;. &lt;em&gt;Gated&lt;/em&gt; repos usually need &lt;strong&gt;&lt;code&gt;huggingface-cli download ...&lt;/code&gt;&lt;/strong&gt;, not anonymous &lt;code&gt;wget&lt;/code&gt; to &lt;code&gt;resolve/main/...&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;huggingface-cli&lt;/code&gt; alternative&lt;/strong&gt; (resumable; each command pulls &lt;strong&gt;one&lt;/strong&gt; GGUF under &lt;code&gt;--local-dir&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-venv
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface"&lt;/span&gt;   &lt;span class="c"&gt;# once; skip if this directory already exists&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface/bin/activate"&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"huggingface_hub[cli]"&lt;/span&gt;
&lt;span class="c"&gt;# huggingface-cli login   # required for *gated* repos (e.g. Llama/Meta); optional otherwise&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depending on the CLI version, the &lt;code&gt;.gguf&lt;/code&gt; may end up in a &lt;strong&gt;subfolder&lt;/strong&gt; under &lt;code&gt;--local-dir&lt;/code&gt;. Point &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; at the real absolute path (for example &lt;code&gt;find "$HOME/models" -name '*.gguf'&lt;/code&gt;).&lt;/p&gt;

&lt;h4&gt;
  
  
  Per-model quick test (right after download)
&lt;/h4&gt;

&lt;p&gt;Run &lt;strong&gt;one&lt;/strong&gt; block (paths match the &lt;code&gt;wget&lt;/code&gt; names above). &lt;strong&gt;&lt;code&gt;-n&lt;/code&gt;&lt;/strong&gt; caps generated tokens so the run stays short; if your &lt;code&gt;llama-cli&lt;/code&gt; rejects &lt;strong&gt;&lt;code&gt;-n&lt;/code&gt;&lt;/strong&gt;, check &lt;code&gt;./build/bin/llama-cli --help&lt;/code&gt; (sometimes &lt;code&gt;--predict&lt;/code&gt; or another alias). Earlier in §7, &lt;em&gt;Quick terminal test&lt;/em&gt; shows a &lt;strong&gt;&lt;code&gt;-cnv&lt;/code&gt;&lt;/strong&gt; example for Gemma and a Llama variant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4 26B Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one short sentence what a tensor is in machine learning."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Qwen2.5 Coder 7B Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a one-line Python factorial(n) function; code only."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DeepSeek Coder V2 Lite Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a JavaScript arrow function that adds two numbers; code only."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Llama 3.1 8B Instruct Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Say in one sentence what llama.cpp is for."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On startup you should see &lt;strong&gt;&lt;code&gt;ggml:&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;ggml_vulkan:&lt;/code&gt;&lt;/strong&gt; lines naming your GPU when Vulkan is in use (§4–§5).&lt;/p&gt;

&lt;h4&gt;
  
  
  Typical &lt;code&gt;ExecStart&lt;/code&gt; tweaks (example)
&lt;/h4&gt;

&lt;p&gt;Same shape as §9; only &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; (and possibly &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;) change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;…/llama-server \
    -m /home/YOUR_USER/models/THE_FILE_YOU_TESTED.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 999 \
    --n-predict -1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;strong&gt;Gemma 26B Q4&lt;/strong&gt; or another big model &lt;strong&gt;OOM&lt;/strong&gt;s on a box with only &lt;strong&gt;~16 GiB&lt;/strong&gt; RAM, lower &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;4096&lt;/strong&gt;) and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;40&lt;/strong&gt; or less) &lt;strong&gt;before&lt;/strong&gt; pushing &lt;strong&gt;99&lt;/strong&gt; / &lt;strong&gt;999&lt;/strong&gt;. Always validate with &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; using the same &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; you plan in &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;, then automate with systemd (§9).&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Minimal web server (&lt;code&gt;llama-server&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Run manually, listening on all interfaces on port &lt;strong&gt;8080&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-predict&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On another machine: &lt;code&gt;http://SERVER_IP:8080&lt;/code&gt; (llama.cpp’s built-in UI is very basic).&lt;/p&gt;




&lt;h2&gt;
  
  
  9. systemd service (start on boot)
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;/etc/systemd/system/llama-web.service&lt;/code&gt; (e.g. with &lt;code&gt;sudo nano&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Llama.cpp API server (Vulkan)&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;YOUR_USER&lt;/span&gt;
&lt;span class="py"&gt;Group&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;YOUR_USER&lt;/span&gt;
&lt;span class="c"&gt;# Vulkan on AMD: the service user must access /dev/dri (groups in §4).
# If the service loads the model on CPU only, check `groups` / `id` for that user.
&lt;/span&gt;&lt;span class="py"&gt;SupplementaryGroups&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;render video&lt;/span&gt;
&lt;span class="py"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/YOUR_USER/llama.cpp&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/YOUR_USER/llama.cpp/build/bin/llama-server &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;-m /home/YOUR_USER/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--host 0.0.0.0 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--port 8080 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;-c 8192 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;-ngl 99 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--n-predict -1&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--now&lt;/span&gt; llama-web.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-web.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Recommended order (tight RAM):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;.gguf&lt;/code&gt; must be &lt;strong&gt;fully downloaded&lt;/strong&gt;; a truncated file makes the unit &lt;strong&gt;fail&lt;/strong&gt; or &lt;strong&gt;restart in a loop&lt;/strong&gt; (&lt;code&gt;Restart=always&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke-test with &lt;code&gt;llama-cli&lt;/code&gt; first&lt;/strong&gt; as the &lt;strong&gt;same user&lt;/strong&gt; as the systemd unit, with the &lt;strong&gt;same&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; as in &lt;code&gt;ExecStart&lt;/code&gt; (§7 &lt;em&gt;Per-model quick test&lt;/em&gt; or step 3’s generic example). If that already OOMs or hangs, &lt;strong&gt;tune flags&lt;/strong&gt; before &lt;code&gt;enable --now&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If systemd shows &lt;strong&gt;OOM&lt;/strong&gt; in &lt;code&gt;journalctl&lt;/code&gt;, the process &lt;strong&gt;dies and respawns&lt;/strong&gt; every few seconds, or the kernel kills the worker, edit &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;: drop &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;4096&lt;/strong&gt;) and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;40&lt;/strong&gt; or less) instead of staying on &lt;strong&gt;99&lt;/strong&gt; / &lt;strong&gt;999&lt;/strong&gt; until &lt;code&gt;status&lt;/code&gt; shows a stable &lt;strong&gt;active (running)&lt;/strong&gt;; then &lt;code&gt;sudo systemctl daemon-reload&lt;/code&gt; and &lt;code&gt;sudo systemctl restart llama-web.service&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If startup fails, check logs: &lt;code&gt;journalctl -u llama-web.service -n 80 --no-pager&lt;/code&gt; (GGUF path, &lt;code&gt;/dev/dri&lt;/code&gt; permissions, &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;, Vulkan).&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Open WebUI with Docker (port 3000 → backend on 8080)
&lt;/h2&gt;

&lt;p&gt;Install Docker if needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; docker.io
&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; docker &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# Log out again, or run: newgrp docker&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Container (UI on &lt;strong&gt;3000&lt;/strong&gt;; engine stays on host &lt;strong&gt;8080&lt;/strong&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 3000:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--add-host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host.docker.internal:host-gateway &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; open-webui:/app/backend/data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; open-webui &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; always &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/open-webui/open-webui:main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the browser: &lt;code&gt;http://SERVER_IP:3000&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connect Open WebUI to llama-server
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Not the same as “External tools”.&lt;/strong&gt; In regular user settings you may see &lt;strong&gt;External tools&lt;/strong&gt; (&lt;em&gt;Manage tool servers&lt;/em&gt;, &lt;code&gt;openapi.json&lt;/code&gt;): that is for optional &lt;strong&gt;tool&lt;/strong&gt; servers, &lt;strong&gt;not&lt;/strong&gt; for the main LLM backend. Putting your URL only there leaves the model picker empty.&lt;/p&gt;

&lt;p&gt;Use &lt;strong&gt;Admin Settings&lt;/strong&gt;, not the gear icon that only shows &lt;em&gt;General / Interface / External tools&lt;/em&gt; (&lt;a href="https://docs.openwebui.com/getting-started/quick-start/settings/" rel="noopener noreferrer"&gt;personal user settings&lt;/a&gt;). Typical path: &lt;strong&gt;profile avatar&lt;/strong&gt; → &lt;strong&gt;Admin Settings&lt;/strong&gt; / &lt;strong&gt;Administration&lt;/strong&gt; → &lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;Connections&lt;/strong&gt; → &lt;strong&gt;OpenAI&lt;/strong&gt; → &lt;strong&gt;Add connection&lt;/strong&gt;. If &lt;em&gt;Admin Settings&lt;/em&gt; is missing, your account is not an instance admin (the first registered user usually is). Docs: &lt;a href="https://docs.openwebui.com/getting-started/quick-start/connect-a-provider/starting-with-openai-compatible/" rel="noopener noreferrer"&gt;OpenAI-Compatible&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Admin panel → Settings → Connections&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; section (llama-server mimics the OpenAI API):

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base URL:&lt;/strong&gt; &lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API key:&lt;/strong&gt; any string (e.g. &lt;code&gt;sk-no-key-required&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save and use &lt;strong&gt;verify connection&lt;/strong&gt; if shown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Turn off “Direct connections”&lt;/strong&gt; (or equivalent) if you enabled it: otherwise the browser will try to resolve &lt;code&gt;host.docker.internal&lt;/code&gt; outside Docker and fail. The UI should proxy to the backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Chat up and running (example)
&lt;/h3&gt;

&lt;p&gt;With the backend wired, pick a model in chat (often the same label as the &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt; filename&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; loaded), send a prompt, and the reply is generated on the host. The screenshot shows &lt;strong&gt;&lt;code&gt;google_gemma-4-26B-A4B-it-Q4_K_M.gguf&lt;/code&gt;&lt;/strong&gt;: the header dropdown reflects that file, and you get a &lt;strong&gt;“Thought for …”&lt;/strong&gt;-style block (internal reasoning before the visible answer). That &lt;strong&gt;adds latency&lt;/strong&gt; before you see the final text; for &lt;strong&gt;terminal&lt;/strong&gt; use and less explicit “thinking” output with Gemma, try &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt; (§7 &lt;em&gt;Quick terminal test&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin3u615sdllq33mujugj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin3u615sdllq33mujugj.png" alt="Open WebUI: chat with Gemma 4 26B Q4_K_M, GGUF picker, and reasoning (“Thought for …”)." width="800" height="724"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  No browsing or GitHub fetch: real limits (and confident wrong answers)
&lt;/h3&gt;

&lt;p&gt;With &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; + &lt;strong&gt;Open WebUI&lt;/strong&gt; as wired here, the model is &lt;strong&gt;text → text&lt;/strong&gt; only: it does &lt;strong&gt;not&lt;/strong&gt; browse the web, issue its own &lt;strong&gt;internet&lt;/strong&gt; requests, download a &lt;strong&gt;&lt;code&gt;https://github.com/...&lt;/code&gt;&lt;/strong&gt; tree, or run code in a sandbox. All it “sees” is what &lt;strong&gt;you&lt;/strong&gt; type (plus whatever context the UI forwards) and knowledge &lt;strong&gt;frozen&lt;/strong&gt; inside the &lt;strong&gt;GGUF&lt;/strong&gt; up to training cutoff.&lt;/p&gt;

&lt;p&gt;It may still answer &lt;strong&gt;very confidently&lt;/strong&gt; as if it had tools—for example claiming it &lt;strong&gt;“can analyze a public repo if you share the link”&lt;/strong&gt; or outlining how it will &lt;strong&gt;“read”&lt;/strong&gt; a remote &lt;code&gt;README&lt;/code&gt;. In this stack &lt;strong&gt;that is false&lt;/strong&gt; if you only paste a URL: the backend &lt;strong&gt;never fetches&lt;/strong&gt; HTML or the repo; Gemma (or any local GGUF) &lt;strong&gt;hallucinates&lt;/strong&gt; or repeats patterns from training. Real analysis needs &lt;strong&gt;you to paste files&lt;/strong&gt; / diffs, or &lt;strong&gt;separate&lt;/strong&gt; plumbing (RAG, Open &lt;strong&gt;WebUI functions&lt;/strong&gt;, agents, APIs) that this guide does &lt;strong&gt;not&lt;/strong&gt; set up.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;“Thought for …”&lt;/strong&gt; / reasoning block (§7, §10) does &lt;strong&gt;not&lt;/strong&gt; verify anything online—it only extends generation and can read like a &lt;strong&gt;super-capable assistant&lt;/strong&gt;; double-check claims about repos, “current” versions, or anything that depends on &lt;em&gt;today&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same stack, different tone:&lt;/strong&gt; ask bluntly &lt;em&gt;can you browse the Internet for new info?&lt;/em&gt; and Gemma may &lt;strong&gt;plainly refuse&lt;/strong&gt;—no live search, only training data plus whatever &lt;strong&gt;you&lt;/strong&gt; paste. That does &lt;strong&gt;not&lt;/strong&gt; undo the GitHub-URL problem above: the model &lt;strong&gt;shifts persona&lt;/strong&gt; with prompt framing (literal capability question vs. “please review this repo”). &lt;strong&gt;Ground truth&lt;/strong&gt; is unchanged: &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; still issues no HTTP&lt;/strong&gt; on its own until you wire tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwix65pt2rzakkihw3v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwix65pt2rzakkihw3v2.png" alt="Open WebUI (English): *Can you browse the Internet…?* — honest “no live web” reply; same stack, still no automatic fetch." width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo (the joke writes itself):&lt;/strong&gt; the assistant just told you to &lt;em&gt;“send the link”&lt;/em&gt;; you reply &lt;em&gt;analyze &lt;code&gt;https://github.com/…/pgwd&lt;/code&gt; and tell me what to improve&lt;/em&gt;—or the &lt;strong&gt;same&lt;/strong&gt; request in &lt;strong&gt;Spanish&lt;/strong&gt; (or any other language you type in the UI); &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; does not switch behavior by chat language&lt;/strong&gt;. Open WebUI shows &lt;strong&gt;Thinking…&lt;/strong&gt; and Gemma looks busy, but &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; never fetched that repo&lt;/strong&gt;: it only sees the &lt;strong&gt;message string&lt;/strong&gt;. The answer may sound technical yet be &lt;strong&gt;untethered from the real tree&lt;/strong&gt;—paste files, use &lt;strong&gt;git&lt;/strong&gt; yourself, or wire tools if you want grounded review.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi2swo3skn5aho4dr49t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi2swo3skn5aho4dr49t.png" alt="Open WebUI: after “analyze this GitHub repo…”, the model shows Thinking… — no URL fetch in this stack." width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same experiment, a minute later:&lt;/strong&gt; the model may return &lt;strong&gt;Thought for ~45–60s&lt;/strong&gt; and a long “review” that &lt;strong&gt;reads like a real audit&lt;/strong&gt;. The screenshot below is &lt;strong&gt;English&lt;/strong&gt; (&lt;em&gt;analyze in details…&lt;/em&gt;): it leans into &lt;strong&gt;Flask&lt;/strong&gt; and &lt;strong&gt;Blueprints&lt;/strong&gt;; in &lt;strong&gt;another&lt;/strong&gt; chat the same Gemma might rattle off &lt;strong&gt;Go&lt;/strong&gt; &lt;code&gt;cmd/&lt;/code&gt;/&lt;code&gt;internal/&lt;/code&gt;—still with &lt;strong&gt;no&lt;/strong&gt; tree read. That is template + guesswork, not repository access: some bullets may match the name (&lt;em&gt;pgwd&lt;/em&gt;, “dashboard”, …), some may be &lt;strong&gt;wrong&lt;/strong&gt;; &lt;strong&gt;length&lt;/strong&gt; and &lt;strong&gt;“thought”&lt;/strong&gt; time are not a substitute for &lt;strong&gt;cloning&lt;/strong&gt; and diffing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa02eg5ippo6ihtgj45p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa02eg5ippo6ihtgj45p7.png" alt="Open WebUI (English example): detailed reply after a bare GitHub URL with no fetch — “Thought for …” plus persuasive text; verify against real code." width="800" height="714"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Model picker shows &lt;strong&gt;“No results found”&lt;/strong&gt; / no models listed
&lt;/h3&gt;

&lt;p&gt;This almost never means “the &lt;code&gt;.gguf&lt;/code&gt; is missing on disk”; it means &lt;strong&gt;Open WebUI is not getting &lt;code&gt;/v1/models&lt;/code&gt;&lt;/strong&gt; from the backend you configured. Walk through in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; must be running&lt;/strong&gt; on the same host as Docker (§8 manual or §9 &lt;code&gt;systemd&lt;/code&gt;). Nothing listening on &lt;strong&gt;8080&lt;/strong&gt; → empty list.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;On the host&lt;/strong&gt; (mini PC shell), hit the API:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; http://127.0.0.1:8080/v1/models | &lt;span class="nb"&gt;head&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see JSON (&lt;code&gt;data&lt;/code&gt;, at least one &lt;code&gt;id&lt;/code&gt;). &lt;strong&gt;Connection refused&lt;/strong&gt; → start or fix &lt;code&gt;llama-server&lt;/code&gt;. If it only bound a weird interface, use &lt;strong&gt;&lt;code&gt;--host 0.0.0.0&lt;/code&gt;&lt;/strong&gt; in &lt;code&gt;ExecStart&lt;/code&gt; (not only &lt;code&gt;127.0.0.1&lt;/code&gt; if LAN clients need 8080; for Docker→host this is the usual choice).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;From the Open WebUI container&lt;/strong&gt;, the host port must be reachable:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec &lt;/span&gt;open-webui sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'wget -qO- http://host.docker.internal:8080/v1/models 2&amp;gt;/dev/null || curl -sS http://host.docker.internal:8080/v1/models'&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this fails but step 2 works, you are missing &lt;strong&gt;&lt;code&gt;--add-host=host.docker.internal:host-gateway&lt;/code&gt;&lt;/strong&gt; in &lt;code&gt;docker run&lt;/code&gt; (§10), or a firewall blocks Docker bridge → host (&lt;code&gt;ufw&lt;/code&gt; may need a rule; many setups allow it by default).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;UI wiring:&lt;/strong&gt; &lt;strong&gt;Settings → Connections → OpenAI&lt;/strong&gt; (or &lt;strong&gt;Admin&lt;/strong&gt; → &lt;strong&gt;Settings&lt;/strong&gt;, depending on version), base URL &lt;strong&gt;&lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;/v1&lt;/code&gt; required&lt;/strong&gt;). Save a dummy API key and &lt;strong&gt;verify&lt;/strong&gt; if offered.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do not mix with Ollama:&lt;/strong&gt; putting the &lt;code&gt;llama-server&lt;/code&gt; URL only under &lt;strong&gt;Ollama&lt;/strong&gt;, or using port 8080 &lt;strong&gt;without&lt;/strong&gt; &lt;code&gt;/v1&lt;/code&gt;, can leave the dropdown empty. See the table below.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After fixing, &lt;strong&gt;hard-refresh&lt;/strong&gt; the UI. The model label may match the &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt; name&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;default&lt;/code&gt;&lt;/strong&gt;, or whatever &lt;code&gt;id&lt;/code&gt; appears in the JSON from step 2.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  “Failed to fetch models” under &lt;strong&gt;Ollama&lt;/strong&gt; (Settings → Models)
&lt;/h3&gt;

&lt;p&gt;If &lt;strong&gt;Settings → Models → Manage Models&lt;/strong&gt; shows the &lt;strong&gt;Ollama&lt;/strong&gt; service with URL &lt;code&gt;http://host.docker.internal:8080&lt;/code&gt; (and nothing else), you often get &lt;strong&gt;Failed to fetch models&lt;/strong&gt;. That usually means &lt;strong&gt;two different backends are mixed up&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you run&lt;/th&gt;
&lt;th&gt;Typical port&lt;/th&gt;
&lt;th&gt;Where to configure it in Open WebUI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;llama-server&lt;/strong&gt; (this guide)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;8080&lt;/strong&gt;, OpenAI-style API&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Settings → Connections → OpenAI&lt;/strong&gt; (or equivalent), base URL &lt;strong&gt;&lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt;&lt;/strong&gt; (the &lt;strong&gt;&lt;code&gt;/v1&lt;/code&gt; suffix is required&lt;/strong&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; (only if installed separately)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;11434&lt;/strong&gt;, Ollama API&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; connection / model management, typically &lt;code&gt;http://host.docker.internal:11434&lt;/code&gt; (only if Ollama listens on the host and the container can reach it).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;llama-server&lt;/code&gt; is &lt;strong&gt;not&lt;/strong&gt; Ollama. If you put the llama-server URL in the &lt;strong&gt;Ollama&lt;/strong&gt; field, the UI uses the wrong protocol and fails even when port 8080 is open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you only use llama-server:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Keep &lt;strong&gt;Connections → OpenAI&lt;/strong&gt; exactly as above (&lt;code&gt;…8080/v1&lt;/code&gt;, dummy key, verify).&lt;/li&gt;
&lt;li&gt;If you do not run Ollama, &lt;strong&gt;clear or disable&lt;/strong&gt; the Ollama URL (do not point it at 8080).&lt;/li&gt;
&lt;li&gt;Return to &lt;strong&gt;Models&lt;/strong&gt; or chat: available models follow whatever &lt;code&gt;llama-server&lt;/code&gt; loaded with &lt;code&gt;-m&lt;/code&gt; (§8–§9).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If &lt;strong&gt;&lt;code&gt;host.docker.internal&lt;/code&gt; does not resolve&lt;/strong&gt; inside the container, confirm your &lt;code&gt;docker run&lt;/code&gt; includes &lt;code&gt;--add-host=host.docker.internal:host-gateway&lt;/code&gt; (§10). On Linux that hostname is not defined by default without it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwev12dfqy1isqcvaduua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwev12dfqy1isqcvaduua.png" alt="Illustration: conceptual flow for upgrading the UI (image pull, recreate container, persistent volume)" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Updating Open WebUI (Docker)
&lt;/h3&gt;

&lt;p&gt;The UI often shows a banner like &lt;em&gt;“A new version (v0.x.y) is now available…”&lt;/em&gt; when a newer image exists. Your &lt;strong&gt;chats and settings&lt;/strong&gt; live in the &lt;strong&gt;&lt;code&gt;open-webui&lt;/code&gt; named volume&lt;/strong&gt;; they are kept when you recreate the container as long as you mount the same &lt;code&gt;-v open-webui:/app/backend/data&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmqtynzg0nn5p8u3eh7t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmqtynzg0nn5p8u3eh7t.png" alt="Updating Open WebUI" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pull&lt;/strong&gt; the updated image (same tag you used at install; this guide uses &lt;code&gt;main&lt;/code&gt;):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull ghcr.io/open-webui/open-webui:main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stop and remove&lt;/strong&gt; only the container (the volume stays intact):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stop open-webui
docker &lt;span class="nb"&gt;rm &lt;/span&gt;open-webui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run&lt;/strong&gt; the &lt;strong&gt;same&lt;/strong&gt; &lt;code&gt;docker run&lt;/code&gt; block from §10 again (same &lt;code&gt;-p 3000:8080&lt;/code&gt;, &lt;code&gt;--add-host=host.docker.internal:host-gateway&lt;/code&gt;, &lt;code&gt;-v open-webui:…&lt;/code&gt;, container name &lt;code&gt;open-webui&lt;/code&gt;, etc.). The new container starts from the image you just pulled.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you originally used a &lt;strong&gt;different tag&lt;/strong&gt; (e.g. &lt;code&gt;v0.8.12&lt;/code&gt; or a &lt;code&gt;cuda&lt;/code&gt; variant) instead of &lt;code&gt;main&lt;/code&gt;, substitute that tag in both &lt;code&gt;docker pull&lt;/code&gt; and &lt;code&gt;docker run&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt; updating the UI does &lt;strong&gt;not&lt;/strong&gt; update &lt;code&gt;llama-server&lt;/code&gt; or your GGUF weights; the engine is still §6–§9. If you do not want to track &lt;code&gt;main&lt;/code&gt;, pin an explicit image tag in &lt;code&gt;docker run&lt;/code&gt; and repeat this flow when you choose to upgrade.&lt;/p&gt;

&lt;h3&gt;
  
  
  If you also run Ollama
&lt;/h3&gt;

&lt;p&gt;A default endpoint may appear on port &lt;strong&gt;11434&lt;/strong&gt;. To keep using &lt;strong&gt;your&lt;/strong&gt; Vulkan llama-server with the same &lt;code&gt;-ngl&lt;/code&gt;/RAM behavior, prioritize the OpenAI entry pointing at &lt;code&gt;:8080/v1&lt;/code&gt; and do not rely on Ollama for that backend.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. OpenCode and VS Code with your &lt;code&gt;llama-server&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Same API surface as Open WebUI: &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; exposes an OpenAI-compatible endpoint&lt;/strong&gt; at &lt;code&gt;http://HOST:8080/v1&lt;/code&gt; (keep §8 or §9 running). Use the mini PC’s IP instead of &lt;code&gt;127.0.0.1&lt;/code&gt; when you work from another machine on the LAN (and open port 8080 in the firewall if needed).&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenCode
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://opencode.ai/" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt; can use &lt;strong&gt;OpenAI-compatible&lt;/strong&gt; providers through &lt;code&gt;@ai-sdk/openai-compatible&lt;/code&gt;. The official docs include a &lt;strong&gt;llama.cpp / llama-server&lt;/strong&gt; example: &lt;a href="https://opencode.ai/docs/providers/" rel="noopener noreferrer"&gt;Providers — llama.cpp&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confirm &lt;code&gt;llama-server&lt;/code&gt; answers (e.g. &lt;code&gt;curl -s http://127.0.0.1:8080/v1/models&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Create or edit &lt;strong&gt;&lt;code&gt;opencode.json&lt;/code&gt;&lt;/strong&gt; for your project or OpenCode’s config path (&lt;code&gt;$schema&lt;/code&gt;: &lt;code&gt;https://opencode.ai/config.json&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Add a provider with &lt;code&gt;"npm": "@ai-sdk/openai-compatible"&lt;/code&gt; and &lt;code&gt;"options.baseURL": "http://127.0.0.1:8080/v1"&lt;/code&gt; (or the remote IP).&lt;/li&gt;
&lt;li&gt;Under &lt;code&gt;provider.&amp;lt;id&amp;gt;.models&lt;/code&gt;, add keys that match what the API expects. If unsure, read the &lt;code&gt;id&lt;/code&gt; field from &lt;code&gt;/v1/models&lt;/code&gt;; it is often the &lt;code&gt;.gguf&lt;/code&gt; filename or &lt;code&gt;default&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;In OpenCode, use &lt;code&gt;/models&lt;/code&gt; to pick &lt;code&gt;provider_id/model_id&lt;/code&gt;, or set &lt;code&gt;"model": "provider_id/model_id"&lt;/code&gt; in the JSON.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Minimal example (adjust IDs to your setup):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"$schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://opencode.ai/config.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llama-local"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"npm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@ai-sdk/openai-compatible"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-server (local)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"baseURL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:8080/v1"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Local model (default)"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-local/default"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If OpenCode cannot see the model, align &lt;code&gt;models&lt;/code&gt; keys with &lt;code&gt;/v1/models&lt;/code&gt;. Tools and heavy agentic flows &lt;strong&gt;depend on the GGUF&lt;/strong&gt;; a general chat model may underperform on coding-agent tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visual Studio Code
&lt;/h3&gt;

&lt;p&gt;VS Code does not talk to your server by itself—you need an &lt;strong&gt;extension&lt;/strong&gt; that supports a custom OpenAI-style endpoint.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common picks: &lt;strong&gt;&lt;a href="https://www.continue.dev/" rel="noopener noreferrer"&gt;Continue&lt;/a&gt;&lt;/strong&gt; and others advertising &lt;strong&gt;OpenAI-compatible API&lt;/strong&gt; or “local LLM”. You typically set &lt;strong&gt;Base URL&lt;/strong&gt; to &lt;code&gt;http://127.0.0.1:8080/v1&lt;/code&gt; (or the server IP) and &lt;strong&gt;API key&lt;/strong&gt; to any placeholder (e.g. &lt;code&gt;sk-local&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual Studio GitHub Copilot&lt;/strong&gt; does not route through your &lt;code&gt;llama-server&lt;/code&gt;; it is a separate service.&lt;/li&gt;
&lt;li&gt;From another PC, use the host IP where &lt;code&gt;llama-server&lt;/code&gt; runs—not &lt;code&gt;host.docker.internal&lt;/code&gt; (that name is for containers such as Open WebUI).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Extensions usually trail cloud models on tools and huge context. Start on the same machine you already validated with &lt;code&gt;llama-cli&lt;/code&gt; or Open WebUI.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Troubleshooting: Vulkan / &lt;code&gt;glslc&lt;/code&gt; on Ubuntu 24.04
&lt;/h2&gt;

&lt;p&gt;Typical CMake symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Could NOT find Vulkan (missing: ... glslc)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Vulkan found but &lt;code&gt;glslc&lt;/code&gt; still missing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suggested order (simplest first):&lt;/p&gt;

&lt;h3&gt;
  
  
  12.1 Universe repository and packages
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;add-apt-repository universe
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; libvulkan-dev vulkan-tools shaderc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;command&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; glslc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; glslc &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean and reconfigure the build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/llama.cpp
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  12.2 LunarG repository (Vulkan SDK)
&lt;/h3&gt;

&lt;p&gt;If your Ubuntu mirror does not offer &lt;code&gt;shaderc&lt;/code&gt; or &lt;code&gt;glslc&lt;/code&gt; is still missing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;-qO-&lt;/span&gt; https://packages.lunarg.com/lunarg-signing-key-pub.asc &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/trusted.gpg.d/lunarg.asc
&lt;span class="nb"&gt;sudo &lt;/span&gt;wget &lt;span class="nt"&gt;-qO&lt;/span&gt; /etc/apt/sources.list.d/lunarg-vulkan-noble.list &lt;span class="se"&gt;\&lt;/span&gt;
  https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; vulkan-sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then &lt;code&gt;rm -rf build&lt;/code&gt; and run &lt;code&gt;cmake&lt;/code&gt; again.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.3 Conflict between Ubuntu’s &lt;code&gt;libshaderc-dev&lt;/code&gt; and LunarG’s Shaderc
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;dpkg&lt;/code&gt; complains about overwriting files between packages, as a last resort you can force-remove the blocking package, then repair:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;--remove&lt;/span&gt; &lt;span class="nt"&gt;--force-depends&lt;/span&gt; libshaderc-dev
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nt"&gt;--fix-broken&lt;/span&gt; &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; shaderc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only do this if you understand mixed repos can leave messy dependencies; often sticking to &lt;strong&gt;either&lt;/strong&gt; LunarG &lt;strong&gt;or&lt;/strong&gt; Ubuntu for Shaderc dev packages is enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.4 Snap fallback for &lt;code&gt;glslc&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;snap &lt;span class="nb"&gt;install &lt;/span&gt;google-shaderc
&lt;span class="nb"&gt;sudo ln&lt;/span&gt; &lt;span class="nt"&gt;-sf&lt;/span&gt; /snap/bin/glslc /usr/local/bin/glslc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check &lt;code&gt;glslc --version&lt;/code&gt; again and retry CMake.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Performance and models (rough guide)
&lt;/h2&gt;

&lt;p&gt;With &lt;strong&gt;lots of RAM&lt;/strong&gt; and a &lt;strong&gt;modest iGPU&lt;/strong&gt;, unified VRAM and &lt;code&gt;-ngl&lt;/code&gt; cap GPU tokens/s; larger models can spill into system RAM.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B A4B (e.g. Q4_K_M ~17 GiB)&lt;/td&gt;
&lt;td&gt;Good balance with high RAM; needs an up-to-date llama.cpp.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same family Q8_0 (~27 GiB)&lt;/td&gt;
&lt;td&gt;Better quality; more pressure on RAM/unified VRAM.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixtral 8×7B, 70B, others&lt;/td&gt;
&lt;td&gt;Feasible mainly thanks to RAM; slower.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Use a lower quantization (e.g. Q4_K_M) if you prioritize &lt;strong&gt;speed&lt;/strong&gt; over &lt;strong&gt;quality&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For hard numbers &lt;strong&gt;on your&lt;/strong&gt; box, run &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; (§7): it is the most direct way to compare &lt;code&gt;-ngl&lt;/code&gt; and quantizations without the web UI in the way.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;htop&lt;/code&gt; looks “light” while you chat (is that normal?)
&lt;/h3&gt;

&lt;p&gt;If &lt;strong&gt;&lt;code&gt;htop&lt;/code&gt;&lt;/strong&gt; shows &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;low CPU&lt;/strong&gt; across cores and only a &lt;strong&gt;few GiB&lt;/strong&gt; of &lt;strong&gt;RES&lt;/strong&gt;, that is often &lt;strong&gt;expected&lt;/strong&gt; when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; leaves much of the model on the &lt;strong&gt;iGPU&lt;/strong&gt; — heavy matmul runs on the graphics core; the &lt;strong&gt;CPU&lt;/strong&gt; orchestrates and shuffles data, so you may &lt;strong&gt;not&lt;/strong&gt; see all cores pegged at 100%.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;GGUF is small&lt;/strong&gt; (e.g. 7B/8B &lt;strong&gt;Q4&lt;/strong&gt;) — small &lt;strong&gt;resident&lt;/strong&gt; RAM footprint; a &lt;strong&gt;26B&lt;/strong&gt; run would show much more &lt;strong&gt;RES&lt;/strong&gt; if most weights live in system memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bursts&lt;/strong&gt; happen while scoring the prompt and &lt;strong&gt;generating&lt;/strong&gt; tokens; between turns or while you read output, usage &lt;strong&gt;drops&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;With &lt;strong&gt;unified memory (UMA)&lt;/strong&gt;, some model cost may &lt;strong&gt;not&lt;/strong&gt; show up as a huge process RSS: the &lt;strong&gt;GPU&lt;/strong&gt; also holds part of the working set.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do &lt;strong&gt;not&lt;/strong&gt; assume nothing is working just because &lt;code&gt;htop&lt;/code&gt; stays calm: check &lt;strong&gt;t/s&lt;/strong&gt; in &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; (§7), or a &lt;strong&gt;GPU&lt;/strong&gt; monitor if you want to see graphics load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference screenshot&lt;/strong&gt; (same class of mini PC as the validated hardware; &lt;strong&gt;SSH&lt;/strong&gt; + &lt;strong&gt;&lt;code&gt;htop&lt;/code&gt;&lt;/strong&gt;: &lt;code&gt;llama.cpp&lt;/code&gt; around &lt;strong&gt;~5 GiB RES&lt;/strong&gt; and &lt;strong&gt;moderate&lt;/strong&gt; CPU on one core—consistent with a &lt;strong&gt;non-huge&lt;/strong&gt; model and &lt;strong&gt;GPU&lt;/strong&gt;-bound &lt;strong&gt;&lt;code&gt;‑ngl&lt;/code&gt;&lt;/strong&gt;):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmywiw3u9wgcgka9647v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmywiw3u9wgcgka9647v.png" alt="htop during inference: llama.cpp with moderate CPU and RAM (Vulkan / -ngl)." width="800" height="939"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  AMD: &lt;code&gt;amdgpu_pm_info&lt;/code&gt; and &lt;code&gt;dri/N&lt;/code&gt; (not always &lt;code&gt;dri/0&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Many snippets use &lt;strong&gt;&lt;code&gt;/sys/kernel/debug/dri/0/amdgpu_pm_info&lt;/code&gt;&lt;/strong&gt;. On Ryzen mini PCs with &lt;strong&gt;amdgpu&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;dri/0&lt;/code&gt; often does not exist&lt;/strong&gt;: the kernel exposes the GPU under a &lt;strong&gt;PCI BDF&lt;/strong&gt; directory (&lt;code&gt;0000:c4:00.0&lt;/code&gt;, …) and provides &lt;strong&gt;symlinks&lt;/strong&gt; such as &lt;strong&gt;&lt;code&gt;dri/1&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;dri/128&lt;/code&gt;&lt;/strong&gt; into the same tree. If &lt;code&gt;cat&lt;/code&gt; returns &lt;em&gt;No such file or directory&lt;/em&gt;, inspect first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mount | &lt;span class="nb"&gt;grep &lt;/span&gt;debugfs   &lt;span class="c"&gt;# expect debugfs on /sys/kernel/debug&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /sys/kernel/debug/dri/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then read &lt;strong&gt;&lt;code&gt;amdgpu_pm_info&lt;/code&gt;&lt;/strong&gt; using the &lt;strong&gt;&lt;code&gt;N&lt;/code&gt;&lt;/strong&gt; or PCI path that belongs to your AMDGPU (&lt;strong&gt;&lt;code&gt;1&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;0000:…:….0&lt;/code&gt;&lt;/strong&gt; usually works):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo cat&lt;/span&gt; /sys/kernel/debug/dri/1/amdgpu_pm_info
&lt;span class="c"&gt;# same content if 1 → 0000:c4:00.0:&lt;/span&gt;
&lt;span class="c"&gt;# sudo cat /sys/kernel/debug/dri/0000:c4:00.0/amdgpu_pm_info&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the directory exists but &lt;strong&gt;&lt;code&gt;amdgpu_pm_info&lt;/code&gt; is missing&lt;/strong&gt;, your kernel may &lt;strong&gt;not export&lt;/strong&gt; that node; try &lt;code&gt;ls … | grep -i pm&lt;/code&gt;. That does &lt;strong&gt;not&lt;/strong&gt; mean Vulkan is broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to read it (sample text, idle mini PC):&lt;/strong&gt; &lt;strong&gt;GPU Load: 0 %&lt;/strong&gt; with &lt;strong&gt;VCN powered down&lt;/strong&gt; matches &lt;strong&gt;idle&lt;/strong&gt;. While &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; runs a long &lt;strong&gt;&lt;code&gt;‑ngl&lt;/code&gt;&lt;/strong&gt; job, run &lt;code&gt;cat&lt;/code&gt; &lt;strong&gt;during&lt;/strong&gt; generation: you should usually see &lt;strong&gt;Load &amp;gt; 0 %&lt;/strong&gt; (the counter may not peg the iGPU). For a live view, &lt;strong&gt;&lt;code&gt;radeontop&lt;/code&gt;&lt;/strong&gt; is often easier (&lt;code&gt;sudo apt install -y radeontop&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GFX Clocks and Power:
    2800 MHz (MCLK)
    800 MHz (SCLK)
    ...
GPU Temperature: 36 C
GPU Load: 0 %
VCN Load: 0 %
VCN: Powered down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(Illustrative excerpt; clocks, millivolts, and watts vary with BIOS, governor, and workload.)&lt;/p&gt;




&lt;h2&gt;
  
  
  14. Remote desktop (Ubuntu 24.04 Desktop, LAN)
&lt;/h2&gt;

&lt;p&gt;When the mini PC runs &lt;strong&gt;GNOME&lt;/strong&gt; and you want the full desktop from &lt;strong&gt;another machine on the same network&lt;/strong&gt; (Windows, Mac, Linux), &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; usually ships &lt;strong&gt;RDP&lt;/strong&gt; built in; you often &lt;strong&gt;do not&lt;/strong&gt; need &lt;strong&gt;xrdp&lt;/strong&gt; unless you want different behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  14.1 Enable on the mini PC
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;System&lt;/strong&gt; → &lt;strong&gt;Remote Desktop&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Turn &lt;strong&gt;Remote Desktop&lt;/strong&gt; on.&lt;/li&gt;
&lt;li&gt;Finish the assistant (password / auth as GNOME shows).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Underlying package: &lt;strong&gt;&lt;code&gt;gnome-remote-desktop&lt;/code&gt;&lt;/strong&gt;. If the toggle is missing or fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--reinstall&lt;/span&gt; gnome-remote-desktop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Log out or reboot and open Settings again.&lt;/p&gt;

&lt;h3&gt;
  
  
  14.2 Connect from another machine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Native &lt;strong&gt;RDP&lt;/strong&gt; clients: &lt;strong&gt;Windows&lt;/strong&gt; (Remote Desktop Connection / &lt;code&gt;mstsc&lt;/code&gt;), &lt;strong&gt;macOS&lt;/strong&gt; (Microsoft Remote Desktop from the App Store), &lt;strong&gt;Linux&lt;/strong&gt; (e.g. Remmina, RDP protocol).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host:&lt;/strong&gt; the Ubuntu box’s &lt;strong&gt;LAN IP&lt;/strong&gt; (&lt;code&gt;hostname -I | awk '{print $1}'&lt;/code&gt; on the mini PC).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port:&lt;/strong&gt; &lt;strong&gt;3389/TCP&lt;/strong&gt; by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  14.3 Firewall (&lt;code&gt;ufw&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;ufw&lt;/code&gt; is enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 3389/tcp comment &lt;span class="s1"&gt;'GNOME RDP'&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  14.4 If connection fails
&lt;/h3&gt;

&lt;p&gt;On the &lt;strong&gt;Ubuntu host&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;hostname&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ss &lt;span class="nt"&gt;-tlnp&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;3389 &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Remote Desktop enabled, something should listen on &lt;strong&gt;3389&lt;/strong&gt;. Confirm the client is on the &lt;strong&gt;same LAN&lt;/strong&gt; and that no AP isolation blocks client-to-client Wi‑Fi.&lt;/p&gt;

&lt;p&gt;If GNOME/RDP misbehaves on &lt;strong&gt;Wayland&lt;/strong&gt;, try the &lt;strong&gt;Ubuntu on Xorg&lt;/strong&gt; session on the login screen and enable Remote Desktop again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security:&lt;/strong&gt; exposing RDP to the &lt;strong&gt;public Internet&lt;/strong&gt; without VPN/tunnel is a bad idea; keep it on a &lt;strong&gt;trusted LAN&lt;/strong&gt; or behind VPN/WireGuard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] BIOS: UMA / VRAM for iGPU adjusted if applicable.&lt;/li&gt;
&lt;li&gt;[ ] Vulkan OK: on desktop &lt;code&gt;vkcube&lt;/code&gt;; on &lt;strong&gt;Ubuntu Server&lt;/strong&gt; &lt;code&gt;vulkaninfo --summary&lt;/code&gt; shows the GPU.&lt;/li&gt;
&lt;li&gt;[ ] User is in &lt;strong&gt;&lt;code&gt;render&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;video&lt;/code&gt;&lt;/strong&gt; (&lt;code&gt;id -nG&lt;/code&gt;); if you ran &lt;code&gt;usermod&lt;/code&gt;, you &lt;strong&gt;logged out or rebooted&lt;/strong&gt; (an old shell session does not pick up new groups).&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;cmake -B build -DGGML_VULKAN=1&lt;/code&gt; succeeds; build reaches 100 %.&lt;/li&gt;
&lt;li&gt;[ ] You can &lt;strong&gt;update &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/strong&gt; (&lt;code&gt;git pull&lt;/code&gt;, rebuild §6) and follow &lt;strong&gt;try model → systemd → Open WebUI&lt;/strong&gt; when experimenting with new GGUFs (§7, &lt;em&gt;Experimenting…&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;llama-cli&lt;/code&gt; shows the Vulkan device when loading the model.&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;llama-server&lt;/code&gt; responds on &lt;code&gt;:8080&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;[ ] Open WebUI on &lt;code&gt;:3000&lt;/code&gt; with &lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt; and &lt;strong&gt;Direct connections&lt;/strong&gt; off.&lt;/li&gt;
&lt;li&gt;[ ] You know the model does &lt;strong&gt;not&lt;/strong&gt; browse or read GitHub from a URL alone; it may &lt;strong&gt;hallucinate&lt;/strong&gt; capabilities (§10 &lt;em&gt;No browsing or GitHub fetch&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;[ ] You know &lt;strong&gt;how to upgrade Open WebUI&lt;/strong&gt;: &lt;code&gt;docker pull&lt;/code&gt;, &lt;code&gt;stop&lt;/code&gt;/&lt;code&gt;rm&lt;/code&gt; the container, rerun the same &lt;code&gt;docker run&lt;/code&gt; with the &lt;code&gt;open-webui&lt;/code&gt; volume (§10).&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;systemd&lt;/code&gt; service enabled if you want a persistent boot setup.&lt;/li&gt;
&lt;li&gt;[ ] You know &lt;strong&gt;how to switch models&lt;/strong&gt;: after adding another &lt;code&gt;.gguf&lt;/code&gt;, you update &lt;code&gt;-m&lt;/code&gt; in &lt;code&gt;llama-web.service&lt;/code&gt; (or in the manual command), run &lt;code&gt;sudo systemctl daemon-reload &amp;amp;&amp;amp; sudo systemctl restart llama-web.service&lt;/code&gt;, and reload Open WebUI.&lt;/li&gt;
&lt;li&gt;[ ] You can &lt;strong&gt;list&lt;/strong&gt; your &lt;code&gt;.gguf&lt;/code&gt; files (&lt;code&gt;ls&lt;/code&gt; / &lt;code&gt;find&lt;/code&gt;, §7) and &lt;strong&gt;measure&lt;/strong&gt; throughput with &lt;code&gt;llama-bench&lt;/code&gt; (§7) when comparing quantizations or &lt;code&gt;-ngl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;[ ] You can follow the &lt;strong&gt;unified playbook&lt;/strong&gt; for Gemma 4 / Qwen Coder / DeepSeek Lite / Llama 3.1 (§7): download → &lt;code&gt;llama-cli&lt;/code&gt; → &lt;code&gt;systemd&lt;/code&gt; → &lt;code&gt;/v1/models&lt;/code&gt; → Open WebUI.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.toOptional"&gt; &lt;/a&gt; &lt;strong&gt;Remote desktop&lt;/strong&gt; §14: RDP enabled in Settings, &lt;strong&gt;3389&lt;/strong&gt; allowed in &lt;code&gt;ufw&lt;/code&gt; if needed, smoke tested from another PC on the LAN.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quick port reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;llama-server&lt;/td&gt;
&lt;td&gt;8080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open WebUI&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remote desktop (GNOME RDP)&lt;/td&gt;
&lt;td&gt;3389 TCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama (optional)&lt;/td&gt;
&lt;td&gt;11434&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;Running local inference on Ubuntu with Vulkan and an AMD iGPU is not a one-click setup, but it is worth it: a model that answers &lt;strong&gt;on your LAN&lt;/strong&gt;, without routing every request through a third-party API, and with the freedom to swap GGUFs or quantizations when you need to.&lt;/p&gt;

&lt;p&gt;The stack moves fast: &lt;strong&gt;llama.cpp&lt;/strong&gt;, Ubuntu packages, and Hugging Face repos &lt;strong&gt;change&lt;/strong&gt; over time. If a command or package name no longer matches this guide, &lt;code&gt;cmake&lt;/code&gt; and &lt;code&gt;apt&lt;/code&gt; errors usually point you in the right direction; double-check the project’s current docs.&lt;/p&gt;

&lt;p&gt;Once the checklist is green, the natural next step is tuning &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;, context size (&lt;code&gt;-c&lt;/code&gt;), and the model until you get the quality-vs-tokens-per-second balance you want &lt;strong&gt;on your&lt;/strong&gt; hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the mini PC&lt;/strong&gt; we used for the tests and validation in this guide: &lt;strong&gt;Minisforum UM760 Slim&lt;/strong&gt; (&lt;strong&gt;Ryzen 5 7640HS&lt;/strong&gt;, &lt;strong&gt;Radeon 760M&lt;/strong&gt;), &lt;strong&gt;Ubuntu 24.04 LTS&lt;/strong&gt;, plenty of &lt;strong&gt;DDR5&lt;/strong&gt; RAM and NVMe — the same box behind the &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; runs, &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; screenshots, &lt;strong&gt;Open WebUI&lt;/strong&gt; examples, and the other reference captures. The photo is the &lt;strong&gt;actual&lt;/strong&gt; machine (powered on, front panel as shown), not a marketing render.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fattwa07dclf2y6szn3mn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fattwa07dclf2y6szn3mn.png" alt="Minisforum UM760 Slim — the physical box used to validate this guide (Ryzen 5 7640HS, Radeon 760M, Ubuntu 24.04)." width="732" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now go tinker:&lt;/strong&gt; this walkthrough is rooted in &lt;strong&gt;Ryzen + iGPU&lt;/strong&gt;, but the playbook travels—&lt;strong&gt;mini PCs&lt;/strong&gt; (Minisforum, Beelink, &lt;strong&gt;ASUS ExpertCenter PN&lt;/strong&gt;, &lt;strong&gt;ZOTAC ZBOX&lt;/strong&gt;, modern &lt;strong&gt;Intel NUC-class&lt;/strong&gt; boxes…), &lt;strong&gt;Mac mini&lt;/strong&gt; / &lt;strong&gt;Mac Studio&lt;/strong&gt; on Apple Silicon if that is your stack, or compact power boxes like &lt;strong&gt;NVIDIA DGX Spark&lt;/strong&gt; when budget and goals match. Build &lt;strong&gt;llama.cpp&lt;/strong&gt; (or your preferred runtime), stress &lt;strong&gt;GGUF&lt;/strong&gt; quantizations, run &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; on &lt;strong&gt;your&lt;/strong&gt; iron, and tune &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; until the ceiling feels right. &lt;strong&gt;Share&lt;/strong&gt; what you learn—a &lt;strong&gt;dev.to&lt;/strong&gt; post, a blog, &lt;strong&gt;Mastodon&lt;/strong&gt;, article comments, or whatever community you use; real numbers beat brochure claims every time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One quiet takeaway:&lt;/em&gt; on &lt;strong&gt;your&lt;/strong&gt; codebases the model usually helps more as a &lt;strong&gt;copilot you feed&lt;/strong&gt;—a diff, a log slice, a trimmed README—than as an &lt;strong&gt;all-knowing reviewer&lt;/strong&gt; from a bare URL or a polished persona. When the answer feels &lt;em&gt;too&lt;/em&gt; slick without anything concrete in the prompt, the limit is rarely the mini PC: it is &lt;strong&gt;text-in, text-out&lt;/strong&gt; with nobody else reading disk for you. §10 walks the receipts; day-to-day, &lt;strong&gt;you&lt;/strong&gt; supply the ground truth.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
