Ollama Keep Alive

Short definition

Ollama Keep Alive is the duration that Ollama holds a model loaded in RAM after the last inference request before automatically unloading it. It is controlled by the OLLAMA_KEEP_ALIVE environment variable or the keep_alive field in individual API requests. The default value is 5 minutes.

Extended definition

When Ollama serves a local language model, it loads model weights into RAM or VRAM at the time of the first request. Keeping those weights loaded between requests eliminates the reload cost on subsequent calls. The Ollama keep alive setting controls exactly how long Ollama waits after the last completed request before evicting the model from memory.

The setting accepts human-readable duration strings such as 30s, 5m, or 1h. It also accepts -1, which keeps the model in memory indefinitely until the Ollama process stops or another explicit unload occurs. Setting it to 0 causes Ollama to unload the model immediately after each request completes, reclaiming memory right away.

This parameter matters most in production-style deployments where response latency is a product requirement. A cold model load for a 7B-parameter model typically takes between 10 and 30 seconds depending on storage speed and model format. For interactive applications or pipelines that call the model frequently, that delay is unacceptable. For batch jobs or scheduled tasks that run infrequently, keeping a model loaded wastes RAM that other processes could use.

The keep_alive field can also be set per API request, which gives application-level control independent of the server default. This allows a single Ollama instance to serve multiple workloads with different memory retention policies applied to each model or each calling client.

Deep technical explanation

Ollama manages model lifecycle through an internal runner that tracks each loaded model’s last active timestamp. After each successful inference, the runner resets a timer tied to the configured keep alive duration. When the timer expires, Ollama invokes an unload routine that releases the GPU or CPU memory allocated to that model.

How the timer works

The Ollama keep alive value is parsed at startup when set as an environment variable, or evaluated per request when passed in the API payload. Per-request values override the global default for that specific session. If a second request arrives before the timer expires, the timer resets rather than triggering an unload. This means a model under steady low-frequency load will stay resident as long as requests arrive within the keep alive window.

Memory implications at each setting

Setting OLLAMA_KEEP_ALIVE=-1 is appropriate for dedicated inference hosts where a single model is always in use. The model stays resident across all requests with zero reload cost. The tradeoff is that all GPU or CPU RAM allocated to that model is permanently occupied, preventing other models from loading without manually issuing an unload command.

Setting keep_alive=0 is appropriate for memory-constrained environments or batch pipelines where each job loads the model, completes its work, and immediately frees resources. Every request will incur the full cold-start latency, so this setting is only practical when requests are infrequent or when the calling application is built to tolerate that delay.

The default 5-minute window is a compromise for development environments where a developer interacts with the model intermittently. It covers the typical gap between iterative prompts while releasing memory if the developer steps away.

Edge cases and failure modes

Running multiple models concurrently on a host with limited VRAM can cause Ollama to fail when a second model attempts to load before the first is evicted. If keep alive is set high or to -1 for one model, requests to a second model will fail or queue indefinitely depending on Ollama’s concurrency configuration. Operators must account for total VRAM across all potentially resident models when choosing keep alive durations.

Another edge case involves containerized deployments where the Ollama process restarts due to OOM kills or health check failures. A restart resets all in-memory state regardless of the keep alive setting, meaning the first post-restart request always incurs cold-start latency. Applications calling Ollama should implement retry logic with appropriate timeouts to handle this transparently.

Practical examples

An internal developer tool calling a code completion model expected sub-second response times. The team set OLLAMA_KEEP_ALIVE=-1 on a dedicated GPU host, eliminating reload latency entirely. The model stayed resident across all editor sessions throughout the workday.

A nightly document summarization pipeline processed thousands of files in a single batch job. The team set keep_alive=0 in each API call so memory was released immediately after each document. Total batch time increased slightly, but the host remained stable without hitting OOM limits.

A multi-model API gateway served three different models on the same GPU host. The team assigned keep_alive=10m to the high-traffic primary model and keep_alive=1m to two rarely used auxiliary models. This prevented all three from loading simultaneously while still keeping the primary model warm under normal load.

A chatbot demo running on a developer laptop used the default 5-minute keep alive. The model stayed loaded during active testing sessions and unloaded automatically when the developer switched to other tasks, freeing RAM for the IDE and browser.

Why it matters

Cold-start latency for large models ranges from 10 to 30 seconds, which is unacceptable for interactive applications without proper keep alive configuration.
Setting keep alive incorrectly on memory-constrained hosts causes model load failures that surface as opaque API errors to the calling application.
Per-request keep_alive overrides allow a single Ollama instance to serve multiple workloads with different latency and memory tradeoffs simultaneously.
Indefinite keep alive on dedicated inference hosts completely removes the reload cycle from production latency budgets.
Zero keep alive on batch workloads prevents VRAM exhaustion during long-running jobs that process large volumes of items.
Understanding this setting is prerequisite to capacity planning for any system running multiple local LLMs on shared GPU infrastructure.

How BlueGrid.io uses it

BlueGrid.io builds and manages engineering teams that integrate local LLM inference into client products using Ollama as a runtime layer. Configuring keep alive correctly is part of the production readiness checklist BlueGrid.io engineers apply before any model-serving component ships.

BlueGrid.io engineers instrument Ollama deployments with model load and unload events piped into Grafana dashboards, so clients can observe the direct relationship between keep alive settings and p95 response latency in real workloads.
For clients like Recorded Future and SecurityTrails who run inference pipelines at high request volumes, BlueGrid.io configures OLLAMA_KEEP_ALIVE=-1 on dedicated GPU nodes and separates batch workloads onto isolated hosts with zero or short keep alive settings to avoid resource contention.
BlueGrid.io Python and Node.js backend teams build Ollama client wrappers that pass per-request keep_alive values based on caller context, allowing a single deployment to serve both interactive and batch clients without a separate Ollama instance for each.
BlueGrid.io DevOps engineers encode keep alive configuration into Systemd service units and Docker Compose environment blocks, ensuring the setting survives host reboots and container restarts without manual intervention.
When onboarding new engineers into client teams, BlueGrid.io includes Ollama memory management in the technical onboarding materials so engineers understand the full cost model of each inference call before they write production code.

This expertise is part of BlueGrid.io’s software development service, where engineering teams are built and embedded to own the full stack from inference configuration to application delivery.