Logo Passei Direto
Buscar
Material
páginas com resultados encontrados.
páginas com resultados encontrados.

Prévia do material em texto

W W W . A D M I N - M A G A Z I N E . C O M
A
D
M
IN
N
et
w
o
rk
 &
 S
ec
u
ri
ty
Network & Security
IPv6-Mostly
ISSUE 92
IAM for machines, workloads, and agents
Non-Human IAM Prometheus + 
Cortex
DVD INSIDE
Bloonix
Combine numerous 
services for continuous 
IT monitoring
Prowler
Check AWS infrastructure for 
vulnerabilities, compliance, 
and security gaps
Geofencing
Isolate web services from 
the public Internet
Zabbix
Monitor constrained 
environments over time
Prometheus plus Cortex
MAT for large volumes of historical data
Uptime Kuma
Self-hosted uptime monitoring
IPv6-Mostly Networks 
RT Industrial Ethernet 
Protocols
Datapizza-AI
Edge AI automation on 
constrained hardware
Non-Human 
Identity Management
 E-Ticket
Artificial Intelligence (AI) and its value, ethics, power, and future are in your daily news feed. Journalists raise 
important questions such as “Will AI replace your job?” and “Will the government really remove guardrail 
protection and train AI bots to surveil and potentially harm citizens?” The topic makes good headlines, but the 
concern is real. I’m sure some of you IT administrators out there have wondered if an AI bot will take your job. 
The answer isn’t obvious, but as company executives strive to make investors happy, the measures they take will 
surely affect employment in a negative manner.
The funny part is that executives are always hunting for ways to save money without causing issues with business 
continuity by targeting the people who are in the trenches doing the actual work. Those of us in the trenches 
aren’t making the big money that the executives enjoy. We all know about the disparity between 
worker pay and executive pay, so why don’t business owners look to replace management with 
AI bots rather than the people who flip the switches and push the buttons? Management 
would be far easier to replace than someone who performs hands-on tasks. If I were to 
write a simple script to replace almost every IT manager, it would go something like:
#!/bin/bash
# Read all input (but ignore its contents)
read -r input
# Randomly choose response
if (( RANDOM % 2 )); then
 echo "Yes."
else
 echo "I'll get back to you with an answer."
fi
Even if you can’t read a Bash script, I think you get the idea that it’s much 
simpler to replace someone who only supplies a “Yes” or an “I’ll get back 
to you with an answer” than it is to replace someone who needs to make 
decisions; fix what’s broken; troubleshoot complex situations; and interact with 
users, customers, and managers. That’s my hot take. I’m sure some very competent middle 
managers and executives do much more than placate their management, owners, and shareholders, but I have 
yet to encounter them in my career. Perhaps my scope and experience are limited.
This part of the AI roller coaster is that long, slow ride to the top before you’re released into freefall with your 
hands held high: waiting on so-called decision makers to contemplate your fate while you worry about your 
mortgage, children’s healthcare, and career options in a world motivated by finding the lowest successful bidder.
You’ll also observe on your trek around the loops that no matter how successful AI companies are, their stock 
prices still fall. It’s the exact opposite of what should happen. It’s the feeling of falling although you’re traveling 
up against gravity. I understand the uncertainty surrounding AI: its promises, its future, and, more personally, 
what it’s going to do to me and my family.
What is certain, though, is that soon AI will affect every part of your life – your car, your appliances, your home, 
your communications, your privacy, your healthcare, and even your food. People are blindly embracing AI and 
its flaws as if it were as safe as those foam balls introduced back in the 1970s that “won’t hurt babies or old 
people.” AI is the new foam ball. It seems safe and benign because we control it. What happens, though, when 
no human riders are on board or no person is pulling the lever to start and stop the roller coaster? Will we still 
feel the same?
Don’t get me wrong. I am a daily user of AI tools. I have multiple AI “badges” that prove my competence. 
However, as with any tool, there is good and bad. A hammer is a great tool, but if you drop it on your foot or 
hit your finger, it’s now a @#$%! menace. I expect to see a lot of AI hammers being dropped onto human hands, 
feet, and careers. Tools are good until you lose control of them. The roller coaster still requires a human hand 
at the switch. A roller coaster without human riders is no fun. Let’s go forward and vow to use this new tool 
ethically, safely, and with restraint.
No AI was used in the writing of this article.
Ken Hess • Senior ADMIN Editor
Let’s keep the AI technology roller coaster in human hands.
Le
ad
 Im
ag
e 
©
 k
gt
oh
, 1
23
RF
.c
om
3A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
W E LCO M EWelcome to ADMIN
54 Forced Tunneling
The Microsoft security service 
tunnels all traffic from Azure 
resources downstream, so 
Internet-bound traffic can be 
inspected and monitored by a 
local firewall before it leaves the 
regional Azure gateway.
62 Geofencing
Use geofence technology to isolate 
your web services from the broader 
public Internet with custom security 
rules and worker routes.
32 Azure Storage Explorer
Manage, automate, and perform 
diagnostics while supporting Azurite 
storage integration, shared access 
signature management, and error 
analysis.
36 Datapizza-AI PHP
Orchestrate API-first agents and 
local vector stores on constrained 
hardware without GPUs.
42 IPv6-Mostly Networks
Offer the best user experience 
while reducing IPv4 resource 
consumption to a minimum.
48 Java Memory 
Management
Scale the steep Java memory 
management learning curve 
while keeping applications up and 
running and looking for trends 
that signal imminent crashes.
12 Non-Human Identity 
Management
Many non-human identities — 
workloads in the cloud, 
service accounts in IT systems, 
autonomous agents in AI 
applications — are poorly 
managed or not managed at all. 
We present a strategic, holistic 
approach to managing these 
identities.
18 Prometheus plus Cortex
This monitoring, alerting, and 
trending software is considered 
the standard, but it is slow when 
faced with a large volume of 
historical data. Cortex comes to 
the rescue, with cluster support, 
as well.
26 Uptime Kuma
A combination of easy installation, 
attractive interface, and extensive 
feature set makes Uptime Kuma 
a good choice for self-hosted 
uptime monitoring.
68 Prowler
Systematically check your AWS 
infrastructure for vulnerabilities, 
meet compliance requirements, 
and automatically plug security 
gaps.
76 MITRE Caldera
Emulate attacks and optimize 
monitoring with automated security 
testing that facilitates the work of 
red and blue teams.
Tools Containers and VirtualizationFeatures
Security
ADMIN
Network & Security
@adminmag
ADMIN magazine
@adminmagazine@hachyderm.io
@admin-magazine.com
You’ll find code and listings for ADMIN articles here: 
https://linuxnewmedia.thegood.cloud/s/9nFQcFb2p8oRMEJ
4 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Table of ContentsS E RV I C E
Nuts and Bolts
78 Bloonix
Combine the numerous 
monitoring services in complex 
environments into a single 
interface.
82 Data Collection with 
Zabbix
Available system utilities and 
tools can provide reliable, policy-
compliant monitoring coverage 
in restricted environments where 
traditional approaches fail.
 Rocky Linux 10.1
The first minor release since the 2025 
major release of the Rocky Linux (RL) 
enterprise operating system retains 
the improvements and upgrades of 
RL 10 [1], including the following tools:
12 | Non-Human 
Identity Management
IAM for machines, workloads, and agents
Address the increasing number of attack surfaces presented 
by NHIs by focusing on attribute and capability descriptions.
18 Prometheusthe community growing steadily 
and the project seeing a continu-
ous influx of work from contributors 
worldwide.
Behind the Bear
At its core, Uptime Kuma (Figure 1) 
is a monitoring tool that monitors 
the availability of network services. 
Unlike commercial SaaS solutions, it 
Formerly a hobbyist project, Uptime Kuma has developed into one of today’s most popular open source 
monitoring tools in just a few years. By Marius Quabeck
Uptime Kuma Open Source Monitoring Tool
 Ursa Major
Figure 1: The dashboard shows all monitors at a glance, color-coded by status.
Ph
ot
o 
by
 Z
de
nÐ
k 
M
ac
há
Ðe
k 
on
 U
ns
pl
as
h
26 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E Uptime Kuma
runs entirely on your infrastructure – 
whether a Raspberry Pi in your liv-
ing room, network-attached storage 
(NAS) in your basement, a virtual 
server at your hosting provider, or a 
full-fledged data center. The software 
does not need a cloud connection and 
stores all its data locally, making it 
the ideal choice for anyone who pre-
fers to keep their data on-premises.
The architecture is based on modern 
web technologies: Node.js and Vue.
js provide the back end and reactive 
user interface, respectively. TypeScript 
ensures type safety in the code, and 
SCSS stylesheets enable an attractive 
design with a dark and light mode. 
SQLite is the default database, sup-
porting operation without the need 
to install a separate database. Version 
2.0 introduced MariaDB as an alterna-
tive – an important boost for larger 
installations.
The installation is typically Docker-
based; a single command is all it takes:
docker run -d U
 --restart=unless-stopped U
 -p 3001:3001 U
 -v uptime-kuma:/app/data U
 --name uptime-kuma U
 louislam/uptime-kuma:2
After a few seconds, the dashboard 
becomes accessible on port 3001. 
When you access the dashboard for 
the first time, you need to create 
the admin user account and select 
the language and, optionally, the 
database type before monitoring can 
alternative to commercial monitoring 
solutions. The software is completely 
free, available under the non-restric-
tive MIT license, and does not incur 
any costs, apart from whatever you 
pay for the hosting infrastructure, 
with no subscription model, no limit 
on the number of monitors, and no 
hidden costs for additional notifica-
tion channels.
Uptime Kuma’s ease of use allows 
even employees without in-depth 
Linux knowledge to create new moni-
tors or configure notifications. The 
web interface is intuitively designed 
and does not require a command 
line. Status pages keep customers and 
stakeholders informed of service sta-
tus – professional external communi-
cation without additional tools.
Enterprise Environments
Larger organizations also benefit from 
Uptime Kuma, although it is often on 
top of more comprehensive monitor-
ing stacks. The software is relatively 
straightforward; a combination of 
Prometheus, Grafana, and Alertman-
ager is definitely more powerful, but 
ultimately far more complex. Instead 
of needing weeks of training, Uptime 
Kuma gives you quick results without 
complex configuration.
In version 2.0 and with MariaDB 
support, scalability has improved sig-
nificantly. Installations with several 
hundred monitors and longer data 
histories benefit from the more ro-
bust database infrastructure. Rootless 
begin. The entire setup takes just a 
few minutes.
Home Lab Enthusiasts
The largest target group is probably 
operators of private servers and home 
networks. Anyone who owns a media 
server such as Jellyfin or Plex, uses a 
Nextcloud instance for data synchro-
nization, uses Home Assistant for 
home automation, or hosts other ser-
vices at home wants to know whether 
everything is working. Uptime Kuma 
works seamlessly on a Raspberry Pi 
and reliably keeps an eye on your 
home infrastructure.
The resource requirements are frugal: 
A single CPU core and 512MB of RAM 
are sufficient for smaller installations 
with a few dozen monitors. Of course, 
as the number of monitors increases 
and the check intervals become 
shorter, the requirements also grow, 
but even faced with several hundred 
endpoints, Uptime Kuma still has a 
modest appetite.
What is particularly practical for 
home lab users is that Uptime Kuma 
can be integrated directly into Home 
Assistant, where it appears as an add-
on, which means you can integrate 
availability data into existing dash-
boards and automations.
Small and Medium-Sized 
Enterprises
For companies with limited IT bud-
gets, Uptime Kuma is a cost-effective 
Figure 2: Uptime Kuma supports numerous monitoring types, from HTTP through DNS to Docker containers.
27A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EUptime Kuma
Docker images also address security 
requirements relevant in enterprise 
environments.
Versatile Monitoring
Uptime Kuma is not limited to simple 
HTTP checks. The software supports 
a wide range of protocols and testing 
methods. HTTP/ HTTPS monitoring 
(Figure 2) is the classic example 
where the software calls a URL and 
checks the status code. Advanced 
options let admins validate certain 
keywords in the response body, 
which comes in handy when a faulty 
application is returning HTTP 200 but 
displaying an error page. Targeting 
API responses with JSON queries is 
useful for health check endpoints that 
provide structured status information.
SSL certificate monitoring is integrated 
into HTTP(S) monitors. Uptime Kuma 
not only checks whether an encrypted 
connection is possible but also warns 
of expiring certificates. The lead time 
for warnings can be set; timely notifi-
cations prevent unpleasant surprises 
from expired certificates.
TCP port monitoring checks whether a 
specific port on a server is accessible, 
which is useful for services such as 
databases, email servers, or proprietary 
applications that do not use HTTP. The 
check confirms ac-
cessibility at the net-
work level but does 
not provide informa-
tion on the applica-
tion status.
Ping/ ICMP monitor-
ing tests the basic 
accessibility of a 
host. If you cannot 
even ping the target, 
the problem is likely 
to be more serious 
than just a crashed 
web server. The en-
tire machine could 
be offline, or a net-
work route might 
be down. DNS 
monitoring checks 
for correct domain 
name resolution by 
helping to identify 
misconfigurations or propagation 
problems at an early stage before they 
affect end users.
Docker container monitoring provides 
a direct view of the container status, 
provided Uptime Kuma can access 
the Docker socket. This information is 
particularly useful when services are 
running inside the container but are 
difficult to test from the outside. Up-
time Kuma displays the container status 
(running, stopped, restarting) and can 
alert you to status changes.
Database checks for MySQL, Maria DB, 
PostgreSQL, Redis, and MongoDB en-
able genuine connectivity tests instead 
of simple port availability. The software 
opens a connection and runs a simple 
query. If the query is successful, the 
monitor is considered “up.” Game 
server monitoring checks the availabil-
ity of Steam game servers, such as for 
Counter-Strike, Team Fortress 2, Rust, 
or ARK. For community server opera-
tors, this helpful feature is rarely found 
in generic monitoring tools.
Monitoring of MQTT (a standards-
based publish-subscribe messaging pro-
tocol) checks message brokers, such as 
those found in Internet of things (IoT) 
environments. Version 2.0 introduced 
the ability to run JSON queries to eval-
uate specific message content. Push 
monitoring reverses the principle: 
Instead of Uptime Kuma querying a 
service, the service regularly reports 
to the application. An alarm is trig-
gered if an expected message is not 
received, which is useful for cron jobs, 
backup scripts, or other periodic tasks. 
SMTP and SNMP monitoring have also 
been part of the feature set since ver-
sion 2.0. You can use these functions 
tomonitor mail servers and network 
devices such as switches and routers 
without additional tools.
Browser monitoring relies on an em-
bedded Chromium browser (or Micro-
soft Edge as an alternative as of ver-
sion 2.0) to load web pages like a real 
user and identifies JavaScript errors 
that remain invisible in simple HTTP 
requests. Remote browser support al-
lows resource-intensive checks to be 
outsourced to dedicated machines.
Flexible Check Intervals
The shortest possible check frequency 
is 20 seconds – a value more com-
monly found in enterprise solutions. 
A SaaS service like UptimeRobot will 
only support such short intervals in 
commercial plans. Of course, longer 
intervals can also be set up, such as 
60 seconds, five minutes, or more.
Shorter intervals mean faster noti-
fication in the event of failures, but 
Figure 3: More than 90 notification channels are available, ranging from email and messengers to incident 
management systems.
28 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Uptime KumaF E AT U R E
pending or maintenance. The project 
improved the loading performance 
in version 2.0, meaning that even 
installations with a large number of 
monitors can be operated smoothly. 
Detailed views (Figure 5) provide 
information with interactive ping 
diagrams, uptime percentages over 
various periods (24 hours, 30 days, 
one year), and average response 
times over time. The diagrams are 
reactive and show details when 
moused over.
Badges are available for integration 
into other systems. These small 
graphics reveal the current status or 
uptime of a monitor. You can embed 
them in READ ME files on GitHub, 
integrate them into internal wikis, 
or display them on dashboards. The 
badge URLs support different styles 
and time periods.
Security Features
Two-factor authentication (2FA) pro-
tects access to the dashboard. Uptime 
Kuma supports time-based one-time 
password (TOTP) apps such as 
Google Authenticator or Authy. Since 
version 2.0, the input field automati-
cally focuses on the token field when 
logging in with 2FA.
The application lets you manage 
multiple users, although they all have 
identical authorizations. Role-based 
access control (RBAC) or the option 
to give certain users read-only access 
does not exist, which is problematic 
for larger teams or organizations with 
different responsibilities.
API and Integration
With the help of a Prometheus 
metrics interface, you can integrate 
Uptime Kuma into existing monitor-
ing stacks, which means you can 
visualize information from the tool 
in Grafana dashboards or correlate 
it with other metrics. This option 
proves particularly useful when 
Uptime Kuma is used as part of a 
larger observability solution.
API keys secure access to the met-
rics endpoint. Once you have con-
figured an API key, simple HTTP 
Webhooks for individual integrations 
mean you can connect virtually any 
system. The webhook payload is docu-
mented and can be processed in your 
own scripts or automations. SMS mes-
sages by various providers (e.g., Twilio, 
Clickatell, or SMSEagle) reach recipi-
ents without an Internet connection.
In version 2.0 the list was expanded 
to include Nextcloud Talk, Brevo (for-
merly Sendinblue), Evolution API, 
and Home Assistant. A new environ-
ment variable supports operation 
behind proxies, which is important in 
corporate environments with restric-
tive network policies.
Status Pages
Uptime Kuma can generate public 
status pages (Figure 4) that display 
the current status of selected services. 
Multiple status pages with different 
monitors can be set up and broken 
down by, say, customer group, prod-
uct, or internal and external services.
Additionally, you can customize the 
design with your own logo, title, and 
description. The pages contain the cur-
rent status of each monitor displayed 
in groups, a timeline with the latest 
events, and optional maintenance 
notes. You can define maintenance 
windows in advance so that planned 
downtime does not trigger alarms. 
Public monitor URLs were introduced 
in version 2 that let you access indi-
vidual monitors directly without hav-
ing to share the entire status page.
The dashboard shows all monitors at 
a glance, color-coded by status: green 
for up, red for down, and yellow for 
also higher resource load on both the 
monitoring server and the monitored 
targets. For each monitor, you can in-
dividually specify the number of failed 
checks after you want Uptime Kuma 
to alert you, which also prevents false 
positives in the event of short-term 
outages or slow network connections.
Notifications
Uptime Kuma’s strength lies in its in-
tegration with more than 90 different 
notification channels (Figure 3) with 
email (SMTP) as the legacy choice. 
Uptime Kuma supports any SMTP 
server, and allows TLS encryption. 
As of version 2.0, the templates use 
LiquidJS, which allows for flexible 
customization with variables such as 
name, msg, status, heartbeatJSON, moni-
torJSON, and hostnameOrUrl. HTML 
support in the templates ensures at-
tractively formatted notifications.
When it comes to messenger integra-
tion, Uptime Kuma impresses with 
choice: Telegram, Signal, Discord, 
Slack, Microsoft Teams, Mattermost, 
Rocket.Chat, and more are supported. 
Integration typically relies on web-
hooks or bot APIs. Discord and Slack 
also support rich embeds with color-
highlighted status indicators.
Push services such as Pushover, Go-
tify, ntfy, Pushbullet, and Apprise pave 
the way for notifications on mobile 
devices without email or messenger. 
PagerDuty, Opsgenie, Splunk On-Call, 
Grafana OnCall, and other enterprise 
tools can be connected as incident 
management systems, enabling escala-
tion chains and on-call rotations.
Figure 4: Public status pages inform customers and stakeholders about the status of services.
29A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EUptime Kuma
basic authentication is disabled. 
The keys can be managed in the 
dashboard.
Limitations
Despite the enthusiasm generated by 
Uptime Kuma, the application does 
have some limitations. Because all 
tests are run from a single location, 
if the server running Uptime Kuma 
is experiencing network problems, 
all monitors will appear to be down. 
Conversely, local problems are not 
detected if the checks come from out-
side. Commercial services such as Up-
timeRobot, Pingdom, or Better Stack 
test from multiple geographically dis-
tributed locations and can therefore 
more reliably distinguish between 
genuine outages and local problems. 
This Uptime Kuma disadvantage can 
be partially compensated for by run-
ning multiple instances at different 
locations, but you won’t get a central 
correlation of the results.
A full-fledged REST API for manag-
ing monitors is also still a work in 
progress. If you want to create or 
configure monitors automatically, 
you will encounter limitations. The 
WebSocket-based API is primarily 
intended for the web interface and is 
not documented. Infrastructure-as-
code approaches, in which monitors 
are versioned in Git, are therefore dif-
ficult to implement.
Uptime Kuma primarily measures 
availability and response time. More 
in-depth performance metrics such 
as throughput, error rates by cat-
egory, synthetic transactions across 
multiple steps, or real-user monitor-
ing (RUM) are outside its feature 
scope. If you need these features, 
you will have to turn to specialized 
application performance manage-
ment (APM) tools such as New Relic, 
Datadog, or Grafana Cloud.
You do have to operate the software 
yourself, with all the responsibilities 
that entails: installing updates, creat-
ing backups, and ensuring the avail-
ability of the monitoring server. The 
paradox is obvious – who is monitor-
ing the monitor? If you don’t want to 
or can’t do this, you might be better 
off with a SaaS service.
As a community project, Uptime 
Kuma does not guarantee support. 
GitHub issues and discussions are 
active, and the maintainerresponds 
regularly, but you won’t be subject to 
any service-level agreements (SLAs), 
which is a risk for business-critical 
applications that require guaranteed 
response times. The application stores 
all the settings in the database, and 
you manage them in the web inter-
face. With no option for YAML file-
based configuration, in contrast to 
Gatus, for example, versioning, code 
reviews, and automated deployment 
are more complicated.
Version 2.0
On October 20, 2025, Lam released 
Uptime Kuma 2.0. After a year of de-
velopment with five beta versions, it 
was the biggest release to date. The 
list of changes is extensive, and some 
of them require attention during the 
update.
The most important new feature is 
the optional use of MariaDB as a da-
tabase. SQLite works well for smaller 
installations but reaches its perfor-
mance limits if you have several hun-
dred monitors and an extensive data 
history. In particular, SQLite’s locking 
behavior during simultaneous access 
can cause problems at times.
MariaDB offers better scalability, 
more robust locking, and the ability 
to run the database on a dedicated 
server. This is an important step for 
enterprise deployment. Please note 
that currently automatic migration 
Figure 5: The detailed view shows history charts, uptime statistics, and average response times.
30 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Uptime KumaF E AT U R E
 Small setup (up to 50 monitors): 
one CPU core, 512MB of RAM
 Medium setup (50 to 200 moni-
tors): two CPU cores, 1GB of RAM
 Large setup (at least 200 moni-
tors): four CPU cores, 2GB of RAM; 
MariaDB recommended
Of course, memory requirements grow 
with data history. SQLite databases of-
ten reach several gigabytes if they have 
been running for a long time.
Conclusion
Uptime Kuma has earned its place 
in the toolboxes of admins and de-
velopers. The combination of easy 
installation, attractive interface, 
and extensive feature set makes it a 
good choice for self-hosted uptime 
monitoring.
Version 2.0 eliminates some sig-
nificant weaknesses of the previous 
version – in particular, scalability 
thanks to MariaDB support and se-
curity with rootless containers. The 
project is showing no signs of slow-
ing down; on the contrary, the active 
community and dedicated maintainer 
ensure continuous improvements. 
For example, more than 100 pull re-
quests were incorporated into the 2.0 
beta versions alone.
If you are aware of, and can live 
with, the limitations and do not 
need distributed monitoring, a REST 
API, or enterprise features such as 
role-based access control, you will 
find Uptime Kuma to be a mature, 
perfectly functional tool. For more 
complex requirements, alternatives 
such as Gatus, Prometheus with 
Blackbox Exporter, or commercial 
solutions are worth a look. For the 
majority of use cases, though, from 
home labs to small and medium-
sized enterprises, Uptime Kuma is 
just the right size, and the little bear 
is a reliable watcher. 
Info
[1] Project wiki: [https:// github. com/ louislam/ 
 uptime-kuma/ wiki]
[2] Project page: [https:// github. com/ 
 louislam/ uptime-kuma]
[3] GitHub Container Registry: [https:// ghcr. io]
devices such as switches, routers, 
and uninterruptible power systems 
(UPSs). Uptime Kuma thus is making 
inroads into classic network monitor-
ing territory that previously required 
other tools.
The real browser monitor supports 
Microsoft Edge in addition to Chro-
mium. You can also connect remote 
browsers without a fully installed 
browser on every monitoring server, 
which saves resources and means the 
browser infrastructure can be distrib-
uted. JSON queries for MQTT moni-
tors enable targeted evaluation of IoT 
messages. Instead of simply checking 
whether a message has arrived, the 
content can now be validated.
Hands On
A minimal configuration for a main-
tainable installation can be created in 
a docker-compose.yml file:
services:
 uptime-kuma:
 image: louislam/uptime-kuma:2
 container_name: uptime-kuma
 restart: unless-stopped
 ports:
 - "3001:3001"
 volumes:
 - ./data:/app/data
The data directory contains the 
SQLite database and all the settings. 
Creating regular backups, ideally by 
a cronjob with rsync or tar, is essen-
tial. If you want to use the embed-
ded MariaDB, select the matching 
option in the setup wizard when you 
first start the application. The data 
will then also be stored in the data 
directory.
For production use, you will want to 
run Uptime Kuma behind a reverse 
proxy with TLS termination. The ap-
plication supports NGINX, Caddy, 
Traefik, Apache, and HAProxy. Cloud-
flare Tunnel also works, allowing op-
eration without a public IP address.
Resource Planning
A guideline for hardware resource 
planning comes in three variants:
from SQLite to MariaDB cannot be 
done. If you want to switch, you will 
need to export and import the data 
manually with tools such as sqlite3to-
mysql. The developer also explicitly 
points out that support for migra-
tion problems cannot be provided. 
For new installations, an embedded 
MariaDB option is available that 
does not require a separate database 
installation.
Rootless Docker Images
Security-conscious administrators 
can run Uptime Kuma in containers 
without root privileges as of version 
2. The new rootless images run as 
an unprivileged user node (UID 1000), 
thereby reducing the attack surface. If 
an attacker breaks out of the applica-
tion context, they do not have root 
privileges in the container.
Some restrictions are in place, how-
ever: Docker monitoring does not 
run without additional configuration, 
because access to the Docker socket 
requires root privileges. The file per-
missions in the data directory must 
also be correct (ownership 1000:1000), 
which can make migration from a 
non-rootless installation problem-
atic. Lam expressly recommends not 
switching directly to rootless images 
for upgrades from v1 to v2, but only 
after completing the initial migration.
In addition to rootless images, the 
now lean versions occupy 300 or 
400MB less space than the full ver-
sion. However, some features such 
as Docker monitoring, the embedded 
Chromium browser for real browser 
testing, and the embedded MariaDB 
are not included. If you do not need 
all of that, you can save storage space 
and download time. Finally, the im-
ages are now available on GitHub 
Container Registry (ghcr.io) [3], not 
just on Docker Hub.
Advanced Monitoring
Direct SMTP monitoring of email 
servers is now possible. The check 
goes beyond a simple port test and 
validates SMTP communication. 
SNMP support is aimed at network 
31A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EUptime Kuma
Azure Storage Explorer (ASE) [1] 
offers a lean, locally installable 
environment for managing Azure 
storage resources. ASE supports di-
rect access to blob, file, queue, and 
table storage, with native support for 
Windows, Linux, and macOS, and 
it integrates with local development 
environments such as Visual Studio. 
ASE makes tasks such as copying 
blobs between storage accounts, 
generating Shared Access Signature 
(SAS) tokens, or working with the 
Azurite emulator far more efficient 
than when working in the Azure 
portal.
Connection Options and 
Authentication
ASE offers several options for con-
necting to Azure 
Storage in the Get 
Started tab with 
Attach to a resource 
links (Figure 1): 
with an Entra ID 
login, a connec-
tion key, a SAS, or 
direct URI access. 
When connecting 
to a storage ac-
count with a con-
nection key, you 
can view and copy 
the key directly in 
Access keys on the 
Azure portal. If 
you use Entra ID to 
log in, make sure 
the appropriate 
data roles, such as 
Storage Blob Data 
Contributor or Stor-
age Queue Data 
Reader, are as-
signed; otherwise, 
many operations 
Learn how to manage, automate, and perform diagnostics with Microsoft Azure Storage Explorer, which also 
supportsAzurite storage integration, shared access signature management, and error analysis. By Thomas Joos
Managing Azure Storage Resources
 Expedition
Ph
ot
o 
by
 C
hr
is
to
ph
er
 R
ue
l o
n 
Un
sp
la
sh
Figure 1: ASE manages storage services in Azure without relying on the Azure portal.
32 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Azure Storage Explorer
will remain grayed out or throw errors.
In some scenarios, you will not need 
access to an entire Azure subscrip-
tion, just a single resource. ASE lets 
you integrate resources directly with 
the Connect to Azure Storage option. 
When you get there, you can select 
Blob container or directory, for exam-
ple, as the resource type (Figure 2).
Next, specify the account and ten-
ant, assign a display name, and 
enter the full URL for the resource. 
Alternatively, you can add a SAS as 
a URL to grant selective access to 
containers, files, queues, or tables. 
The resource then appears in Local 
& Attached | Storage Accounts | (At-
tached Containers) in the navigation 
pane. This method is particularly 
useful for automated processes or 
third-party systems with limited 
permissions.
Creating and Managing 
Containers for Blob Storage
To create a new blob container in 
ASE, right-click on the Blob Contain-
ers node of the desired storage ac-
count and select the Create Blob Con-
tainer entry. After 
entering a valid 
name, the new 
container appears 
in the tree struc-
ture. Blobs can 
be dragged and 
dropped into the 
main window or 
uploaded with Up-
load | Upload Files, 
or you can transfer 
entire folders with 
Upload | Upload 
Folder. If you 
want to adjust the 
container’s access 
level, select Set 
Public Access Level 
in the context 
menu and then 
No public access, 
Public read access 
for container and 
blobs, or Public 
read access for 
blobs only.
time-limited SAS tokens. In the 
Shared Access Signature dialog, you 
can specify the start and end times, 
permitted actions (read, write, de-
lete, list), and the time zone. Click-
ing Create generates a link including 
a token, which you can copy di-
rectly with the Copy button.
Individual blobs can be managed 
directly in the main window, where 
functions such as Upload, Download, 
Delete, Open, and Copy are available 
in the toolbar. ASE automatically 
recognizes virtual directories in a con-
tainer. By selecting a blob and click-
ing Open, you can download the file 
locally, and it will open in the default 
program. Blobs are moved by click-
ing Copy and then Paste in the target 
container.
Advanced Container Actions
Containers can be created, deleted, and 
copied in full. To copy, select the Copy 
option in the context menu and paste 
into another storage account – again, 
from the context menu. To delete, 
either select Delete or press the Del 
key. If a blob has snapshots, a dialog 
ASE supports two equivalent access 
methods for specifically checking 
the contents of an existing blob 
container. You can either double-
click on the container entry in the 
tree structure or select the Open 
Blob Container Editor option from 
the context menu. This function is 
particularly helpful if the container 
contains complex directory struc-
tures with many files. The main 
window lists all blobs, subfolders, 
and metadata in a table. You can fil-
ter the overview by simply reorder-
ing the columns or entering search 
terms. You can navigate deeply 
nested structures simply by clicking 
on the directory entries in the path 
bar, and you can edit the metadata 
of individual blobs directly: Use the 
context menu of the object in ques-
tion to call up Container properties 
and adjust the blob type or user-
defined keys, for example.
ASE offers two mechanisms for 
granular access control: policies and 
SAS tokens. The Manage Access Poli-
cies option lets you create perma-
nently valid rules, and Get Shared 
Access Signature lets you generate 
Figure 2: ASE offers numerous options for connecting to storage resources in Azure.
33A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SAzure Storage Explorer
box opens when you delete it, giv-
ing you the option to Delete blobs 
with snapshots. Azure file shares act 
like network drives. Creating a share 
with Create File Share makes the 
share available on user devices. The 
Get Shared Access Signature option 
also lets you transfer temporary ac-
cess rights during this process.
The context menu of the menu item 
in question also lets you create 
queues. Messages are added with Add 
Message, displayed by View Message, 
or removed with Clear Queue. The 
message content is structured, and 
the JSON data is directly readable. 
You can also use the context menu 
to create tables, including manual 
insertion of rows with partition keys, 
row keys, and user-defined fields. The 
filter function lets you analyze indi-
vidual entries.
Local Testing with Azurite
Azurite [2] is a local emulator ideal 
for developing and testing storage 
platforms without active Azure ac-
cess. ASE recognizes running Azurite 
instances by default, provided they 
are listening on the expected ports: 
10000 for blobs, 10001 for queues, 
and 10002 for table storage. Azurite is 
located at Local & Attached | Storage 
Accounts | Emulator – Default Ports in 
the tree structure.
You can set up an alternative port or 
container name manually. To do so, 
open the Connect to Azure storage 
dialog, select Local storage emulator 
as the resource type, and specify the 
ports and a display name. ASE does 
not launch Azurite automatically – 
the container or local instance needs 
to be up and running beforehand. You 
can do this at the command line by 
typing:
azurite --silent U
 --location c:\azurite U
 --debug c:\azurite\debug.log
With Docker, it is a good idea to 
check that the container is running 
(docker container list --all). If the 
ports and network settings are incor-
rect, ASE cannot open a connection. 
If needed, you can reset the ports 
and network settings by typing 
docker restart or create new con-
tainers by executing docker run with 
the mcr.microsoft.com/ azure-storage/ 
azurite image. The Docker context 
you use is also important. Linux us-
ers should note that ASE only works 
in the default context. Further ad-
justments can be made with:
docker context use 
Snap users need to set additional per-
missions – for example by typing:
snap connect storage-explorer:docker U
 docker:docker-daemon
After using Azurite to configure a 
custom storage account, you can then 
use the command
docker exec U
 printenv AZURITE_ACCOUNTS
to check the name and key.
Troubleshooting and 
Diagnostics
If you come across access problems, 
check the role assignments. Without 
the Storage Blob Data Reader role, 
blobs cannot even be displayed. The 
administration level (subscriptions, 
storage accounts) requires reader or 
contributor roles. If SAS tokens or ac-
count keys are missing, you can add 
them by looking under the Manage 
section, choosing Connect Resource, 
and selecting Shared Access Signature 
(SAS). You can fix TLS/ SSL problems 
by going to Edit | SSL Certificates | 
Import Certificates. The --ignore-cer-
tificate-errors parameter at startup 
is not recommended for security 
reasons.
Problems with the authentication 
broker, the login window, or re-login 
can be resolved with Help | Reset or 
by deleting the .IdentityService di-
rectory in the user profile. On macOS, 
go to Login to lock the keychain or 
reauthorize it. If you are using Linux, 
tools such as Seahorse can help you 
manage the standard keychain.
Check the proxy configurations in Set-
tings | Application | Proxy. ASE only 
supports standard authentication; 
NTLM is not compatible. A network 
tool such as Fiddler can help with 
diagnostics if you set it up on local-
host:8888 and configure the proxy 
source in ASE to Use system proxy.
Logging
You can go to Help | Open Logs Direc-
tory to access the application logs. 
Thelog level can be increased to 
Verbose in Settings | Application | 
Log Level. AzCopy logs end up in C:\
Users.azcopy (Windows) or 
~/.azcopy (Linux/ macOS). Authenti-
cation logs in C:\Users\Ap-
pData\Local\Temp\servicehub\logs or 
~/.ServiceHub/logs also offer valuable 
information in the event of errors.
To save and manage your own con-
nections, navigate to Help | Switch de-
veloper tools in the local storage area. 
When you get there, you can clear the 
settings in case of issues by deleting 
the entry in question, leaving just the 
square brackets, [ ]. You can also de-
lete non-functioning SAS URIs in this 
menu by removing targeted entries in 
StorageExplorer_AddStorage-Service-
SAS_v1_blob.
Automating ASE
One key advantage of ASE is its com-
prehensive support for complex role 
models. Although you can manage 
the way admin-level subscriptions and 
accounts are displayed with reader or 
contributor roles, data operations at 
the resource level require specific as-
signments such as Storage Blob Data 
Contributor or Storage Queue Data 
Reader. The fact that these two levels 
are separated often leads to errors, 
but you can avoid issues by choosing 
clear-cut role assignments.
ASE offers alternatives for environ-
ments with restricted GUI access: The 
various connection options support 
common authentication types and 
connection scripts that use SAS URIs 
or connection keys. You can then 
integrate these into your deployment 
or automation processes. A simple 
34 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Azure Storage ExplorerTO O L S
resources in Azure. Its strengths lie 
in its broad format support, seam-
less integration of local develop-
ment environments, granular access 
control, and high level of automa-
tion. Practical features such as 
blob copies between accounts, SAS 
management, queue handling, table 
management, emulator support, 
detailed logging, and extensive trou-
bleshooting options make it a go-to 
tool for managing various storage 
platforms in Azure. 
Info
[1] Azure Storage Explorer: 
[https:// azure. microsoft. com/ en-us/ 
 products/ storage/ storage-explorer]
[2] Azurite Emulator: 
[https:// learn. microsoft. com/ en-us/ azure/ 
 storage/ common/ storage-use-azurite]
The Author
Thomas Joos is a freelance IT consultant and 
has been working in IT for more than 20 years. 
In addition, he writes hands-on books and 
papers on Windows and other Microsoft topics. 
Online you can meet him on [http:// thomasjoos. 
 spaces. live. com].
and continuous deployment (CI/ CD) 
pipelines. When managing access 
rights, permanent access policies com-
bined with ephemeral SAS tokens are 
recommended. You can use a program 
to create the tokens and assign an ex-
piration date to and deploy them spe-
cifically for each service (blob, queue, 
table). The integration of storage re-
sources in containers with anonymous 
access or publicly accessible blobs is 
also possible, provided you configure 
Set Public Access Level appropriately.
Finally, integration with AzCopy plays 
a central role. Internally, ASE relies 
on the command-line tool for trans-
fers and lets you track every action 
in the GUI. Integration is so tight that 
you can open AzCopy logs directly 
without leaving the tool. If you also 
want to track data movements in the 
background or automate them with 
scripts, instead of just using the inter-
face, you can implement some actions 
in a far faster and easier way than in 
the Azure portal.
Conclusion
ASE offers a powerful interface 
for effectively managing storage 
sample script that automates the pro-
cess of uploading a file to an Azure 
Blob container uses azcopy:
azcopy copy "C:\Data\example.txt" U
 "https://mystorage.blob.core.windows.net/U
 mycontainer/example.txt?U
 sv=2022-11-02&U
 ss=b&srt=sco&sp= wac&U
 se=2025-12-31T23:59:59Z&U
 st= 2025-01-01T00:00:00Z&U
 spr=https& sig=" U
 --overwrite=true
The command copies the files; C:\Data\
example.txt defines the local path to 
the file, and https://… is the target 
URL of the blob container, including 
the SAS token. The --overwrite=true 
parameter lets you overwrite existing 
files with the same name. To upload an 
entire folder instead of a single file, just 
extend the command:
azcopy copy "C:\Projects\UploadFolder" U
 "https://mystorage.blob.core.windows.net/U
 mycontainer?sv=" U
 --recursive=true
You can also integrate this script into 
batch files or continuous integration 
35A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SAzure Storage Explorer
What if the key to understanding the 
future of artificial intelligence (AI) 
lies not in the latest GPU, but in a 
14-year-old piece of hardware? In this 
article, I demonstrate how to build 
and run a modern AI agent on a 2011 
Raspberry Pi Model B, a single-core 
computer with just 256MB of RAM. 
The goal is not just to prove it can be 
done, but to show why it matters for 
system administrators and developers: 
By embracing constraints, you can 
design AI systems that are more effi-
cient, transparent, and secure.
Datapizza-AI PHP [1] is an open 
source dependency-free framework 
written in pure PHP 7.4+. Here, I 
show how to build a Sysadmin Agent 
capable of monitoring server health, 
analyzing logs, and reasoning about 
its own actions. This exercise isn’t 
theoretical; it’s a hands-on journey 
into the core mechanics of AI orches-
tration, proving that sophisticated 
automation doesn’t require a cloud-
sized budget. You’ll learn how to 
decouple local logic from remote in-
ference, manage local data with a file-
based vector store, and create custom 
tools that give your agents real-world 
capabilities.
API-First Agent Architecture
At the heart of this project is a 
simple but powerful idea: decoupled 
orchestration. Instead of running 
a massive AI model locally, which 
is impossible on the hardware I’m 
using, I run only the “brain” of the 
agent – the reasoning loop. The 
Raspberry Pi acts as a conductor, 
managing the conversation between 
local tools and powerful remote 
large language models (LLMs) over 
API calls.
This architecture offers three key 
advantages:
 Efficiency: The local footprint is 
tiny. The agent’s logic consumes 
only a few megabytes of RAM, 
making it a negligible load on 
any server – from a vintage Pi 
to a production-grade enterprise 
machine.
 Data Sovereignty: Sensitive data, 
like internal documentation or 
system logs, can be processed and 
stored locally. Only the abstract 
queries or sanitized data snippets 
are sent to the external AI model, 
keeping confidential information 
within your network.
 Transparency: Without complex 
SDKs or black-box libraries, every 
step of the agent’s process – every 
API call, every tool execution, ev-
ery piece of context retrieved – is a 
simple, auditable HTTP request or a 
local file operation (Figure 1).
Why PHP for AI 
Orchestration?
Although Python dominates the 
AI landscape, PHP is surprisingly 
well-suited for the role of an agent 
orchestrator. At its core, an AI agent’s 
reasoning loop is a series of block-
ing, I/ O-bound operations: Make an 
API call, wait for the response, run 
a local tool, wait for the result. PHP 
was born for this role. Its simple, pro-
cedural nature and robust handling 
of HTTP requests make it a perfect fit 
for managing the call-and-response 
flow of an agent’s thought process. 
PHP is the language that built the 
modern web, and its core strengths 
– simplicity, ubiquity, and state man-
agement – are exactly what’s needed 
to build transparent and reliable AI 
agents on the edge.
The API-First Pattern
Most local AI tutorials focus on quan-
tization – shrinking a 70B parameter 
model until it barely fits in RAM. I 
Orchestrating API-first agents and local vector stores on constrained 
hardware without GPUs. By Paolo Mulas
Edge AI Automation on a 2011 Raspberry Pi
 Sublime Pie
Ph
ot
o 
by
 L
au
ra
 S
ea
m
an
 o
n 
Unsp
la
sh
36 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Datapizza-AI PHP
take the opposite approach by accept-
ing that a 2011 Raspberry Pi cannot 
run inference. Instead, it is optimized 
for what it does best: I/ O orchestra-
tion (Figure 2).
The architecture comprises three 
decoupled layers:
1. The brain (remote): a high-
intelligence API (OpenAI GPT-4o, 
Anthropic Claude 3.5 Sonnet, or 
a local llama.cpp library instance 
on another machine) that handles 
reasoning and natural language 
generation.
2. The memory (local): a file-based 
vector store holding embeddings 
 Observe: The agent reads the con-
versation history and the user’s 
latest query.
 Reason: This context is sent to the 
LLM with a system prompt de-
scribing available tools.
 Decide: The LLM replies not with 
an answer, but with a structured 
JSON payload: {"tool": "disk_
space", "params": {"path": "/"}}.
 Act: The PHP script parses this 
JSON, instantiates the DiskSp-
aceTool class, executes it, and cap-
tures the output.
 Loop: The tool’s output is ap-
pended to the history, and the loop 
repeats until the LLM decides it 
has enough information to answer.
This transparency is a security feature. 
You can log every single step of this 
decision tree to a text file, creating a 
perfect audit trail of why the AI de-
cided to check a specific logfile.
Environment Setup
Getting started requires minimal 
setup by design. The entire frame-
work is self-contained and has no 
external dependencies – not even the 
Composer dependency manager for 
PHP. All you need is a Linux environ-
ment with PHP and Git.
To begin, clone the repository from 
GitHub:
git clone https://github.com/paolomulas/U
 datapizza-ai-php.git
cd datapizza-ai-php
of your local data (logs, docs, 
notes) that resides entirely on the 
Pi’s SD card.
3. The hands (local): PHP classes that 
execute system commands, read 
files, or query internal APIs. These 
run on the Pi’s bare metal CPU.
The ReAct Loop in PHP
The core of the agent is the reasoning 
and acting (ReAct) loop. In Python 
frameworks like LangChain, this 
logic is often buried under layers of 
abstraction. In Datapizza-AI PHP, it 
is exposed as a single, readable while 
loop.
The incredibly 
easy-to-debug 
process is syn-
chronous and 
linear:
Figure 1: The API-first architecture comprises a Raspberry Pi that orchestrates local tools 
and remote API calls, keeping the logic local and the heavy lifting in the cloud.
Figure 2: The decoupled architecture: The Pi acts as the secure 
gateway, holding tools and memory, and the LLM provides pure 
reasoning power.
Figure 3: The output of the 00_sanity_check.php script 
confirms that the environment is ready.
37A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SDatapizza-AI PHP
Next, create a .env file in the root 
directory to store your API key (a tem-
plate is provided as .env.example), add 
your API key to the file, and finally, 
run the built-in sanity check script to 
ensure your environment is configured 
correctly by verifying the PHP version, 
permissions, and API connectivity:
cp .env.example .env
nano .env
 
OPENAI_API_KEY="sk-..."
 
php 00_sanity_check.php
If all checks pass, you’ll see a confir-
mation message (Figure 3), and your 
minimal AI lab is ready for its first 
experiment.
Implementing a Sysadmin 
Agent
Now that the environment is ready, 
you should build something practi-
cal: a Sysadmin Agent designed to 
monitor server health and analyze 
logs autonomously. This step shows 
where the true power of the frame-
work’s extensibility shines. The 
Datapizza-AI PHP architecture treats 
“tools” as modular PHP classes that 
the AI can invoke according to its 
reasoning.
In this way, you can expose any 
PHP logic system calls, database 
queries, or API integrations to the 
agent simply by extending the Base-
Tool class.
Designing Custom Tools
To create a tool that allows the 
agent to check disk usage, leverage 
PHP’s native disk_free_space and 
disk_total_space functions instead of 
relying on potentially unsafe shell_
exec calls or parsing df -h output. 
This approach is faster, safer, and 
platform-independent.
Listing 1 shows the implementation 
of DiskSpaceTool. Notice how it de-
fines a JSON schema in the crucial get_ 
parameters_schema() that is injected into 
the LLM’s system prompt, teaching the 
model exactly how to use this tool and 
what parameters to provide.
This pattern is universally applicable. 
You could easily write ServiceRestart-
Tool to wrap systemctl commands 
(with strict whitelisting for security) or 
DatabaseHealthTool to run a quick SQL 
diagnostic query. The agent doesn’t 
need to know the implementation de-
tails; it just needs the schema.
Log Analysis Tool
Giving an AI agent read access to 
system logs is powerful but risky. 
To mitigate this risk, implement 
LogGrepTool to enforce strict access 
controls at the application level. The 
tool allows the agent to search for 
specific string patterns (e.g., error 
or segfault) but restricts access to a 
pre-defined whitelist of logfiles (e.g., 
/var/log/syslog, /var/log/auth.log), 
which prevents the LLM from hal-
lucinating a request to read sensitive 
files like /etc/shadow or strictly pri-
vate user data.
The implementation uses PHP’s file 
handling to read lines safely, avoiding 
the overhead of spawning external 
grep processes – a critical optimiza-
tion when running on constrained 
hardware like the Pi 1.
When the Agent Lies
A common fear is that an AI agent 
will “go rogue.” In testing, I found 
that the ReAct loop is remarkably 
robust, but not infallible.
For example, if you ask, Check the 
health of the Postgres database, but 
haven’t written PostgresTool, the LLM 
might try to hallucinate a solution. It 
might incorrectly guess that it can 
use LogGrepTool to read /var/lib/
postgresql/data, which is blocked by 
your whitelist.
This scenario triggers a safety failure 
in the tool: Error: Path not allowed.
Crucially, the agent sees this error in 
its observation step. The LLM then 
“reasons” about the failure:
 Thought: I cannot read the data di-
rectory directly. I should check the 
standard logs instead.
 Action: log_grep on /var/log/syslog.
This self-correction capability is 
what differentiates an agent from a 
simple script. It adapts to permis-
sion-denied errors just like a human 
operator would – by trying a safer 
alternative path.
01 name = "disk_space";
08 $thi s->description = "Checks disk usage for a 
given path. Returns free/total space and 
percentage.";
09 }
10 
11 public function execute($params = []) {
12 $path = $params['path'] ?? '/';
13 
14 if (!file_exists($path)) {
15 return "Error: Path '$path' does not exist.";
16 }
17 
18 $free = @disk_free_space($path);
19 $total = @disk_total_space($path);
20 
21 if ($free === false || $total === false) {
22 return "Error: Unable to read disk stats.";
23 }
24 
25 // Calculate percentage and format output
26 $used_p = (1 - ($free / $total)) * 100;
27 
28 return sprintf(
29 "Dis k '%s': %.1f%% used (%.2f GB free / %.2f 
GB total)",
30 $path, $used_p, $free/1e9, $total/1e9
31 );
32 }
33 
34 public function get_parameters_schema() {
35 return [
36 'type' => 'object',
37 'properties' => [
38 'path' => [
39 'type' => 'string',
40 'des cription' => 'Path to check (default: 
"/")'
41 ]
42 ]
43 ];
44 }
45 }
46 ?>
Listing 1: DiskSpaceTool Implementation
38 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Datapizza-AI PHPTO O L S
takes a different approach: a serverless 
vector store that lives entirely in a local 
JSON file.
The Math Behind theMagic
How do you search text by meaning 
rather than keywords? You convert 
configuration is minimal (Listing 2; 
Figure 4):
File-Based Vector Store for 
Local Context
Although the Sysadmin Agent is power-
ful, it is stateless. To build a truly intel-
ligent assistant – 
one that knows 
your specific server 
configurations, run-
books, or incident 
history – you need 
persistent memory. 
In the AI world, 
this means a vector 
store.
Standard vector 
databases (e.g., 
Pinecone or Weavi-
ate) are overkill 
for a Raspberry Pi. 
They require Docker 
containers, sig-
nificant RAM, and 
a complex setup. 
Datapizza-AI PHP 
Reasoning Loop in Action
With the tools defined, you now con-
figure ReactAgent. The ReAct pattern 
is the engine that drives the agent’s 
autonomy. In this framework, the 
loop is implemented as a straightfor-
ward while loop in datapizza/agents/
react_agent.php.
1. Thought: The agent receives a user 
query (e.g., Is the disk full?). It 
analyzes the available tool sche-
mas and decides it needs to call 
disk_space.
2. Action: The framework intercepts 
this decision, instantiates Disk-
SpaceTool, and executes it with 
the parameters generated by the 
model.
3. Observation: The tool returns the 
raw string output (e.g., Disk '/ ': 
45.2% used), which is fed back into 
the conversation history.
4. Final Answer: The model sees the 
observation and formulates a natu-
ral language response for the user.
To instantiate your Sysadmin 
Agent with these capabilities, the 
Figure 4: The agent’s reasoning trace in the terminal: Note how it sequentially calls disk_space and then log_grep before synthesizing 
a final report.
Listing 2: Initializing the Sysadmin Agent
01 run(
18 " Check system health: verify disk space and look for errors in syslog."
19 );
20 echo $response;
39A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SDatapizza-AI PHP
text into embeddings – vectors of 
floating-point numbers (e.g., [0.123, 
-0.567, …]), where similar concepts 
are mathematically close (Figure 5). 
To find the most relevant document 
for a query, calculate the cosine simi-
larity between the query vector and 
your stored document vectors.
Most developers import a Python li-
brary for this task. In Datapizza-AI PHP, 
the math is implemented in pure PHP 
to demystify the process (Listing 3).
This function is the engine of your 
retrieval-augmented generation 
(RAG) system, and it runs surpris-
ingly fast on the Raspberry Pi for 
datasets of fewer than 10,000 docu-
ments, proving that big data tools 
aren’t always necessary for personal 
AI projects.
The Limits of Bare Metal
Why does this JSON approach work 
for fewer than 10,000 documents? You 
need to do the math, because on a 
256MB Raspberry Pi, every byte counts.
An OpenAI text-embedding-3-small 
vector consists of 1,536 floating-
point numbers. In PHP, an array 
of floats consumes significantly 
more memory than a packed C 
structure. A conservative estimate is 
roughly 16KB per vector in memory 
overhead:
 1,000 documents is about 16MB 
RAM
 10,000 documents is about 160MB 
RAM
On a Raspberry Pi Model B with 
256MB of total RAM (and the operat-
ing system taking ~80MB), loading 
10,000 vectors leaves practically zero 
headroom for the PHP runtime itself 
(Figure 6).
The Ingestion Pipeline
To populate this store, you need 
a pipeline that converts raw text 
files (Markdown, logs, configura-
tion files) into vectors (Listing 4). 
The ingestion_pipeline.php script 
handles the steps:
 Load: Read files from a directory.
 Split: Break text into chunks (e.g., 
500 tokens) to fit LLM context 
windows.
 Embed: Send each chunk to the 
OpenAI API to get its vector 
representation.
 Store: Save the text, vector, and 
metadata to data/vectorstore.json.
This simple pipeline allows you to 
teach your AI agent about your in-
frastructure. You can feed it a folder 
of post-mortem.md files, and suddenly 
your Sysadmin Agent can answer 
questions like: How did we fix the 
MySQL crash last November? by 
Figure 5: A visual representation of vector search. The query is converted into numbers 
and compared against the database to find the closest match.
01 // datapizza/vectorstores/simple_vectorstore.php
02 
03 private function cosine_similarity($vec1, $vec2) {
04 $dot_product = 0.0;
05 $norm1 = 0.0;
06 $norm2 = 0.0;
07 
08 // O(d) complexity where d is vector dimension 
(e.g., 1536)
09 for ($i = 0; $iBecause Datapizza-AI PHP is just a 
PHP script, deploying it is as simple 
as adding a line to your crontab. 
This headless mode allows the agent 
to perform scheduled health checks 
without human intervention.
Headless Execution with 
Cron
To run your Sysadmin Agent every 
morning at 8:00am, simply point 
cron to your PHP executable and 
your agent script:
# /etc/cron.d/sysadmin-agent
0 8 * * * paolo /usr/bin/php U
 /home/paolo/datapizza-ai-php/examples/U
 05_sysadmin/sysadmin_agent.php >> U
 /var/log/sysadmin_agent.log 2>&1
Because the framework uses standard 
output for logging (the $this->log() 
method you saw in ReactAgent), 
all reasoning steps – thoughts, tool 
outputs, and final answers – are au-
tomatically captured in the logfile, 
which creates a comprehensive audit 
trail. You can review /var/log/sys-
admin_agent.log to see exactly why 
the agent decided to flag a disk space 
warning.
Integration by HTTP
For more interactive use cases, the 
framework also includes a simple 
api/chat.php endpoint that allows 
you to trigger your agent from 
Keywords: Datapizza-AI, PHP, artificial, intelligence, Raspberry, Pi, local, 
vector store, API, agents, edge, computing, automation, decoupled, orches-
tration
Listing 4: The Ingestion Pipeline
01 // datapizza/pipeline/ingestion_pipeline.php
02 
03 fun ction pipeline_ingest_single($filepath, $embedder, 
$vectorstore, $chunk_size=1000) {
04 // 1. Parse text
05 $parsed = parser_parse_text($filepath);
06 
07 // 2. Split into chunks
08 $ch unks = splitter_split($parsed['text'], $chunk_
size);
09 
10 foreach ($chunks as $i => $chunk) {
11 // 3. Generate embedding (Remote API call)
12 $embedding = $embedder->embed($chunk);
13 
14 // 4. Store locally
15 $vectorstore->add_document($chunk, $embedding, [
16 'source' => basename($filepath),
17 'chunk_index' => $i
18 ]);
19 }
20 }
The Author
Paolo Mulas is a developer special-
izing in edge AI and minimal com-
puting architectures.
41A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SDatapizza-AI PHP
For years, networks have been slated 
to migrate to IPv6 to address the short-
age of IPv4 addresses. This problem 
can be an issue for users, because 
many older devices, applications, 
and services still in use do not work 
properly in a IPv6-only environment. 
Therefore, dual-stack networks cur-
rently offer the best user experience, 
but at the expense of running out 
of IPv4 addresses. Although most 
network operators initially tend to 
introduce IPv6 in parallel with their 
existing IPv4 infrastructure, IPv6-only 
networks are still uncommon outside 
the mobile communications sector. 
Most admins agree that the dual-stack 
approach is an unavoidable transition 
phase that allows lessons to be learned 
with the IPv6 protocol while minimiz-
ing disruptions to network operations.
Admittedly, dual-stack networks do 
not solve the core problem: running 
out of IPv4 addresses. A network 
operator still needs the same IPv4 re-
sources as for an IPv4-only network. 
Worse still, a dual-stack infrastructure 
often has to remain in operation for 
many years. Many applications still 
rely on IPv4, as well, which leads to 
a chicken-and-egg problem: IPv6-only 
networks are impractical for incom-
patible applications, while applica-
tions continue to rely on IPv4 because 
IPv6-only networks are rare.
One possible solution is what are 
dubbed IPv6-mostly networks, which 
provide IPv4 connectivity when 
needed and allows IPv6-enabled de-
vices to operate in IPv6-only mode, 
while IPv4 is seamlessly delivered to 
those who need this protocol version.
What Defines IPv6-Only 
Networks?
An IPv6 network is very similar to 
a dual-stack network, with two ad-
ditional key elements. First, the net-
work provides NAT64 functionality in 
line with RFC 6146 [1], which enables 
IPv6-only clients to communicate 
with IPv4 destinations. Second, the 
DHCPv4 infrastructure processes the 
DHCP IPv6-Only Preferred DHCPv4 
option (Option 108) in line with RFC 
8925 [2]. When connecting to an 
IPv6-enabled network segment, an 
endpoint configures its IP stack ac-
cording to its capabilities:
 An IPv4-only endpoint obtains an 
IPv4 address by DHCPv4.
 A dual-stack endpoint (not just 
IPv6-capable) configures IPv6 
IPv6-mostly networks primarily use IPv6 for communication but also support IPv4 as a fallback, simplifying 
address management, reducing the load on the IPv4 infrastructure, and allowing IPv6-only and IPv4-enabled 
endpoints to coexist on the same network. We describe transition mechanisms that facilitate the operation of 
an IPv6-mostly network and the few technical hurdles to overcome. By Mathias Hein
Set Up an IPv6-Mostly Network
 Twofer
Ph
ot
o 
by
 s
hr
ag
a 
ko
ps
te
in
 o
n 
Un
sp
la
sh
42 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S IPv6-Mostly Networks
addresses by stateless address 
autoconfiguration (SLAAC) and op-
tionally by DHCPv6. Additionally, 
this device obtains an IPv4 address 
by DHCPv4.
 An IPv6-only endpoint configures 
its IPv6 addresses and, when per-
forming DHCPv4, includes Option 
108 (in line with RFC 8925) in the 
parameter request list. The DHCP 
server returns the option, and the 
endpoint waives the request for an 
IPv4 address and remains in IPv6-
only mode.
A network segment primarily based 
on IPv6 can support a mix of IPv4-
only, dual-stack, and IPv6-only de-
vices. IPv6-only endpoints use the 
NAT64 provided by the network to 
reach IPv4-only destinations.
However, the term “IPv6-only enabled 
endpoint” is not a strict technical 
definition. Instead, it describes a 
device that can work without native 
IPv4 connectivity or IPv4 addresses 
while providing the same user experi-
ence. The most common method is to 
implement a customer-side translator 
(CLAT) as described in 464XLAT [3] 
(RFC 6877). Devices that support 
CLAT (e.g., mobile phones) are 
known to operate in IPv6-only mode 
without any problems. In some cases, 
however, a network administrator 
any rule or with a dynamic ACL from 
the RADIUS server. If 802.1x authen-
tication is used, RADIUS can provide 
an ACL that blocks all IPv4 traffic. 
However, the ACL-based approach 
has some implications for scalability 
and is detrimental in terms of opera-
tional complexity, which is why it is 
only recommended as a temporary 
solution.
Access to IPv4-Only 
Destinations
IPv6-only endpoints require NAT64 to 
access IPv4-only destinations. Admins 
often opt for a combination of NAT44 
and NAT64 functions, but if not all 
internal services are IPv6-enabled, 
NAT64 might need to be implemented 
closer to the user. If internal IPv4-
only destinations use the RFC 1918 
address space, the known prefix 
64:ff9b::/ 96 does not need to be used 
for NAT64 (Figure 1; see section 3.1 
of RFC 6052).
Enabling CLAT on endpoints is es-
sential for running IPv4-only appli-
cations in IPv6-only environments. 
CLAT provides an RFC 1918-compat-
ible address and a default IPv4 route, 
ensuring functionality even without a 
native IPv4 address from the network. 
Without CLAT, IPv4-only applications 
might consider a device to be IPv6-
only-capable even without a CLAT 
implementation – for example, if 
all applications running on the de-
vice have been tested to work in a 
NAT64 environment without IPv4 
dependencies.
Coexistence of IPv6- and 
IPv4-Capable Endpoints
One effective way to restrict IPv4 
addresses only to those devices that 
need them is to use Option 108. Most 
CLAT-enabled systems also support 
this setting. When a network detects 
this option, it can configure these 
devices as IPv6-only devices so 
that they use CLAT to provide IPv4 
addresses to the local endpoint’s 
network stack.
Certain devices, such as resource-con-
strained embedded systems, can oper-
ate in IPv6-only mode without CLAT 
if their communication is limited to 
IPv6-enabled destinations. Because 
these systems often do not support 
Option 108,you might need to use 
alternative methods to prevent the 
assignment of IPv4 addresses. One 
approach is to block IPv4 traffic at the 
switch port level, which can be done 
either with a static access control list 
(ACL) and a filter with a deny ip any 
Figure 1: Onboarding a new device without IPv6-mostly support (source: Ruhr University Bochum, Germany).
43A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SIPv6-Mostly Networks
would fail, negatively affecting the 
user experience and adding to the 
support overhead.
Recommendations for network ad-
mins who control the endpoints are 
(1) controlling endpoint configura-
tion and enabling CLAT on endpoints 
that send DHCPv4 Option 108 and 
(2) enabling Option 108 without 
CLAT if you are set to identify and 
fix IPv4-only systems and applica-
tions or if all applications will run 
reliably in IPv6-only mode.
Signaling the NAT64 Prefix 
to Hosts
Hosts running 464XLAT must deter-
mine the IPv6 prefix (PREF64) used 
by NAT64. The network administrator 
needs to configure the first-hop rout-
ers to include PREF64 information in 
router advertisements [4] (RA; RFC 
8781), even if the network provides 
DNS64 (so that hosts can use DNS64-
based prefix discovery, RFC 7050). 
This measure is important because 
hosts or individual applications could 
have a custom DNS configuration (or 
even run a local DNS server) and ig-
nore the DNS64 information provided 
by the network, preventing them from 
using the RFC 7050 method for de-
tecting PREF64 (Figure 2).
In the absence of PREF64 informa-
tion in RAs, these systems would be 
unable to perform CLAT, resulting 
in connectivity issues for all IPv4-
only applications running on the af-
fected device. Because such a device 
would be unable to use the DNS64 
provided by the network, access to 
IPv4-only destinations would also 
be disrupted.
All common operating systems cur-
rently support DHCPv4 Option 108 
and automatically enable CLAT 
according to RFC 8781. Therefore, 
providing PREF64 information in 
RAs can reliably reduce the effect 
of a user-defined DNS configuration 
on these systems. Receiving PREF64 
information in RAs also speeds up 
the CLAT startup time, making an 
IPv4 address and a default route 
available to applications in a far 
faster way.
DNS vs. DNS64
DNS64 with NAT64 enables end-
points that exclusively use IPv6 to 
access destinations that only use 
IPv4. However, this arrangement has 
some disadvantages. For example, 
Domain Name System Security Ex-
tension (DNSSEC) incompatibility 
causes DNS64 responses to fail DNS-
SEC validation. Moreover, endpoints 
or applications configured with 
custom resolvers are left out in the 
cold when it comes to DNS64. The 
application has additional require-
ments: To use DNS64, applications 
must be IPv6-capable and use DNS 
(i.e., not use IPv4 literals). Many 
programs do not meet this require-
ment and therefore fail if the end-
point does not have an IPv4 address 
or native IPv4 connectivity.
If the network provides PREF64 in 
RAs and all endpoints are guaranteed 
to enable CLAT, DNS64 is not needed, 
and you should not enable it. How-
ever, if some IPv6-only devices may 
not have CLAT support, the network 
must provide DNS64 unless these 
endpoints are guaranteed never to 
require IPv4-only destinations (e.g., 
in specialized network segments that 
exclusively communicate with IPv6-
enabled destinations).
Advantages of IPv6-Mostly
IPv6-mostly networks offer sig-
nificant advantages over traditional 
dual-stack models, where endpoints 
have both IPv4 and IPv6 addresses. 
The first advantage is a drastic re-
duction in IPv4 address consumption 
through IPv6-mostly. This reduction 
depends on the capabilities of the 
terminal devices (DHCPv4 Option 
108 and CLAT support). In real-
world scenarios (e.g., WiFi confer-
ences), 60 to 70 percent of endpoints 
can support IPv6-only operation, 
Figure 2: With IPv6-mostly support, a device can discover that it can use IPv6 (source: Ruhr University Bochum, Germany).
44 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
IPv6-Mostly NetworksTO O L S
they become apparent if you have no 
IPv4 on which to fall back. The net-
work should at least allow the Frag-
ment and ESP extension headers (for 
IPSec traffic such as VPN).
Solving Typical Problems
Hidden problems usually occur on 
IPv6 networks because the IPv4 
safety net is no longer present. Al-
though implementation errors vary 
greatly, I focus here on the configura-
tion, topology, or design decisions. 
It is important to note that these 
problems are already likely to exist on 
dual-stack networks, although they 
will go unnoticed because of IPv4 
fallback.
In the past, disabling IPv6 was 
considered a quick workaround for 
problems, but it affected devices 
without IPv6. Similarly, the IT depart-
ment might have disabled or filtered 
IPv6 on the assumption that it is not 
widely used. Devices that request Op-
tion 108 cannot connect on an IPv6-
centric network because they do not 
receive IPv4 addresses and IPv6 is 
disabled. You must therefore ensure 
that IPv6 is enabled on your end-
points before migrating the network 
to IPv6-centric mode.
When you expand your network, 
NAT44 from IPv4 allows endpoints 
to extend connectivity to down-
stream systems without the upstream 
network being aware or without 
granting permission. However, this 
situation leads to problems with IPv6 
if the endpoints do not have IPv4 
addresses.
The following solutions are available 
for the problems mentioned:
 DHCPv6-PD for assigning prefixes 
to endpoints: Provides downstream 
systems with IPv6 addresses and 
native connectivity.
 Enabling the CLAT function on 
the endpoint: Functions similar 
to the wired network architecture 
described in section 4.1 of RFC 
6877. The downstream systems 
receive IPv4 addresses, and their 
IPv4 traffic is translated into IPv6 
by the endpoint. However, this ap-
proach means that the downstream 
of Happy Eyeballs [5], as well. IPv6 
connectivity issues are now far more 
apparent, including those that were 
previously hidden in dual-stack en-
vironments. You should be prepared 
for problems with both the endpoints 
and the network infrastructure, even 
if the dual-stack network is running 
smoothly. Some considerations for 
the rollout follow.
With limited control over endpoint 
configuration, a rollout in each sub-
net is essential, where you gradu-
ally enable Option 108 processing in 
DHCP. If you have control over the 
endpoint, a rollout per device is pos-
sible (at least for operating systems 
with a configurable Option 108). Note 
that some operating systems enable 
Option 108 support unconditionally 
and only use IPv6 once it is run-
ning on the server side. I therefore 
recommend that you enable Option 
108 processing when enabling DHCP 
server-side.
Some operating systems automatically 
switch to IPv6-only. A rollback at this 
stage affects the entire subnet, so it 
makes more sense to enable Option 
108 on the endpoints and make sure 
each device can roll back if needed. 
For a quick rollback, you should start 
with a minimum Option 108 value 
(300 seconds) and increase it if the 
IPv6-centric network proves to be 
reliable.
Network Operation
CLAT requires either a dedicated IPv6 
prefix or a dedicated IPv6 address. 
Currently, all implementations use 
SLAAC to acquire CLAT addresses. 
To enable CLAT functionality in IPv6 
network segments, first-hop routers 
therefore need to advertise a prefix 
information option (PIO) containing a 
globally routable, SLAAC-compatible 
prefix with the autonomous address-
configuration flag set to 0.
Because this concept is specific to 
IPv6, the IPv6 extension headers are 
often neglected in dual-stack net-
works or even explicitly prohibited 
by security policies. The problems 
caused by blocking extension headers 
are obscured by Happy Eyeballs, but 
reducing the size of IPv4 subnets by 
up to 75 percent.
Managing dual-stack networks means 
operatingtwo network layers simul-
taneously, which increases complex-
ity, costs, and susceptibility to errors. 
IPv6-mostly enables the elimination of 
IPv4 at many endpoints, simplifying 
operations and improving the reli-
ability of the entire network. It also re-
duces dependencies on DHCPv4. With 
increasing numbers of devices operat-
ing seamlessly in IPv6-only mode, the 
importance of the DHCPv4 service has 
dropped significantly, making it pos-
sible to downsize the DHCPv4 infra-
structure or operate the infrastructure 
with less stringent service level objec-
tives (SLOs) and a view to optimizing 
costs and resource allocation.
Traditional IPv6 deployment required 
separate networks plus dual-stack 
networks. IPv6-mostly offers signifi-
cant improvements here, too, primar-
ily by improving scalability. Separate 
IPv6-only networks double the num-
ber of service set identifiers (SSIDs) 
in wireless environments, leading to 
channel congestion and performance 
degradation. IPv6-mostly does not 
require additional SSIDs. Addition-
ally, IPv4 and IPv6 devices can coex-
ist on the same wired virtual LANs 
(VLANs), eliminating the need for 
additional VLANs.
Troubleshooting, in turn, provides 
improved visibility: User-selected fall-
back to dual-stack networks can ob-
scure issues with IPv6-only operation 
and make it difficult to report and 
resolve problems. IPv6-mostly forces 
users to deal with all the issues, 
which improves identification and en-
ables troubleshooting for a smoother 
long-term migration. Finally, IPv6-
mostly allows for a gradual migration 
of devices on the basis of individual 
segments. Devices only become IPv6-
only if they are fully compatible with 
this mode.
Gradual Transition
Migrating endpoints to IPv6 funda-
mentally changes the network dy-
namics by removing the IPv4 safety 
net and applies to the masking effect 
45A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SIPv6-Mostly Networks
systems exclusively use IPv4 and 
do not benefit from end-to-end IPv6 
connectivity. To take advantage of 
IPv6 despite these circumstances, 
you can use a combination with 
IPv6 prefix delegation (PD).
 Bridging and ND proxy: Bridges 
IPv6 traffic and hides all down-
stream devices behind its MAC 
address. However, this arrange-
ment can lead to scalability is-
sues, because a single MAC ad-
dress is assigned to many IPv6 
addresses.
Multiple Addresses per 
Device
Unlike IPv4, where end devices typi-
cally have a single IPv4 address per 
interface, IPv6 end devices inherently 
use multiple addresses: the link-local 
address, a temporary address (com-
monly used on mobile devices for pri-
vacy protection), a stable address for 
long-term identification, and a CLAT 
address. Endpoints with containers, 
namespaces, or Neighbor Discovery 
(ND) Proxy functions can have even 
more addresses, posing a challenge for 
network infrastructure devices such as 
switches, wireless access points, and 
so on that map MAC addresses to IPv6 
addresses, often with limitations to 
prevent resource exhaustion or denial-
of-service (DoS) attacks.
If the number of IP addresses per 
MAC is exceeded, infrastructure de-
vices behave differently in different 
implementations, resulting in incon-
sistent connectivity losses. Although 
some systems reject new addresses, 
others delete older entries, causing 
previously functioning addresses to 
lose their connection. In all these 
cases, endpoints and applications are 
not explicitly told that the address has 
become unusable.
Assigning prefixes to endpoints by 
DHCP-PD can eliminate this problem 
and the associated scalability issues, 
but not all devices support this op-
tion. You will therefore need to en-
sure that the deployed infrastructure 
devices support a sufficient number 
of IPv6 addresses that can be as-
signed to a client’s MAC address, and 
you need to watch for events that in-
dicate that the limit has been reached 
(e.g., syslog messages).
Avoiding Fragmentation
Because the basic IPv6 header is 20 
bytes longer than the IPv4 header, the 
transition from IPv4 to IPv6 can cause 
packets to exceed the path maximum 
transmission unit (MTU) on the IPv6 
side. In this case, NAT64 generates 
IPv6 packets with fragment headers. 
In line with RFC 6145, the translator 
fragments IPv4 packets by default 
so that they will fit into 1280-byte 
IPv6 packets: All IPv4 packets larger 
than 1260 bytes are fragmented or 
discarded if the DF (don’t fragment) 
bit is set.
To minimize fragmentation, you need 
to maximize the path MTU on the 
IPv6 side (from the translator to the 
IPv6-only hosts). Configuring NAT64 
devices to use the actual path MTU 
on the IPv6 side when fragmenting 
IPv4 packets also makes sense.
Another common cause of IPv6 
fragmentation is the use of protocols 
such as DNS and RADIUS, where the 
server response must be sent as a 
single UDP datagram. Security poli-
cies must allow IPv6 fragments for 
permitted UDP traffic if responses 
in the form of single datagrams are 
required. You need to allow IPv6 frag-
ments for permitted TCP traffic unless 
the network infrastructure reliably 
performs TCP maximum segment size 
(MSS) provisioning.
Custom DNS Configuration
On IPv6 networks without PREF64 
in RAs, hosts rely on DNS64 to de-
termine the NAT64 prefix for CLAT 
operation. Endpoints or applications 
configured with custom DNS resolv-
ers (e.g., public or corporate DNS) 
can bypass the network-provided 
DNS64, preventing the NAT64 prefix 
from being detected and obstructing 
CLAT functionality.
If possible, try to integrate PREF64 into 
RAs on IPv6-centric networks to mini-
mize reliance on DNS64. Be aware of 
the possibility of CLAT failures when 
endpoints use custom resolvers in en-
vironments without PREF64.
Conclusion
In practice, IPv6-mostly networks of-
fers the best user experience while 
reducing IPv4 resource consumption 
to a minimum. On the other hand, the 
complexity of conventional dual-stack 
networks is affected, without resulting 
in any direct IPv4 resource savings, be-
cause IPv4 is still necessary everywhere 
to support older devices. In coming 
years, the volume of native IPv4 traffic 
on these networks is likely to decline to 
such an extent that it will start to make 
sense to stop using IPv4 altogether. 
Info
[1] RFC 6146: Stateful NAT64: 
[https:// www. rfc-editor. org/ info/ rfc6146]
[2] RFC 8925: IPv6-Only Preferred Option for 
DHCPv4: [https:// www. rfc-editor. org/ info/ 
 rfc8925]
[3] RFC 6877: 464XLAT: 
[https:// www. rfc-editor. org/ info/ rfc6877]
[4] RFC 8781: Discovering PREF64 in Router 
Advertisements: 
[https:// www. rfc-editor. org/ info/ rfc8781]
[5] Happy Eyeballs: [https:// en. wikipedia. org/ 
 wiki/ Happy_Eyeballs]
The Author
Mathias Hein is a freelance 
IT consultant and technical 
writer with more than 
40 years of professional 
experience in the field of 
networking. He also serves 
as an adjunct instructor at several universities. 
As a trainer and speaker at technical seminars, he 
shares his expertise in the areas of switching, 
TCP/IP, Voice over IP, Carrier Ethernet, and network 
management. As an author of technical books and 
articles in relevant trade journals, Hein regularly 
contributes to the dissemination of knowledge.
46 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
IPv6-Mostly NetworksTO O L S
Thirty years of unbroken compat-
ibility promises accumulated to 
quite a number of choices between 
different concepts in the Java Virtual 
Machine (JVM) runtime environment 
(Figure 1). Looking at some of those 
ideas today makes them appear to 
be a statement of Zeitgeist more than 
anything else, but you still have to 
choose the parameters for running 
your applications. Understanding 
how the JVM manages memory and 
how to observe the many metrics it 
exposes is essential to operating JVM 
applications in production.
To observe a JVM application’s be-
havior and memory usage, operators 
needplus 
Cortex
Build a flexible monitoring 
architecture that keeps pace 
with the requirements of 
modern, scalable apps and 
neatly integrates trending.
42 IPv6-Mostly Networks
Simplify address management, 
reduce load on the IPv4 
infrastructure, and allow 
IPv6-only and IPv4-enabled 
endpoints to coexist on the 
same network.
90 Industrial Ethernet
Real-time Ethernet can lower 
production costs and vertically 
integrate operations into a single 
network, but the technology has 
at least 10 different, and mostly 
incompatible, technical solutions.
88 Certificate Enrollment 
Web Service
Obtain X.509 certificates for 
Linux systems from Active 
Directory Certificate Services 
with a combination of standard 
Unix utilities, zabbix_sender, and 
scheduled execution by crontab 
files.
90 Real-Time Ethernet
The replacement of first-
generation fieldbuses with real-
time Ethernet creates a single 
network that extends from the 
control level in the office to field 
devices, but admins have to 
struggle with the lack of a single 
uniform standard.
Management
Highlights
On the DVD
 3 Welcome
 6 News
96 Back Issues
97 Call for Papers
98 Coming Next Month
Service
• GCC 14.3
• glibc 2.39
• binutils 2.41
• Rust toolset 1.88.0
• Go toolset 1.24
• .NET 10.0
• OpenJDK 25
• Kernel 6.12.0
As with release 10, RL 10.1 for x86_64 
only supports the x86-64-v3 micro-
architecture level, which is based on 
the feature set of the Intel Haswell 
processor generation.
[1] Release notes: [https://docs.rocky-
linux.org/release_notes/10_1/]
5A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S E RV I C ETable of Contents
Le
ad
 Im
ag
e 
©
 v
la
st
as
, 1
23
RF
.c
om
Get the latest 
IT and HPC news 
in your inbox
Subscribe free to 
ADMIN Update 
bit.ly/ADMIN-Update
Nitrux 6.0.0 Released
The Nitrux team has released version 6.0.0 of Nitrux Linux, which includes many software updates, 
bug fixes, and performance improvements. This latest version of the specialized Linux system, 
built for technical workstations, features Linux kernel 6.19.
One highlight of Nitrux 6.0 is the new VxM hypervisor orchestrator, designed for high-per-
formance virtualization. According to the announcement (https://nxos.org/changelog/release-announce-
ment-nitrux-6-0-0/), VxM “enables virtual machines to achieve near-native performance by using 
technologies such as VFIO PCI passthrough and IOMMU isolation, which allow guest operating 
systems direct access to dedicated hardware, including GPUs.”
Nitrux 6.0 also includes a fully rewritten Nitrux Update Tool System for improved performance 
and long-term maintainability and introduces new components specifically designed for the 
Wayland architecture. 
Read more at Nitrux: https://nxos.org/.
Red Hat Announces New Integrated AI Platform
Red Hat has announced a new integrated AI platform for deploying and managing AI models, 
agents, and applications. 
The platform — known as Red Hat AI Enterprise (https://www.redhat.com/en/products/ai/enterprise) — 
“is designed to bridge the gap between infrastructure and innovation by providing a unified metal to 
agent platform,” says Joe Fernandes, vice president and general manager, AI Business Unit, Red Hat.
Red Hat AI Enterprise is powered by Red Hat OpenShift and offers capabilities such as:
• High-performance AI inference
• Model tuning and customization
• Agent deployment and management 
• Integrated observability
• Lifecycle management 
“By integrating advanced tuning and agentic capabilities with the industry-leading foundation of 
Red Hat Enterprise Linux and Red Hat OpenShift, we are providing the complete stack — from the 
GPU-accelerated hardware to the models and agents that drive business logic,” Fernandes says.
 Learn more at Red Hat: https://thenewstack.io/red-hat-introduces-its-first-out-and-out-ai-platform/.
ControlMonkey Expands Cloud Disaster Recovery to the Network
ControlMonkey has expanded its Cloud Configuration Disaster Recovery (https://controlmonkey.io/solution/
infrastructure-disaster-recovery/) capability to major network vendors, such as Cloudflare, Fastly, Akamai, 
and F5, “bringing visibility and automated recovery to routing, DNS, and edge configurations.”
To accomplish this, the platform “automatically captures daily snapshots of critical network 
control-plane components — including route tables, CDN configurations, security groups, firewall 
rules, DNS records, and edge routing policies,” the announcement states. 
News for Admins
 Tech News
6 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
N E WS ADMIN News
This approach allows organizations to restore networking policies and routing configurations directly 
from the snapshots, thereby reducing manual effort and speeding recovery time. Features include:
• Automated configuration recovery
• Time machine for networking configuration
• Real-time drift detection
• Recovery readiness visibility
Learn more at ControlMonkey: https://controlmonkey.io/.
CISA and International Partners Warn of Major Cisco SD-WAN Vulnerability
The US Cybersecurity and Infrastructure Security Agency (CISA), along with international partner 
agencies, has issued an alert regarding active compromise (https://blog.talosintelligence.com/uat-8616-sd-wan/) 
of Cisco Catalyst SD-WAN systems.
According to the statement, malicious actors “have been observed exploiting a previously undis-
closed authentication bypass vulnerability, CVE-2026-20127 (https://www.cve.org/CVERecord?id=CVE-2026-20127), 
for initial access before escalating privileges using CVE-2022-20775 (https://www.cve.org/CVERecord? id=CVE- 
2022-20775) and establishing long-term persistence in Cisco SD-WAN systems.”
The alert strongly urges network defenders to immediately: 
1. Inventory all in-scope Cisco SD-WAN systems.
2. Collect artifacts, including virtual snapshots and logs of SD-WAN systems to support threat hunt 
activities.
3. Fully patch Cisco SD-WAN systems with available updates.
4. Hunt for evidence of compromise.
5. Concurrently review Cisco’s latest security advisories and implement Cisco’s SD-WAN Hardening 
Guidance: https://sec.cloudapps.cisco.com/security/center/resources/Cisco-Catalyst-SD-WAN-HardeningGuide.
CISA has also issued the following directives to help address malicious activity involving vulnerable 
Cisco SD-WAN systems (https://sec.cloudapps.cisco.com/security/center/resources/Cisco-Catalyst-SD-WAN-HardeningGuide):
• Emergency Directive 26-03: Mitigate Vulnerabilities in Cisco SD-WAN Systems: https://www.cisa.gov/
news-events/directives/ed-26-03-mitigate-vulnerabilities-cisco-sd-wan-systems
• Supplemental Direction ED 26-03: Hunt and Hardening Guidance for Cisco SD-WAN Systems: 
https://www.cisa.gov/news-events/directives/supplemental-direction-ed-26-03-hunt-and-hardening-guidance-cisco-sd-wan-systems
Read more at CISA: https://www.cisa.gov/news-events/alerts/2026/02/25/cisa-and-partners-release-guidance-
ongoing-global-exploitation-cisco-sd-wan-systems.
LPI Offers Complete Learning Materials for New DevOps Certification
The Linux Professional Institute (LPI), which recently announced version 2.0 of the DevOps Tools 
Engineer Certification (https://www.lpi.org/our-certifications/devops-overview/), is offering new learning 
materials along with a series of articles (https://www.lpi.org/blog/2026/01/20/devops-tools-introduction-
01-getting-getting-started-started/) to help you prepare for the certification exam. 
The new certification covers the methodologies and open source tools needed for implementing 
modern DevOps, with general topics including: 
• Software Engineering
• Application Container
• Kubernetes
• Security and Observability
The free, downloadable learning materials (https://learning.lpi.org/en/learning-materials/701-200/) are 
structured around these same topics to provide in-depth lessons on required concepts. Lessons 
include an overview of each topic, guided exercises, explorational exercises,more than a few heap graphs: 
They need a reliable way to observe 
JVM memory usage, interpret it cor-
rectly, and spot failure patterns early 
enough to intervene. In the following 
sefctions, we dive into the details 
of JVM memory observation: which 
metrics exist, how to retrieve them, 
how to visualize them in Grafana, 
and what memory trends tend to pre-
cede JVM memory failures.
JVM Memory Areas 
Explained
The JVM exposes memory telemetry 
in several layers. At the conceptual 
level, the most important metric 
groups are heap usage, non-heap 
usage, garbage collection activ-
ity, and off-heap/ native memory 
consumption. These can be broken 
down further into “used,” “com-
mitted,” and “max” values, which 
appear across the JVM’s memory 
subsystems.
The heap is the best-known area 
because it holds Java objects cre-
ated by the application. Heap met-
rics typically include current heap 
usage (used), the amount currently 
requested from the operating system 
(committed, always greater than 
used), and the configured upper 
bound (max, usually derived from the 
-Xmx flag value). If you only monitor 
one thing, heap usage is the baseline.
Most garbage collectors organize the 
heap into generations. New objects 
are created in a young area, and ob-
jects that survive garbage collection 
cycles are eventually promoted into 
an old area. When administrators talk 
about “the memory leak curve,” they 
usually mean old-generation (Old 
Gen) usage creeping upward. Heap 
memory is managed by the garbage 
collector, so a healthy service often 
shows a sawtooth pattern: Allocations 
push heap up, garbage collection 
drops it down, then it repeats.
Although heap exhaustion is by far the 
most common cause of crashes, some 
real-world incidents happen outside 
the heap. The non-heap category con-
tains several memory pools that are es-
sential for runtime execution. Most no-
tably it includes Metaspace, where the 
JVM stores class metadata. Metaspace 
exhaustion can lead to failures that 
look like memory leaks while the heap 
remains stable, but they are extremely 
rare. Metaspace is not cleaned by nor-
mal object garbage collection in the 
same way heap memory is. If the ap-
plication repeatedly loads new classes 
(e.g., because of ClassLoader leaks, 
redeploy loops, or dynamic proxy 
generation without proper cleanup), 
Metaspace usage could climb steadily 
until the JVM fails.
Another relevant area is the JVM’s 
code cache, which stores just-in-time 
(JIT) compiled code. Less frequently, 
code cache pressure can also create 
instability.
Finally, off-heap areas such as direct 
buffers often explain cases in which 
operating system-level memory pres-
sure endangers the workload while 
heap graphs look normal. In contain-
erized environments, this distinction 
matters even more, because cgroup 
limits apply to the overall process 
memory, not just the heap.
As a general rule of thumb, non-heap 
memory should be rather constant 
across the runtime of an application, 
whereas heap memory will fluctuate 
a lot. Additionally, Java processes 
consume memory in places that are 
not always represented well in heap/ 
non-heap metrics. Every non-virtual 
thread has a native stack, defaulting 
to 1MB, which becomes significant 
when thread counts are high. Large 
non-virtual thread counts can cause Ph
ot
o 
by
 B
re
tt
 J
or
da
n 
on
 U
ns
pl
as
h
Java’s memory management has quite a steep learning curve when you are tasked with operating a 
Java Virtual Machine efficiently in production. We guide you through the waters of keeping applications 
up and running and what signals to look for to prevent crashes. By Henner Schmidt and Max Jonas Werner
Managing JVM Applications in Production
 Defensive Driving
48 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Java Memory Management
high memory usage, even when heap 
is stable, which is a common surprise 
in systems that use blocking I/ O or 
misconfigured thread pools. The JVM 
can also allocate memory off-heap 
through direct buffers (e.g., by Byte-
Buffer.allocateDirect() as part of 
non-blocking I/ O (NIO)), which is 
common in asynchronous network-
heavy stacks such as Netty.
Slight differences occur regarding 
metrics between garbage collector 
(GC) implementations, but these mat-
ter mainly in naming and structure. 
Most production distributions (Open-
JDK, Eclipse Temurin, Corretto) share 
the same HotSpot foundation, so the 
concepts are identical, and most pool 
names are similar. Alternative JVMs 
such as Eclipse OpenJ9 expose com-
parable metrics, but memory pool 
labels and some GC-related signals 
can differ. For this reason, building 
dashboards around stable top-level 
categories (heap/ non-heap/ resident 
set size (RSS)) and treating pool-spe-
cific graphs as JVM- and GC-specific 
are good practices.
For JVM applications, we highly ad-
vise opting for a white box or gray 
box approach, because it allows you 
to understand the application’s mem-
ory usage much better than looking at 
it from the perspective of the operat-
ing system. The Java runtime allows 
for different ways to obtain these 
metrics.
Source-Level 
Instrumentation
Source-level instrumentation means 
the application actively exposes met-
rics as part of its runtime behavior, 
typically with an HTTP endpoint 
(Prometheus/ OpenMetrics format) 
or an OpenTelemetry exporter. This 
approach is common in modern Java 
services because it integrates well 
into the same observability pipeline 
as request latency, error rates, data-
base timings, and other application 
signals.
In practice, this approach is often 
implemented with Micrometer (e.g., 
through Spring Boot Actuator) or 
Obtaining Memory Usage 
Metrics
For production environments, we 
want to define three fundamentally 
different approaches to collecting Java 
application memory usage metrics 
before diving into the details:
 White box instrumentation: Appli-
cation metrics are exposed through 
measures directly baked into the 
source code, optionally through a 
library, gaining insights into the 
different memory areas and gar-
bage collector cycles.
 Gray box instrumentation: Metrics 
are consumed from the running 
Java Virtual Machine. With regard 
to memory, the same metrics can 
be retrieved, as with white box 
instrumentation.
 Black box instrumentation: Met-
rics are exposed by the operating 
system, which allows observation 
of the overall use of memory and 
CPU by an application, thread 
count or network latency, and 
saturation.
Figure 1: Java Virtual Machine tuning is an art in and of itself. Over the last 30 years, many knobs have been added, but only a subset is 
crucial for operators to know.
49A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SJava Memory Management
with the OpenTelemetry SDK. Both 
options provide memory-related 
metrics, such as heap and non-heap 
usage, memory pool usage (Old Gen, 
Metaspace, etc.), garbage collection 
pause behavior, thread counts, and 
class loading statistics. These metrics 
are exported in a monitoring-friendly 
format and are easy to scrape with 
Prometheus, without the need to ex-
pose JMX ports.
For administrators, the main advan-
tage is reliability and consistency: 
You get a stable metrics endpoint, 
consistent naming, and easier integra-
tion in containerized environments. 
The trade-off is that you either need 
application or, at minimum, runtime 
configuration changes. The benefit of 
source-level instrumentation over an 
approach that uses Java agents is that 
the application is clearly defined and 
does not change at runtime, making 
it easier to tick all the boxes in high-
security environments.
Java Agents
The easiest way to expose metrics 
from a JVM without changing an ap-
plication’s source code is to run a 
Java agent such as the Prometheus 
JMX Exporter [1] or the OpenTelem-
etry Java Agent [2]. Both use the 
same mechanism to gather metrics 
from a running application, which is 
to run a Java agent alongsidethe ap-
plication in the JVM, making use of 
Java’s instrumentation API to extract 
metrics. An example command line 
for running an application with the 
Prometheus JMX Exporter might be
$ java -Xmx32M U
 -javaagent:./jmx_prometheus_U
 javaagent-1.5.0.jar=9090:U
 exporter.yaml U
 -jar MyApp.jar
The most basic exporter configuration 
(exporter.yaml) is
rules:
- pattern: ".*"
The exporter then listens on TCP port 
9090 and provides metrics through 
the /metrics endpoint. You would 
subsequently configure Prometheus 
to scrape http:// HOSTNAME:9090/ 
metrics and get all JVM memory met-
rics right out of the box.
Important to know, though, is that 
Java agents manipulate the applica-
tion’s bytecode for inspection, which, 
although not necessarily a concern in 
general if you trust the agent, might 
be an operational obstacle to deploy-
ing the agent files together with the 
application, especially when running 
the application in a container. In 
highly regulated environments, run-
ning agents manipulating bytecode 
might also not be possible because of 
regulatory concerns. After all, with an 
agent, you are not running the exact, 
certified, and audited binary but a dy-
namically modified one. Make sure to 
understand the implications of using 
Java agents and document their use 
in your operational handbooks.
Black Box Instrumentation
If you cannot or do not want to run 
a Java agent and can’t change the 
application’s source code, black box 
instrumentation is your only way to 
retrieve metrics from the JVM. Fortu-
nately, the Java ecosystem provides 
tools to provide detailed metrics even 
in these cases: jcmd and jstat. The 
jcmd utility is the modern, supported, 
all-purpose JVM control interface. 
It supersedes many jmap and jstat 
use cases. Given the JVM’s PID, you 
would run the following command to 
gather JVM heap metrics:
jcmd GC.heap_info
This single-shot command would 
need to be scheduled to gather met-
rics continuously for time-based ob-
servations; therefore, jstat might be 
preferable:
jstat -gc 1000
With this command, jstat would 
gather and print metrics from the pro-
cess each second.
To provide these metrics in a Pro-
metheus-compatible format, you 
would have to convert them so that 
Prometheus can scrape at a regular 
interval. Existing tools such as jstat-
2prom facilitate this process, but they 
are usually not well maintained or 
they are straight up abandoned, so 
you would likely have to write your 
own glue code. It becomes apparent 
that JVM black box instrumentation 
adds considerable operational over-
head that you should be well aware of 
when planning your instrumentation 
strategy. Tools like those mentioned 
here might not be very well suited for 
continuous monitoring, but they do 
come in handy for ad hoc profiling of 
applications.
As a last resort, you can still rely 
on operating system metrics for the 
JVM process. With Prometheus, the 
easiest way is to use the node_ex-
porter [3] and process-exporter [4] 
applications, which gather all kinds 
of metrics about the node on which 
they are running or the processes 
that run on that node, respectively. 
Obviously, you will not be able to 
gather insights into the application’s 
memory internals (such as the dif-
ferent memory areas), but you will 
still be able to create coarse-grained 
alerts that are based on the overall 
process or node memory usage, still 
allowing you to prevent a memory-
related application crash.
Visualizing JVM Memory 
in Grafana
A useful Grafana dashboard should 
make it easy to answer one opera-
tional question: What type of memory 
pressure is killing my process?
The most important visualization is 
heap usage compared with its con-
figured maximum. This graph pro-
vides an immediate view of whether 
heap headroom exists. A panel that 
tracks old-generation usage is also 
extremely valuable, because it is the 
best early-warning signal for retained 
object growth.
Non-heap metrics should be dis-
played separately from heap. Meta-
space deserves its own graph, 
because it can fail independently. 
Garbage collection activity should 
50 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Java Memory ManagementTO O L S
shapes and may build up over the 
span of days or even weeks.
One of the most common patterns 
is old-generation growth that does 
not drop after major collections. If 
Old Gen usage trends upward over 
minutes or hours and never meaning-
fully returns downward, the service is 
retaining objects. This classic memory 
leak signature is independent of the 
GC algorithm (Figure 2).
Another warning sign is increasing full 
GC frequency with diminishing benefit 
(Figure 3). The heap reaches a point 
where every GC cycle frees too little 
memory. The JVM responds by collect-
ing more often, causing latency spikes 
and throughput collapse. This phase 
often comes before the final OutOfMemo-
ryError and is where intervention still 
helps. One intervention that is not 
infrequently deployed in real-world 
production environments is restart-
ing the application under controlled 
conditions. Although this interven-
tion should be seen as a last resort to 
avoid service interruptions, increases 
in full GC cycles are a good sign that 
you should capture a heap dump of 
the application for further inspection 
or capture a runtime profile with Java 
Flight Recorder. In the best case, the 
application’s heap memory just needs 
to be adjusted. In the worst case, such 
observations point to a memory leak 
that can only be fixed by the applica-
tion developer.
A separate but equally important pat-
tern is monotonic Metaspace growth. 
When Metaspace continually in-
creases, heap graphs can look healthy 
right up until the crash. Operators 
should treat this as a first-class signal, 
especially in environments with fre-
quent redeployments.
Finally, watch for the mismatch be-
tween JVM-internal metrics and operat-
ing system-level memory. If RSS climbs 
while heap remains stable, suspect 
native allocations: direct buffers, thread 
stacks, or JNI libraries. In this situation, 
increasing heap will not solve the prob-
lem and may accelerate it by reducing 
headroom for native memory.
Now that you understand how to 
observe JVM memory usage, you can 
look at how to optimize it.
pause times might indicate that the ap-
plication is experiencing memory pres-
sure (e.g., because of higher load). In 
such a case, a countermeasure would 
be to scale the application either hori-
zontally (by deploying more instances) 
or vertically (by allocating more mem-
ory, e.g., through -Xmx).
Predicting 
OutOfMemoryError
JVM memory incidents rarely happen 
instantly. They show recognizable 
be shown as both pause time and 
pause frequency, because rising 
full-GC rates are often the leading 
indicator that the JVM is trying 
(and failing) to recover memory. 
As another rule of thumb, major 
GC pause times are a sign that the 
application is struggling to stay op-
erational with the given amount of 
heap space as old-generation heap 
space fills up.
Minor GC pause times should be 
rather constant across the runtime of 
the application; an increase in minor 
Figure 2: Memory leak curve. Used Heap reaches 100 percent at 13GB, and the JVM crashes. 
Heap looks fine afterward, but not for long.
Figure 3: The GC frequently fails its 200ms pause time target, indicating GC pressure 
because of too little available memory.
51A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SJava Memory Management
Choosing a Garbage 
Collector
The term “garbage collector” has 
always been kind of a misnomer. It 
is a bit like calling the mayor of a 
town a “trash guy” just because tak-
ing care of the town`s trash manage-
ment is one of the mayor’s duties. 
The GC is basically doing everything 
about the system RAM used by the 
JVM. You might ask: Why have 
a choice? Why not just have one 
“optimal” garbage collector? The 
reason is the amount of sophistica-
tion in JVM memorymanagement. 
Over the evolution of the JVM, it 
quickly reached a degree at which 
it became impossible to create one 
GC that was optimal under all cir-
cumstances. The main challenge is 
explained quickly: The “better” a 
GC, the more resources – RAM and 
CPU – it needs by itself. Therefore, 
choosing the correct GC has con-
sequences for the overall memory 
consumption, behavior under load, 
and resource efficiency.
The Contenders
We stick to the garbage collectors 
available in HotSpot OpenJDK. 
Other JDK distributions, like the one 
by Azul, and even alternative JVM 
implementations like Eclipse OpenJ9 
introduce broader implications than 
just memory management; therefore, 
the contenders as of Java 25 are:
1. Serial/ Parallel
2. G1
3. Z (ZGC)/ Shenandoah
Technically the list has five garbage 
collectors from which to choose, but 
you can put them into three groups to 
make the first decision.
Nature of Your Workflow
The first group has by far the least 
cost regarding CPU and memory, but 
it is also the least favorable option 
because of its huge stop-the-world 
collection pauses. You might consider 
it for jobs or CLI applications but 
almost never for long-running server 
applications.
The second group consists of only G1. 
This GC is special in that it aims for 
the sweet spot between cost and per-
formance and will therefore be your 
most likely pick.
Z and Shenandoah are made for 
workloads that trade maximum 
throughput and efficiency for ultra-
short pause times. Z is marketed with 
pauses around 1ms, which is interest-
ing to contrast against the 200ms tar-
geted by G1. Shenandoah is the only 
garbage collector being developed 
outside of the Java team. It was con-
tributed by Red Hat and more or less 
has the same features as ZGC. Now 
all you’re left to do is load test your 
workload with both options if you 
need the qualities of this group.
To Not Choose
Not picking and pinning an option is 
not advisable because the JVM will 
then make the decision for you. JVM 
version-dependent magic numbers 
form a heuristic to decide whether to 
use Serial in case you did not bother 
to define the GC of your choice. One 
CPU and less than 1,792MB of RAM 
will let it choose Serial because it has 
a better efficiency than G1 under such 
circumstances.
The Java team is working to change 
this behavior, though. JEP 523 is 
proposing to make G1 the default 
under all circumstances. They laid the 
groundwork for this option with Java 
release 25, of which they claim to 
reach a good-enough efficiency of G1 
even for constrained environments.
When G1 is the default, when can 
you choose ZGC or Shenandoah? See 
it as an optimization for workloads 
that have CPU and memory to trade 
for shortest GC latency: probably 
most web back ends these days. Keep 
in mind, though, as with all optimiza-
tions, you will have to load test and 
prove their positive effect.
Tuning Garbage Collection
The JVM is versatile with its ways 
of optimizing for a range of envi-
ronments. We want to share some 
of the tuning knobs that should be 
considered for Linux server workloads.
The one thing you should always do 
is set the garbage collector explicitly. 
The JVM will otherwise attempt to 
be smart by heuristically picking one, 
as explained above, which could lead 
to a different GC being chosen across 
your development workstation, your 
Build Server, or your deployment 
stages, leading to hard-to-debug is-
sues that might only be visible with a 
specific GC.
Planning Heap Size
Memory capacity planning for JVM 
instances is complex because of the 
several very specific uses GCs have 
for memory.
The irony is that the heap size is not 
the only use; it is just the most promi-
nent. You also have direct buffers, 
the metaspace, class caches, memory 
maps, and so on, as stated earlier. 
Although you could attempt to budget 
all of them explicitly, we argue that 
it would do more harm than good. 
Getting these explicit sizes right is dif-
ficult, partly because they are tightly 
coupled to the features of the envi-
ronment the JVM runs in. Just two 
examples are:
Thread stack size is double the 
size on ARM 64 bit. Setting it to a 
fixed value would allow for half the 
amount of possible threads, just be-
cause you picked a different machine 
type. You would like to prevent these 
kinds of surprises.
Direct buffers are used by NIO. 
NIO is the current I/ O component 
of the JVM. The amount of memory 
it uses is dependent on load: Set it 
to a fixed size you figured out by 
load testing and be unpleasantly 
surprised that a small change in, 
for example, network latency might 
quickly lead to an out of memory 
(OOM) while the heap is not even 
close to being saturated.
A proven approach is to let the JVM 
have some breathing room for dy-
namically sizing everything outside 
the heap. Observing the total memory 
utilization of the JVM while load test-
ing will inform you about how to set 
the heap.
52 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Java Memory ManagementTO O L S
controllers by considering which 
area of memory is quickest to reach 
from each CPU. NUMA awareness is 
disabled by default because VM plat-
forms might not be honest about the 
amount of memory controllers. Not 
setting this flag on a NUMA-enabled 
host will set you back between 10 and 
30 percent of throughput, depending 
on your workload.
Parting Words
The JVM and its memory manage-
ment is a huge topic. We can recom-
mend its man pages for those of you 
who want more information [5], 
but even those docs do not mention 
every aspect worth knowing – for 
example, when you wonder why 
the JVM is not exiting, although 
you clearly see it running into out-
of-memory exceptions, and find 
out that you explicitly have to set 
-XX:+CrashOnOutOfMemoryError to make 
it exit. 
Info
[1] Prometheus JMX Exporter: 
[https:// prometheus. github. io/ jmx_ex-
porter/]
[2] OpenTelemetry Java Agent: 
[https:// opentelemetry. io/ docs/ zero-code/ 
 java/ agent/]
[3] Node Exporter: [https:// github. com/ 
 prometheus/ node_exporter]
[4] Process Exporter: [https:// github. com/ 
 ncabatoff/ process-exporter]
[5] java command man page: 
[https:// docs. oracle. com/ en/ java/ javase/ 
 21/ docs/ specs/ man/ java. html]
Authors
Max Jonas Werner is a software engineer, pref-
erably working on and with Kubernetes. As part 
of his day job at Coppersoft GmbH, he builds 
and operates third-party applications for criti-
cal infrastructure suppliers across Europe. He 
is one of the core maintainers of the Flux open 
source continuous delivery solution.
Henner Schmidt works as a Fullstack staff 
engineer for development and operations at As-
sense Software Solutions in Hamburg, Germany. 
His expertise lies in writing and operating JVM 
applications for the service industry.
touch every requested memory 
page right at launch. You probably 
want to use this approach if your 
workload requires maximum risk 
avoidance, because it is no longer 
possible for the operating system to 
give this memory to another process 
without killing the JVM.
Performance Tuning
The most effective tuning measure 
is to update the JVM version. The 
developers introduce optimiza-
tions to footprint, throughput, and 
latency with almost every release. 
The ZGC design goal is to not of-
fer tuning knobs, but if you are 
looking at G1-specific tuning with 
JVM parameters, the champion is 
-XX:MaxGCPauseMillis, which sets 
G1 a soft but usually very effective 
goal for pause times. It will trade 
this value for some CPU cycles, but 
most people will be happy to spend 
that for shortening the maximum 
response time of their applications 
endpoints.
Reducing JVM Memory 
Footprint
The -XX:+UseCompactObjectHeaders 
and -XX:+UseStringDeduplication 
parameters both target reducing the 
memory footprint. They are off by 
default, even in JVM 25, which is 
a testament to the JVM developers’ 
conservative nature when introduc-ing rather radical changes to the 
ecosystem. The ecosystem is prob-
ably what you want to look out 
for when testing your workloads 
for problems that might come up, 
because the JVM itself will have no 
issues with flags.
Taking Advantage of 
Environment Features
A property of the server on which 
your workload runs might be having 
more than one memory controller. 
Performance-wise, knowing this pos-
sibility is relevant information for 
the GC. The -XX:+UseNUMA paramater 
lets it work with multiple memory 
Heap Sizing Example
Say you want to optimize the 
memory allocation for a JVM run-
ning in a container: You go with 
G1 as the starting point and want 
to know the amount of JVM heap 
space you should allocate relative 
to the memory limit given to the 
container. You have prepared a load 
test and would love to see the appli-
cation get along with a total of 4GB 
so it would fit the node sizing of 
the cluster in which the container is 
supposed to run. What you do not 
know is whether this size is enough 
under load. You can ask the JVM to 
set the divide between heap and off-
heap as a percentage to avoid set-
ting the size with an absolute value. 
A good starting point for this test is 
70 percent heap and 30 percent off-
heap. The JVM parameter for that 
is -XX:MaxRAMPercentage=70. You can 
now alter the memory limit for the 
container, test each, and adjust ac-
cording to what the metrics show.
Overcommitting Memory
All applications request memory 
from the operating systems on 
which they run, not just the JVM. 
Digital evolution has lead most 
operating systems to answer these 
requests with virtual memory, 
even if that means they overcom-
mit the physical memory available. 
The pieces of the virtual memory 
(pages) are backed with actual 
memory the first time the applica-
tion uses them. You can change that 
globally on some operating systems 
and in some environments, like a 
VM, but not everywhere you might 
run an application.
Therefore, letting the JVM claim 
the maximum amount of memory 
you would want it to have does no 
harm and has no effect on systems 
that allow for unbound overcommit-
ting – other than keeping your met-
rics informed, which the parameter 
-XX:InitialRAMPercentage does.
You also have a way to commit the 
physical memory. The parameter 
-XX:+AlwaysPreTouch makes the JVM 
Keywords: Java, JVM, virtual, machine, memory, management
53A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SJava Memory Management
Azure Firewall is a fully stateful 
firewall as a service with built-in 
high availability and unlimited cloud 
scalability that manages both east-
west and north-south traffic. In this 
article, I look at forced tunneling, 
which allows northbound traffic to 
be inspected by a local firewall before 
leaving the regional Azure gateway.
To assess the capabilities of the ser-
vice properly in relation to its price 
[1], you should understand how in-
frastructure-based workloads – think 
virtual machines (VMs) – typically 
communicate with the outside world. 
In Azure, VMs are always deployed 
on virtual networks (VNets), where 
each VNet uses a freely selectable 
RFC 1918-compliant address range.
The VNet must have at least one 
subnet on which each VM uses the 
private IP address of its virtual net-
work interface – that is, an address 
from the subnet’s address range. Of 
course, the VM can also access mul-
tiple private IP addresses, either in 
the form of multiple IP configurations 
(one of which is always primary) or 
in the form of multiple network in-
terfaces. The VN needs the VNet to 
communicate:
 with other VMs in the same 
network,
 with other Azure services that re-
side on the same virtual network 
with a service or private endpoint,
 with Azure VMs on other Azure 
VNets by VNet peering or IPsec 
virtual private network (VPN),
 with the local site by IPsec VPN 
or Microsoft Azure ExpressRoute,
 with other Azure resources by 
their public endpoint, or
 with the Internet.
Outgoing Internet communication 
worked automatically out of the box 
(up to September 30, 2025) without 
any further configuration – even with-
out an explicit public IP. Microsoft 
refers to this implicit network address 
translation (NAT)-like procedure as 
standard outbound access. However, 
it was discontinued on the date men-
tioned above. Ever since, customers 
have had to configure outbound In-
ternet communication explicitly (e.g., 
with the use of a public IP address, 
a NAT gateway, or a source NAT 
(SNAT) in conjunction with the Azure 
Firewall). For inbound Internet con-
nectivity, the VM always (directly or 
indirectly) needs a public IP address – 
at an additional cost, priced by 
standard stock keeping units (SKUs; 
basic SKUs were discontinued at the 
same time).
Every virtual network in Azure has 
a default gateway (as a service) that 
is not visible to the customer and a 
default route table. Although you can-
not see the gateway as an entity in 
Azure, typing ipconfig on the guest 
system of the VM or with a Power-
Shell (PS) script injected into the VM 
from outside by the VM agent will 
reveal its existence. The gateway runs 
on the first available IP address (after 
the network address) in the VNet’s 
address range (e.g., 10.0.0.1).
Routing in Azure
The invisible (system) routing table 
contains matching routes (e.g., to the 
default gateway for Internet commu-
nication). Azure automatically cre-
ates system routes and assigns them 
to all subnets of a virtual network, 
with the route definition consisting 
of an address prefix and a next hop 
type, which can be a kind of alias or 
service tag.
For example, the address prefix 
0.0.0.0/ 0 is assigned to Internet. 
If the destination is not within the 
network’s address range, the route 
passes through the default gateway. 
In contrast, the VirtualNetwork next 
hop type stands for the address space Ph
ot
o 
by
 A
nt
ho
ny
 R
eu
ng
èr
e 
on
 U
ns
pl
as
h
The Azure Firewall network security service combines threat protection, packet filtering, 
and application firewalling for cloud workloads in a platform-based offering. By Thomas Drilling
Forced Tunneling in Azure Firewall
 Thoroughfare
54 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O N Forced Tunneling
Azure. Although every newly created 
VM in Azure (or its network inter-
face, if you prefer) is linked to a new 
packet filter, users can also skip cre-
ating a filter. In Azure, packet filters 
do pretty much what their name sug-
gests, that is, what Linux users but 
not Windows users understand when 
they hear the word “firewall.” Secu-
rity groups only work in OSI Layer 4 
and, therefore, only support the UDP 
and TCP protocols (plus ICMP).
Otherwise, the functionalities are 
similar to those of the Linux kernel 
(netfilter or iptables), which means 
input and output rules with stateful 
inspection (also known as connection 
tracking on Linux), whose process-
ing order is determined by priority. 
The rule with the highest priority is 
processed last and is usually a deny. 
Each rule includes information about 
the port (e.g., 80), protocol (TCP or 
UDP), source, destination, and action 
(allow or deny). IP addresses, IP ad-
dress ranges, or service tags (aliases) 
define the source, destination, or 
both. Every VM can communicate 
with every other VM on the same net-
work with the three default inbound 
and outbound rules that are always 
created automatically. Additionally, 
why your firewall rules do not seem 
to be working. Even worse, you might 
not notice that something is awry.
You have an easy way of checking 
whether Azure Firewall is being used 
with Azure Network Watcher, by 
the use of either topology visualiza-
tion, next hop analysis, connection 
troubleshooting, VPN analysis, or a 
combination of methods. Because 
services such as Azure Firewall incur 
costs (with standard SKUs, the offer is 
approximately $1,000 per month for 
provisioning,plus data transfer), the 
service is usually operated as part of 
a hub-spoke architecture, where the 
spoke networks need to be connected 
to the hub by a peering connection, 
and each must have a user-defined 
routing table to reach Azure Firewall. 
The Azure Architecture Center [2] 
provides more information about this 
process.
Packet Filters Instead of 
Firewalls
Although routing tables and default 
gateways are mandatory for every 
virtual network in Azure, they are 
not used for packet filters, which are 
known as network security groups in 
of the virtual network itself, which 
means that Azure automatically cre-
ates a route with an address prefix 
that matches the address range de-
fined in the VNet’s address space.
If you need special routes that go 
beyond the system routes, as in the 
firewall scenario, you will need to 
create your own routing tables (Fig-
ure 1) and populate them with routes, 
because you cannot see or change 
the system routes. This user-defined 
routing (UDR) always takes higher 
priority in Azure, ranking higher even 
than learned routes (Border Gateway 
Protocol, BGP) and system routes.
The next hop type is also fundamen-
tal later on, because you need it to 
define Azure Firewall as the next hop 
in a user-defined route table if devices 
(e.g., VMs) use it as a gateway on the 
source network. The definition uses 
the VirtualAppliance next hop type, 
which is specified by Azure Firewall’s 
private IP address.
A user-defined routing table must 
be actively assigned to the source 
network in Azure, as well; otherwise, 
Azure would continue to use the de-
fault gateway for all Internet traffic, 
and you might wonder why Azure 
Firewall is not being used or wonder 
Figure 1: User-defined routes override standard system routes in Azure.
55A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NForced Tunneling
Azure Load Balancer accepts all in-
ternal inbound traffic and blocks all 
other inbound traffic.
The default outbound rules work in 
a similar way, except that they allow 
outbound Internet traffic. If security 
groups are created automatically 
when a VM is created, Azure links 
them to the network interface of the 
specific VM. If you create network 
security groups up front, you can as-
sign them to a VM when you add the 
VM or link them to a subnet, which 
means all VMs on the subnet can use 
the packet filter.
At the end of the day, though, packet 
filters are just packet filters; they do 
not work at the application level, al-
though Windows users might imagine 
they do, which leaves you with Azure 
Firewall as a platform as a service 
(PaaS) for Layers 3 to 7. Alternatively, 
you could search the Azure Market-
place for available firewall offerings; 
doing so will bring up offerings from 
virtually all noteworthy vendors, 
from Barracuda through Fortigate to 
Sophos. Many are based on virtual 
machines (infrastructure as a service, 
IaaS) and require the same mainte-
nance and administration as your lo-
cal firewall. You are then responsible 
for high availability and patching 
yourself, but software as a service 
(SaaS) and PaaS are offered in the 
marketplace, as well, including Azure 
Firewall.
Deploying Azure Firewall
As mentioned earlier, Azure Firewall 
is available in three SKUs: Basic, 
Standard, and Premium [3]; you need 
at least Standard for tunnel enforce-
ment. Deploy-
ment itself is 
simple and 
largely self-ex-
planatory. The 
work lies more 
in the accompa-
nying planning 
in terms of vir-
tual network-
ing, hub-spoke 
networking 
(peering), and 
UDR. The first point is important 
because Azure Firewall, as PaaS, re-
quires a specific subnet for itself that 
is visible with its own IP addresses.
This subnet must be named Azure-
FirewallSubnet and have a dimension 
of at least /26 (i.e., 64 IP addresses) 
in CIDR notation. However, you 
can easily ensure this if you use the 
Azure Firewall template to create the 
subnet. For the tunnel enforcement 
feature, the firewall VNet must also 
contain another subnet named Azure-
FirewallManagementSubnet. Another 
template is available for this purpose 
when you create the Firewall Man-
agement (forced tunneling) network 
(Figure 2).
A few terms need to be clarified. 
On the Azure portal, you will find 
Firewalls, Firewall Manager, Firewall 
Policies (all three related to Azure 
Firewall), and WAF (web application 
firewall) policies. The latter do what 
the name suggest and are not relevant 
here. Firewall Manager is a central 
management hub for use cases in 
which you operate multiple Azure 
firewalls and want to create, manage, 
and assign your firewall rules inde-
pendently of the firewall instances. 
However, Azure Firewall can also be 
operated in a kind of classic mode, 
where the firewall rules are created 
firewall-side.
Figure 2: Deploying and operating Azure Firewall requires specific subnets.
$resourceGroupName = "fw-demo-rg"
$vnetName = "fw-vnet
$firewallName = "fw-firewall"
$firewallPipName = "fw-pip"
$firewallMgmtPipName = "fw-mgmt-pip"
$location = "germanywestcentral"
 
Ne w-AzFirewall -ResourceGroupName $resourceGroupName -Name $firewallName 
-Location $location -Sku AZFW_VNet -VirtualNetworkName $vnetName -PublicIpName 
$firewallPipName -ManagementPublicIpName $firewallMgmtPipName -EnableDnsProxy 
$true -EnableForcedTunnel $true
Listing 1: Deploying the Firewall with PS
56 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Forced TunnelingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
that the address is only resolved by 
user-defined DNS servers that you 
stored previously as the DNS servers 
responsible for the workload network. 
The matching network rule on Azure 
Firewall would then be a UDP rule 
for port 53 with the destination of 
your user-defined DNS servers (e.g., 
8.8.8.8).
Forced Tunneling in Azure
You can also configure Azure Firewall 
to forward first all Internet traffic to 
the specified next hop, such as an 
edge firewall at the local site, instead 
of directly to the Internet. For com-
pliance reasons, some companies 
require that multiple network security 
devices (e.g., firewalls) inspect outgo-
ing network traffic before it goes to 
an Internet destination. Perhaps your 
security policy also requires you to 
send all Internet-bound traffic first to 
another network firewall (a network 
virtual appliance, NVA) in Azure or 
directly to a local firewall for inspec-
tion before it reaches the Internet.
Azure Firewall also supports split tun-
neling, which is the ability to forward 
traffic selectively. One such scenario 
is activating Windows licenses with 
a key management service (KMS) 
system, where Azure-based Windows 
VMs require a public source IP ad-
dress owned by Microsoft rather than 
their local Internet gateway IP ad-
dress. You could solve this situation 
with custom routing tables on Azur-
eFirewallSubnet (see below). For the 
tunneling enforcement scenarios here, 
the protocol, 3389 as the destination 
protocol, and the destination VM’s 
private IP address as the translated 
destination (i.e., also 3389).
The most important use case for 
Azure Firewall is application rules 
for outgoing HTTPS traffic. The key 
feature here is that you can use fully 
qualified domain names (FQDNs; 
instead of IP addresses) in the rules; 
moreover, Azure Firewall recognizes 
FQDN tags. Microsoft defines these as 
a group of FQDNs that are assigned 
to known Microsoft services – for ex-
ample, to allow the required outgoing 
network traffic (e.g., Windows update 
traffic) to pass through the firewall.
Additionally, when creating rules, 
Azure Firewall service tags can be 
used in the target field for network 
rules instead of specific IP addresses. 
A service tag represents a group of 
IP addresses. Service tags are primar-
ily used to reduce the complexity of 
security rules. Also, Azure Firewall 
includes a built-in rule collection 
for infrastructure FQDNs that are al-
lowed by default.These FQDNs are 
platform-specific and cannot be used 
for other purposes.
The rule configured in Figure 3 is 
a very simple example of a type of 
web browsing rule. It allows outgoing 
HTTPS traffic to https:// duckduckgo.
com from the workload source net-
work 10.2.0.0/ 24. Azure Firewall 
also recognizes network rules (Layer 
4). Application and network rules 
then take effect in combination. For 
example, you could restrict the reso-
lution of https:// duckduckgo.com so 
If you set up the required resources 
in advance – all that you are miss-
ing is two public IP addresses for the 
firewall itself and its management 
interface (only required for tunnel en-
forcement), which you can also create 
on the fly when creating the firewall – 
the deployment dialog for Azure 
Firewall on the portal is completed 
quickly. Optionally, you can deploy 
the firewall with Azure PowerShell 
(Listing 1).
Defining Firewall Rules
The main reason (apart from forced 
tunneling) for the use of Azure Fire-
wall instead of default routing with a 
default gateway, system routes, and 
network security groups (see above) 
is to control incoming and outgoing 
traffic from Azure to the Internet, the 
local site (over IPsec VPN), or both in 
a more precise and granular way than 
you can with packet filters in OSI 
Layer 4. Azure Firewall supports rules 
for the application layer (app rules), 
network layer (network rules), and 
network address translation rules for 
destination NAT (DNAT).
DNAT rules are useful for securely ac-
cessing virtual machines on an Azure 
VNet (without a public IP address) 
for maintenance or management 
tasks (e.g., by Remote Desktop Pro-
tocol (RDP)). Otherwise, you would 
need a self-managed jump host or the 
Azure Bastion service, for which you 
would be billed. For a DNAT rule, you 
would use the Azure firewall’s public 
IP address as the destination, TCP as 
Figure 3: A simple application rule on Azure Firewall allows traffic to DuckDuckGo.
57A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NForced Tunneling
you need to deploy Azure Firewall 
with the Management NIC enabled, 
but without a public IP address. En-
abling the Management NIC means 
that Azure will create a separate 
management network interface with a 
public IP address that Azure Firewall 
uses for its management operations.
Setting Up a Test 
Environment
To try out tunnel enforcement, you 
can turn to a test lab provided by 
Microsoft on GitHub [4] that covers 
both normal tunnel enforcement and 
split tunneling for the KMS scenario 
outlined above.
The lab uses a simulated local site 
in Azure for its site-to-site VPN. This 
site is connected to the hub VNet of 
an Azure firewall by IPsec site-to-site 
VPN with the Azure VPN gateway, 
which requires an additional subnet 
named “GatewaySubet” for each site. 
The firewall at the local site is also 
represented by an Azure firewall, 
which means you need a total of 
three virtual networks: (1) a network 
representing the on-premises side 
and containing a gateway subnet for 
the VPN gateway, (2) the mandatory 
firewall subnet for the Azure firewall 
representing the local firewall, and 
(3) a workload subnet for a test work-
load in the form of an Azure VM.
On the Azure side, the lab uses a 
hub-spoke architecture in which the 
workload VNet is peered with the 
hub VNet, which also contains: a 
gateway subnet for the VPN gateway; 
AzureFirewallSubnet for the Azure 
firewall; and another subnet, Azure-
FirewallManagementSubnet, for the 
Management NIC.
Additionally, the lab environment cre-
ates the required routing tables with 
custom routes. The ARM template 
on GitHub makes it extremely easy 
to deploy the required components. 
All you need to do is specify an ad-
min username and password and a 
pre-shared key (PSK) for the IPsec 
connection. Deployment takes about 
40 minutes and outputs the mapped 
environment distributed across two 
resource groups.
The first resource group, rg-fw-azure, 
contains all of the Azure environ-
ment components – that is, the hub 
network with the required subnets, 
the spoke network with the Workers 
subnet, and the VPN gateway (in-
cluding the matching local network 
gateway). It also includes the site-to-
site connection for the VPN gateway, 
the Azure firewall, the firewall policy, 
three custom routing tables (route-
spokes-snets, route-fw-snets, and 
platform-managed-rt), the required 
public IP addresses, a diagnostics set-
ting, and a log analytics workspace 
for monitoring.
The second resource group, rg-fw-
onprem, contains the simulated 
on-premises firewall in the form of 
an Azure firewall, the simulated on-
premises VPN device on the subnets 
of the on-premises VNet in the form 
of an Azure VPN gateway and local 
network gateway, a site-to-site VPN 
connection (always a separate entity 
in the case of the Azure VPN gate-
way) to Azure, and a workload VM 
on the associated subnet for the on-
premises Workers. Also created here 
are the required firewall policies and 
a diagnostics setting.
Incidentally, the master template 
references four linked templates. The 
first template creates all of the Azure-
side resources in one fell swoop, the 
second linked template creates the 
complete on-premises environment 
(simulated in Azure), the third cre-
ates the two Azure VPN connection 
objects (VPN Connection) for the IPsec 
site-to-site policy, and the fourth cre-
ates the diagnostics settings for the 
Figure 4: You can visualize an ARM template in VS Code.
58 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Forced TunnelingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
the form of FQDN owaspdirect.azur-
ewebsites.net (an app deployed in 
Azure as an Azure Container Instance 
during deployment) is allowed by an 
application rule on Azure Firewall if 
the source is the IP group ipg-azure-
network; in this case, it is then routed 
over the local site because of tunnel 
enforcement and rejected by the fire-
wall there. To test this arrangement, 
call up owaspdirect.azurewebsites.net 
in the VM’s browser.
Not only will you see an error in the 
browser, you will also find an error 
message in the Azure Firewall log 
analytics – Action: Deny. Cause: No 
rule matches. Proceed with default ac-
tion – because the Azure firewall in 
the local environment is dropping the 
traffic, which you can see in the con-
figured Log Analytics workspace. You 
see two entries for this request: one 
for each Azure firewall. The second 
log entry shows that the local firewall 
rejected the request, which in turn 
confirms that the configuration forces 
all Internet traffic to use the local 
network because the Azure container 
instance has a public endpoint.
A quick look at the source IP address 
of the local firewall reveals that a 
source NAT rule sent it to the private 
following cmdlet should complete 
successfully:
Test-NetConnection U
 -ComputerName 10.100.0.68 U
 -Port 3389
and initiate a TCP connection to port 
3389, which is open by default on 
Windows computers. The IP address 
of the “local” VM is 10.100.0.68. To 
avoid having to connect to the VM 
console by RDP up front, you can 
use the option of executing PS scripts 
externally with the VM agent under 
the Operations section of the Azure 
portal.
Of course, you will probably want to 
know whether the outgoing Internet 
traffic from the workload network in 
Azure uses the local firewall as its 
Internet gateway. To find out, access 
a public IP address from the Azure 
VM. You will then see in log analytics 
how the request reaches the local fire-
wall because of the enforced tunnel 
configuration. You can also see that 
tunnel enforcement works throughout 
the environment for any traffic des-
tined for a public IP address, which 
confirms that the application rules 
configured on Azure Firewall are 
working cor-
rectly. You can 
view these in 
Firewall Man-
ager or directly 
in the associ-
ated policies 
(Figure 5).The allowed 
destination in 
on-premises firewall to stream the 
required monitoring logs to a log 
analytics workspace in Azure. If you 
installed the ARM extension and the 
ARM template viewer in your local VS 
Code, you can also visualize the tem-
plate components (Figure 4).
Another option is to deploy all the re-
quired objects manually, step by step, 
in the Azure portal. The procedure 
and required parameters for network 
ranges can be found on GitHub [5]; 
however, the easiest and fastest way 
to deploy the test environment is by 
entering the PowerShell commands 
shown in Listing 2 directly from the 
Cloud Shell terminal.
Trying Out Tunnel 
Enforcement
After successful deployment, it’s time 
to test the setup. First, check the con-
nectivity from the Azure VM to the 
local VM to determine whether the 
basic deployment, routing, and tunnel 
enforcement are working. The data 
traffic flows from the VM hosted in 
the snet-trust-workers subnet of the 
vnet-spoke-workers VNet on Azure 
through the Azure firewall in the hub. 
The hub, in turn, resides on the vnet-
hub-secured VNet thanks to the vgw-
vnet-hub-secured VPN gateway.
The reason for this setup is that only 
the default (system) route for IPsec, 
which the system learns from the 
BGP, is used here. The data traffic 
then reaches the local firewall and, 
from there, the local VM. To avoid 
asymmetric routing, the return path 
to the Azure VM is the same; the 
Figure 5: The application rule in Azure Firewall allows access to a container app with a public endpoint provided by the lab scenario.
Listing 2: Deploying the Test Environment
$securePassword = ConvertTo-SecureString "YourPassword"
$securePSK = ConvertTo-SecureString "YourPSK"
 
Ne w-AzSubscriptionDeployment -Name demoSubDeployment -Location westeurope 
-TemplateUri "https://raw.githubusercontent.com/Azure/Azure-Network-Security/
master/Lab%20Templates/Lab%20Template%20-%20Azure%20Firewall%20Forced%20
Tunnel%20Lab/Templates/azfwForceTunnelTemplate.json" -AdminPassword 
$securePassword -SharedKey $securePSK
59A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NForced Tunneling
IP address of one of the Azure firewall 
instances (192.168.0.70). This behav-
ior is the result of the firewall classify-
ing all traffic with a destination IP ad-
dress outside the RFC 1918 ranges as 
NATed. Incidentally, you can change 
Azure Firewall’s SNAT behavior by 
switching to the Private IP ranges 
(SNAT) tab in the associated policy 
and selecting one of the available op-
tions. For example, the Learned SNAT 
IP Prefixes option was still a preview 
feature at the time of writing.
Another supported scenario is the 
shared tunnel mentioned above, 
which is used, for example, in the 
KMS scenario described earlier during 
Windows activation, because activa-
tions would initially fail with forced 
tunneling; after all, the configuration 
routes all traffic from the Azure VM 
to be activated to the local network. 
The Azure VM is then unable to con-
nect to KMS servers for Windows 
activation. A troubleshooting docu-
ment [6] describes this scenario in 
detail and suggests custom routing.
Now add a new send-to-kms route 
with destination 23.102.135.246/ 
32 and Internet as the next hop type 
to the route-fw-snet routing table at-
tached to AzureFirewallSubnet. The 
IP address, 23.102.135.246, is one 
of three KMS servers that process 
Windows activations for Azure VMs 
worldwide. You must allow the traffic 
to pass through Azure Firewall, which 
the send-to-kms rule does, allowing 
all connections from the 192.168.2.0/ 
24 subnet to KMS servers over the 
Internet. You can test this again in 
your Azure VM PowerShell session or 
by remote script execution. To do so, 
drag your Azure Firewall logs into log 
analytics again, which should confirm 
that the traffic passed through the 
firewall and that the TCP request to 
the Internet was allowed.
Conclusion
Forced tunneling has long been an 
important security requirement for 
many organizations. The need to 
inspect and monitor Internet-bound 
traffic from Azure resources is grow-
ing with the increasing prevalence 
of Azure-powered infrastructures. 
Configuring Azure Firewall to tunnel 
all traffic downstream for additional 
monitoring meets the strict require-
ments for maintaining compliance in 
many organizations’ environments. 
Additionally, the ability to split spe-
cific traffic to meet other dependen-
cies and requirements is key to main-
taining an operational and controlled 
infrastructure. 
Info
[1] Azure Firewall pricing: 
[https:// azure. microsoft. com/ en-us/ 
 pricing/ details/ azure-firewall/ # pricing]
[2] Azure Architecture Center: [https:// learn. 
 microsoft. com/ en-us/ azure/ architecture/ 
 networking/ architecture/ hub-spoke]
[3] Firewall SKUs: 
[https:// learn. microsoft. com/ en-us/ azure/ 
 firewall/ choose-firewall-sku]
[4] Test environment: [https:// github. com/ 
 Azure/ Azure-Network-Security/ tree/ 
 master/ Lab%20Templates]
[5] Manual deployment: [https:// github. com/ 
 Azure/ Azure-Network-Security/ tree/ 
 master/ Lab%20Templates/ Lab%20Tem-
plate%20-%20Azure%20Firewall%20
Forced%20Tunnel%20Lab# readme]
[6] Azure Windows VM troubleshoot-
ing documentation: [https:// learn. 
 microsoft. com/ en-us/ troubleshoot/ 
 azure/ virtual-machines/ windows/ 
 welcome-virtual-machines-windows]
The Author
Thomas Drilling has been a full-time free-
lance journalist and editor for science and IT 
magazines for more than 10 years. He and his 
team make contributions on the topics of open 
source, Linux, servers, IT administration, and 
Mac OS X. Drilling is also a book author and 
publisher, advises small and medium-sized en-
terprises as an IT consultant, and lectures on 
Linux, open source, and IT security.
60 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Forced TunnelingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
Have you ever gone online, per-
haps while on holiday, and received 
a notification that your favorite BBC 
original comedy isn’t available in 
your area? How does the service 
know what country you’re in? You 
might assume it’s information sent 
by the client in an HTTP header, 
but in reality, it involves a combina-
tion of allocation records, inference, 
and engineering that allows web 
applications to assess where an IP 
address likely originates (Figure 1).
IP address blocks are distributed by 
regional Internet registries (RIRs): 
ARIN in North America, RIPE in 
Europe, and others worldwide. Each 
registry records who owns a block 
and for which country it is intended. 
If an IP address belongs to a block 
registered to a British Internet ser-
vice provider (ISP), for example, it 
is reasonable to infer that the traffic 
originates from the United Kingdom. 
Geolocation databases aggregate RIR 
allocation data, ISP documentation, 
and historical routing information to 
estimate a country ISO code (e.g., US) 
and often a subdivision code (e.g., WI 
for Wisconsin; Figure 2).
These databases can be queried di-
rectly by modules such as GeoIP2 
Python [1] or indirectly through cloud 
services. In this article, I describe how 
to set up geolocation through a cloud 
service. Here, I’ll use Cloudflare as an 
example. For other environments, see 
your cloud vendor documentation.
Cloudflare relies on proprietary infer-
ence systems to attach geographic 
metadata to incoming requests. In 
this tutorial, I parse the region code 
from Cloudflare’s edge request con-
text object and use it to build a lay-
ered geofence control. Although the 
example uses Cloudflare, the same 
principles apply to other geolocation 
platforms and can be adapted to im-
plement your own geofence policy.
Know Your Geofence
There are many reasons to deploy 
a geofenced application. Private 
Use geofence technology to isolate your web services from the broader public Internet with customsecurity 
rules and worker routes. By Sam Klein
Isolating Cloud Web Services
 Passport Check
Ph
ot
o 
by
 O
xa
na
 M
el
is
 o
n 
Un
sp
la
sh
Figure 1: Cloudflare blocks access to the origin server. Users see this when a geofence 
security rules policy blocks access to a website.
62 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O N Geofencing
the client’s direct source IP. Although 
the original client IP can still be for-
warded by headers (which can be 
implemented with Apache modifica-
tion to accept reverse proxy headers), 
IP-based bans enforced at the origin 
are not a substitute for edge-level 
geographic controls or account-based 
enforcement. IP geolocation is also in-
herently approximate. Users might be 
restricted incorrectly if their network 
is registered in a different region from 
their physical location.
Another limitation is the widespread 
availability of virtual private network 
(VPN) services and proxy services. 
Users can deliberately route traffic 
through another geographic region 
(e.g., appearing to originate from 
Chicago while physically located in 
Germany). As a result, geofencing pri-
marily filters low-effort or incidental 
access. It should not be expected to 
prevent determined circumvention.
Taken together, these limitations 
reinforce that geofencing is a coarse-
grained control. It reduces routine 
access from specific regions but does 
not reliably constrain user behavior 
on its own. Where the risk of circum-
vention is unacceptable, geofencing 
should be combined with additional 
controls such as account verification, 
identity checks, or application-level 
monitoring. Geographic filtering 
might help determine when such ad-
ditional measures are appropriate, but 
it should not be relied on as the sole 
mechanism of enforcement.
Cloudflare Example
Consider a phpBB (PHP bulletin 
board package) forum hosted for a 
local club in Wisconsin. The admin-
istrator wants to reduce spam and 
unwanted traffic by limiting access 
primarily to users within the state. 
Open registration is closed, members 
are known personally, and SSH access 
is restricted to a specific administra-
tive IP address.
The forum is hosted on a Digital-
Ocean Droplet. Each Droplet is as-
signed a static public IPv4 and IPv6 
address for its lifetime [2], providing 
a stable origin endpoint. A domain 
This approach is particularly useful for 
operators who want to avoid collecting 
or verifying personal information but 
still want to run a limited-scope service, 
such as a hobby site or community 
forum intended for a specific local-
ity or private network. In these cases, 
geofencing functions as a proportional 
control: It filters routine access without 
introducing additional identity verifica-
tion or data collection obligations.
Policy decisions can be enforced at 
the network edge without retaining 
geographic information beyond what 
is necessary to evaluate a request. 
Persistently storing IP-based location 
data or building user profiles on the 
basis of geographic behavior intro-
duces privacy, security, and regula-
tory concerns that extend beyond 
simple access control. For this reason, 
geofencing mechanisms should be 
implemented as stateless controls. 
They should evaluate a request, en-
force policy, and discard geographic 
context immediately. Logging should 
be limited to aggregate operational 
metrics rather than per-user geo-
graphic reporting.
Geofencing is a technical control, not 
legal advice. Its effectiveness and ac-
ceptability vary by jurisdiction and 
evolve over time. Even administrators 
of small, non-commercial sites might 
eventually be required to implement 
more precise compliance mechanisms. 
Geofencing should therefore be under-
stood as one tool for reducing opera-
tional exposure, not as a comprehen-
sive or permanent compliance strategy.
Limitations to Geofencing
Depending on how it is implemented, 
geofencing can introduce unintended 
side effects. It is important to un-
derstand its practical limitations and 
failure modes.
One limitation involves shared IP 
infrastructure (e.g., banning users 
from registering with an IP address 
because IP addresses no longer iden-
tify user behavior). In a Cloudflare 
deployment, incoming requests termi-
nate at Cloudflare’s edge network. At 
the TCP layer, the origin server sees 
Cloudflare’s IP addresses rather than 
organizations might operate region-
specific web assets that are not in-
tended for access outside a defined 
locality. Commercial services use geo-
fencing to prevent fraudulent transac-
tions or to control the distribution 
of licensed content (e.g., streaming 
platforms that offer different catalogs 
in different countries). Geofencing 
can also be used defensively. Some 
national networks employ large-scale 
filtering regimes, and some US-based 
retailers temporarily blocked Euro-
pean IP addresses after the publica-
tion of the European Union General 
Data Protection Regulation (GDPR) to 
reduce regulatory uncertainty.
Blocking requests from specific juris-
dictions does not guarantee regulatory 
compliance. However, it can reduce 
the number of users and interactions 
originating from particular geographic 
locations, thereby shrinking a ser-
vice’s regulatory footprint while still 
requiring administrators to comply 
with applicable law.
Figure 2: Geolocation isn’t a fixed attribute 
but is inferred from data.
63A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NGeofencing
name is configured to proxy through 
Cloudflare, so all inbound HTTP traf-
fic first passes through Cloudflare’s 
edge before reaching the Droplet.
The following sections consider build-
ing such an example from scratch, 
not by going into details about the 
application particularly, but by exam-
ining its network infrastructure. You 
will consider how a domain name is 
essential to geofencing, implementa-
tion of reverse proxy, and firewalls 
to ensure that user traffic doesn’t go 
around your geofence policy; finally, 
I’ll look at how to implement the con-
trols relevant to just the geofence.
Creating a Domain Name
This section explains why URLs are 
necessary for your geofence policy 
with Cloudflare. I start by buying 
a domain name from a registrar, of 
which you have many to consider: 
GoDaddy, Namecheap, etc. You 
can use an existing account or any 
domain name that Cloudflare accepts. 
Subdomains can also be delegated 
through name server (NS) record redi-
rection to Cloudflare or through enter-
prise management [3]. For long-term 
accessibility, though, it will always be 
a better idea to own the root domain.
On creating an account, Cloudflare 
asks your domain name. Cloudflare 
needs an address for its reverse 
proxy (i.e., a server positioned be-
fore the application) so requests 
can be handled and forwarded 
for the reliability of the service. 
For example, Apache can act as a 
proxy for services like the Forgejo 
forge and repository. When a cli-
ent types http:// example.com (port 
80), Apache forwards the request 
to a localhost service on a specific 
port number (not on port 80 pub-
licly), which, if you were using the 
server as a developer, might appear 
in your browser locally as http:// 
localhost:3000/ example (which is 
private; the Internet doesn’t see 
port 3000). This approach hides the 
internal structure of the application 
from the client.
Cloudflare’s reverse proxy does the 
same for your IP address so that 
when the client looks for example.
com, the DNS address first goes to 
Cloudflare and not to the origin ad-
dress (your static IP address); then, 
it comes out from Cloudflare’s proxy, 
providing a security boundary be-
tween your server and the request. 
Because Cloudflare proxies at the 
public edge, the origin can still run 
its own reverse proxy such as Apache 
forwarding requests to a backend ser-
vice like Forgejo.
After you secure a domain name 
(preferablya root domain, because 
that is the assumed condition on-
ward), you will drop in the external 
nameservers Cloudflare provides into 
your domains nameserver list (Fig-
ure 3) after enrolling your domain. If 
you use Cloudlare as a registrar, this 
behavior will be the default. These 
nameservers 
are gener-
ated for your 
account [4].
Now go to your 
Cloudflare ac-
count. On the 
Domain Man-
agement page 
click Onboard 
a domain (Fig-
ure 4). In the 
field, submit 
your domain 
name with a 
quick scan. You 
will be given 
the name serv-
ers to put into 
your records, 
as shown in 
Figure 3. Once 
onboarded, the 
console will 
refresh to show 
that your link 
is active. (You 
can see that 
example.com 
was added in 
Figure 4).
Figure 3: Adding Cloudflare nameservers to a domain registrar.
Figure 4: Onboarding your domain from the Domain Management console.
64 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
GeofencingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
block traffic at a lower network layer 
if possible.
Packets blocked at Layer L3 or L4 
(by the DigitalOcean Firewall) is a 
better use of resources than block-
ing beyond and into the Application 
Layer, saving resources for legitimate 
requests (see Figure 5). At the top 
layer, the traffic through the Internet 
has yet to be processed. Before it 
reaches your application (which has 
finite resources), DigitalOcean firewall 
blocks anything you specify.
The last step in this section is to copy 
the IP address from the Droplet page 
and paste it into your Cloudflare DNS 
management console. Return to the 
Cloudflare Domain Management con-
sole (Figure 4) and select the desired 
domain name. In the left column 
navigation bar, select DNS | Records 
the Firewalls tab on the Networking 
page from the Droplet itself or from 
the Management | Firewalls screen. 
In the Create Firewall window, you 
can add a Name for the firewall (Fig-
ure 7) that is used to differentiate 
multiple policies. Adding inbound 
and outbound rules is the focus of 
this article.
An important aspect of effective 
Cloudflare implementation is to set 
inbound rules so that only Cloudflare 
IPs have direct access to HTTP 80 or 
443. You must block all other traffic. 
With inbound rules, anything not 
explicitly allowed will be considered 
blocked, so you will simply add all 
the IP addresses from their documen-
tation [5] (Figure 7). Although you 
could do this at the application layer 
with the operating system, it’s best to 
Once your domain has been regis-
tered with Cloudflare and it is using 
the correct name servers, you should 
assign your domain to a static IP 
address.
Creating a Proxy Connection
DigitalOcean is an infrastructure as a 
service (IaaS), implementing virtual 
servers and storage. Other services 
you might use include AWS Elastic 
Container Service (ECS), Google 
Kubernetes Engine (GKE), VMware 
Tanzu Platform, Azure Kubernetes 
Service (AKS), and IBM Red Hat 
OpenShift on IBM Cloud. Other 
smaller platforms can also meet the 
requirements, as long as they can 
host a virtual private server (VPS) 
or virtual dedicated server (VDS), 
maintain a static IP address for your 
service, and enforce firewall controls. 
(Although your application can ac-
complish this task, it’s better to do it 
on the network and transport layer, 
functioning with IP addresses, ports, 
and protocols.)
Now you create a DigitalOcean drop-
let (which contains your VPS) by 
clicking the Create button at the top 
right of the navigation bar adjacent to 
your project and team name. Select 
Droplets, then choose whatever speci-
fications you desire: region, datacen-
ter, image, size, CPU options, storage, 
backups, SSH key, and hostname. For 
the example in this article, you don’t 
need to specify any of these options 
because you’ll just generate a static IP 
address. Once you create the Droplet, 
the next window shows its progress. 
At any time, if you need the IP ad-
dress, you can find it by the time the 
animation finishes or under the Man-
age category in the left navigation 
column. Select Droplets to preview 
the name, IP address, and time cre-
ated in a table.
Before copying the IP address, create 
a firewall policy for the Droplet. At 
the Droplets table mentioned in the 
previous paragraph, click the name 
of your latest Droplet, go to Network-
ing (Figure 6), and select an exist-
ing firewall or create your own by 
clicking the Create Firewall button in 
Figure 5: Users accessing your web services go through the DigitalOcean network from 
the Internet.
Figure 6: Under the Manage category in the left column navigation bar is the Droplets 
option. From here you can manage your firewall and see networking information.
65A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NGeofencing
and add your Droplet static IP address 
to the IPv4 address field for Type A 
and Name @ (Figure 8).
Security Policy for 
Geofencing
At a high level, enforcement occurs 
in two layers (Figure 9). In this ex-
ample, (1) a Cloudflare security rule 
blocks all traffic originating outside 
the United States, and (2) a Cloud-
flare Worker enforces a second check, 
allowing only requests from Wiscon-
sin (region code WI).
Layer 1: Country-Level Block 
(Security Rule)
After onboarding the domain in 
Cloudflare and updating nameservers 
at the registrar, navigate to: Security | 
Security rules | Create rule. Here, cre-
ate a rule with the Action value Block 
and use the following expression:
(ip.geoip.country ne "US")
and not http.request.uri.path contains U
 "/.well-known/acme-challenge/"
This expression blocks all requests 
that do not originate from the United 
States, while allowing Let’s Encrypt 
ACME validation traffic. If certificate 
management is handled entirely by 
Cloudflare, the ACME exception might 
not be required. The result should 
look like Figure 10.
The field ip.geoip.country (also avail-
able as ip.src.country) returns the 
two-letter ISO country code inferred 
by Cloudflare [6] (see Figure 2). Be-
cause the rules on 
the Security rules 
tab execute before 
Workers, requests 
blocked here 
never reach the 
next enforcement 
layer.
Layer 2: 
State-Level 
Enforcement 
(Worker)
To create a 
Worker, go to 
Workers | Manage 
Workers | Create 
Application | Start 
with Hello World, 
then replace the 
default code with 
that in Listing 1.
Cloudflare at-
taches geographic 
metadata to each 
request through 
the request.cf 
object. The coun-
try field contains 
the ISO country 
code, and region-
Code contains the 
state or provincial 
subdivision when 
Figure 7: Adding an inbound or outbound rule can be done in the web console GUI. Select the table element you 
want to edit or create a new row. Fill out the Type, which determines what client is expected, the Protocol type 
(either TCP or UDP), Port Range, and Sources.
Figure 8: The proxy configuration for a static IP.
01 export default {
02 async fetch(request) {
03 const url = new URL(request.url);
04 
05 // Allow Let's Encrypt validation
06 if (url.pathname.startsWith("/.well-known/
acme-challenge/")) {
07 return fetch(request);
08 }
09 
10 const regionCode = request.cf?.regionCode;
11 const country = request.cf?.country;
12 
13 // Allow only US traffic from Wisconsin
14 if (!(country === "US" && regionCode === "WI")) {
15 re turn new Response("Access denied", { status: 
403 });
16 }
17 
18 return fetch(request);
19 }
20 }
Listing 1: Worker Route Expression
66 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
GeofencingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
[3] Cloudflare CNAME partial setup, no root: 
[https:// developers. cloudflare. com/ dns/ 
 zone-setups/ partial-setup/]
[4] Cloudflare nameservers: 
[https:// developers. cloudflare. com/ dns/ 
 nameservers/]
[5] Cloudflare IP ranges: 
[https:// www. cloudflare. com/ ips/][6] Cloudflare ip.src.country: 
[https:// developers. cloudflare. com/ 
 ruleset-engine/ rules-language/ fields/ 
 reference/ ip. src. country/]
The Author
Sam Klein is a cybersecurity engineer with 
more than six years of experience safeguarding 
enterprise infrastructure and shaping resilient 
systems at scale. His career spans embedded 
Linux platforms, open source development, 
and academic collaborations on privacy and 
application security. Klein is currently on sab-
batical for full-time parenting while continuing 
to contribute to the cybersecurity community.
thoughtfully, it 
can serve as a 
practical access 
control mecha-
nism for small 
organizations, 
private com-
munities, and 
region-specific 
services that 
want to limit 
exposure without 
expanding their 
data collection 
footprint. 
Info
[1] GeoIP2 Python: 
[https:// geoip2. readthedocs. io/ en/ latest/]
[2] DigitalOcean Droplet static IP address: 
[https:// docs. digitalocean. com/ support/ 
 are-my-droplets-ip-addresses-static/]
available. This Worker evaluates both 
values. If the request is not from the 
United States and Wisconsin, it re-
turns an HTTP 403 response, as seen 
in the test panel (Figure 11, right); 
otherwise, the request proceeds to the 
origin server.
By blocking non-
US traffic at the 
Security rules level, 
most unwanted re-
quests are dropped 
before Worker exe-
cution. The Worker 
then applies a 
more granular re-
gional check. The 
origin server only 
receives traffic that 
has passed both 
controls.
This design keeps 
enforcement at the 
edge, minimizes 
origin load, and 
avoids retaining 
geographic data be-
yond the evaluation 
of each request.
Conclusion
Geofencing is often 
associated with 
media licensing 
and streaming re-
strictions, but its 
utility extends well 
beyond entertain-
ment platforms. 
When implemented 
Figure 10: Create a custom rule in the Cloudflare Security console.
Figure 9: Each request goes through different edge layers. Security 
rules for the country are checked before the region check, reducing 
checks per request.
Figure 11: Custom rules are created in the Cloudflare Security console.
67A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NGeofencing
Many vulnerabilities in AWS are not 
caused by zero-day attacks but by 
configuration errors – from Amazon 
Simple Storage Service (S3) buckets 
with open write permissions, Elastic 
Compute Cloud (EC2) snapshots that 
accidentally publish access creden-
tials, or identity and access manage-
ment (IAM) roles without multifac-
tor authentication. The Prowler [1] 
open source tool [2] systematically 
checks for violations of security stan-
dards and visualizes risks, and it can 
be precisely tailored to individual 
requirements.
The software is not a black box analy-
sis tool, but a framework for traceable 
security audits at the command line 
level. The checks are based on best 
practices and benchmarks (e.g., from 
such organizations as the Center for 
Internet Security (CIS), the US Na-
tional Institute of Standards and Tech-
nology (NIST), and Payment Card 
Industry Data Security Standard (PCI-
DSS)) and deliver immediately action-
able results for AWS, Azure, Google 
Cloud Platform (GCP), Kubernetes, 
and Microsoft 365. One focus is on 
AWS, where the scope of testing is 
greatest and integration with cloud-
native services such as Security Hub 
and GuardDuty is most advanced.
Getting Started
If you want to use Prowler locally on 
Linux, you need to install it with the 
Python package manager, for exam-
ple, on Ubuntu or with Brew, enter:
pipx install prowler
brew install prowler
Alternatively, you can use the Docker 
container:
docker run -it U
 --rm ghcr.io/prowler-cloud/prowler U
 prowler -v
The tool uses existing AWS CLI pro-
files for authentication. To use all 
of the checks, the profile requires at 
least the SecurityAudit and ViewOnly-
Access managed policies. Additionally, 
an inline policy is recommended 
to unlock specific read permissions 
for non-standard resources. This 
extension is found in permissions/
prowler-additions-policy.json in the 
official repository.
Initial Security Scans
The following commands carry out 
a basic audit of all regions of an ac-
count and then create a matching 
profile from the AWS CLI:
prowler aws --profile 
aws configure --profile 
The tool will prompt you for four 
things:
 AWS access key ID: the key ID 
belonging to the IAM user or a 
role with sufficient authorizations
 AWS secret access key: the match-
ing secret access token
 Default region name: the name of 
the region (e.g., eu-central-1)
 Default output format: optional 
information, such as json
The open source Prowler is ideal for systematically checking your AWS infrastructure for vulnerabilities, 
meeting compliance requirements, and automatically plugging security gaps. We show you how to use this 
tool in a production environment – from initial scan to integration into CI/ CD pipelines, dashboards, and 
organization-wide audits. By Thomas Joos
AWS Security Audits with Prowler
 Prowling the Depths
Ph
ot
o 
by
 J
os
ep
h 
No
rt
hc
ut
t 
on
 U
ns
pl
as
h
68 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T Y Prowler
The profile is saved in the ~/.aws/
credentials or ~/.aws/config file. Next, 
call up Prowler as shown before. In ad-
dition to the SecurityAudit and ViewOn-
lyAccess managed policies, the IAM 
account requires the inline policy from 
the Prowler repository in permissions/
prowler-additions-policy.json to per-
form all the checks. The above call iter-
ates through all the configured checks 
and stores the results in the output/ di-
rectory. In addition to CSV and HTML, 
Prowler generates standards-compliant 
JSON files in OCSF or ASFF format:
prowler aws -M html json-ocsf json-asff
ASFF is used for direct transfer to 
AWS Security Hub (Figure 1) – more 
on that later. The HTML output pro-
vides a clear overview with filter 
options for compliance standards, 
prowler aws --list-checks
You can use the next command to 
carry out three specific checks that 
focus on key aspects of IAM and ACM 
security:
prowler aws U
 --checks accessanalyzer_enabled U
 acm_certificates_expiration_check U
 iam_root_mfa_enabled
Prowler checks whether IAM Access 
Analyzer is enabled (accessana-
lyzer_enabled), whether any ACM 
certificates are close to their expira-
tion dates (acm_certificates_expira-
tion_check), and whether multifactor 
authentication has been enabled for 
the root account (iam_root_mfa_enabled). 
These checks address typical vulner-
abilities in AWS accounts, can be 
severity, and affected resources. The 
checks can be restricted to individual 
services, regions, or test groups.
Selectively Controlling 
Checks
To carry out targeted security checks 
for the three specified AWS services 
(Amazon S3, EC2, and IAM), use the 
command:
prowler aws --services s3 ec2 iam
Prowler limits the scan to these ser-
vices and checks their configurations 
for security-related vulnerabilities, 
policy violations, and potential risks, 
which gives you a focused security re-
port without analyzing other services. 
You can display a list of all available 
checks (Figure 2), for example, with:
Figure 1: Prowler also works with AWS Security Hub.
69A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T YProwler
validated separately, and provide a 
focused report on particularly critical 
configurations. The process of choos-
ing can be facilitated by a JSON check 
file. One file with three security-
related checks that address common 
vulnerabilities in AWS environments 
could look like:
{
 "checks": [
 "s3_bucket_public_access",
 "ec2_instance_port_ssh_exposed_U
 to_internet",
 "cloudtrail_multi_region_enabled"
 ]
This file ensures that Prowler per-
forms three critical checks: s3_bucket_
public_accessand answers. 
For example, within the Security and Observability section, you’ll find lessons on:
• Cloud-native security
• Prometheus monitoring
• Log management and analysis
• Tracing concepts
“The new DevOps Tools Engineer exam covers the most important methodologies and tools 
along the entire lifecycle of modern software applications,” says Fabian Thorns, Director of Product 
Development at LPI. 
7A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N E WSADMIN News
The exam consists of 60 questions that must be answered in 90 minutes and is available in 
English, with a Japanese language version planned for 2026.
Visit LPI for more details: https://www.lpi.org/.
Power Demands and Complexity Limit AI Deployments, per DDN Report
AI deployments introduce power demands and other challenges that most infrastructure budgets 
and facilities were never designed for, says the 2026 State of AI Infrastructure Report from DDN 
(https://www.ddn.com/2026-state-of-ai-infrastructure-report/).
“Energy consumption, cooling capacity, and inefficient data movement have become real oper-
ational constraints — often limiting progress long before compute capacity or GPU availability,” 
the company says.
Specifically, the report says that:
• 65% of infrastructure sits idle while still consuming power.
• 93% of respondents are actively working to reduce AI’s energy footprint.
• 47% cite energy and cooling as their top inefficiency.
• Only 41% report efficiency gains from recent AI investments.
Complexity in AI infrastructure was cited as another top challenge, as: 
• 98% of respondents report a skills gap related to AI infrastructure.
• 65% say their AI environments are already too complex.
• 54% say they have postponed or cancelled AI initiatives.
Read more at DDN: https://www.ddn.com/.
Microsoft Announces Open Source Litebox OS
Microsoft has announced Litebox, an open source “security-focused library OS supporting kernel- 
and user-mode execution.”
According to the project page, “LiteBox is a sandboxing library OS that drastically cuts down the 
interface to the host, thereby reducing attack surface.” 
LiteBox, which is written in Rust and developed under the MIT license, is designed for use in 
both kernel and non-kernel scenarios, with example use cases including:
• Running unmodified Linux programs on Windows
• Sandboxing Linux applications on Linux
The LiteBox team is currently working toward a stable release and notes that some APIs and in-
terfaces may change as development continues. Learn more from the GitHub page (https://github.com/
microsoft/litebox).
OpenMP Adds Support for Python
The OpenMP Architecture Review Board (ARB) has created a Python Language Subcommittee to 
add Python support to version 7.0 of the OpenMP API specification for parallel programming. This 
move will make Python the fourth officially supported language in the specification, alongside C, 
C++, and Fortran.
“Adding Python support to the OpenMP standard will provide Python developers with a new way 
to express parallelism portably and accelerate Python applications running on CPUs, GPUs, and other 
accelerators,” the announcement states (https://www.openmp.org/press-release/python-new-member-anaconda/).
Additionally, the company notes that Anaconda has joined the OpenMP ARB and will play a key 
role in the Python integration.
The OpenMP 7.0 release is planned for 2029, while version 6.1 (https://www.openmp.org/wp-content/
uploads/openmp-TR14.pdf) is expected in November 2026. For more information, visit OpenMP: 
https://www.openmp.org/.
SUSE Offers Cloud Sovereignty Framework Self Assessment
SUSE has created a Cloud Sovereignty Framework Self Assessment tool aimed at helping organiza-
tions identify gaps in their digital strategy.
8 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ADMIN NewsN E WS
This web-based, self-service assessment tool lets organizations quickly see how their infrastructure 
measures up against the 2025 EU Cloud Sovereignty Framework, providing an analysis that includes:
• Overall sovereignty score (0-100%)
• Individual scores per area
• Critical violation warnings
• Prioritized gap analysis
• SUSE solution recommendations
“Most organizations struggle to bridge the gap between policy and production,” says Andreas 
Prins, head of Global Sovereign Solutions at SUSE, but the Cloud Sovereignty Framework Self 
Assessment “gives words to an abstract principle and gives them verified open source pathways to 
make it resilient.”
Features include: 
• The SEAL benchmark: Maps the organization to one of five Sovereignty Effective Assurance 
Levels (SEAL 0–4). This creates a common language for organizations to discuss risk (e.g., “We 
are currently SEAL-1, but our public sector contracts require SEAL-3”).
• Weighted risk analysis: The tool weighs eight sovereignty objectives (SOVs), prioritizing supply 
chain and operational autonomy.
• Trust-based engagement: Results are stored only in the user’s browser. 
Check out this video walkthrough (https://www.youtube.com/watch?v=c9y0YUHcObE) to see how the 
assessment works and learn more at SUSE: https://www.suse.com/.
Open Invention Network Releases OIN 2.0
The Open Invention Network (OIN) has released OIN 2.0 (https://www.openinventionnetwork.com/license-
agreement-2/) — a “significant evolution” of its open source software patent protection program.
With this update, OIN has introduced a shared funding model with a modified, fee-based ap-
proach. Under the new model, participation remains free to individuals and small businesses, 
while medium- and large-sized organizations will help support OIN through a tiered, annual fee 
based on revenue.
Additionally, OIN has released Linux System Table 13 (https://www.openinventionnetwork.com/linux-
system/), which details the patent protection coverage offered under the OIN 2.0 license agreement. 
This update “covers over 650 new open source software packages, including smart technologies, 
security, networking, data centers, and automotive. It increases coverage for cloud computing, 
including for Kubernetes and Eclipse, and expands coverage for modern languages by adding 
many new libraries for Go, Python, and Rust.” 
“OIN 2.0 is a continuation of OIN’s long-standing commitment to protect OSS from patent 
threats, modified to reflect today’s realities,” said Keith Bergelt, CEO of Open Invention Network. 
Learn more at Open Invention Network: https://www.openinventionnetwork.com/.
New Global Open Source Vulnerability Database Launched
The Global CVE (GCVE) initiative has launched a new open and freely accessible vulnerability 
advisory database. According to the announcement, “the platform aggregates and correlates 
vulnerability information from more than 25 public sources, including GCVE GNA (Numbering 
Authority) sources and other established vulnerability databases.”
The GCVE database (https://db.gcve.eu/), which is maintained by the Computer Incident Response 
Center Luxembourg (CIRCL), provides a public web interface, a public API (https://db.gcve.eu/api/), 
and open data dumps for offline analysis. It also provides compatibility with existing CVEs through 
a backward-compatible ID scheme.
The platform is powered by vulnerability-lookup (https://www.vulnerability-lookup.org/), an open source 
project also maintained by CIRCL, that implements the Best Current Practices (https://gcve.eu/bcp/) 
defined by the GCVE initiative. 
By bringing together data from public sources, the GCVE vulnerability database “helps reduce frag-
mentation and improves visibility across the global vulnerability landscape,” the announcement says.
Learn more at GCVE: https://gcve.eu/.
9A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N E WSADMIN News
Non-human identities (NHIs) are not 
a new phenomenon, but they are rap-
idly becoming increasingly prevalent 
and complex. NHIs include identities 
for workloads, services, Internet of 
Thingsdetects S3 buckets 
that are publicly accessible without 
restrictions. This misconfiguration is 
one of the most common causes of 
data leaks in AWS environments. The 
ec2_instance_port_ssh_exposed_to_in-
ternet check looks at whether EC2 
instances allow SSH access across 
the entire IPv4 space (0.0.0.0/ 0). An 
open port 22 is an inviting gateway 
for brute force or exploit attempts. 
Finally, cloudtrail_multi_region_en-
abled ensures that AWS CloudTrail 
is active in all regions. Without this 
setting, security-related activities in 
regions that are not used by default 
but might still be vulnerable will fly 
under the radar. Targeted check pro-
files like this can be called up directly 
with the command,
prowler aws --checks-file ./my-checks.json
which gives you targeted results on 
particularly security-critical areas with-
out having to run through the entire 
check catalog. If you need regularly 
recurring checks, you can combine 
profiles with a scheduler, such as AWS 
Systems Manager or local cron jobs.
Restrictions and Profiles
If needed, you can limit the analysis 
to individual regions,
prowler aws --profile audit-profile U
 -f eu-central-1 us-east-1
which reduces run time and costs, 
especially if you want to initiate fur-
ther processing with Security Hub. 
For multi-account setups with mul-
tiple CLI profiles, scans can also be 
scripted:
for profile in audit-prod U
 audit-dev U
 audit-test
do
 prowler aws U
 --profile "$profile" U
 --output-folder ./audit-$profile
done
This loop lets you carry out auto-
matic security checks for multiple 
AWS accounts by calling Prowler se-
quentially with different CLI profiles. 
The three profile names audit-prod, 
audit-dev, and audit-test stand for 
the different production, develop-
ment, and testing environments. 
A separate scan is started for each 
profile, and the results are stored 
in a dedicated folder named for the 
respective profile (e.g., ./audit-au-
dit-prod), which facilitates struc-
tured evaluation and archiving of the 
results, especially in multi-account 
environments with role-based access 
control and separate responsibilities. 
The prerequisite is that a correspond-
ing entry exists in the AWS CLI con-
figuration for each profile.
In environments with multiple AWS 
accounts, Prowler offers the op-
tion of centrally checking entire 
Figure 2: Use the appropriate command to display the available Prowler checks at the command line.
70 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ProwlerS EC U R I T Y
 -o prowler.env
docker compose up -d
The GUI is then available from http:// 
localhost:3000. After logging in, you 
can start scans, compare results, 
and export compliance reports. The 
interface is based on Next.js, and 
the back end is based on Django and 
PostgreSQL. For production environ-
ments, I recommend a separate de-
ployment including role-based access 
control (RBAC) hardening and HTTPS 
encryption.
Prowler can be run locally and 
deployed as a fully managed solu-
tion. The Prowler Managed Service 
automates daily audits across mul-
tiple cloud providers, storing the 
results centrally and making them 
accessible in a consolidated web 
interface (Figure 3) that includes 
AWS, Azure, GCP, and Kubernetes 
– including compliance evaluations, 
risk ratings, and context-related 
recommendations.
The service also supports RBAC, API 
access, and centralized visualization 
in dashboards. In hybrid scenarios, 
Managed Service can be synchronized 
with local Prowler instances.
master-profile. The --org-role op-
tion lets you specify an IAM role that 
can be temporarily assumed by the 
subordinate accounts. This role must 
be present in all audited accounts 
and allow cross-account access. 
Prowler stores the results of each 
account check in the ./org-audit 
directory, structured by account ID. 
In this way, you get complete re-
ports for each member account, and 
centralized evaluations or targeted 
security measures to be derived are 
enabled.
Dashboard and Web 
Interface
Besides the CLI, Prowler introduced a 
locally hostable dashboard in version 
5, which you can install with Docker 
Compose:
curl U
 -LO https://raw.githubusercontent.com/U
 prowler-cloud/prowler/refs/heads/U
 master/docker-compose.yml
curl U
 -L https://raw.githubusercontent.com/U
 prowler-cloud/prowler/refs/heads/U
 master/.env U
organizational units. If you have a 
management account, the scan can be 
extended to all subordinate accounts. 
This operation requires a central role 
with cross-account authorizations 
and is available as a CloudFormation 
template in the Prowler repository. 
The check can then be performed 
either sequentially or in parallel, 
with the results stored in separate 
subdirectories.
A typical scenario is an automated 
run with separate report storage per 
account and optional transmission to 
the Security Hub. For larger organi-
zations with hundreds of accounts, 
this method provides a consolidated 
security overview without the need 
for manual evaluation of individual 
profiles. The command
prowler aws --org-role U
 arn:aws:iam::111111111111U
 :role/ProwlerAuditRole U
 --org-master-profile master-profile U
 --output-folder ./org-audit
starts an organization-wide secu-
rity audit across all your AWS ac-
counts. The management account 
is addressed by the profile called 
Figure 3: A security scan can also be initiated from the Prowler web interface.
72 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ProwlerS EC U R I T Y
Customization and 
Compliance
Prowler supports extensive custom-
ization at the configuration level, 
such as extending the maximum log 
retention time:
log_group_retention_days=500
This setting stipulates that Cloud-
Watch log groups be retained for at 
least 500 days. The parameters are 
stored in a user-defined configuration 
file, usually in INI format, which you 
pass in at startup:
prowler aws --config-file .of the hub with each 
scan, enabling flexible, audit-proof 
rule management tailored to your 
requirements.
Prowler in DevOps Workflows
In the continuous integration and 
continuous deployment (CI/ CD) 
context, Prowler can be integrated 
as a build step. In conjunction with 
the new Prowler Fixer module, auto-
matic remediation is even possible. 
A typical CI/ CD workflow performs 
two consecutive steps. In the first 
step, Prowler scans the AWS environ-
ment, executing only the security 
checks from the checks.json file. This 
file contains a list of check IDs that 
you have defined specifically. The 
--output-folder ./results parameter 
ensures that all scan results, including 
CSV, JSON, and HTML reports, are 
stored in the results directory.
In the second step, the
cat ./results/html-report.html
command outputs the HTML report 
directly to the console. It is particu-
larly useful for automated pipelines 
for which you want to save the 
results as an artifact or pass them 
on to downstream steps, which in 
conjunction with CI/ CD systems 
such as GitLab or Jenkins, helps 
you map out a continuous security 
check process.
A webhook can be used to forward 
the results to security information 
platforms, such as Splunk or Elastic, 
provided the JSON is in the OCSF 
format, which saves integration work 
and improves traceability.
In addition to the familiar command 
mode, Prowler introduced support for 
scanning GitHub repositories for se-
curity risks in version 5. Among other 
things, this capability helps you de-
tect publicly accessible secrets, miss-
ing branch protection rules, or unpro-
tected repository settings. Authentica-
tion is handled by personal access 
tokens, OAuth, or GitHub app access. 
Microsoft 365 environments can now 
also be checked, for example, for 
inadequate authentication policies or 
overly broad access authorizations in 
Exchange Online.
Prowler offers a Checkov-based dedi-
cated scan engine for infrastructure 
as code (IaC) that helps you analyze 
Terraform, CloudFormation, or Ku-
bernetes manifest files before they 
even reach the cloud. In this way, 
Prowler can integrate with local-only 
Kubernetes Checks and EKS Scans
The Prowler command
prowler kubernetes U
 --kubeconfig-file ~/.kube/config
analyzes EKS clusters. The focus is on CIS 
Benchmark 1.10, including checks for pod 
security, network traffic, and API server 
permissions and for securing worker nodes. 
Vulnerabilities such as runAsRoot, missing 
seccomp profiles, or overly open Cluster-
RoleBinding resources are listed along with 
recommendations for hardening. A specific 
namespace combination can also be scanned 
for targeted analyses:
prowler kubernetes U
 --namespaces kube-system production
The following job deployment is used for 
integration with existing clusters:
kubectl apply -f kubernetes/job.yaml
The dashboard-based visualization of the 
EKS results is similar to the AWS evaluation, 
optional filtering by namespace, category, 
and compliance framework.
73A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T YProwler
detected vulnerabilities appear there 
after a few minutes, including a 
reference to the affected resources. 
Actions can then be derived through 
manual annotation or playbooks in 
AWS Systems Manager.
Version 5 also saw Prowler introduce 
the ability to analyze threat patterns 
with AWS CloudTrail. The --cat-
egories threat-detection parameter 
lets you enable checks that detect 
typical attack indicators in the logs. 
Examples include unusual API calls, 
sudden privilege escalations, or activ-
ity in inactive regions. The tool evalu-
ates standardized events from the last 
24 hours but can be adjusted to other 
time periods if necessary. The prereq-
uisite is that CloudTrail is active in 
all regions being checked. The results 
can be filtered by severity or resource 
type and help to identify compro-
mised identities or misused services 
at an early stage.
Conclusion
Prowler covers a wide range of ap-
plication scenarios as a CLI tool, a 
browser-based dashboard, or a man-
aged service for enterprise-wide com-
pliance. With customizable checks, 
reporting in standard formats, direct 
integration into CI/ CD pipelines, and 
extensive support for multi-account 
environments, the tool offers a com-
bination of flexibility, automation, 
and transparency. As such, Prowler 
provides a great basis for audit-proof, 
traceable, and continuously improv-
able audits, especially for admins 
who are responsible for the security 
of complex cloud structures. 
Info
[1] Prowler homepage: [https:// prowler. com]
[2] Prowler on GitHub: [https:// github. com/ 
 prowler-cloud/ prowler]
The Author
Thomas Joos is a freelance IT consultant and 
has been working in IT for more than 20 years. 
In addition, he writes hands-on books and 
papers on Windows and other Microsoft topics. 
Online you can meet him on [http:// thomasjoos. 
 spaces. live. com].
requests or modify Terraform files. 
Organizations with a high degree of 
automation will benefit from a fast 
feedback cycle between analysis, find-
ings, and validation.
Fixer works directly at the API level; 
in other words, it accesses AWS re-
sources directly. By default, though, 
only low-risk changes are supported, 
such as the aforementioned activa-
tion of GuardDuty, enforcing secure 
password rules, or setting missing 
CloudTrail parameters. Individual ad-
justments are possible because each 
remediation is written in Python and 
can be modified as needed.
Without the use of external tools, 
then, you can integrate your own se-
curity policies directly into the testing 
and hardening process. A combina-
tion with IaC is also in the pipeline. 
Instead of making live changes, 
you can generate a pull request that 
provides the desired changes with 
GitOps.
Prowler Meets Security Hub
Prowler can transfer the results of its 
security checks directly to AWS Secu-
rity Hub. Security Hub acts as a cen-
tral consolidation and analysis tool 
for security-related events in an AWS 
organization. Once a scan is com-
plete, the findings can be exported to 
AWS Security Finding Format (ASFF) 
and automatically transmitted with 
the command:
prowler aws --security-hub --status FAIL
Security Hub fields this information 
across regions and assigns it to the 
respective accounts. As an admin, 
you can see at a glance where a 
problem was detected, including 
the account, region, and resource 
information. For multi-account en-
vironments with an organizational 
structure, Security Hub provides a 
unified interface for prioritizing, 
categorizing, and tracking vulner-
abilities. Moreover, alerts can be 
automated (e.g., with EventBridge 
rules or playbooks for AWS Systems 
Manager) to trigger coordinated 
responses to critical findings. The 
development environments while 
safeguarding the shift-left approach in 
DevSecOps pipelines.
Automatically Fixing Typical 
Misconfigurations
Practical audits regularly reveal the 
same misconfigurations, such as:
 S3 buckets with public read 
permissions
 IAM users without multifactor au-
thentication (MFA)
 EC2 instances with open ports in 
0.0.0.0/ 0
 Missing CloudTrail configuration
 Lambda functions with sensitive 
environment variables
 Roles with permissions that are far 
too broad
 Outdated KMS policies
To address these vulnerabilities in a 
targeted manner, Prowler can use the 
Fixer module to fix selected findings 
directly and automatically.
Each supported check can be supple-
mented with predefined remediation 
logic. For example, a missing Cloud-
Trail in the us-east-1 region can be de-
tected and immediately resolved with 
the command:
prowler aws U
 --checks cloudtrail_enabled_multi_region U
 --region us-east-1 U
 --fixer
In this case, Prowler automatically 
creates a new CloudTrail configura-
tion with the recommended settings. 
However, this action only works if the 
required IAM authorizations are in 
place. Anothercommon tactic is acti-
vating GuardDuty, which can also be 
automated by the Fixer module:
prowler aws U
 --checks guardduty_enabled U
 --region eu-central-1 U
 --fixer
Prowler checks whether the service is 
active and enables it if needed. These 
automations can also be executed 
with CI/ CD support or controlled 
by IaC processes. Instead of active 
changes, Fixer can generate pull 
Keywords: Prowler, AWS, Amazon, vulnerability, compliance, automation, 
CI/CD, audit, CLI, reporting
74 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ProwlerS EC U R I T Y
Cybersecurity is not a one-time 
investment, but an ongoing budget 
item. Attackers are constantly im-
proving their tools, techniques, and 
methods, which means defenders 
also need to up their detection and 
response game and improve security 
checks. If you perform manual at-
tack analysis and emulation, you will 
realize how expensive, time-consum-
ing, and difficult to repeat this work 
can be.
Other articles have covered tools and 
knowledge databases from US-based 
research institution MITRE. With 
Caldera [1], the organization now 
promotes a tool that helps you auto-
matically replicate attacker behavior, 
allowing you to simulate complex at-
tack chains without the need for a red 
team on site. You execute the same 
playbook of an attack pattern repeat-
edly to adjust your defenses in real 
time and validate their effectiveness.
ATT&CK Framework Basis
Caldera is available as a free open 
source platform and enables attacker 
emulation exercises with the MITRE 
ATT&CK framework [2]. The platform 
is a plugin-based framework in which 
modular attack steps, known as “abil-
ities,” are grouped into sequences or 
“adversaries” that are then executed 
by agents on the target computers. 
The agents are cross-platform capable 
and can be used on Windows, Linux, 
and macOS.
Instead of targeting exploits or vul-
nerabilities like other tools, Caldera 
targets the behavior of an attacker by 
simulating techniques that attackers 
use after a compromise, such as privi-
lege escalation, lateral movement, or 
the exfiltration of company data. Its 
modularity and automation will help 
you hone your skills and adapt them 
to the existing IT infrastructure.
Setting Up Caldera
To get a feel for how you can use 
Caldera productively, I’ll first look at 
a straightforward scenario. Of course, 
you need a running Caldera instance, 
which you can easily set up in the 
usual way with Docker: To begin, 
clone the current Git repository,
git clone https://github.com/mitre/ U
 caldera.git --recursive
then change to the Caldera directory 
and run the command
docker build . -t caldera:server
to create the Docker image for later 
use. This step takes a while, because 
all the dependencies and supplied 
plugins for Caldera are either loaded 
or generated on the fly.
If the build was successful, which is 
the case if you see Successfully tagged 
caldera:server as the output, you can 
launch the platform with:
docker run U
 -p 7010:7010 U
 -p 7011:7011/udp U
 -p 7012:7012 U
 -p 8888:8888 caldera:serve
As soon as the Caldera label appears 
in ASCII art in the console after start-
up, call http:// localhost:8888 in your 
browser to access the login page. The 
different access credentials for the red 
and blue teams can be found on the 
Docker container console without any 
further configuration.
Because of the unusual width of 
the log output, you cannot simply 
Organizations often lack the human and financial resources for red and blue teaming, forcing many admins 
to become both the attacker and the defender. The MITRE Caldera cybersecurity platform supports attack 
emulation and automates security testing. By Matthias Wübbeling
Emulate Attacks with MITRE Caldera
 Volcanic
Ph
ot
o 
by
 J
ef
er
so
n 
Ar
gu
et
a 
on
 U
ns
pl
as
h
76 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T Y MITRE Caldera
copy and paste the passwords; you 
will need to copy both parts of the 
password separately. Also note the 
instructions that follow the access 
credentials (Figure 1).
Emulating Attacks
As a blue team member, imagine you 
operate a security information and 
event management (SIEM) or an end-
point detection and response (EDR) 
system in your organization and want 
to test whether an attacker who is suc-
cessful in the first step will be detected 
by your monitoring systems as the as-
sailant continues their activities (e.g., 
lateral movement or data exfiltration).
To simulate this situation effectively, 
the above-mentioned Caldera agents 
now enter the play. For the scenario 
described, you need to create a 
Sandcat agent under agents, which 
simulates the attacker’s remote access 
tool (RAT) already installed on one 
of your servers. After clicking Deploy 
an agent, select sandcat and then the 
operating system of the machine that 
has already been compromised.
In this example, I simply set up a 
Linux virtual machine (VM) in my 
cluster. Clicking on the penguin icon 
opens the documentation relating 
you should adjust your detection 
logic. You can repeat the same op-
eration as often as you like to check 
whether your new rules will work 
later as intended. This iterative pro-
cess in Caldera helps you gradually 
optimize your SIEM.
Besides the simple scenario looked 
at here, Caldera offers many more 
possibilities for carrying out attacks 
or tests. Red teams, for example, can 
investigate and develop new attack 
chains, and blue teams can use the 
tool for postmortem analyses of simu-
lated incidents.
Conclusion
MITRE Caldera is a proven and well-
equipped open platform for attack 
simulation. In this article, I used a 
small example to show how to use 
Caldera to optimize monitoring. Cal-
dera also offers many other possibili-
ties to facilitate the work of red and 
blue teams.
Although Caldera is already quite 
mature, don’t expect a miracle solu-
tion. It does not replace the entire 
spectrum of red teaming measures, 
especially those that focus on social 
engineering or zero-day vulnerabili-
ties. On the upside, you will gain a 
basic understanding of attack tactics 
and the MITRE ATT&CK framework. 
Info
[1] Caldera: [https:// caldera. mitre. org]
[2] MITRE ATT&CK: [https:// attack. mitre. org]
The Author
Dr. Matthias Wübbeling is an IT security en-
thusiast, scientist, author, consultant, and 
speaker. As a Lecturer at the University of 
Bonn in Germany and Researcher at Fraunhofer 
FKIE, he works on projects in network security, 
IT security awareness, and protection against 
account takeover and identity theft. He is the 
CEO of the university spin-off Identeco, which 
keeps a leaked identity database to protect 
employee and customer accounts against iden-
tity fraud. As a practitioner, he supports the 
German Informatics Society (GI), administrat-
ing computer systems and service back ends. 
He has published more than 100 articles on IT 
security and administration.
to the various installation methods. 
The agent uses HTTP to communi-
cate with the framework’s open port 
8888 and must not be filtered by the 
firewall. The easiest way to start the 
agent is with the first command from 
the documentation. To do this, you 
can execute the following commands 
(replace the IP address with that of 
the Caldera server in your setup):
server="http://127.0.0.1:8888"
curl -s -X POST -H "file:sandcat.go" U
 -H "platform:linux" U
 $server/file/download > splunkd
chmod +x splunkd
./splunkd -server $server -group red -v
Of course, this binary does not con-
tain a real Splunk daemon. The agent 
only hides the way in which you 
instruct. Once the test environment 
is ready, you can select predefined 
functions for the first test. These 
functions represent an attacker’s 
individual actions and include com-
mands from genuine attack behavior, 
such as searching for files, creating 
directories, or exfiltrating data. You 
will notice that each ability contains 
informationabout a relevant MITRE 
ATT&CK technique, which means you 
can immediately see the kind of be-
havior being emulated.
For this example, press the Create Op-
eration button under operations at the 
top of the page. Assign a name (e.g., 
Worm) as the adversary and click 
Start. The agent you created previ-
ously is now the focus of the graphi-
cal SVG view, and various worm 
techniques are now being deployed 
against your network. Once the op-
eration is complete, Caldera provides 
a detailed log of each run. You can 
view logs and collected files or export 
them in JSON format with the button 
at top right.
Testing SIEM
After the run, the all-important ques-
tion now arises: Did your monitoring 
system detect the emulated attacks 
and warn you appropriately? If not, 
you might now have a good clue, 
from the MITRE findings, as to how 
Keywords: MITRE, Caldera, ATT&CK, security, red, blue, team, attack, simulation, defend, automation
Figure 1: Until you create and apply your own 
configuration in the conf/local.yml file, 
the access credentials will be regenerated 
each time you start.
77A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T YMITRE Caldera
Bloonix [1] is a user-friendly envi-
ronment for monitoring tasks; it is 
capable of monitoring all of your IT 
assets, assuming they are accessible 
over a network connection. You 
do need to install special plugins 
to query certain network compo-
nents – but more on that later. The 
web-based, modular Bloonix envi-
ronment comes with modules for a 
massive crop of popular hardware 
and software components and is 
fundamentally based on the Simple 
Network Management Protocol 
(SNMP), although it can also use 
other protocols for monitoring.
The first lines of code for Bloonix, de-
veloped by the company of the same 
name based in Germany, date back to 
2006. The project is available under the 
GNU Affero General Public License ver-
sion 3 (AGPLv3), which allows users to 
run, modify, and share software while 
ensuring that any modified versions are 
also made available to the public. 
According to the developers, Bloonix’s 
server software is highly available 
and highly scalable and can be op-
erated on multiple servers for load 
balancing. For monitoring tasks, the 
tool primarily relies on agents that 
are available for popular operat-
ing systems. It has no packages for 
desktop systems – not even for ma-
cOS. Bloonix is available as a man-
aged server and as a self-hosted en-
vironment [2]. You can gain an ini-
tial impression of the environment 
in the online demo [3]. Free support 
is provided by the community [4], 
although the extent of this support 
is limited. Professional support is 
also available (see the “Commercial 
Services” box).
Bloonix Architecture
To meet the complex challenges in-
volved in monitoring heterogeneous 
environments, Bloonix uses a modular 
architecture comprising five compo-
nents: the Bloonix server, a WebGUI, 
field information from agents, plugins, 
and satellites. At the heart of the sys-
tem is the Bloonix server, which brings 
together the various modules. When 
this server boots up, it launches vari-
ous process pools, including listeners, 
database (DB) managers, Keepalived, 
and various checker and scheduler 
modules. The Bloonix server usually 
Continuous IT monitoring often requires multiple tools, depending on the scope and complexity of the 
environment. The Bloonix modular monitoring tool combines numerous services in a single interface. We 
show you how to set up and handle monitoring tasks with this free software. By Holger Reibold
Infrastructure monitoring with Bloonix
 Guardian
Le
ad
 Im
ag
e 
©
 t
ar
ok
ic
hi
, 1
23
RF
.c
om
Commercial Services
Bloonix is commercially available as a man-
aged server or as a self-hosted environment. 
Customers who opt for the managed server 
option are assigned their own virtual ma-
chine (VM). Daily backups and gateways for 
SMS are also available. Prices start at around 
EUR60 per month, depending on the number 
of virtual CPUs (VCPUs) and the RAM and 
hard disk size. Companies that host Bloonix 
themselves but do not want to do without 
support can choose between different sup-
port options starting at EUR600 per year.
78 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N T Bloonix
writes its data to PostgreSQL and Redis 
databases.
The monitoring environment primar-
ily relies on agents to collect relevant 
system metrics on the target hosts – 
in particular, CPU utilization, memory 
usage, and database services. In 
principle, an agent can also run on a 
server that Bloonix itself cannot query 
directly. From this vantage point, it 
can then monitor routers, switches, 
and other relevant network services. 
For organizations with distributed 
locations, Bloonix Satellite offers 
monitoring of globally distributed 
web services.
The WebGUI is used to control the 
environment, which includes manag-
ing hosts, groups, clients, and other 
services. The Bloonix environment 
also has a plugin mechanism that 
more or less includes complex scripts 
that query the status of one or more 
services. The extensions are usually 
installed with the agents, the server, 
or the satellites. On Linux, the pl-
ugins are stored in the /usr/lib/bloo-
nix/plugins directory by default.
The listener components field status 
information and metrics from the 
agents, validate the information, and 
store it in the database; additionally, 
the server checks whether or not it 
needs to notify the administrator. The 
DB Manager module is responsible 
for all database-specific tasks and 
writes the metrics to the PostgreSQL 
database. An NGINX web server pre-
pares the data for the web interface. 
The Bloonix server is also respon-
sible for checking registered routers, 
switches, and services, and it queries 
the satellite configuration.
Putting Monitoring into 
Operation
When it comes to monitoring, Bloonix 
distinguishes between monitoring 
hosts and monitoring services. Basi-
cally, the environment prefers to work 
with Linux servers. To simplify the 
configuration, an agent should already 
be installed on the monitored system. 
In this kind of scenario, the host con-
figuration can be completed host-side.
To add a first host to your monitoring 
executed by the server, the agent, and 
the satellite component (Figure 1).
The Plugin World
Bloonix has plugins for external 
tests, as well as for Linux, SNMP, 
web servers, caching, and database-
specific checks. According to the 
documentation, you can choose 
from more than 40 plugins. For ex-
ample, to monitor the CPU load of a 
Linux server, select check-linux-cpu. 
Other useful SNMP checks examine 
memory and hard disk usage, as 
well as the number of services and 
processes. The two most important 
external checks are used to monitor 
TCP/ IP or UDP/ IP. However, when 
monitoring databases, you are lim-
ited to PostgreSQL and MySQL.
In this context, it is interesting to 
see how the server and the agents 
interact. After setting up a service, it 
initially has an INFO status, because 
no monitoring has yet taken place. 
The service overview displays the 
status information in the column of 
the same name (Figure 2). When 
monitoring is initiated, the agent es-
tablishes a connection to the server, 
authenticates, and continuously trans-
mits the plugin-specific data.
Ideally, the blue INFO message will 
change to a green OK. Bloonix recog-
nizes seven different status messages: 
OK (exit code 0), INFO (code 0), 
NOTICE (purple, 1), WARNING (light 
orange, 2), ALERT (pink, 3), CRITI-
CAL (red, 4), and UNKNOWN (dark 
orange, 5). Color highlighting in the 
WebGUI makes it easy to classify the 
messages when skimming through. In 
the Services overview, you can also 
change the sort order by clicking on 
the header, simplifying your analysis 
of the output.
When checking services, you are 
not forced to rely on agents; you 
can also have the checks carried 
out by a Bloonixserver or Bloonix 
satellites. This service is available 
for all checks that can be performed 
locally, which occurs when the 
Remote check option is set to No. 
Nevertheless, remote checks have 
advantages; for example, you can 
setup, open the Hosts menu (stacked 
rectangles), click on the plus sign, 
and specify the typical server data 
in the corresponding dialog. You can 
customize the hostname in the sys-
tem settings (cog wheel) with Bloonix 
Server Hostnames. You have two ways 
to configure the agent: manually or 
with the bloonix-init-host script. For 
a manual configuration, you need to 
edit the /etc/bloonix/agent/main.conf 
and /etc/bloonix/agent/conf.d/host.
conf files and enter the address of the 
Bloonix server in the /etc/bloonix/
agent/main.conf file; then, configure 
the required settings in the server 
section and the corresponding host 
parameter:
server {
 host 127.0.0.1
 host bloonix.server.de
}
Now, save the host ID and password in 
/etc/bloonix/agent/conf.d/host.conf:
host {
 host_id 01
 password 
}
For the changes to take effect, you 
need to restart the Bloonix agent. 
You can use the aforementioned 
bloonix-init-host script to automate 
the agent configuration, assuming the 
agent configuration file has not been 
modified:
bloonix-init-host U
 --host-id 33 U
 --password U
 --server bloonix.server.de
Alternatively, you can simply enter 
a line reading host 127.0.0.1 in the 
server section.
After creating your first host, you can 
begin configuring the services to be 
monitored. The procedure is similar 
to adding hosts: In the Services menu 
(three stacked documents), click on 
the plus sign and specify the proper-
ties, which includes selecting the 
plugins (i.e., the scripts responsible 
for the monitoring). The plugins are 
79A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TBloonix
assess the quality of a service by 
accessing it from the perspective of 
different locations. This method is 
particularly useful when monitoring 
HTTP, IMAP, POP3, and SMTP. If 
you implement a satellite configura-
tion, a special dashboard is avail-
able in the WebGUI that allows you 
to filter response times by different 
locations.
Configuration and 
Administration
The WebGUI not only lets you cre-
ate hosts and services, you can also 
Figure 2: The Services overview resulting from monitoring shows a variety of useful details.
Figure 1: After commissioning, Bloonix provides information in its web-based dashboard.
80 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
BloonixM A N AG E M E N T
bundles the collected information 
and lists the various notices and 
warnings. Clicking on these visu-
alizations takes you to a detailed 
view, where you can initiate any 
required action.
Conclusion
Bloonix is designed for monitoring IT 
infrastructure components and does so 
with flying colors. The only downside 
is the lack of autodiscovery, which is 
offset by the extensive plugin ecosys-
tem. Professional support is available 
in the two commercial versions. You 
can also get help for the more complex 
configuration steps by reading the ex-
cellent documentation. 
Info
[1] Bloonix homepage: 
[https:// www. bloonix. org/ en/]
[2] Commercial Bloonix services: 
[https:// www. bloonix. com]
[3] Online demo: 
[https:// demo. bloonix. org/ login]
[4] Bloonix forum: 
[https:// community. bloonix. org]
The Author
Holger Reibold is a computer scientist, having 
worked as an IT journalist since 1995. Currently, 
he works as a key account manager for a Ger-
man ISP. His main interests are open source 
tools and security topics.
procedure: You need to save the 
script to the directory specified as 
the message_service_script_path op-
tion in the Bloonix server configura-
tion. As an example, I used test.py 
and saved it in /usr/local/lib/bloo-
nix/message-service (Listing 1).
Next, create a new message service 
of the Script type in the WebGUI and 
assign a value of %message% to Mes-
sage. For send_to, enter %send_to% 
and for foo, enter bar.
When Bloonix triggers an alarm, the 
parameters defined in the WebGUI are 
transferred to the script in JSON for-
mat by STDIN. The exit code tells the 
Bloonix server whether the message 
was sent successfully: 0 is a success-
ful transmission and 1 is unsuccess-
ful. You will find the entry for this 
option in the /tmp/test.log file.
The targets for notifications are con-
tacts or contact groups. You can as-
sign different numbers of messaging 
services to a contact and specify the 
notification periods, provided their 
content is not critical. The idea of 
contact groups is that you can link 
contacts to hosts and services, which 
gives you precise control over which 
contacts are notified in case of a fail-
ure of a specific host or service. In 
practice, it is useful to assign at least 
one group to each host.
Setting up host and service configu-
rations proves to be very time con-
suming in practice because Bloonix 
lacks an autodiscovery function, 
although it has an alternative in the 
form of a Service Templates func-
tion, thanks to which you can bun-
dle services and service parameters 
and apply them to any number of 
hosts. When you create new hosts, 
the templates are automatically ap-
plied there. To access the template 
function, go to Configuration | Tem-
plates. When you get there, you will 
find a selection of templates that 
deliver standard checks for Apache 
or MySQL servers along with ge-
neric Linux checks. You can also 
create your own templates in the 
WebGUI and assign checks to them. 
Variables let you define different 
thresholds for outputting warnings 
or critical messages. The dashboard 
manage host groups, users, and 
satellites. Bloonix uses the admin, 
operator, and user roles for user ad-
ministration, along with the associ-
ated permissions. If you want to delve 
deeper into the specifics, it is worth 
taking a look at the configuration of 
the various components.
The Bloonix server configuration is 
stored in the /etc/bloonix/server/
main.conf file. You can use the web-
gui_domain parameter to specify the 
domain for the web interface. The 
environment usually manages the 
plugins in /usr/lib/bloonix/plugins, 
which is also where you store your 
development projects. The database 
and storage are set up in two configu-
ration files: /etc/bloonix/database/
main.conf and /etc/bloonix/datas-
tore/main.conf.
Bloonix writes the WebGUI configu-
ration to the /etc/bloonix/webgui/
main.conf file, and the Bloonix agent 
configuration is located in /etc/bloo-
nix/agent/main.conf. Modifying either 
of these is only advisable if special 
circumstances dictate this action. 
Finally, you can edit the satellite con-
figuration in /etc/bloonix/satellite/
main.conf.
Setting Up Notifications
Monitoring the IT infrastructure 
is not really useful if you are not 
notified in the event of critical inci-
dents. To prevent this from happen-
ing, the software offers a notifica-
tion function that can communicate 
in three ways: Sendmail, HTTP, and 
script-based notification output. For 
Sendmail, you need a mail transfer 
agent (MTA; e.g., Postfix or Exim). 
Specifying a valid sender is impor-
tant. With the help of the MTA, you 
can decide whether email notifica-
tions are sent by a relay server or by 
SMTP. With HTTP-based transmis-
sion, you can forward URL-encoded 
or JSON-based data over a corre-
sponding HTTP interface.
If the HTTP and Sendmail variants 
do not meet your requirements, you 
can opt for script-based notification, 
which integrates into the WebGUI. 
A simple example illustrates the 
Listing 1: Script-Based Notifications
#> cat /usr/local/lib/bloonix/ message-service/test.py
#!/usr/bin/python3
import json
import sys
lines = ""
while True:
 try:
 line = input()
 except EOFError:
 break
 lines += line
param = json.loads(lines)
f = open("/tmp/test.log", "w")f.write(json.dumps(param))
f.close()
sys.exit(0)
81A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TBloonix
At first glance, collecting metrics 
from IT infrastructure seems straight-
forward: Deploy an agent, configure 
some checks, and watch the num-
bers roll in. However, anyone who 
has spent time building production 
monitoring systems knows that effec-
tive data collection is far from trivial. 
The challenge isn’t simply gathering 
data – it’s collecting the right data, at 
the right intervals, with the right con-
text, all while minimizing the effect 
on the systems being monitored.
Technical, policy, and economic 
constraints that restrict many en-
vironments are also of concern. As 
infrastructure becomes increasingly 
complex and security requirements 
more stringent, the ability to adapt 
monitoring approaches to constrained 
environments becomes not just valu-
able, but essential. Zabbix’s flexible ar-
chitecture and support for diverse col-
lection methods make it well-suited for 
these challenging scenarios, enabling 
comprehensive monitoring even when 
circumstances are far from ideal.
Beyond Simple Numbers
When you instrument systems for 
monitoring, you’re not just collecting 
isolated data points: You’re also cap-
turing the relationships between them 
by attempting to capture the behavior 
of complex, dynamic systems that 
operate continuously across multiple 
dimensions. A CPU utilization metric 
at 14:23:47 tells you something, but 
that single number lacks the context 
that makes it actionable. Was this 
value typical for that time of day? Is 
it trending upward? Did it spike mo-
mentarily or sustain for minutes?
The true value of monitoring data 
emerges not from individual measure-
ments, but from the patterns and 
relationships that become visible 
when you collect data consistently 
over time. In this case, measurement 
transcends simple observation and 
becomes a tool for understanding sys-
tem behavior.
Patterns in Time Series Data
Modern IT infrastructure exhibits 
rhythmic behavior. Web applica-
tions see traffic patterns that mirror 
human activity – morning rushes, 
lunch lulls, evening peaks, and 
overnight quiet periods. Database 
systems show query patterns tied to 
business processes. Backup systems 
create predictable load cycles. These 
rhythms exist at multiple time scales: 
hourly patterns within days, weekly 
patterns across months, and seasonal 
patterns throughout years.
Effective monitoring systems must 
capture these patterns because they 
form the baseline against which you 
detect anomalies. A database con-
suming 80% CPU might be alarming 
at 3am, but perfectly normal during 
end-of-month reporting. Without 
historical context and pattern recog-
nition, you cannot distinguish be-
tween normal variation and genuine 
problems.
Trend analysis adds another dimen-
sion to pattern recognition (Figure 1). 
Whereas patterns reveal cyclical 
behavior, trends show directional 
change over time. Is disk usage grow-
ing linearly, or has growth acceler-
ated? Are response times gradually 
degrading? These trends often signal 
problems long before they become 
critical, enabling proactive interven-
tion rather than reactive firefighting.
Sampling Continuous 
Systems
One of the most significant challenges 
in monitoring is measuring discrete 
snapshots of systems that operate 
continuously. Monitoring systems col-
lect data at intervals – perhaps every 
Zabbix has emerged as a compelling choice for monitoring restricted 
environments over time. By Attila Bartek
Monitoring Constrained Environments
 For Good 
 Measure
Le
ad
 Im
ag
e 
©
 K
on
st
an
ti
n 
Yu
ga
no
v,
 12
3R
F.
co
m
82 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N T Data Collection with Zabbix
Why Zabbix?
Given the complexities of monitor-
ing, choosing the right monitoring 
platform becomes a critical decision. 
The monitoring tool must balance ca-
pability against complexity, flexibility 
against maintainability, and power 
against ease of deployment. Zabbix 
has emerged as a compelling choice 
for organizations, from small startups 
to large enterprises.
Zabbix is Free and Open Source Soft-
ware (FOSS) licensed under the GNU 
General Public License v2. The open 
source nature of Zabbix is not merely 
a cost consideration – although the 
absence of per-host licensing fees 
certainly matters at scale. The license 
also provides several strategic ad-
vantages that proprietary monitoring 
solutions cannot match, including 
transparency, community, and inde-
pendence from corporate control.
Zabbix appears in the package reposi-
tories of virtually every major Linux 
distribution: Debian, Ubuntu, Red 
Hat Enterprise Linux, CentOS, Rocky 
Linux, AlmaLinux, SUSE, and count-
less others. This universal availability 
significantly reduces deployment 
friction. You don’t need to config-
ure third-party repositories, manage 
custom package signing keys, or 
explain to security teams why you’re 
installing software from non-standard 
sources.
For organizations with standard-
ized deployment procedures, mature 
change management processes, and 
strict security requirements, being 
able to install Zabbix through native 
package managers means infrastruc-
ture monitoring follows the same 
deployment, patching, and lifecycle 
management processes as all other 
metrics rarely tell complete stories. 
A memory utilization metric means 
something different on a database 
server than on a web server. High 
disk I/ O might indicate a problem 
on one system and normal opera-
tion on another. Network throughput 
numbers lack meaning without un-
derstanding the application’s require-
ments and typical behavior.
This context dependency means that 
effective monitoring requires not just 
collecting data, but collecting the 
right combination of data points and 
understanding their relationships. 
You need to know not just that CPU 
is high, but also whether it correlates 
with increased request rates, whether 
memory pressure exists simultane-
ously, and whether response times 
have degraded. Single metrics viewed 
in isolation can mislead as easily as 
they inform.
Building Toward 
Understanding
These challenges – pattern recogni-
tion, sampling limitations, observer 
effects, and context dependencies – 
shape how you should approach 
monitoring. Understanding these 
fundamental issues helps you make 
informed decisions about what to 
measure, how frequently to measure 
it, and how to interpret the data 
collected.
The goal of monitoring isn’t to elimi-
nate all uncertainty or capture every 
possible event. Rather, it’s to build 
a practical observability framework 
that provides sufficient visibility into 
system behavior to support opera-
tional decision-making, while remain-
ing sustainable in terms of cost and 
complexity.
30 seconds, every minute, or every 
five minutes. Between these col-
lection points, an infinite amount 
of activity occurs that is never 
observed.
Sampling strategies introduces sev-
eral well-known problems. First, 
you face the risk of aliasing – miss-
ing important events that occur 
between collection intervals. A CPU 
spike that lasts 10 seconds will be 
invisible if you collect data every 60 
seconds and happen to sample dur-
ing the quiet periods before and af-
ter. Critical errors might be logged, 
processed, and resolved entirely 
within the gaps of the monitoring.
The sampling frequency creates a 
fundamental trade-off. More fre-
quent collection provides better 
visibility and reduces the chance of 
missing transient events. However, 
higher collection frequency means 
more agent overhead, more network 
traffic, more database writes, and 
more storage consumption. In large 
environments with thousands of 
monitored items across hundreds of 
hosts, these costs multiply rapidly.
Moreover, the act of measure-
ment itself affects the system be-
ing measured: the observer effect. 
Monitoring agents consume CPUcycles, memory, and I/ O bandwidth. 
Checking networks generates traf-
fic. Database queries for monitoring 
compete with application queries. 
At extreme scales, the monitoring 
system can become a significant 
portion of the infrastructure load it’s 
meant to observe.
The Context Problem
Even if you collect data successfully 
at appropriate intervals, individual 
Figure 1: A long-term graph clearly illustrates the wavering behavior of the filling rate.
83A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TData Collection with Zabbix
infrastructure components. This con-
sistency reduces operational overhead 
and simplifies compliance.
Simple Defaults, Complex 
Capabilities
One of Zabbix’s most valuable char-
acteristics is its graduated complexity 
curve. A default installation provides 
immediate utility – basic host moni-
toring, common service checks, and 
a functional web interface – without 
requiring extensive configuration 
or deep expertise. You can have a 
working monitoring system for a 
handful of servers within an hour of 
installation.
This simplicity at entry doesn’t come 
at the cost of capability. As require-
ments grow and expertise deepens, 
Zabbix scales both technically and 
functionally. The same platform that 
monitors 10 servers with default tem-
plates can evolve into a sophisticated 
monitoring infrastructure handling 
thousands of hosts, custom metrics, 
complex trigger logic, distributed 
collection through proxies, and high-
availability configurations.
This architectural approach solves a 
common problem in the selection of 
monitoring tools: the tension between 
immediate usability and long-term 
capability. Tools that are simple to 
deploy often lack depth for complex 
environments. Tools with enterprise 
capabilities often require significant 
investment in time before delivering 
any value. Zabbix occupies a middle 
ground – quick wins early, with a 
clear path to sophistication.
The Learning Curve 
Advantage
Zabbix’s learning curve is notably 
progressive. Initial deployment and 
basic monitoring require minimal 
expertise to get operational quickly, 
just by following documentation and 
using the provided templates. As you 
work with the system, additional 
capabilities become discoverable 
organically. Template customization 
leads to understanding items and 
triggers. Trigger customization leads 
to expression syntax. Expression 
work leads to calculated items and 
dependencies.
From Default to High 
Availability
Perhaps the most compelling aspect 
of the Zabbix architecture is continu-
ity from simple to complex deploy-
ments. A basic single-server installa-
tion can evolve incrementally toward 
enterprise-grade high availability 
without fundamental architectural 
changes or data migration:
 Do you need to distribute the col-
lection across network segments 
or geographical locations? Add 
Zabbix proxies.
 Is your growing data volume 
straining database performance? 
Implement database partitioning 
and optimize retention policies.
 Do you need to eliminate single 
points of failure? Configure active-
passive database clustering and 
load-balanced Zabbix servers.
 Do you have to meet strict uptime 
SLAs? Implement a full high-avail-
ability architecture with redundant 
components.
Each of these evolutions represents 
architectural enhancement rather 
than replacement. The templates, trig-
gers, and configurations developed 
on a simple deployment remain valid 
and functional in complex high-
availability setups. This continuity 
protects operational investment and 
reduces the risk of the monitoring in-
frastructure itself becoming a barrier 
to growth.
Practical Considerations
From the perspective of security en-
gineering, several practical aspects of 
Zabbix deserve mention. The system 
supports encrypted agent communica-
tions, privilege separation, and granu-
lar access controls. It integrates with 
enterprise authentication systems 
through LDAP and SAML. Audit log-
ging tracks configuration changes and 
user actions. These features aren’t 
afterthoughts – they’re fundamental 
design elements that enable Zabbix 
deployment in security-conscious 
environments.
The data model is well documented 
and accessible, enabling integration 
with other tools through the API or 
direct database access where ap-
propriate. This openness facilitates 
building monitoring into broader 
operational workflows rather than 
creating isolated silos of observabil-
ity data.
Zabbix Internal Structure
To configure and operate Zabbix ef-
fectively, you must understand its 
internal data model: how monitoring 
targets are represented, how collec-
tion is defined, and how these logi-
cal structures map to data collection 
activities. This conceptual framework 
shapes every aspect of Zabbix con-
figuration and operation.
In Zabbix terminology, a “host” repre-
sents any entity you want to monitor, 
which seems straightforward until 
you realize that “host” is a deliber-
ately abstract concept that doesn’t 
necessarily correspond to what is 
traditionally thought of as a host or 
server. Traditional hosts are exactly 
what you’d expect: physical servers, 
virtual machines, network devices – 
discrete systems with IP addresses 
that you monitor as unified entities. 
A web server is a host. A database 
server is a host. A network switch is 
a host.
However, hosts that aren’t traditional 
systems demonstrate where the ab-
straction becomes powerful. A cloud 
service with no single IP address can 
be represented as a host collecting 
metrics over API calls. A clustered 
database system might be monitored 
both as individual node hosts and 
as a logical cluster host tracking ag-
gregate metrics. Business processes, 
external APIs, and distributed 
workflows can all be represented as 
hosts – each serving as a container 
for related monitoring items.
This flexible host concept becomes 
powerful once you embrace the ab-
straction: A host is simply a container 
for related monitoring items – nothing 
more, nothing less.
84 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Data Collection with ZabbixM A N AG E M E N T
awk processes structured command 
output, extracting specific columns 
and performing calculations; sed 
transforms and extracts text through 
pattern substitution, which is particu-
larly useful for parsing configuration 
files; and cut provides simple column 
extraction from delimited data. The 
real power emerges when combining 
these utilities through pipes. A single 
command chain can extract, filter, 
parse, and aggregate data without 
requiring any additional software 
installation.
The Role of Zabbix Sender
The zabbix_sender utility provides the 
universal integration point, enabling 
any script or utility chain to push col-
lected metrics to the Zabbix server:
 Define a trapper item on a host 
in Zabbix to receive the collected 
data.
 Extract the necessary information 
from the system with available 
utilities.
 Deliver the information to Zabbix 
with zabbix_sender:
bash
zabbix_sender U
 -z [zabbix_server] U
 -s [hostname] U
 -k [item_key] U
 -o [value]
The process follows this consistent 
pattern.
Real-World Production 
Examples
The following examples reflect real 
production challenges encountered 
in monitoring SIEM appliances and 
other restricted systems. All examples 
are custom-tailored because the ap-
pliances require the monitoring of 
specific parameters to determine op-
erational bottlenecks.
 
Monitoring High-Performance 
Storage
If you’re managing a server compo-
nent responsible for continuously 
storing large amounts of data on a 
preconfigured packages that enable 
a quick and efficient deployment. 
For environments requiring a more 
tailored setup, the official Zabbix 
website [1] offers a guided selection 
tool to help choose the most appro-
priate components for your existing 
infrastructure. A complete installa-
tion involves more than just deploy-
ing the zabbix-server package.The 
zabbix-frontend-php package must 
also be installed and integrated with 
a supported web server to provide 
the web-based management interface. 
Once the web front end is accessible 
and the initial login is completed, 
additional users can be created, and 
the system configuration can begin. 
Zabbix allows administrators to de-
fine alerting rules on the basis of con-
figurable thresholds. Numerous pre-
defined triggers are already available 
through built-in templates, allowing 
rapid implementation of monitoring 
policies. Additionally, customizable 
dashboards provide consolidated vi-
sual insight into monitored metrics, 
helping teams track critical param-
eters and maintain operational visibil-
ity across the infrastructure.
Default Utilities in 
Restricted Environments
Sometimes the environment is not 
friendly to IT security engineers. On 
restricted systems where installing 
additional components is impossible 
because of policy constraints, vendor 
limitations, or technical restrictions, 
you must work with the tools avail-
able by default. Fortunately, standard 
Unix utilities (e.g., awk, sed, grep, cat, 
sort, cut, wc) provide powerful data 
extraction and processing capabilities 
present on virtually every Unix and 
Linux system from initial installation.
These utilities aren’t workarounds; 
they’re legitimate monitoring tools 
that have existed for decades. When 
agents cannot be installed and custom 
software is prohibited, these standard 
utilities become the primary mecha-
nism for extracting monitoring data.
The grep tool filters and pattern-
matches text, making it ideal for log 
analysis and counting occurrences; 
If hosts represent what you’re moni-
toring, items represent specific metrics 
you’re collecting from those hosts. 
Items are the atomic units of data col-
lection in Zabbix: Each item collects 
one metric.
An item definition includes several 
critical attributes. The item key speci-
fies exactly what to collect – for an 
agent item, this might be system.
cpu.load[percpu,avg1]. The item type 
determines how collection happens: 
Zabbix agent, SNMP, IPMI, simple 
check, HTTP agent, or other methods. 
The update interval controls col-
lection frequency, directly affecting 
database load and monitoring respon-
siveness. The value type defines the 
data format: numeric, character, log, 
or text.
Items can also include preprocess-
ing steps – that is, transformations 
applied to collected data before stor-
age, such as unit conversion, regular 
expression extraction, or JSON path 
parsing.
Host-Item Relationship
The relationship between hosts and 
items creates the Zabbix monitoring 
hierarchy. A single host might have 
hundreds of predefined items: CPU 
metrics, memory usage, disk I/ O, 
network statistics, application logs, 
and custom metrics. This relationship 
enables bulk operations (disabling a 
host stops all its items), templating 
(define items once, apply to many 
hosts), and logical organization.
Templates deserve special mention: 
They define collections of items, trig-
gers, and monitoring elements that 
can be linked to multiple hosts. A 
“Linux Server” template might define 
50 items for monitoring the operat-
ing system. Link that template to 100 
hosts and you’ve defined 5,000 items 
through a single template relation-
ship. Change the template, and all 
linked hosts inherit the change.
Zabbix to Production
Installing Zabbix Server is rela-
tively straightforward because most 
major Linux distributions provide 
85A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TData Collection with Zabbix
high-performance local disk, monitor-
ing its performance is crucial, espe-
cially if the component is part of a log 
storage and analysis system, such as 
a SIEM. Given that this component 
often functions as a dedicated ap-
pliance, which limits the ability to 
install arbitrary software, the use of 
a standalone Zabbix sender to trans-
mit critical performance data to your 
monitoring system is a highly effec-
tive solution.
The iostat command provides disk 
performance metrics, and with the -Nd 
parameter, you 
can retrieve de-
vice names with 
utilization reports 
(Listing 1).
 
Monitoring Net-
work Filesystem 
Performance
When the local 
disk reaches its 
size limitations, 
a common best 
practice is to 
offload data to 
a more cost-
effective storage 
device, such as 
a network at-
tached storage 
(NAS) device. 
Because NAS operates over the 
network, it’s equally important to 
monitor its performance.
Monitoring NFS performance requires 
parsing the output of nfsiostat to 
extract read and write throughput 
data (Figure 2). Because the com-
mand output follows a different two-
line format compared with previous 
versions, a workaround is needed to 
identify the relevant numbers cor-
rectly, which can be achieved in mul-
tiple ways. The approach used here 
is to capture the required line along 
with the next one and then exclude 
the first line (Listing 2).
 
Monitoring Critical Network 
Connections
For business continuity and to en-
sure no network issues exist on 
your side, it’s a best practice to 
monitor the TCP connection status 
between sensitive components. Un-
fortunately, direct monitoring isn’t 
feasible, so you enabled a password-
less SSH connection to allow remote 
command execution. This approach 
Figure 2: As a result of bandwidth disruption, the previously utilized bandwidth was no longer available.
bash
# Collect iostat output for a specific 10TB device
meter=$(iostat -Nd | grep 10T)
# Extract specific metrics
tps=$(echo "$meter" | awk '{print $2}')# transactions per second
blkr=$(echo "$meter" | awk '{print $3}')# read performance (kB/s)
blkw=$(echo "$meter" | awk '{print $4}')# write performance (kB/s)
# Send to Zabbix
zabbix_sender -z zabbix.example.com -s storage-host -k disk.tps -o "$tps"
zabbix_sender -z zabbix.example.com -s storage-host -k disk.read -o "$blkr"
zabbix_sender -z zabbix.example.com -s storage-host -k disk.write -o "$blkw"
Listing 1: High-Performance Storage
bash
# Extract read throughput (kB/s)
rk=$(/usr/sbin/nfsiostat | grep -A 1 "read" | grep -v "read" | awk '{print $2}')
# Extract write throughput (kB/s)
wk=$(/usr/sbin/nfsiostat | grep -A 1 "write" | grep -v "write" | awk '{print $2}')
# Send to Zabbix
zabbix_sender -z zabbix.example.com -s nfs-client -k nfs.read.throughput -o "$rk"
zabbix_sender -z zabbix.example.com -s nfs-client -k nfs.write.throughput -o "$wk"
Listing 2: Read and Write Throughput
bash
# Identify the Java process
jid=$(jps | grep [TaskIdentifier] | awk '{print $1}')
# Extract heap usage (removing formatting for Zabbix 
integer requirement)
ju sed=$(jcmd $jid GC.heap_info | awk 'NR==2 {print $6}' 
| sed "s/K//g" | sed "s/,//g")
jt otal=$(jcmd $jid GC.heap_info | awk 'NR==2 {print $4}' 
| sed "s/K//g" | sed "s/,//g")
# Send to Zabbix
za bbix_sender -z zabbix.example.com -s java-host -k 
java.heap.used -o "$jused"
za bbix_sender -z zabbix.example.com -s java-host -k 
java.heap.total -o "$jtotal"
Listing 4: Java Heap Status
bash
# Execute netstat remotely via SSH
ou t=$(ssh user@remote-host -i /home/user/.ssh/id_rsa "netstat -tupn 2>/dev/null")
# Count established connections to important service
es tablished=$(echo "$out" | grep [critical_ip] | grep ESTABLISHED | wc -l)
# Send to Zabbix
za bbix_sender -z zabbix.example.com -s remote-host -k connections.established -o 
"$established"
Listing 3: TCP Connection Status
86 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Data Collection with ZabbixM A N AG E M E N T
[2] Crontab guru: [https:// crontab. guru/]
[3] Zabbix community forum: 
[https:// www. zabbix. com/ forum/]
Author
Attila Bartek has more than 25 years of experi-
ence as a Cybersecurity Engineer and Advisor.
persistent monitoring agents cannot 
be deployed. Crontab guru [2] offers 
a better understanding of the crontab 
setting and structure.
Although the minimum scheduling 
interval for crontab is one minute,you can achieve sub-minute collection 
with sleep commands in a crontab file 
(Listing 7).
Conclusion
Effective monitoring in restricted en-
vironments requires creativity, a deep 
understanding of available system 
utilities, and strategic use of tools like 
zabbix_sender. Although these ap-
proaches might lack the elegance of 
purpose-built monitoring agents, they 
provide reliable, policy-compliant 
monitoring coverage in environments 
where traditional approaches fail.
The examples presented here dem-
onstrate that monitoring isn’t about 
having perfect tools – it’s about 
working effectively within constraints 
while still achieving operational vis-
ibility. By combining standard Unix 
utilities, zabbix_sender, and sched-
uled execution through crontab files, 
security engineers can build robust 
monitoring solutions that respect or-
ganizational policies, vendor limita-
tions, and technical realities.
If you require additional informa-
tion or have outstanding questions, 
feel free to reach out to the Zabbix 
community forum [3], a trusted 
source of practical guidance and 
best practices from experienced 
professionals. 
Info
[1] Zabbix server instal-
lation: [https:// www. 
 zabbix. com/ download/]
monitors established connections to 
critical services (Listing 3).
Java Heap Monitoring Without 
Agents
An official Java monitoring agent is 
available for installation, but some-
times it’s not feasible because of 
compliance or technical constraints. 
However, monitoring the Java heap 
status (Listing 4) is always important 
because reaching the maximum heap 
size can cause the application to stop 
functioning (Figure 3).
 
Kafka Buffer Monitoring with 
Auto-Remediation
An application failing to handle Kafka 
buffers properly requires monitoring 
with automatic intervention when 
thresholds are exceeded (Listing 5).
 
DNS Resolution Monitoring
After an automated protection system 
inadvertently makes a critical domain 
unresolvable, leading to the failure 
of other systems, you can implement 
monitoring to detect similar issues 
(Listing 6).
Use with Crontab
Crontab provides reliable scheduling 
for zabbix_sender collection scripts on 
any Unix or Linux system. Universally 
available, crontab requires no installa-
tion, and administrators already under-
stand how to use and maintain cron 
jobs. A simple crontab entry schedules 
your collection script that gathers data 
with standard utilities and calls zab-
bix_sender at whatever interval makes 
sense for your metrics. This arrange-
ment creates lightweight, scheduled 
monitoring that works even in the 
most restricted environments where 
Figure 3: An issue was encountered once the used heap size (brown) reached the total heap size (green).
Listing 7: Sub-Minute Collection
bash
* * * * * /home/cronscript/myscript.sh # Run script in every minute
* * * * * sleep 30 /home/cronscript/myscript.sh # Run script in every 
minute, but wait for 30 seconds
Listing 6: DNS Resolution
bash
# Attempt DNS resolution
re solvedIP=$(nslookup api.example.com | awk -F':' 
'/^Address: / {matched=1} matched {print $2}' | xargs)
# Determine success (1) or failure (0)
[[ -z "$resolvedIP" ]] && result=0 || result=1
# Send to Zabbix
za bbix_sender -z zabbix.example.com -s dns-monitor -k 
dns.resolution.status -o "$result"
Listing 5: Kafka Buffers
bash
# Monitor partition fill percentage
fi ll=$(df -h | grep "[kafka_partition]" | awk '{print 
$5}' | sed 's/%//')
# Define threshold
threshold=85
if [ $fill -ge $threshold ]; then
 echo " Threshold $threshold reached - executing 
cleanup."
 /usr/local/bin/kafka_cleanup.sh
 za bbix_sender -z zabbix.example.com -s kafka-host -k 
kafka.cleanup.triggered -o 1
else
 echo "Below threshold - no action needed."
fi
# Always send current fill level
za bbix_sender -z zabbix.example.com -s kafka-host -k 
kafka.buffer.fill -o "$fill"
87A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TData Collection with Zabbix
The Certificate Enrollment Web 
Service was introduced in Windows 
Server 2008 R2 to modernize certifi-
cate requests and make them more 
flexible. Unlike traditional requests 
by Remote Procedure Call (RPC) and 
Distributed Component Object Model 
(DCOM) protocols, which require a 
direct connection to internal network 
ports and domain membership, both 
Certificate Enrollment Policy (CEP) 
web service and Certificate Enroll-
ment Web Service (CES) are imple-
mented on the Simple Object Access 
Protocol (SOAP) standard, which 
allows certificate requests to be made 
over an HTTPS interface, facilitating 
the integration of systems that are 
not part of the Active Directory (AD)
domain or even reside on remote 
networks.
Two Central Services
The CEP web service is based on 
X.509 CEP (MS-XCEP) [1] and is used 
to provide clients with information 
about available certificate templates 
and certification authorities. The ser-
vice provides this information over 
an HTTPS interface. Authentication 
is handled either by Kerberos with a 
username/ password combination, or 
it relies on a client certificate.
In contrast, the CES web service is 
based on the WS-Trust X.509v3 Token 
Enrollment Protocol (MS-WSTEP) [2] – 
a Microsoft-specific implementation of 
the OASIS WS-TRUST [3] standard. It 
is responsible for requesting the cer-
tificate, which it does by forwarding 
certificate signing requests (CSRs) to 
the certification authority (CA). As with 
CEP, communication takes place over 
HTTPS, and authentication is identical 
to the CEP protocol.
Managing Certificates with 
certmonger
The certmonger tool [4] helps with all 
the tasks related to managing X.509 
certificates on Linux systems, which 
means everything from generation 
of private keys, through certificate 
requests (CSRs), to automatic renewal 
of certificates before they expire.
The cepces [5] plugin lets you use 
CEP/ CES to procure a certificate from 
AD Certificate Services (CS) and place 
it under the control of certmonger. This 
function is used by Samba to provision 
certificates automatically for clients 
(Certificate Auto Enrollment) with a 
Group Policy Object (GPO) [6].
To ensure that communication with 
AD CS over CEP and CES protocols 
works, make sure the Certificate En-
rollment Web Service and Certificate 
Enrollment Policy Web Service roles 
are installed on an AD system, in ad-
dition to the Certificate Services. If 
these roles are not available, you can 
discover online how to add the roles 
to your existing AD CA [7][8].
Requesting a Certificate
The following example is based on 
a current Fedora system, but it also 
works on all other Linux systems on 
which the certmonger tool and the cep-
ces plugin are available. As usual, the 
two packages are installed on Fedora 
Microsoft’s Certificate Enrollment Web Service offers an easy way to 
obtain X.509 certificates from Active Directory Certificate Services. We 
introduce the protocols and investigate how to use the certmonger tool 
to issue certificates for Linux systems. By Thorsten Scherf
Linux Meets Windows CA
 Bridges
Ph
ot
o 
by
 S
an
di
p 
Ro
y 
on
 U
ns
pl
as
h
88 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTS Certificate Enrollment Web Service
At the end of the day, though, the 
combination of CEP, CES, and cert-
monger offers a very useful approach 
for automated certificate requests in 
heterogeneous environments. 
Info
[1] MS-XCEP: 
[https:// learn. microsoft. com/ en-us/ 
 openspecs/ windows_protocols/ ms-xcep/ 
 08ec4475-32c2-457d-8c27-5a176660a210]
[2] MS-WSTEP: 
[https:// learn. microsoft. com/ en-us/ 
 openspecs/ windows_protocols/ ms-wstep/ 
 4766a85d-0d18-4fa1-a51f-e5cb98b752ea]
[3] WS-TRUST: [https:// docs. oasis-open. org/ 
 ws-sx/ ws-trust/ v1. 4/ ws-trust. html]
[4] certmonger: 
[https:// pagure. io/ certmonger]
[5] cepces on GitHub: 
[https:// github. com/ openSUSE/ cepces]
[6] Certificate Auto Enrollment:[https:// wiki. samba. org/ index. php/ 
 Certificate_Auto_Enrollment]
[7] Configuring the Certificate Enrollment 
Web Service: [https:// learn. microsoft. 
 com/ en-us/ windows-server/ identity/ 
 ad-cs/ configure-certificate-enrollment- 
 web-service]
[8] Configuring the Certificate Enrollment Pol-
icy Web Service: [https:// learn. microsoft. 
 com/ en-us/ windows-server/ identity/ 
 ad-cs/ configure-certificate-enrollment- 
 policy-web-service]
[9] realmd: [https:// www. freedesktop. org/ 
 software/ realmd/]
[10] realmd and AD: 
[https:// www. freedesktop. org/ software/ 
 realmd/ docs/ guide-active-directory. html]
realm discover win2022-1g7p.test
Make sure you use the server that 
has the AD DNS entries as the DNS 
resolver [10]. After making sure 
this worked, add the system to the 
domain:
realm join win2022-1g7p.test
A simple id command lets you 
verify that you can query users from 
the domain; then finally, test the 
authentication:
id Administrator@win2022-1g7p.test
kinit Administrator@win2022-1g7p.test
If everything worked, you can now 
manually request the certificate for 
your system:
getcert request U
 -c cepces U
 -k /etc/pki/tls/private/machine.key U
 -f /etc/pki/tls/certs/machine.crt
Type the -c option here to use the 
previously installed cepces plugin. If 
everything worked, you will see from 
the output of getcert list that a certif-
icate was issued, and the system jour-
nal will also display information about 
a successful certificate issuance.
You can also use openssl to query the 
certificate’s details (Listing 1).
Conclusion
The certmonger tool and cepces 
plugin make it very easy to obtain 
certificates from an AD CS if the 
CEP and CES CA features are avail-
able. Currently, the client must be a 
domain member, because certmonger 
only supports Kerberos for authen-
tication. However, this situation 
could change in future versions of 
the tool.
Alternatively, you could check with 
curl and openssl 
whether addi-
tional wrappers 
or manual re-
quests let you log 
in with a certifi-
cate or password. 
by the dnf package manager from the 
distribution’s standard repository:
dnf install certmonger cepces-certmonger
The package manager automatically 
adds the Cepces CA plug-in to the 
certmonger configuration. To verify 
that this install worked, use:
getcert list-cas
 
[...]
CA 'cepces':
 is-default: no
 ca-type: EXTERNAL
 helper-location: U
 /usr/libexec/certmonger/ cepces-submit
In addition to several other plugins, 
you should now see a CA named cepces 
in the command output. If this entry 
does not appear, simply add the plugin 
manually by typing the command:
getcert add-ca U
 -c cepces U
 -e '/usr/libexec/certmonger/U
 cepces-submit'
In the /etc/cepces/cepces.conf con-
figuration file, the next step is to enter
grep '^server' /etc/cepces/cepces.conf
server=ad1-1g7p.win2022-1g7p.test
to find the name of the AD system on 
which you previously installed the 
CEP and CES roles.
Into the Domain with realmd
Before you can request a certificate 
from AD CS, you first need to add the 
client system to the domain, which 
might seem a little surprising because 
CEP and CES support different authen-
tication methods. Unfortunately, the 
certmonger plugin currently only uses 
Kerberos to log in to an AD system.
The easiest way to add the client to 
the AD domain is to use the realmd 
tool [9]. The package is available for 
most Linux distributions. Once the 
package is installed on the system, 
the first step is to perform a domain 
discovery:
Listing 1: Certificate Details
# openssl x509 -in /etc/pki/tls/certs/machine.crt -noout -issuer -subject -dates
issuer=DC=test, DC=win2022-1g7p, CN=win2022-1g7p-AD1-1G7P-CA
subject=CN=client.win2022-yn6a.test
notBefore=May 30 10:18:31 2025 GMT
notAfter=May 30 10:18:31 2026 GMT
The Author
Thorsten Scherf is the 
global Product Lead for 
Identity Management and 
Platform Security in Red 
Hat’s Product Operations 
group. He is a regular 
speaker at various international conferences 
and writes a lot about open source software.
89A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTSCertificate Enrollment Web Service
Demand for Ethernet as a real-
time control network is growing as 
manufacturers and other companies 
discover the advantages of a single 
network technology throughout the 
enterprise (from the office floor 
to the factory floor). This kind of 
vertical integration offers many 
benefits in terms of administration 
and support for IT. Lower product 
costs combined with the potential 
for overlap in training and mainte-
nance costs for information, field, 
control, and possibly device net-
works, are expected to reduce costs 
significantly.
Ethernet offers many advantages 
over existing approaches at the 
real-time control level. As a con-
trol network, it offers a bandwidth 
of 10Gbps (and higher), which is 
almost 1,000 times faster than com-
parable fieldbus networks. However, 
distributed applications in control 
environments require tight syn-
chronization to guarantee message 
delivery within defined cycle times. 
Conventional Ethernet and fieldbus 
systems are unable to meet the tim-
ing requirements of less than a few 
milliseconds, but real-time Industrial 
Ethernet enables cycle times of just a 
few microseconds.
Ethernet also promises less complex-
ity with all the features required for 
a field, control, or device network. 
Moreover, Ethernet devices support 
TCP/ IP stacks, allowing Ethernet 
to connect to the Internet without 
problems. This feature is attractive 
because it enables remote diagnostics, 
control, and monitoring of an indus-
trial network from any device con-
nected to the Internet.
Real-Time Systems
Various organizations like IEEE and 
ISO define standards and guidelines 
for real-time systems that can vary by 
context and application, but real time 
generally can be defined as the opera-
tion of a computing system in which 
programs for processing incoming 
data are constantly ready for immedi-
ate execution, enabling the system 
to process data and produce outputs 
within a strict, predefined time con-
straint. Depending on the applica-
tion, the data could occur at random 
intervals or at predetermined times. 
Appropriate hardware and software 
must be used to avoid the occurrence 
of delays capable of preventing com-
pliance with this condition.
Correct execution of real-time (RT) 
systems depends not only on the logi-
cal validity of the data, but also on 
its timeliness. Hard real-time (HRT) 
systems are those in which faulty 
operation can lead to catastrophic 
events. Errors can lead to accidents or 
even death. Such computers are typi-
cally found in flight or train control 
systems. In contrast, soft real-time 
(SRT) systems are not as vulnerable. 
Although errors are undesirable, they 
do not lead to the loss of property or 
human life.
The building blocks on which real-
time systems are based are referred 
to as “jobs.” Each real-time job is 
assigned specific timing parameters: 
release time, readiness time, execu-
tion time, response time, and dead-
line. The release time of a job is the 
point at which the job is available 
to the system. The execution time is 
the time required for a job to be fully 
processed. The response time is the 
period between the release time and 
the execution time. The readiness 
time is the earliest time at which the 
The replacement of first-generation fieldbuses with real-time Ethernet creates a single network that extends from the 
control level in the office to field devices. We describe the challenges and solutions of various protocols for Industrial 
Ethernet with real-time capabilities that currently is not governed by a single uniform standard. By Mathias Hein
Real-Time Industrial Ethernet Protocols
 Every Second Counts
Le
ad
 Im
ag
e 
©
 O
rl
an
do
 R
os
u,
 12
3R
F.
co
m
90 A D M I N 92 W W W. A D M I N - M(IoT) devices, machines, and, 
increasingly, autonomous artificial 
intelligence (AI) applications. Studies 
and observations in corporate envi-
ronments show that NHIs exceed the 
number of human identities many 
times over: Ratios of 40:1 to 80:1 
have been reported. Whether or not 
these numbers are accurate, clearly 
NHIs give rise to an identity and ac-
cess management (IAM) and cyber-
security problem of a considerable 
magnitude, giving rise to a variety of 
security risks and prompting the need 
for automation.
The challenge lies not only in the 
sheer numbers. NHIs are often cre-
ated automatically, for example, as 
part of continuous integration and 
continuous delivery (CI/ CD) pipelines 
or through instances of Kubernetes 
pods. Their lifespans can range from 
a few seconds to several years, and 
their privileges range from simple 
read access to comprehensive admin-
istrative rights.
The majority of today’s NHIs are 
either unknown or work with static 
access credentials that do not change 
over long periods of time. This com-
bination of opacity and permanent 
authorizations creates a massive at-
tack surface that classic strategies 
in the area of IAM do not address. 
The strategies currently in place only 
consider human identities and a small 
subset of NHIs – the technical and 
functional user accounts managed by 
privileged access management (PAM; 
i.e., service and system accounts to 
be more precise).
Management of Non-Human 
Identities
Different terms are sometimes used 
synonymously with the umbrella term 
“non-human identity management” 
for strategies, technologies, and pro-
cesses, and sometimes specific sub-
areas (Table 1).
NHI management encompasses iden-
tifying, creating, governing, and de-
leting these identities, including cre-
dential (authentication information) 
management; creating, managing, 
and assigning policies; and manag-
ing the resulting risks. The goal is to 
enforce basic principles, such as least 
Many non-human identities – workloads in the cloud, service accounts in IT systems, autonomous agents 
in AI applications – are poorly managed or not managed at all. We present a strategic, holistic approach to 
managing these identities. By Martin Kuppinger
Identity for Machines, Workloads, and Agents
 Digital Colleagues
Le
ad
 Im
ag
e 
©
 A
RM
M
Y 
PI
CC
A,
 12
3R
F.
co
m
Table 1: Non-Human Identities
Term Definition or Focus Distinction
Non-human identity Generic term for all digital identities 
not related to a human being
Also includes workloads, 
machines, agents
Machine identity Identity of physical or virtual systems 
(e.g., servers, IoT devices)
Typically long-term; secured by 
certificates
Workload identity Identities for temporary processes 
(e.g., containers, serverless functions)
Ephemeral, dynamic; token-based
Service account or 
API identity
Functional accounts for services, 
pipelines, APIs
Often static, with wide-ranging 
authorizations
Agentic AI identity Identity for autonomous AI agents Contextual, adaptive, goal-
oriented
12 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E Non-Human Identity Management
privilege and zero standing privileges, 
for NHIs, too.
The diversity of terms in the field of 
NHIs also reflects the complexity of 
the topic. Although workload iden-
tity often plays a role in the context 
of cloud-native architectures and 
DevOps, machine identity is more 
focused on classic system-to-system 
communication (e.g., in the context 
of TLS certificates or device certifi-
cates for IoT). Agentic AI identity, 
on the other hand, describes a new 
class of identities that is increasingly 
characterized by autonomous, adap-
tive systems. These identities come 
with additional requirements – for 
example, with regard to the decision-
making context and the ability to 
change over time.
A future-oriented model for NHI 
management therefore needs to avoid 
being based on fixed typologies and 
must instead focus on attribute and 
capability descriptions. These descrip-
tions cover, for example, the duration 
of an identity’s existence (short-lived 
or ephemeral vs. persistent), its origin 
(automatically generated vs. manually 
created), the degree of autonomy, and 
the type of interaction with systems. 
An attribute-based approach enables 
more flexible governance and pro-
motes a better understanding of how 
identities should be treated, regard-
less of how they are labeled.
CIEM
Cloud infrastructure entitlement man-
agement (CIEM) focuses on managing 
and analyzing access authorizations 
security technologies. For example, 
anomalies in access behavior can be 
identified by combining CIEM with 
identity threat detection and response 
(ITDR). Automatically remedying in-
correct or risk-fraught authorizations 
leads to an adaptive security model 
that can also respond to volatile and 
short-term workload identities, which 
makes CIEM an indispensable com-
ponent of any modern, risk-oriented 
cloud security architecture.
Interfaces to Other 
Security Segments
NHI management does not stand 
alone. Numerous other segments in 
the area of IAM and cybersecurity 
overlap or relate to NHI management 
(Table 2). These segments need to be 
integrated into a holistic identity and 
security model, such as an identity 
fabric that covers both human and 
non-human actors (Figure 1).
Historically, PAM did not focus exclu-
sively on privileged human users. Early 
on, PAM also addressed functional 
and technical accounts, in particular 
shared accounts or service accounts 
at the operating system and database 
level. These accounts form a subset of 
non-human identities because they are 
either used automatically or shared by 
multiple people with elevated privi-
leges. However, with the expansion of 
dynamic, short-lived workload identi-
ties, PAM needs to be rethought and 
more closely integrated with modern 
NHI management strategies.
Secrets management is a central com-
ponent in dealing with NHI, because 
almost every non-human identity 
requires authentication information 
in the form of secrets. The secure 
management, versioning, and rotation 
of these secrets is essential to avoid-
ing security risks and vulnerabilities. 
However, the isolated use of vault 
technologies is not up to this task. 
Secrets, including authentication cre-
dentials, are linked to identities, such 
as workloads, defined owners of ap-
plications and software components, 
and security policies.
Simply managing different types 
of secrets, from SSH keys and SSH 
in cloud infrastructures. In other 
words, CIEM occupies the space be-
tween NHI management and access 
governance. Another term for this 
could be non-human access manage-
ment (NHA); in fact, this term would 
describe CIEM’s function far much 
more accurately and make clear the 
very close relationships between 
NHI and CIEM, although usually 
they currently are not implemented 
in business practice.
CIEM tools analyze which NHIs have 
access to which resources, whether 
these authorizations are too exten-
sive, and whether principles such 
as least privilege are being violated. 
CIEM therefore provides the required 
counterbalance to identity manage-
ment: Where NHI management is 
responsible for managing identity and 
associated credentials, CIEM consid-
ers access to specific resources in the 
cloud.
Especially in dynamic cloud environ-
ments with infrastructure as code 
and automated provisioning, it is 
nearly impossible for companies to 
keep track manually of all access 
authorizations for NHIs. CIEM tools 
enable transparency here through 
continuous analysis and visualization 
of entitlement structures, identifying 
overprivileged roles, and recommend-
ing optimizations on the basis of 
usage patterns, which is an essential 
step toward implementing the least 
privilege principle in complex cloud 
landscapes.
Additionally, modern CIEM ap-
proaches increasingly offer integrated 
functions for correlation with other 
Table 2: Security TechnologiesAGA Z I N E .CO M
N U TS A N D B O LTS Real-Time Ethernet
job can be executed (always greater 
than or equal to the release time). 
The deadline is the time by which the 
execution must be completed.
All real-time systems exhibit a certain 
degree of jitter (i.e., a deviation from 
the actual timing of the aforemen-
tioned times). In a real-time system, 
the jitter should be measurable within 
a defined interval so that system per-
formance can still be guaranteed.
Ethernet Without Collisions
Ethernet is a non-deterministic net-
work protocol and therefore inher-
ently unsuitable for hard real-time 
applications. The Carrier Sense Mul-
tiple Access with Collision Detection 
(CSMA/ CD) media access control 
protocol specified in the IEEE 802.3 
standard, with its binary, exponential 
backoff algorithm, does not enable 
the network to support hard real-time 
communication, because it includes 
random delays and allows for the pos-
sibility of transmission errors.
With the CSMA/ CD mechanism, each 
node detects whether another node is 
transmitting on the medium (carrier 
sense). If the carrier sense function 
is active on a node, it delays trans-
mission until it determines that the 
medium is free. Whenever two nodes 
transmit simultaneously (multiple 
access), a collision occurs in the net-
work, and all packets become invalid. 
The nodes can detect collisions by 
monitoring the collision signal pro-
vided by the bit transmission layer. If 
a collision occurs, the node sends a 
corresponding notification.
When a node begins transmission on 
the medium, a specific time interval, 
known as the collision window, takes 
place, during which a collision can 
occur. This window is large enough 
for the signal to propagate throughout 
the entire network segment. Once this 
time window has expired, all (func-
tioning) nodes should have their car-
rier detection enabled and therefore 
not attempt to begin transmission.
If a collision occurs, the backoff algo-
rithm is applied to each colliding node. 
One advantage of this algorithm is 
that it controls the use of the medium. 
switches. These devices can isolate 
collision domains by segmenting the 
network, as each device connection 
is configured as a single collision 
domain, which means full-duplex 
switches in combination with full-
duplex-capable nodes can eliminate 
collisions in all segments.
The IEEE 802.1Q standard provides 
for the required quality of service 
(QoS) at the media access control 
(MAC) level and defines how these 
switches can handle prioritization. 
An 802.1Q implementation has 
certain advantages for real-time 
Industrial Ethernet applications: It 
introduces standardized prioritiza-
tion on Ethernet and enables control 
engineers to implement up to eight 
different user-defined priority levels 
for their data traffic.
To classify real-time capability, the 
OSI model is divided into three 
classes. Class 1 above the Transport 
Layer, class 2 above the Ethernet 
Layer, and class 3 through modifica-
tion of the Ethernet Layer (Figure 1). 
In class 1, the entire protocol stack is 
retained, preserving full compatibility 
with conventional Ethernet up to the 
Application Layer. Well-known imple-
mentations of this class are Modbus/ 
TCP, P-NET, JetSync, EtherNet/ IP 
with CIP Sync, and Foundation Field-
bus high-speed ethernet (HSE).
EtherNet/ IP
EtherNet/ IP (EIP, where IP stands for 
Industrial Protocol) is an open Ap-
plication Layer protocol that is based 
on the existing IEEE 802.3 Physical/ 
Data Layers and TCP/ UDP/ IP, which 
ensures interoperability with most 
information layer networks. EIP offers 
real-time performance if strict guide-
lines are followed but is not determin-
istic. It uses the open, object-oriented 
Control and Information Protocol 
(CIP) as its Application Layer – the 
same Layers 5 through 7 as DeviceNet 
and ControlNet, providing full in-
teroperability with those networks.
CIP is a flexible and scalable au-
tomation protocol, well suited for 
distributed systems, and features 
object orientation, electronic data 
If the medium is heavily loaded, the 
probability of collisions increases, and 
the algorithm increases the interval 
from which the random delay time 
is selected. This step is intended to 
reduce the load and avoid further col-
lisions. However, Ethernet’s CSMA/ CD 
algorithm can result in complete trans-
mission failure and the possibility of a 
random transmission time, making the 
protocol non-deterministic, especially 
in heavily loaded networks.
That said, Ethernet is only non-deter-
ministic when collisions can occur. To 
implement a fully deterministic Eth-
ernet, all collisions must be avoided. 
A collision domain is a CSMA/ CD 
segment in which simultaneous trans-
missions can lead to a collision. The 
probability of collision increases with 
the number of nodes transmitting in a 
single collision domain.
Switched Ethernet 
in the IoT
When Ethernet was standardized, 
all communication was based on a 
half-duplex transmission mechanism, 
wherein a node can either send or 
receive, but not do both at the same 
time. Nodes that share a half-duplex 
connection operate in the same col-
lision domain, which means these 
nodes compete for bus access and 
their packets can collide with other 
packets on the network. With full-
duplex, a node can send and receive 
simultaneously, and a maximum of 
two nodes can be connected to it, 
which is usually a node-to-switch or 
switch-to-switch configuration where 
each network node has its own colli-
sion domain. This method completely 
avoids collisions. Because full-duplex 
connections can serve a maximum of 
two nodes per connection, this tech-
nology is not practical without the 
use of fast switches.
The most common method of colli-
sion avoidance is the introduction of 
individual collision domains for each 
node, because this guarantees the 
node sole use of the medium, elimi-
nating access conflicts. This system is 
achieved by implementing full-duplex 
connections and hardware such as 
91A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTSReal-Time Ethernet
sheets, and device profiles. EIP 
with CIP is not a real-time protocol. 
To achieve RT for EIP, CIP Sync (a 
high-speed CIP synchronization 
solution) is used. With 100Mbps 
switched Ethernet, it achieves a syn-
chronization accuracy of more than 
500ns between devices, although jit-
ter caused by the protocol stack still 
poses a problem.
EIP uses both TCP and UDP with IP 
for communication. If a connection-
oriented exchange is preferred (e.g., 
during initialization), it uses TCP 
(Explicit Messaging). Explicit Mes-
saging contains protocol and service 
information but has no strict timing 
requirements; it is therefore perfectly 
okay to use the slower but guaranteed 
TCP protocol. For RT traffic, EIP uses 
the unicast and multicast capabilities 
of UDP to implement the producer-
consumer model of communication, 
which is popular in control applica-
tions. Implicit messages do not con-
tain commands, only data. The mean-
ing of this data is configured during 
initialization, which reduces runtime 
processing in the nodes. Network 
collisions are avoided by switches, 
whereas EIP generally operates in a 
star topology. One variant uses virtual 
local area networks (VLANs) and 
places all devices that exchange time-
critical data on the same VLAN.
Foundation Fieldbus HSE
The starting point for Foundation 
Fieldbus HSE (Figure 2) is Founda-
tion Fieldbus H1, introduced in 1995, 
with a transmission rate of 31.25Kbps 
and identical bus physics to Profibus 
PA (process automation) in accor-
dance with IEC 61158-2. Because the 
transmission rate is very low, a faster 
Figure 1: Overview of real-time capability classes.
Figure 2: On a Foundation Fieldbus network, linking devices (e.g., by ABB) connect systems 
that communicate over Ethernet.
92 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Real-Time EthernetN U TS A N D B O LTS
switch.The minions have an inte-
grated memory of between 2 bits and 
64KB. They look like a single device 
to the Ethernet, although in real-
ity they can comprise up to 65,535 
devices configured in an open ring 
topology with the Ethernet interface 
at the open end. The manager sends 
commands to the MAC address of the 
first device. When the signal reaches 
the Ethernet-minion interface, it is 
converted to eBus specifications (if 
eBus is used) and forwarded.
The field memory management unit 
(FMMU) of each configurable min-
ion converts a logical address into a 
physical address; this information is 
available to the manager during ini-
tialization, which is why each minion 
requires a special application-specific 
integrated circuit (ASIC). When a min-
ion receives a datagram, it determines 
whether it is being addressed and 
then forwards the data to or from the 
datagram, resulting in a delay of a few 
nanoseconds. EtherCAT is therefore 
a fast real-time Ethernet and is deter-
ministic when not used with UDP/ 
IP or between managers and minions 
connected by switches or routers.
Ethernet Powerlink
Ethernet Powerlink (EPL) is a hard RT 
protocol that is based on Fast Ether-
net. EPL devices use standard Ether-
net hardware without special ASICs. 
EPL can deliver a cycle time of 200ms 
with jitter of less than 1ms. EPL uses 
cyclic communication with time slot 
allocation and the manager-minion 
model. One manager (manager) is 
allowed per network. This manager 
schedules all transmissions and is the 
only active station; the minions trans-
mit on demand.
The EPL cycle comprises four sec-
tions. During the start period, the EPL 
manager sends the start-of-cyclic (SoC) 
frame, which synchronizes the min-
ions. The timing of this frame is the 
only time base for network synchro-
nization; all other frames are purely 
event-driven. The SoC is followed by 
the cyclic period, when the manager 
polls each station with a poll request 
frame. At this point, the minion 
can process 1,000 I/ O in 30ms, but 
requires a full-duplex transmission 
mechanism of copper or fiber optic 
cables. EtherCAT is based on the 
manager-minion principle and can 
interact with normal TCP/ IP and 
other Ethernet-based networks such 
as EIP or Profinet. It also supports 
any Ethernet topology, including bus.
The EtherCAT manager processes 
the RT data with dedicated hardware 
and software. The manager priori-
tizes EtherCAT frames over normal 
Ethernet traffic and controls traffic 
by initiating all transmissions. The 
datagrams (data packets between 
manager and minions) are standard 
Ethernet packets, where the data field 
encapsulates the EtherCAT frame (an 
EtherCAT header and one or more 
EtherCAT commands). Each com-
mand contains a header, data, and a 
working counter field. Each Ethernet 
datagram can contain many Ether-
CAT commands, resulting in higher 
bandwidth and more efficient use of 
the large Ethernet data field size and 
header. The standard Ethernet cyclic 
redundancy check (CRC) is used to 
verify the correctness of the message.
The EtherCAT manager completely 
controls its minions. Its commands 
only trigger responses; the minions 
do not initiate transmissions. The 
two EtherCAT communication meth-
ods used are EtherType or UDP/ 
IP encapsulation. The EtherType 
implementation does not use IP, 
which limits EtherCAT traffic to the 
original subnet. Encapsulating com-
mands with UDP/ IP lets EtherCAT 
frames traverse subnets but has dis-
advantages. The UDP/ IP header adds 
28 bytes to the Ethernet frame and 
undermines RT performance with its 
non-deterministic stack.
EtherCAT minions range from intel-
ligent nodes to 2-bit I/ O modules and 
are networked by 100BASE-TX, fiber 
optic cable, or eBus. eBus is a Physi-
cal Layer of EtherCAT for Ethernet 
that provides a low-voltage differen-
tial signal (LVDS) scheme. Minions 
are hot-pluggable in any topology 
of branches or stub lines. Multiple 
minion rings can exist on a single 
network if they are connected by a 
H2 bus was initially considered for 
communication at the control level. 
However, because of the widespread 
use of Ethernet in this area, develop-
ment was discontinued at an early 
stage, and the development of Foun-
dation Fieldbus HSE was initiated.
A mixture of tree and bus topology 
can be used with the H1 fieldbus. 
Communication takes place by man-
ager-minion access or a deterministic 
token passing procedure. In specifica-
tion 1.2, a maximum of 32 devices 
can be located on an H1 subnet, in 
which non-deterministic communica-
tion is not permitted. A linking device 
uses a bridge to connect several H1 
subnets and form an HSE network 
on which conventional 100Mbps 
switches operate. Because HSE is still 
based on standard Ethernet with a 
superimposed TCP/ UDP/ IP protocol 
stack, it is not real-time-capable itself. 
However, the development of another 
real-time-capable network was not 
the focus of the Fieldbus Foundation.
At the application level of the H1 
fieldbus, much like EtherNet/ IP, a 
function block model already existed 
for managing reusable hardware and 
software components of the auto-
mated facility. The function block is 
standardized according to IEC 61131 
and interacts with other function 
blocks by way of I/ O variables. Here, 
too, are gateways to other fieldbuses 
(third-party I/ O gateways). The goal 
of the Fieldbus Foundation was to 
transfer this function block model 
to the HSE level and use the same 
object model. In this way, the bridge 
between the two buses appears 
transparent. The user on the Ether-
net side has the impression of being 
able to access all H1 devices directly 
and equally. Multiple H1 buses can 
exchange non-time-critical manage-
ment, diagnostic, and configuration 
data with each other over the HSE 
bridges.
EtherCAT
The EtherCAT (Ethernet for Control 
Automation Technology) protocol is 
a real-time motion control concept 
defined in IEC standard 61158. It 
93A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTSReal-Time Ethernet
responds with a poll response frame 
containing data, avoiding collisions. 
The minion sends its response to all 
devices, enabling communication be-
tween the minions. After successfully 
polling all minions, the manager sends 
the end-of-cyclic frame, which informs 
each minion that cyclic traffic has 
been completed correctly.
The asynchronous section allows 
non-cyclic data transfers under the 
control of the manager. To transmit 
during this period, a minion must 
have informed the manager in its poll 
response during the cyclic period. 
The manager creates a list of wait-
ing minions and uses a scheduler to 
ensure that no transmission request is 
delayed indefinitely. Standard IP data-
grams can be transmitted during the 
asynchronous period.
EPL does not use switches to avoid 
collisions or ensure network syn-
chronization – this responsibility 
is controlled by the manager. EPL 
networks can be based on standard 
hubs, the recommendation being that 
each device contain a hub to facilitate 
bus implementation. Switches are not 
prohibited, but they add jitter and 
reduce determinism. Because the EPL 
network avoids collisions through 
time-controlled bus access, up to 10 
hubs can cascade.
Currently, EPL devices that require 
RT communication cannot coexist in 
the same segment as non-RT Ethernet 
devices. However, EPL devices can be 
operated like normal Ethernet hard-
ware. In protected mode, the real-time 
segment must be separated from 
normal traffic by a switch or router. 
In open mode, RT traffic shares the 
segment with normal traffic, but real-
time communication is impaired.
Profinet
Profinet is a fieldbus standard for dis-
tributed automation systems. It uses 
object orientation and existing IT 
standards (TCP/ IP, Ethernet, XML). 
Profinet is based on IEEE 802.3, is in-
teroperable with TCP/ IP and therefore 
with Ethernet, and is compatible with 
Profibus-DP (decentralized peripher-
als). Profinet v1has a response time 
of 10 to 100ms (Figure 3).
In contrast, Profinet-SRT (soft real-
time) with a cycle time of 5 to 10ms is 
designed to work in factory automa-
tion and to implement real-time ex-
clusively in the software. It uses TCP/ 
IP and its own software channel for 
RT communication. Profinet-IRT (iso-
chronous RT) introduces a hard RT el-
ement into the Profinet protocols. The 
three Profinet protocols enable differ-
ent degrees of real-time performance.
Profinet-IRT supports systems that 
require synchronization in the sub-
microsecond range, typically high-
performance motion control systems. 
The benchmark for such a system is 
a millisecond cycle time, microsecond 
jitter accuracy, and guaranteed deter-
minism; IRT meets all three criteria. 
However, because the software causes 
jitter of greater than 1ms, IRT (unlike 
SRT) is implemented in hardware with 
synchronized Ethernet nodes. With 
the use of full-duplex Fast Ethernet, 
the communication cycle is divided 
into an open standard TCP/ IP channel 
and a deterministic RT channel. Each 
Profinet-IRT device has a special ASIC 
for handling node synchronization and 
cycle division and includes an intel-
ligent two- or four-port switch.
The Profinet switch in each node con-
tains a bus access schedule and can 
process RT and non-RT traffic. This 
bus prioritizes real-time traffic and 
provides full-duplex connections for 
all ports. Classic switches add jitter, 
which affects determinism. Profinet 
switches minimize jitter so that it 
has a negligible effect. The Profinet 
communication model enables the 
coexistence of RT and non-RT traf-
fic in a network without additional 
precautions.
Conclusion
Currently no uniform standard for 
automation technology has been 
determined for Industrial Ethernet 
with real-time capabilities. The IEC 
61784-2 standard specifies at least 
10 different, and mostly incompat-
ible, technical solutions. In practice, 
though, no technical reason demands 
that so many different real-time Ether-
net implementations should be main-
tained. Pressure from users likely will 
lead to a reduction in these numbers 
in the medium term, with the market 
deciding which candidates best meet 
the requirements of the respective 
automation applications. 
Keywords: Ethernet, real, time, RT, protocol, frame, layer, EIP, fieldbus, 
HSE, EtherCAT, Powerlink, EPL, Profinet, SRT, IRT
Figure 3: Profinet occupies its own area in the data packet of an Ethernet frame (FCS, frame check sequence).
The Author
Mathias Hein is a freelance 
IT consultant and technical 
writer with more than 
40 years of professional 
experience in the field of 
networking. He also serves 
as an adjunct instructor at several universities. 
As a trainer and speaker at technical seminars, he 
shares his expertise in the areas of switching, 
TCP/IP, Voice over IP, Carrier Ethernet, and network 
management. As an author of technical books and 
articles in relevant trade journals, Hein regularly 
contributes to the dissemination of knowledge.
94 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Real-Time EthernetN U TS A N D B O LTS
ADMIN is your source for technical solutions to real-world problems. Every issue 
is packed with practical articles on the topics you need, such as: security, cloud 
computing, DevOps, HPC, storage, and more! Explore our full catalog of back 
issues for specific topics or to complete your collection. 
#87 – May/June 2025
Lightweight Kubernetes
K3s, k0s, and MicroK8s vie for performance honors on the control plane and data plane in 
artificially created extreme stress scenarios.
On the DVD: AlmaLinux 9.5 Minimal
#89 – September/October 2025
Automation
Optimize, automate, and manage workflows and processes in your data center. 
• Microsoft Power Automate 
• Ansible Automation Platform
On the DVD: IPFire 2.29 Core Update 196
#91 – January/February 2026
AI in the Enterprise
New tools bring the power of artificial intelligence and machine learning to the 
corporate world.
On the DVD: Fedora Server 43
 NEWSSTAND
ADMIN
Network & Security
#90 – November/December 2025
VoIP Network Security
Session Initiation Protocol provides both the underpinnings for VoIP and a potential attack 
vector for hackers. Open source tools can help you test and secure your VoIP networks.
On the DVD: Ubuntu 25.10 Server
Order online: 
bit.ly/ADMIN-Library
#88 – July/August 2025
5 Network Admin Distros
Admin distros take both workstations and servers into account, have broad support for 
various filesystems, deploy in heterogeneous environments without restrictions, and come 
with the necessary tool collections.
On the DVD: openSUSE Leap 15.6
#86 – March/April 2025
Data Obfuscation
Generalization, suppression, perturbation, and differential privacy are essential data 
protection techniques that enable a balance between data security and usability and 
ensure compliance with legal requirements.
On the DVD: Rocky Linux 9.5 Minimal
96 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Back IssuesS E RV I C E
Admin: Network and Security is 
looking for good, practical articles on 
system administration topics. We 
love to hear from IT professionals 
who have discovered innovative tools 
or techniques for solving real-world 
problems.
Tell us about your favorite:
• Interoperability solutions
• Practical tools for cloud 
environments
• Security problems and how you 
solved them
• Ingenious custom scripts
• Unheralded open source utilities
• Windows networking techniques 
that aren’t explained (or aren’t 
explained well) in the standard 
documentation
We need concrete, fully developed solu-
tions: installation steps, configuration 
files, examples – we are looking for a 
complete discussion, not just a “hot tip” 
that leaves the details to the reader.
If you have an idea for an article, send 
a 1-2 paragraph proposal describing 
your topic to: 
edit@admin-magazine. com.
WRITE FOR US
Authors
Amber Ankerholz 6
Attila Bartek 82
Thomas Drilling 54
Mathias Hein 42, 90
Ken Hess 3
Thomas Joos 32, 68
Samuel Klein 62
Martin Kuppinger 12
Martin Gerhard Loschwitz 18
Paolo Mulas 36
Marius Quabeck 26
Dr. Holger Reibold 78
Thorsten Scherf 88
Henner Schmidt 48
Max Werner 48
Matthias Wübbeling 76
Contact Info
Editor in Chief 
 Joe Casad, jcasad@linuxnewmedia.com
Managing Editors 
 Rita L Sooby, rsooby@linuxnewmedia.com 
Lori White, lwhite@linuxnewmedia.com
Senior Editor 
 Ken Hess
Localization & Translation 
 Ian Travis
News Editor 
 Amber Ankerholz
Copy Editors 
 Amy Pettle, Aubrey Vaughn
Layout 
 Dena Friesen, Lori White
Cover Design 
 Lori White, Illustration based on graphics by 
kgtoh 123RF.com
Advertising 
 Brian Osborn, bosborn@linuxnewmedia.com
Publisher 
 Brian Osborn
Marketing Communications 
 Gwen Clark, gclark@linuxnewmedia.com 
 Linux New Media USA, LLC 
 4840 Bob Billings Parkway, Ste 104 
 Lawrence, KS 66049 USA 
Customer Service / Subscription 
 For USA and Canada: 
 Email: cs@linuxnewmedia.com 
 Phone: 1-785-856-3080 
 For all other countries: 
 Email: subs@linuxnewmedia.com 
 www.admin-magazine.com
While every care has been taken in the content of 
the magazine, the publishers cannot be held re-
sponsible for the accuracy of the information con-
tained within it or any consequences arising from 
the use of it. The use of the DVD provided with the 
magazine or any material provided on it is at your 
own risk.
Copyright and Trademarks © 2026 Linux New 
Media USA, LLC.
No material may be reproduced in any form 
whatsoever in whole or in part without the writ-
ten permission of the publishers. It is assumed 
that all correspondence sent, for example, let-
ters, email, faxes, photographs, articles, draw-
ings, are supplied for publication or license to 
third parties on a non-exclusive worldwide 
basis by Linux New Media unless otherwise 
stated in writing.
All brand or product names are trademarks 
of their respective owners. Contact usif we 
haven’t credited your copyright; we will always 
correct any oversight.
Printed in Nuremberg, Germany by be1druckt GmbH.
Distributed by Seymour Distribution Ltd, United 
Kingdom
ADMIN (Print ISSN: 2045-0702, Online ISSN: 2831-
9583, USPS No: 347-931) is published bimonthly by 
Linux New Media USA, LLC, and distributed in the 
USA by Asendia USA, 701 Ashland Ave, Folcroft PA. 
March/April 2026. Application to Mail at 
Periodicals Postage Prices is pending at 
Philadelphia, PA and additional mailing offices. 
POSTMASTER: send address changes to ADMIN, 
4840 Bob Billings Parkway, Ste 104, Lawrence, 
KS 66049, USA.
Represented in Europe and other territories by: 
Sparkhaus Media GmbH, Bialasstr. 1a, 85625 
Glonn, Germany.
97A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S E RV I C EContact Info / Authors
BE THE FIRST TO SEE WHAT'S NEXT
Subscribe free to the ADMIN
Preview newsletter and get a 
sneak peek at every article
included in the next issue of
ADMIN.
Sign up today at https://bit.ly/admin-preview
Available Starting 
June 5
ADMIN 93
Image © artnovielysa, 123RF.com
Our next issue will be packed with all the great 
content you expect from ADMIN. Here are a few of 
the upcoming articles: 
 Microsoft Dataverse
 Virtualizing with Neko
 BunkerWeb Firewall
 Repairing MySQL Tables
 And much more!
Please note: Articles could change before the next issue.
Next Issue PreviewS E RV I C E
98 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO Mand NHI Management
Security Segment Relationship to NHI Management
Privileged access management 
(PAM)
Management of privileged NHI and privileged human-assigned 
user accounts, especially technical and shared service 
accounts
Secrets management Securing and rotating access information (tokens, keys) for NHI
Identity governance and 
administration (IGA)
Governance, responsibilities, and lifecycle management, 
including for NHI
Cloud-native application 
protection platform (CNAPP)
Consideration of security aspects at the application level, 
including NHI context
Cloud workload protection 
platforms (CWPPs)
Protection of workloads, including their identities, runtime 
monitoring, and vulnerability analysis
Identity threat detection and 
response (ITDR)
Detection of anomalous behavior by NHIs
13A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R ENon-Human Identity Management
certificates to API tokens, is not 
enough. The goal must be not just to 
store secrets in a vault, but to manage 
them in a controlled lifecycle.
CNAPPs (please refer to Table 2 for 
security technology acronyms) extend 
protection to the application level. 
Among other things, they combine 
CIEM, CWPPs, and vulnerability 
management. CNAPPs are particu-
larly relevant in the context of NHI 
because they provide contextual in-
formation about workloads and their 
interactions. This information makes 
it possible to assess the risk context 
of individual identities better – for 
example, when a workload identity 
requests access to particularly sensi-
tive resources or originates from a 
vulnerable application component.
CWPPs focus on protecting workloads 
in cloud and hybrid environments. 
They monitor runtime behavior, 
identify vulnerabilities, and isolate 
or block workloads with policies. 
In terms of NHI, CWPP solutions 
provide valuable signals that reveal 
which workload identities are active, 
and in which context, and whether 
they originate from potentially com-
promised instances or exhibit suspi-
cious behavior. They therefore com-
plement the purely access-based view 
of other security technologies with an 
operational perspective.
Another key link is IGA. For NHI, too, 
owners must be named, lifecycles 
defined, and access authorizations 
regularly reviewed. Classic IGA pro-
cesses such as recertification and the 
joiner-mover-leaver (JML) principle 
can be adapted to ensure control and 
accountability for non-human identi-
ties, too. However, these cases require 
customized workflows and an evalu-
ation logic that draws on technical 
metadata and usage patterns rather 
than personal attributes. Additionally, 
a high degree of automation is neces-
sary, if only because of the volatility 
of NHIs and their large numbers.
Last but not least, interaction with 
ITDR plays a special role. NHIs oper-
ate in a highly automated way, often 
in the background, which makes 
them particularly vulnerable to mis-
use and difficult to monitor. Only 
through behavior-based analysis are 
anomalies, such as the misuse of a 
secret or the expansion of an access 
pattern, detected in a timely fashion. 
ITDR therefore significantly boosts 
the ability to respond to threats in 
the context of NHIs and must be an 
integral part of any security strategy 
in this area.
What Is Delivered and What 
Is Missing
Many of the products currently mar-
keted as NHI management primarily 
address the management of secrets, 
but less so the entire identity and 
authorization model. Several dimen-
sions need to be taken into account:
 Secret vs. identity: A secret is not 
the same as an identity. Secrets are 
access credentials; identity defines 
the entity, its characteristics, and 
responsibilities.
 Static vs. dynamic: Long-lived se-
crets contradict security principles. 
Ephemeral identities with short-
lived tokens are the goal.
 Credentials vs. entitlements: 
Possession of a secret alone says 
nothing about entitlements. The 
mapping of identity to entitlement 
is crucial.
In practice, many products lack a 
consistent view of the relationship 
between identity, assigned secret, 
technical and organizational owner-
ship, and actual access rights. Indi-
vidual components such as token 
issuers, vaults, or identity providers 
(IDPs) typically operate in isolation 
and without consistent policies and 
enforcement, which creates gray areas 
Figure 1: Human and non-human identities play an equal role in the identity fabric.
14 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Non-Human Identity ManagementF E AT U R E
and agility requirements of software 
development.
Also important is that the identity 
fabric has an integrative meta model 
that can map different identity types, 
credential types, usage contexts, and 
trust levels. This model serves as the 
basis for automated decisions, such 
as granting temporary access rights or 
escalating rule violations. It also helps 
to implement regulatory requirements 
such as traceability, data residency, 
and client separation for NHI.
Such a strategic concept must also be 
designed for heterogeneity from the 
outset. In reality, companies typically 
rely on multiple cloud platforms, a 
variety of vault technologies, and 
different approaches to software de-
velopment. A central identity fabric is 
required to orchestrate this diversity 
without artificially restricting it. The 
goal is comprehensive control, not 
the homogenization of tools, which is 
why modularity is a key success fac-
tor: Organizations need to be able to 
rely on interoperable building blocks 
that can be flexibly integrated into 
existing landscapes.
Finally, the integration of NHI into 
the identity fabric also has a cultural 
component. Cooperation between 
IT security, IAM, cloud governance, 
and software development must be 
institutionalized, which can only be 
achieved through clearly defined pro-
cesses, coordinated interfaces, and a 
common vision. The identity fabric is 
thus not only a technological archi-
tecture, but also the organizational 
framework for modern, scalable iden-
tity management. NHIs are therefore 
no longer an exception in this con-
struct – they are an integral part of it.
Organizational Challenges
Responsibility for NHIs typically 
lies between software development, 
and access model. The identity fabric 
forms the structural and conceptual 
backbone, enabling different types of 
identities with their specific require-
ments to be managed consistently 
and holistically.
An identity fabric typically includes 
functions for identity provisioning, 
authentication, authorization, gover-
nance, and access protection across 
platform boundaries. For NHIs, it 
means that the NHI must not be 
treated as a special case, but as an 
equivalent entity with the same re-
quirements for traceability, control, 
and automation, necessitating a clear 
extension of classic IAM models to 
include NHI-specific elements, such 
as those for managing ephemeral 
workloads, cross-platform secrets, or 
autonomous agents.
A strategic NHI approach within the 
identity fabric begins with a complete 
inventory of all NHIs. This discov-
ery process must be continuous and 
include both declarative (e.g., infra-
structure definitions) and observable 
source (e.g., runtime data). On this 
basis, a clear assignment of respon-
sibilities follows: Who is the owner 
of an identity? Who is allowed to use 
it? Who controls the assigned permis-
sions? Without this governance, NHI 
management remains fragmented and 
difficult to audit.
Additionally, the identity fabric must 
also provide the technical mecha-
nisms for security and enforcement. 
These mechanisms include automated 
provisioning and deletion processes, 
standardized interfaces for integrating 
vaults and policy engines, and central 
control mechanisms for access control 
and role management. A policy-as-
code approach, with code generated 
automatically on the basis of poli-
cies, can help enforce policies con-
sistently across systemand platform 
boundaries while meeting the speed 
in the security architecture, especially 
where identities are generated by 
automated processes outside of tradi-
tional IAM provisioning.
Another shortcoming is the lack of 
defined and managed lifecycle man-
agement for NHI. Whereas typical 
events for human identities, such as 
entry, role changes, or departure, are 
clearly defined and automated, NHI 
has no comparable triggers. Without 
explicit definitions of expiration dates, 
dependencies, or usage context, many 
identities remain active even though 
they are no longer needed. This type 
of shadow identity poses a significant 
risk, especially in combination with 
overprivileged secrets.
Moreover, the integration of analysis 
and response mechanisms shows 
weaknesses. Only a few products 
offer native support for continuous 
monitoring of secret usage or detect 
anomalies in the behavior of individ-
ual workloads. As already mentioned, 
a closer link to ITDR is essential to 
evaluate the behavior of non-human 
identities on a situational basis and 
initiate automatic countermeasures. 
These functions are still the exception 
rather than the rule today.
Table 3 shows which vaults are 
commonly used today and in which 
contexts. These vaults need to be 
incorporated into overarching NHI 
management to achieve a centralized 
view of identities, policies, secrets, 
and access.
NHI as Part of the Identity 
Fabric
An isolated view of NHI manage-
ment falls short. In modern IT 
landscapes, which are increasingly 
characterized by hybrid, dynamic, 
and distributed architectures, all 
identities – human and non-human – 
must be part of a common identity 
Table 3: Important NHI Management Values
Vault Provider or Technology Typical Area of Application
HashiCorp Vault [1] Open source and enterprise Multicloud, development security operations (DevSecOps), Kubernetes
AWS Secrets Manager [2] Amazon Web Services (AWS) AWS Services, Lambda, Elastic Container Service (ECS), etc.
Azure Key Vault [3] Microsoft Azure Entra ID, functions, app services
Google Secret Manager [4] Google Cloud Platform (GCP) GCP-native workloads, IAM integration
15A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R ENon-Human Identity Management
DevOps, IAM, and IT security. This 
shared responsibility model often 
leads to gaps. Clear role assignments 
are necessary:
 IAM/ IT security defines gover-
nance and security mechanisms.
 Software development consumes 
services and vaults as part of agile 
processes.
The goal is cooperation by division 
of labor. Security must not slow 
down development but must support 
it through automated services and 
guidelines. In DevOps environments 
in particular, different vaults are used 
in parallel (Table 3). These vaults 
must be identified, managed, and in-
tegrated in a controlled way.
An additional challenge arises from 
the lack of standardization of re-
sponsibility models for non-human 
identities. Although roles and re-
sponsibilities for human users are 
often defined as part of onboarding 
processes and organizational struc-
tures, comparable mechanisms are 
often lacking for NHIs. Organizations 
therefore need defined procedures 
for assigning technical ownership 
that are clearly documented and 
regularly reviewed, which also in-
cludes processes for transferring 
responsibilities when projects 
change or technical owners leave the 
organization.
Equally important is the integration of 
security requirements through deploy-
ment pipelines in application develop-
ment. Security guidelines must be for-
mulated and implemented such that 
they can be seamlessly integrated into 
existing CI/ CD processes. Instead of 
checking security as a separate con-
trol instance downstream, audits and 
policy checks should be an integral 
part of automation. In this way, both 
security and development goals can 
be achieved efficiently without creat-
ing conflicting objectives.
Conclusion
Non-human identities are a central 
element of modern IT landscapes. 
Their secure management requires 
more than ad hoc solutions. Compa-
nies need a holistic strategy for NHI 
management that is embedded in an 
identity fabric and tailored to cloud 
and DevOps realities. Future-proof 
NHI management must be based on 
a modular architecture principle that 
takes into account the diversity of 
platforms, vaults, and development 
methods used, allowing the com-
bination of agility with central 
controllability.
For this reason, central governance re-
quirements must be combined on an 
organizational level with decentral-
ized implementation options within 
development teams. This tension can 
only be resolved through defined in-
terfaces, coordinated role models, and 
common goal definitions. Securing 
non-human identities is crucial to the 
resilience of digital infrastructures. 
The challenges of dealing with NHI 
affect not only IT departments, but 
the entire organization. 
Info
[1] HashiCorp Vault: [https:// www. hashicorp.
 com/ en/ products/ vault]
[2] AWS Secrets Manager: 
[https:// aws. amazon. com/ secrets-manager/]
[3] Azure Key Vault: [https:// azure. microsoft.
 com/ en-us/ products/ key-vault]
[4] Google Secret Manager: 
[https:// cloud. google. com/ security/ 
 products/ secret-manager]
Author
Martin Kuppinger is the founder of and Principal 
Analyst at KuppingerCole Analysts AG.
Keywords: non-human, identity, NHI, management, attribute, 
CIEM, access, NHA, modular, role, security, automation
16 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Non-Human Identity ManagementF E AT U R E
Prometheus suffers from a structural 
problem: It does not offer a true clus-
ter mode. A single instance stores its 
data locally and responds to queries 
from this local database. High avail-
ability therefore requires a separate 
design. Many teams solve this prob-
lem with the use of two Prometheus 
instances that query the same targets 
and with graphical or logical abstrac-
tions that merge the two data sources 
(e.g., in Grafana).
This solution increases availability, 
but it does not eliminate the fun-
damental problem of scaling. Each 
instance continues to back up locally, 
each instance compresses its own 
data, and each instance only stores its 
own data. Metrics volumes that grow 
and retention times that become lon-
ger mean more than a significant loss 
of convenience, because you have to 
deal with multiple points of adminis-
tration. If you have several locations 
with the same setup, the unpredict-
ability of slow connections between 
them adds to the problem.
In these scenarios, Cortex [1] [2] 
enters the scene. Cortex is directly 
related to Prometheus, because it oper-
ates in the same data model and pro-
tocol world and natively understands 
Prometheus data. It fields Prometheus 
metrics, stores them long term on scal-
able back ends, and makes them avail-
able again for queries. Instead of each 
Prometheus instance keeping its entire 
dataset locally, Prometheus transfers 
the data to Cortex at defined intervals, 
typically with its remote write module. 
Cortex then assumes responsibility for 
long-term storage and distributes the 
data across multiple instances of itself 
that scale horizontally, which means 
you can offload the pressure from the 
individual Prometheus instance to a 
system designed for scaling.
Monitoring
Containers have secured their place in 
everyday IT, and they are here to stay. 
As new as the deployment mechanism 
may be, Kubernetes (K8s) [3] and the 
like face exactly the same challenges 
in everyday operation as their conven-
tional counterparts. At the top of the 
scale of operational pain points for 
containers, much as in conventional 
environments, is monitoring, because 
if something goes wrong in Kuber-
netes, you want to know about it, just 
as when operating typical monoliths in 
legacy environments.
Monitoring in legacy environments 
has long followed a familiar pattern: 
Youmonitor hosts and services, check 
system statuses, and respond to events 
flagged by the monitoring system. If a 
service fails, a check triggers an alert. 
If a value exceeds a threshold, the 
system reports an error. This model 
is still in place today in the majority 
of conventional setups. In contrast, 
container platforms fundamentally 
shift the requirements. In K8s environ-
ments, plain vanilla event monitoring 
is no longer fit for the purpose because 
stability depends not just on whether 
or not something is running.
Container workloads are rarely binary 
or simple. Instead, they gradually run 
into difficulties: increasing latencies, 
CPU or memory bottlenecks, satu-
rated networks, excessive requests, 
and full buffers on the network and in 
storage are just a few of the potential 
issues. These kinds of developments 
need to be identified by analyzing 
time series; otherwise, you only see 
the end state in the form of a failure, 
which is precisely why event logging 
is seeing a second aspect coming to 
the fore in container platforms: trend-
ing – that is, continuous monitoring 
of utilization and behavior over time.
The value that legacy event moni-
toring systems such as Nagios [4], 
Icinga [5], or Checkmk [6] add in 
this scenario is limited. These tools 
are an excellent choice for static hosts 
and static services where you can 
clearly define thresholds for each 
check. They record statuses, generate 
warnings, and provide an actionable 
list of problems.
Trending, on the other hand, is a very 
different function. Historical evalua-
tions are often incomplete, depending 
as they do on retention parameters, 
or the data could end up in graphs 
that look great but do not allow for 
real-time series analysis. What is 
more, the systems do not scale well in 
container environments because the 
number of targets to be monitored is 
constantly changing.
Prometheus is the standard application when it comes to 
monitoring, alerting, and trending, but the software is slow 
when faced with a large volume of historical data. Cortex comes 
to the rescue and offers cluster support, as well. By Martin Loschwitz
Long-Term Prometheus Data Storage with Cortex
 Trend Scout
Ph
ot
o 
by
 F
LO
UF
FY
 o
n 
Un
sp
la
sh
18 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E Prometheus plus Cortex
Of course, this change is a key aspect 
of container environments: Applica-
tions and their instances come and 
go dynamically. A Kubernetes clus-
ter creates and destroys pods every 
second in high-traffic setups, mov-
ing workloads between nodes and 
automatically scaling deployments. 
A monitoring approach that needs to 
input every host and every service 
manually cannot hope to keep up 
with the pace.
Where Have All the Targets 
Gone?
The next key challenge is reliably 
capturing the targets. In conventional 
environments, you know your servers 
and enter them as static objects. In 
container environments, these objects 
might not even exist. Pods in Kuber-
netes are created dynamically, and if 
you use mesh tools such as Istio [7], 
services are sometimes even given 
completely new endpoints.
Monitoring that can be used effectively 
in K8s needs to detect these changes 
and respond to them. The system must 
determine for itself which endpoints 
provide metrics, and it has to refresh its 
internal list of these endpoints at short 
intervals. Even a well-maintained data 
center inventory management (DCIM) 
and dashboard logic for Kubernetes 
itself and for hosts and services (Fig-
ure 1), and Alertmanager distributes 
the alerts generated by Prometheus. 
Together, this trio forms the basis 
for observability in platforms where 
workloads change dynamically and 
traditional monitoring models fail.
Data Flood
Prometheus also offers a crucial feature 
that container environments absolutely 
need: automatic service discovery 
(SD). In Kubernetes, Prometheus uses 
its API to identify pods, services, and 
endpoints automatically. Admins also 
define scrape jobs and labels that 
regularly query the data and store the 
results in Prometheus in a structured 
way. Prometheus itself continuously 
updates the metrics data from its recog-
nized sources, eliminating the biggest 
hurdle that legacy systems face: manu-
ally maintaining a constantly chang-
ing inventory list. Prometheus works 
closely with the platform and follows 
its reality every step of the way.
However, Prometheus soon reaches 
its limits, a fact that quickly becomes 
apparent in larger environments. 
Prometheus’ local storage saves time 
series on the local server drive. As 
hours pass, the data volume grows, 
tool or a server and service database 
(configuration management database, 
CMDB) is not particularly helpful here. 
After all, the reality in clusters changes 
far faster than the documentation can 
ever hope to. Monitoring then becomes 
a question of integration into the or-
chestration logic that already exists in 
K8s: If you monitor Kubernetes, you 
have to understand it.
In this context, Prometheus [8] has 
established itself as the standard appli-
cation. It combines three features that 
are crucial in container and platform 
environments: It stores metrics as time 
series, actively pulls data from export-
ers, and has a powerful query lan-
guage in the form of PromQL, which 
makes the tool ideal for trending, 
capacity analysis, and understanding 
system behavior over time.
State monitoring is more-or-less a by-
product, because the number of httpd 
services running at any point in time 
is also a time series, but admittedly 
a very short one. Automated pro-
cesses such as alerts can be tailored 
to this scenario. In combination with 
Grafana [9] and Prometheus’ own 
Alertmanager [10], you have a combi-
nation currently considered the gold 
standard by many teams.
Prometheus provides the time se-
ries, Grafana has the visualization 
Figure 1: Grafana is the miracle tool for visualizing metrics data in Prometheus, as shown in this example from Kubernetes. © CNCF
19A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EPrometheus plus Cortex
and the bigger the history collection 
becomes, the greater the demands on 
I/ O, CPU, and memory become.
Although Prometheus itself works ef-
ficiently, it remains a system designed 
for a small number of time series. 
Queries over large periods of time tend 
to take quite a while to complete, and 
compressing and storing data long-term 
exposes the hardware to excessive load. 
Admins constantly have to keep an eye 
on the data repositories. To keep a long 
story short, the more historical data 
Prometheus needs to keep, the slower 
it becomes in everyday use – especially 
where large numbers of metrics and 
labels come together.
Help in Sight
Cortex not only addresses the issue of 
more history, it also addresses opera-
tion in larger, distributed structures. 
The tool consists of a distributed sys-
tem of several components, such as 
the distributor, the ingester, the store, 
and the query service, and follows the 
principle of microarchitecture: Each 
of the components listed here is re-
sponsible for precisely one task.
This architecture separates the tasks 
of processing incoming data, storing 
the data, and retrieval. Each layer 
scales independently horizontally, 
leading to a monitoring back end that 
grows with the platform, while Pro-
metheus is at the forefront, handling 
the monitoring and trending.
Cortex uses object storage such as 
Amazon S3-compatible targets – think 
a local instance of the Ceph Object 
Gateway or other scalable variants – for 
storage. Therefore, the local drives of 
the individual Prometheus servers lose 
their central importance. They only 
contain a core set of data that must re-
main accessible for quick access. Que-
ries are no longer run against a single 
local Prometheus instance, but against 
a distributed storage system that pro-
vides long-term data in a powerful way.
Administrators can look forwardto a 
monitoring and trending architecture 
that reliably captures all the relevant 
vital signs in extremely dynamic con-
tainer environments and then visual-
izes trends over months and years 
without a single service instance col-
lapsing under the weight of its own 
history.
On the basis of an existing Kuber-
netes cluster, I describe how to roll 
out Prometheus, Prometheus Alert-
manager, and Grafana and how to use 
the Prometheus Node Exporter to ac-
quire metrics for key vital signs of the 
hardware platform. Later, I also look 
into processing the metrics data from 
applications in detail.
Installing Prometheus
Helm [11] has established itself as 
the ideal solution for integrating 
Prometheus, its Alertmanager, and 
Grafana. It consistently versions the 
components, cleanly resolves depen-
dencies between them, and automati-
cally casts an immutable configuration 
into declarative statements for K8s.
To begin, you need to set up a 
dedicated namespace (e.g., monitor-
ing), and then use the kube-pro-
metheus-stack [12] Helm chart 
package to roll out Prometheus in 
the form of its own operator [13]; 
the Alertmanager, Grafana, and 
kube-state-metrics [14]; plus the 
Prometheus Node Exporter as a coor-
dinated set across the cluster. In this 
way, the entire cluster has a com-
plete monitoring basis – essentially 
launched in a single command line 
– without the need to put together de-
ployments and services individually.
The most important aspect for the suc-
cess of this endeavor is a clean labeling 
strategy in Prometheus because the 
application recognizes targets in K8s 
by label and logically groups metrics 
on that basis. If you ensure that all 
monitoring components have consis-
tent labels (e.g., app.kubernetes.io/
part-of=monitoring and app.kubernetes.
io/managed-by=helm), you are implicitly 
enabling the automatic detection of 
services and their grouping within Pro-
metheus itself.
For workloads that provide their own 
metrics, you ideally also want to es-
tablish a uniform schema on the basis 
of parameters such as team, service, 
env, or component. These labels end up 
both on the objects created in K8s and 
later in label queries in Prometheus, 
meaning that queries, dashboards, 
and alarms are governed by a fixed 
structure.
Ideally, you will also separate your 
platform metrics from application 
metrics in line with best practices with 
the use of namespaces or additional 
labels, such as metrics=platform and 
metrics=app, which allows the data to 
be accessed separately in services such 
as Grafana, preventing Grafana dash-
boards from becoming too chaotic.
After the install, Node Exporter (Fig-
ure 2) provides the host metrics of 
the physical systems. The Helm chart 
used here installs the exporter as a 
daemon set, giving each K8s node a 
local metric source. Prometheus auto-
matically grazes these targets because 
the stack chart contains matching 
service monitor entries. Finally, you 
need to check in the Prometheus user 
interface whether the targets appear 
there as expected and whether en-
tries such as those for node-exporter, 
kubelet, kube-state-metrics, and 
Prometheus’s own components are 
present there. If these targets have an 
UP status, basic data acquisition is 
working reliably, which means that 
the cluster will field CPU, RAM, disk, 
and network metrics for each node, 
plus status metrics for deployments, 
pods, daemon sets, and many other 
Kubernetes objects.
Fine-Tuning
The Prometheus Operator is respon-
sible for automatic service discovery 
in the Kube Prometheus stack. It 
observes K8s objects and generates 
scrape configurations for Prometheus 
from them, which means you no lon-
ger need legacy scrape_config entries 
for applications but instead create 
ServiceMonitor or PodMonitor objects 
in Kubernetes. A service monitor de-
scribes a service with a label selector, 
plus the port and path for the /metrics 
endpoint.
As soon as a team assigns a suitable 
label to a K8s service, Prometheus 
automatically adds the endpoint to the 
list of targets to be monitored. This 
pattern scales because you no longer 
20 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Prometheus plus CortexF E AT U R E
with Grafana’s Prometheus source 
plugin. For admins, this means leav-
ing Prometheus in place as the data 
collector and for real-time queries in 
the cluster but shifting the long-term 
history to Cortex.
Deployment takes place on the same 
Kubernetes cluster as Prometheus, 
preferably also in the monitoring 
namespace or in a separate 
namespace such as cortex. This 
arrangement makes sense if 
separation of concerns is important 
in your organization.
Helm is again recommended for the 
installation, because most Cortex de-
ployments are highly specific to their 
environments and can be handled 
easily on a per cluster basis with Cor-
tex Helm charts and values.yaml files. 
Much like Prometheus, Cortex does 
not consist of a single pod, but of 
several components that perform dif-
ferent tasks. Deployment in K8s must 
take this setting into account.
cluster status, node utilization, and 
workload behavior. You need to con-
nect Grafana to Prometheus as a data 
source, enable the chart’s predefined 
dashboards, and then add your 
own views according to the previ-
ously defined labeling strategy. That 
completes the Prometheus installa-
tion. All that’s missing is the cluster 
mechanism.
Operating Cortex
After the basic installation of the 
monitoring system, the next step 
is to convert plain vanilla cluster 
monitoring into a scalable metrics 
back end. In other words, you need 
to roll out Cortex (Figure 3), which 
relies on the remote_write interface 
to field metrics as described, before 
putting them into long-term storage, 
while remaining alert to queries from 
PromQL-compatible query endpoints, 
which also makes it fully compatible 
need to maintain a centrally managed 
target list. The platform identifies new 
services by labels and automatically 
includes them in the scraping process.
Alertmanager extends this setup to 
include central routing for all alerts. 
To do this, you need to set up routing 
rules in the Alertmanager configura-
tion by severity, team assignment, 
and namespace. External tools for 
delivering alerts also need to be cre-
ated here (e.g., for mailing or Matrix 
messages).
Prometheus, in turn, relies on 
PromQL rules to generate alerts, and 
you manage the rules as Prometheus-
Rule objects directly in Kubernetes. 
Assigning consistent labels (e.g., team 
and severity) to ruleset objects is 
important so that Alertmanager can 
route them in a targeted way – and to 
support deduplication.
Grafana closes the loop by querying 
dashboard data from Prometheus 
and providing visualizations for the 
Figure 2: Grafana evaluates data from Node Exporter and visualizes the results, as shown here with an example of a simple Raspberry Pi. 
© Grafana
21A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EPrometheus plus Cortex
Storage as a Backbone
A stable Cortex installation stands 
and falls with its storage back end. 
You need to choose an object storage 
system that is compatible with the S3 
API (e.g., MinIO [15] on the cluster or, 
as mentioned, the Ceph [16] Object 
Gateway [17]). Cortex writes its data 
directly to this object storage system, 
which eliminates the need to bind the 
local disks of individual systems. For 
fast processing, Cortex also uses a key-
value store for ring and status informa-
tion, often in the form of memberlist ob-
jects or, alternatively, with Consul [18] 
or Etcd [19]. In Kubernetes, memberlist 
is becoming the norm because it does 
not have external dependencies and is 
well-suited to dynamic environments, 
ensuring stable DNS names for the mem-
berlist entries, which in turn enables 
the individual Cortex services to find 
each other for communication.
For the connection between Cortex 
and Prometheus, you need to add a 
remote_write block to the Prometheusconfiguration that points to an in-
stance of the Cortex distributor. In an 
operator-based setup, as rolled out by 
the Helm chart, you will use the Pro-
metheus Custom Resource Definition 
(CRD) for this purpose and add the 
remote write endpoint there.
Prometheus is responsible for query-
ing all the active exporters – that is, 
the Node Exporter, kube-state-met-
rics, and all application-specific 
ServiceMonitor objects, but it ad-
ditionally forwards scraped samples 
to Cortex. To allow this to happen, 
the creation of additional labels in 
Prometheus to designate the envi-
ronment or cluster (external label-
ing) makes sense. A label such as 
cluster= or platform= 
prevents time series from different 
Prometheus instances being mixed up 
later. The label automatically ends up 
as an identifier for each time series 
on the Cortex back end and supports 
queries across multiple clusters with-
out collisions because of identical job 
or instance names.
Everyday Use
A pattern has established itself in op-
eration, with Prometheus continuing 
to be used for short-term queries and 
Grafana querying Cortex for long-
term views. Grafana accesses two 
data sources for this purpose: one for 
Prometheus and one for Cortex. On 
your dashboards, you need to define 
which panels require short-term de-
tailed data and which panels will dis-
play the long-term history (Figure 4).
Cortex provides this history with the 
query endpoint, whereas Prometheus 
communicates directly with Grafana. 
In this way, the query language re-
mains the same, but the back end 
changes. The operator stack fits in 
neatly because service discovery 
continues to take place in Pro-
metheus, and Cortex does not query 
any data itself.
Multicluster Scenarios
Multiple Cortex instances within 
the same K8s cluster can be imple-
mented in different ways. Here, I 
look at two models. The first relies 
on multitenancy to separate tenants 
within a central Cortex installation. 
Cortex supports tenant IDs by entries 
in the HTTP header of incoming re-
quests. You need to define a tenant 
ID for each cluster or each team in 
Prometheus to keep all the data logi-
cally separate, even though the same 
Cortex cluster is used. This model 
reduces resource requirements, sim-
plifies operation, and creates a central 
query layer. However, it does pose a 
challenge in terms of separation of 
concerns: The data is stored in the 
same repositories, and auditors might 
raise an eyebrow at that.
The second model avoids this prob-
lem by rolling out multiple sepa-
rate Cortex stacks within the same 
platform (e.g., for different security 
zones or platform teams). In this 
way, you can ensure hard isolation 
at the Kubernetes level, but at the 
cost of significantly greater opera-
tional overhead. In this model, the 
instances are assigned separate 
buckets or separate prefixes in ob-
ject storage so that block collisions 
cannot occur. You also need to use 
different block storage – for ex-
ample, different storage classes (in 
Kubernetes).
Figure 3: Cortex follows the microarchitecture application principle and comprises several 
components that work together with Prometheus and Grafana. © Cortex
22 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Prometheus plus CortexF E AT U R E
[8] Prometheus: [https:// prometheus. io/]
[9] Grafana: [https:// grafana. com/]
[10] Prometheus Alertmanager: 
[https:// prometheus. io/ docs/ alerting/ 
 latest/ alertmanager/]
[11] Helm: [https:// helm. sh/]
[12] kube-prometheus-stack: 
[https:// github. com/ prometheus- 
 community/ helm-charts/ blob/ main/ charts/ 
 kube-prometheus- stack/ README. md]
[13] Prometheus Operator: 
[https:// github. com/ prometheus- 
 operator/ prometheus-operator]
[14] kube-state-metrics: [https:// github. com/ 
 kubernetes/ kube-state-metrics]
[15] MinIO: [https:// www. min. io/]
[16] Ceph: [https:// ceph. io/ en/]
[17] Ceph Object Gateway: 
[https:// docs. ceph. com/]
[18] Consul: [https:// www. consul. io/]
[19] etcd: [https:// etcd. io/]
Conclusion: Easy to Do
Monitoring applications and plat-
forms with Kubernetes, Prometheus, 
Grafana, and Cortex promises an 
extremely flexible monitoring archi-
tecture. It keeps pace with the re-
quirements of modern, scalable apps 
and neatly integrates trending. If you 
have been used to Checkmk or Nagios, 
it could be a considerable change 
to familiarize yourself with all the 
components that make up the stack. 
However, the bottom line is a solution 
whose performance far exceeds that 
of legacy event monitoring. I can only 
encourage anyone who works with 
Kubernetes to take the plunge – if 
only for your own peace of mind. 
Info
[1] Cortex on GitHub: [https:// github. com/ 
 cortexproject/ cortex]
[2] Cortex: [https:// cortexmetrics. io/]
[3] Kubernetes: [https:// kubernetes. io/]
[4] Nagios: [https:// www. nagios. org/]
[5] Icinga: [https:// icinga. com/]
[6] Checkmk: [https:// checkmk. com/]
[7] Istio: [https:// istio. io/]
A hub-and-spoke approach is recom-
mended for platforms that spread 
across cluster boundaries. Each K8s 
cluster runs Prometheus locally for 
scraping and short-term queries but 
uses remote_write to send metrics to 
a central Cortex back end running 
either on a dedicated observability 
cluster or as a standalone platform 
instance. You can use (external) 
labels, typically with the values for 
cluster, region, and environment, to 
enforce a clean identity for the in-
coming data.
Grafana then accesses this central 
Cortex instance and creates global 
dashboards that map multiple plat-
forms simultaneously. The approach 
scales across locations as long as a 
network path exists between Pro-
metheus and the central Cortex. 
In practice, companies tend to rely 
on TLS-secured ingress endpoints 
in K8s, mutual (m)TLS between 
clusters, or dedicated private net-
work connections for this purpose. 
However, the latter are difficult to 
implement in public clusters and 
are usually reserved as a feature for 
private clouds.
Keywords: Kubernetes, Prometheus, Grafana, Cortex, data, storage, trending, monitoring, alerting
The Author
Martin Loschwitz is the 
founder and managing 
director of True West IT 
Services GmbH, which offers 
scalable IT infrastructure 
based on OpenStack and Kubernetes.
Figure 4: Visualization of long-term trending enables not only the rapid identification of specific problems, but also the detection of 
long-term trends, such as the emergence of alerts, before they become a problem. © Grafana
24 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Prometheus plus CortexF E AT U R E
In Japanese, Kuma means bear, 
which the Ainu associate with protec-
tive qualities. Uptime Kuma [1], [2] is 
a little bear that keeps a watchful eye 
on your websites, servers, and ser-
vices 24x7, as described by developer 
Louis Lam. What began in July 2021 
as a personal solution to a specific 
problem has grown into one of the 
most successful self-hosted monitor-
ing tools.
The story behind Uptime Kuma is 
typical of open 
source projects: 
Lam was looking for 
a free, self-hosted 
monitoring tool 
with a state-of-the-
art interface. He 
was unimpressed 
by the alternatives 
available at the 
time: statping-ng 
was no longer ac-
tively maintained 
and seemed out-
dated. The free 
version of Uptime-
Robot, a software-
as-a-service (SaaS) 
solution, proved to 
be too limited in 
its scope; to make 
matters worse, it’s not open source. 
All of these problems prompted Lam 
to write his own tool.
The numbers speak for themselves: 
nearly 79,000 GitHub stars and 
more than 127 million Docker pulls 
make Uptime Kuma one of the most 
popular projects in its category. The 
growth curve is also impressive: In 
August 2021, just a few weeks after 
the initial release, the project reached 
its first 1,000 stars. Within a year, that 
number rose to more than 20,000, 
with

Mais conteúdos dessa disciplina