Daily Health Checks
# Full dependency status
curl http://127.0.0.1:8080/dependencies/health
# Coordinator recent logs
docker logs otheru-coordinator --since=1h
# JetKVM bridge live check (frame_age should be < 100ms)
curl http://127.0.0.1:8005/stats
# GSD loop status (if running)
curl http://127.0.0.1:8090/gsd/report
Coordinator
# Restart
docker restart otheru-coordinator
# Follow logs
docker logs otheru-coordinator -f --since=5m
# Verify routing for a prompt
docker exec otheru-coordinator python3 -c "
import sys; sys.path.insert(0, '/app')
from routing import needs_tools
print(needs_tools('your prompt here'))
"
Hardware Bridge (JetKVM)
# Bridge status
curl http://127.0.0.1:8005/stats
# Capture a screenshot
curl http://127.0.0.1:8005/screenshot | python3 -c \
"import sys,json,base64; d=json.load(sys.stdin); \
open('/tmp/desktop.jpg','wb').write(base64.b64decode(d['image_base64']))"
# Start a screen recording
curl -X POST http://127.0.0.1:8005/record \
-H 'Content-Type: application/json' \
-d '{"output_path":"/tmp/recording.mp4","duration":60,"fps":15}'
# Patch bridge.py
docker cp /path/to/bridge.py hardware-bridge:/app/bridge.py
docker restart hardware-bridge
Agent Memory Management
# List all agent states + memory usage
curl http://127.0.0.1:8080/agents/status
# Load an agent on demand
curl -X POST http://127.0.0.1:8080/agents/fara/load
# Unload to free memory
curl -X POST http://127.0.0.1:8080/agents/fara/unload
GSD Loop
curl -X POST http://127.0.0.1:8090/gsd/start
curl -X POST http://127.0.0.1:8090/gsd/pause
curl http://127.0.0.1:8090/gsd/report
curl http://127.0.0.1:8090/gsd/status
curl -X POST http://127.0.0.1:8090/gsd/set-objective \
-H 'Content-Type: application/json' -d '{"objective":"..."}'
Incident Triage
- Identify scope — which tier or service is failing?
- Check logs —
docker logs <container> --since=10m - Validate dependencies —
curl http://127.0.0.1:8080/dependencies/health - Apply targeted fix — restart container, patch config, reload model
- Verify recovery — re-run health checks
- Document — update runbook with root cause and fix
Common Issues
KVM requests not routing to Fara
Check that KVM-related keywords are present in routing_policy.json under tool_intent_keywords.
docker exec otheru-coordinator python3 -c "
import sys; sys.path.insert(0, '/app')
from routing import needs_tools
print(needs_tools('use the kvm to open notepad')) # should be True
"
If False, add the missing keyword patterns to core/config/routing_policy.json and restart the coordinator.
Display goes dark after EDID change
Custom low-resolution EDIDs may lack fallback timing modes. Reset to the native display EDID:
curl -X POST http://127.0.0.1:8005/edid \
-H 'Content-Type: application/json' \
-d '{"edid_name": "T749-fHD720"}'
The call may time out but the change applies. Restart the bridge to confirm.
Memory pressure / model load failures
Unload agents that aren't currently needed:
curl -X POST http://127.0.0.1:8080/agents/fara/unload
curl -X POST http://127.0.0.1:8080/agents/reasoner/unload
Check overall memory with free -h and docker stats --no-stream.
Coordinator doesn't start after a code change
Syntax error in a bind-mounted file. Validate before restarting:
cd otheru-core/core/coordinator
python3 -c "import py_compile; py_compile.compile('changed_file.py', doraise=True)"
docker restart otheru-coordinator
Release Hygiene
- Keep compose and env changes version-controlled
- Stage risky changes with a controlled rollout
- Coordinator source is bind-mounted — edit on host, never inside the container
- Bridge changes require
docker cp+ restart routing_policy.jsonchanges requiredocker restart otheru-coordinator
WMMA Ops Profiling
Standard workflow
- Run timing benchmarks for kernel variants
- Capture HIP traces for launch behavior and synchronization overhead
- Use
rocprofv3where available to compare instruction and memory patterns
gfx1151 limitation
Many hardware performance counters are unavailable on consumer gfx1151 (aqlprofile-backed counters fail). Prioritize:
- Wall-clock benchmark stability
- PyTorch profiler traces
- HIP API trace analysis
Rule: Promote a kernel variant only when it passes both correctness checks and repeatable timing benchmarks. Keep the adaptive fallback enabled.