Common issues
Diagnosis and fixes for common issues with agents, jobs, alerts, notifications, and the UI.
This page covers the most frequent problems and how to fix them. Each issue lists the symptoms, likely cause, and resolution steps.
Agent issues
Agent cannot connect to the server
Symptoms: Agent logs show connection refused or timeout errors. The agent does not appear on the service health page.
Likely cause: Incorrect server URL, network connectivity issue, or missing protocol scheme.
Fix:
- Verify the
AGENT_SERVER_URLenvironment variable includes the scheme (http://orhttps://) - Confirm the agent can reach the server: test with
curl <server-url>/healthfrom the agent’s host - Check firewall rules allow outbound traffic on the server’s port
- If using Docker networking, ensure the agent and server are on the same Docker network or the server port is exposed
Agent is connected but not picking up tasks
Symptoms: The agent appears on the service health page and sends heartbeats, but jobs remain in “pending” or “queued” status.
Likely cause: The agent is not declaring the required capabilities or credentials, or the agent belongs to a different organization than the jobs.
Fix:
- Check the agent’s configured capabilities – it must support the job type (
gather,scrape, orhelm_sync) - For scrape jobs with a credential requirement, verify the agent has that credential configured in its
config.yaml - Confirm the jobs have a
next_run_attime that has passed (scheduled jobs wait for their cron trigger) - Check that the jobs are in “pending” status, not “completed” waiting for their next scheduled run
- Verify the agent belongs to the same organization as the jobs – agents only claim jobs for their own organization. The dashboard and job pages show a warning if no active agents are detected for your organization
Agent health check shows stale heartbeat
Symptoms: The service health page shows a last heartbeat time that is minutes or hours old.
Likely cause: The agent process has stopped or is unable to reach the server intermittently.
Fix:
- Check if the agent container/process is still running
- Review agent logs for errors or crashes
- Verify network connectivity has not changed since the agent started
- Restart the agent if it appears stuck
Job issues
Job stuck in “in_progress” status
Symptoms: A job shows “in_progress” for an extended period (more than an hour).
Likely cause: The agent that claimed the job crashed or disconnected before reporting results.
Fix:
Planekeeper automatically recovers stuck jobs through two mechanisms:
- Stale job detection: Jobs in “in_progress” for more than 1 hour are reset to “pending” by the scheduler
- Orphan cleanup: Jobs claimed by agents that are no longer connected are reset every 2 minutes
Wait for the automatic recovery, or manually trigger the job again with Run Now.
Job fails repeatedly
Symptoms: A job completes with “failed” status and the error count increases with each attempt.
Likely cause: The job configuration is incorrect or the target resource is unavailable.
Fix:
- Check the job’s error message for specific details
- For gather jobs:
- Verify the artifact name is correct (for example,
owner/repofor GitHub) - Check for GitHub rate limiting – add a
GITHUB_TOKENif you haven’t already - Confirm the repository is public or that credentials are configured
- Verify the artifact name is correct (for example,
- For scrape jobs:
- Verify the repository URL is accessible from the agent
- Confirm the target file path exists in the repository at the specified ref
- Test the parse expression against the actual file content
- For private repos, ensure the agent has the required credential
Max retry attempts
After multiple failures, the job enters a permanent “failed” status. Update the job configuration to fix the issue, then click Run Now to retry.
Job produces no results
Symptoms: A job completes successfully but no releases or version snapshots appear.
Likely cause: The tag filter, version regex, or parse expression does not match the actual content.
Fix:
- For gather jobs: check if the Tag Filter regex is too restrictive. Remove it temporarily to see all releases
- For gather jobs: verify the Version Regex capture group matches the tag format used by the project
- For scrape jobs: confirm the Parse Expression matches the file content. Test the expression against the actual file
Alert issues
Alerts are not being generated
Symptoms: Both gather and scrape jobs complete successfully, but no alerts appear on the dashboard.
Likely cause: Missing alert config, inactive rule, or the versions do not violate the rule thresholds.
Fix:
- Verify an alert config exists that links the scrape job, gather job, and rule
- Check that the alert config is active (not toggled off)
- Check that the linked rule is active
- Compare the deployed version against the latest version manually – it may not exceed the moderate threshold
- If using stable_only, check whether the latest upstream release is a prerelease (it would be skipped)
Alert shows wrong severity
Symptoms: An alert displays a severity level that seems incorrect for the version gap.
Likely cause: Threshold configuration does not match expectations, or the behind-by calculation differs from what you expect.
Fix:
- Check the rule’s threshold values – thresholds are ordered moderate < high < critical
- For
minors_behind: this counts actual releases, not version number gaps. Check the upstream release list - For
days_behind: this measures days since the deployed version’s release date, not the gap between release dates - Review the alert’s behind_by value to understand the exact calculation result
Alerts are not resolving automatically
Symptoms: You updated the deployed version, but the alert remains active.
Likely cause: The scrape job has not run since the version was updated.
Fix:
- Trigger the scrape job manually with Run Now
- Wait for the job to complete and re-evaluate the rule
- If the alert persists, verify the scrape job extracted the correct new version
- Confirm the new version no longer violates the rule thresholds
Notification issues
Notifications are not being delivered
Symptoms: Alerts are generated but no webhook messages arrive in your Slack/Discord channel.
Likely cause: Missing notification rule, inactive channel, or no default channel configured.
Fix:
- Verify a notification rule exists that matches the alert’s severity and event type
- Check that the notification rule is active
- Verify the notification channel assigned to the rule is active
- If the rule has no specific channel, check that a default notification channel is configured in Settings
- Use the channel Test button to confirm the webhook URL is reachable
Webhook returns wrong format errors
Symptoms: Notifications are sent but Discord or Slack rejects them with a 400 error.
Likely cause: The template produces invalid JSON for the target platform.
Fix:
- Check the Notification Deliveries page for failed deliveries and their error messages
- Verify the template produces valid JSON – watch for unescaped quotes or missing commas
- Use the channel Test button to send a sample notification and check the response
- Refer to the Discord or Slack recipe for tested template examples
Notifications end up in dead letter
Symptoms: The dead letter list shows failed deliveries that were not retried.
Likely cause: The webhook endpoint returned a non-retryable error (4xx except 429), or retries were exhausted.
Fix:
- Navigate to Notification Deliveries > Dead Letters
- Check the error details for each dead letter
- Fix the underlying issue (URL, template, or endpoint configuration)
- Retry the delivery via the API endpoint:
POST /notification-deliveries/{id}/retry. A UI retry button is planned for a future release.
Non-retryable errors
HTTP 4xx responses (except 429 Too Many Requests) are treated as permanent failures and move directly to dead letter without retrying. Fix the configuration before retrying.
UI issues
Pages show stale data
Symptoms: The UI shows outdated job statuses or alert counts that do not match recent activity.
Likely cause: Browser cache serving old page content.
Fix:
- Hard-refresh the page (Ctrl+Shift+R or Cmd+Shift+R)
- Clear the browser cache for the Planekeeper domain
- If the issue persists, check that the API server is running and responding
Error banner appears after an action
Symptoms: A red error banner appears at the top of the page after creating or updating a resource.
Likely cause: The API rejected the request due to validation errors or a server-side issue.
Fix:
- Read the error message – it contains specific details about what went wrong
- Common validation errors:
- Duplicate name or configuration
- Invalid cron expression
- Missing required fields
- Webhook URL blocked (private IP address when SSRF protection is enabled)
- Correct the issue and try again