Common issues

Diagnosis and fixes for common issues with agents, jobs, alerts, notifications, and the UI.

This page covers the most frequent problems and how to fix them. Each issue lists the symptoms, likely cause, and resolution steps.

Agent issues

Agent cannot connect to the server

Symptoms: Agent logs show connection refused or timeout errors. The agent does not appear on the service health page.

Likely cause: Incorrect server URL, network connectivity issue, or missing protocol scheme.

Fix:

Verify the AGENT_SERVER_URL environment variable includes the scheme (http:// or https://)
Confirm the agent can reach the server: test with curl <server-url>/health from the agent’s host
Check firewall rules allow outbound traffic on the server’s port
If using Docker networking, ensure the agent and server are on the same Docker network or the server port is exposed

Agent is connected but not picking up tasks

Symptoms: The agent appears on the service health page and sends heartbeats, but jobs remain in “pending” or “queued” status.

Likely cause: The agent is not declaring the required capabilities or credentials, or the agent belongs to a different organization than the jobs.

Fix:

Check the agent’s configured capabilities – it must support the job type (gather, scrape, or helm_sync)
For scrape jobs with a credential requirement, verify the agent has that credential configured in its config.yaml
Confirm the jobs have a next_run_at time that has passed (scheduled jobs wait for their cron trigger)
Check that the jobs are in “pending” status, not “completed” waiting for their next scheduled run
Verify the agent belongs to the same organization as the jobs – agents only claim jobs for their own organization. The dashboard and job pages show a warning if no active agents are detected for your organization

Agent health check shows stale heartbeat

Symptoms: The service health page shows a last heartbeat time that is minutes or hours old.

Likely cause: The agent process has stopped or is unable to reach the server intermittently.

Fix:

Check if the agent container/process is still running
Review agent logs for errors or crashes
Verify network connectivity has not changed since the agent started
Restart the agent if it appears stuck

Job issues

Job stuck in “in_progress” status

Symptoms: A job shows “in_progress” for an extended period (more than an hour).

Likely cause: The agent that claimed the job crashed or disconnected before reporting results.

Fix:

Planekeeper automatically recovers stuck jobs through two mechanisms:

Stale job detection: Jobs in “in_progress” for more than 1 hour are reset to “pending” by the scheduler
Orphan cleanup: Jobs claimed by agents that are no longer connected are reset every 2 minutes

Wait for the automatic recovery, or manually trigger the job again with Run Now.

Job fails repeatedly

Symptoms: A job completes with “failed” status and the error count increases with each attempt.

Likely cause: The job configuration is incorrect or the target resource is unavailable.

Fix:

Check the job’s error message for specific details
For gather jobs:
- Verify the artifact name is correct (for example, owner/repo for GitHub)
- Check for GitHub rate limiting – add a GITHUB_TOKEN if you haven’t already
- Confirm the repository is public or that credentials are configured
For scrape jobs:
- Verify the repository URL is accessible from the agent
- Confirm the target file path exists in the repository at the specified ref
- Test the parse expression against the actual file content
- For private repos, ensure the agent has the required credential

info

Max retry attempts

After multiple failures, the job enters a permanent “failed” status. Update the job configuration to fix the issue, then click Run Now to retry.

Job produces no results

Symptoms: A job completes successfully but no releases or version snapshots appear.

Likely cause: The tag filter, version regex, or parse expression does not match the actual content.

Fix:

For gather jobs: check if the Tag Filter regex is too restrictive. Remove it temporarily to see all releases
For gather jobs: verify the Version Regex capture group matches the tag format used by the project
For scrape jobs: confirm the Parse Expression matches the file content. Test the expression against the actual file

Alert issues

Alerts are not being generated

Symptoms: Both gather and scrape jobs complete successfully, but no alerts appear on the dashboard.

Likely cause: Missing alert config, inactive rule, or the versions do not violate the rule thresholds.

Fix:

Verify an alert config exists that links the scrape job, gather job, and rule
Check that the alert config is active (not toggled off)
Check that the linked rule is active
Compare the deployed version against the latest version manually – it may not exceed the moderate threshold
If using stable_only, check whether the latest upstream release is a prerelease (it would be skipped)

Alert shows wrong severity

Symptoms: An alert displays a severity level that seems incorrect for the version gap.

Likely cause: Threshold configuration does not match expectations, or the behind-by calculation differs from what you expect.

Fix:

Check the rule’s threshold values – thresholds are ordered moderate < high < critical
For minors_behind: this counts actual releases, not version number gaps. Check the upstream release list
For days_behind: this measures days since the deployed version’s release date, not the gap between release dates
Review the alert’s behind_by value to understand the exact calculation result

Alerts are not resolving automatically

Symptoms: You updated the deployed version, but the alert remains active.

Likely cause: The scrape job has not run since the version was updated.

Fix:

Trigger the scrape job manually with Run Now
Wait for the job to complete and re-evaluate the rule
If the alert persists, verify the scrape job extracted the correct new version
Confirm the new version no longer violates the rule thresholds

Notification issues

Notifications are not being delivered

Symptoms: Alerts are generated but no webhook messages arrive in your Slack/Discord channel.

Likely cause: Missing notification rule, inactive channel, or no default channel configured.

Fix:

Verify a notification rule exists that matches the alert’s severity and event type
Check that the notification rule is active
Verify the notification channel assigned to the rule is active
If the rule has no specific channel, check that a default notification channel is configured in Settings
Use the channel Test button to confirm the webhook URL is reachable

Webhook returns wrong format errors

Symptoms: Notifications are sent but Discord or Slack rejects them with a 400 error.

Likely cause: The template produces invalid JSON for the target platform.

Fix:

Check the Notification Deliveries page for failed deliveries and their error messages
Verify the template produces valid JSON – watch for unescaped quotes or missing commas
Use the channel Test button to send a sample notification and check the response
Refer to the Discord or Slack recipe for tested template examples

Notifications end up in dead letter

Symptoms: The dead letter list shows failed deliveries that were not retried.

Likely cause: The webhook endpoint returned a non-retryable error (4xx except 429), or retries were exhausted.

Fix:

Navigate to Notification Deliveries > Dead Letters
Check the error details for each dead letter
Fix the underlying issue (URL, template, or endpoint configuration)
Retry the delivery via the API endpoint: POST /notification-deliveries/{id}/retry. A UI retry button is planned for a future release.

info

Non-retryable errors

HTTP 4xx responses (except 429 Too Many Requests) are treated as permanent failures and move directly to dead letter without retrying. Fix the configuration before retrying.

UI issues

Pages show stale data

Symptoms: The UI shows outdated job statuses or alert counts that do not match recent activity.

Likely cause: Browser cache serving old page content.

Fix:

Hard-refresh the page (Ctrl+Shift+R or Cmd+Shift+R)
Clear the browser cache for the Planekeeper domain
If the issue persists, check that the API server is running and responding

Symptoms: A red error banner appears at the top of the page after creating or updating a resource.

Likely cause: The API rejected the request due to validation errors or a server-side issue.

Fix:

Read the error message – it contains specific details about what went wrong
Common validation errors:
- Duplicate name or configuration
- Invalid cron expression
- Missing required fields
- Webhook URL blocked (private IP address when SSRF protection is enabled)
Correct the issue and try again

Agent issues

Agent cannot connect to the server

Agent is connected but not picking up tasks

Agent health check shows stale heartbeat

Job issues

Job stuck in “in_progress” status

Job fails repeatedly

Job produces no results

Alert issues

Alerts are not being generated

Alert shows wrong severity

Alerts are not resolving automatically

Notification issues

Notifications are not being delivered

Webhook returns wrong format errors

Notifications end up in dead letter

UI issues

Pages show stale data

Error banner appears after an action