Planekeeper is currently in alpha development. Features and APIs may change. Feedback is welcome! Request early access to get started.

Common issues

Diagnosis and fixes for common issues with agents, jobs, alerts, notifications, and the UI.

This page covers the most frequent problems and how to fix them. Each issue lists the symptoms, likely cause, and resolution steps.


Agent issues

Agent cannot connect to the server

Symptoms: Agent logs show connection refused or timeout errors. The agent does not appear on the service health page.

Likely cause: Incorrect server URL, network connectivity issue, or missing protocol scheme.

Fix:

  1. Verify the AGENT_SERVER_URL environment variable includes the scheme (http:// or https://)
  2. Confirm the agent can reach the server: test with curl <server-url>/health from the agent’s host
  3. Check firewall rules allow outbound traffic on the server’s port
  4. If using Docker networking, ensure the agent and server are on the same Docker network or the server port is exposed

Agent is connected but not picking up tasks

Symptoms: The agent appears on the service health page and sends heartbeats, but jobs remain in “pending” or “queued” status.

Likely cause: The agent is not declaring the required capabilities or credentials, or the agent belongs to a different organization than the jobs.

Fix:

  1. Check the agent’s configured capabilities – it must support the job type (gather, scrape, or helm_sync)
  2. For scrape jobs with a credential requirement, verify the agent has that credential configured in its config.yaml
  3. Confirm the jobs have a next_run_at time that has passed (scheduled jobs wait for their cron trigger)
  4. Check that the jobs are in “pending” status, not “completed” waiting for their next scheduled run
  5. Verify the agent belongs to the same organization as the jobs – agents only claim jobs for their own organization. The dashboard and job pages show a warning if no active agents are detected for your organization

Agent health check shows stale heartbeat

Symptoms: The service health page shows a last heartbeat time that is minutes or hours old.

Likely cause: The agent process has stopped or is unable to reach the server intermittently.

Fix:

  1. Check if the agent container/process is still running
  2. Review agent logs for errors or crashes
  3. Verify network connectivity has not changed since the agent started
  4. Restart the agent if it appears stuck

Job issues

Job stuck in “in_progress” status

Symptoms: A job shows “in_progress” for an extended period (more than an hour).

Likely cause: The agent that claimed the job crashed or disconnected before reporting results.

Fix:

Planekeeper automatically recovers stuck jobs through two mechanisms:

  • Stale job detection: Jobs in “in_progress” for more than 1 hour are reset to “pending” by the scheduler
  • Orphan cleanup: Jobs claimed by agents that are no longer connected are reset every 2 minutes

Wait for the automatic recovery, or manually trigger the job again with Run Now.


Job fails repeatedly

Symptoms: A job completes with “failed” status and the error count increases with each attempt.

Likely cause: The job configuration is incorrect or the target resource is unavailable.

Fix:

  1. Check the job’s error message for specific details
  2. For gather jobs:
    • Verify the artifact name is correct (for example, owner/repo for GitHub)
    • Check for GitHub rate limiting – add a GITHUB_TOKEN if you haven’t already
    • Confirm the repository is public or that credentials are configured
  3. For scrape jobs:
    • Verify the repository URL is accessible from the agent
    • Confirm the target file path exists in the repository at the specified ref
    • Test the parse expression against the actual file content
    • For private repos, ensure the agent has the required credential
info

Max retry attempts

After multiple failures, the job enters a permanent “failed” status. Update the job configuration to fix the issue, then click Run Now to retry.


Job produces no results

Symptoms: A job completes successfully but no releases or version snapshots appear.

Likely cause: The tag filter, version regex, or parse expression does not match the actual content.

Fix:

  1. For gather jobs: check if the Tag Filter regex is too restrictive. Remove it temporarily to see all releases
  2. For gather jobs: verify the Version Regex capture group matches the tag format used by the project
  3. For scrape jobs: confirm the Parse Expression matches the file content. Test the expression against the actual file

Alert issues

Alerts are not being generated

Symptoms: Both gather and scrape jobs complete successfully, but no alerts appear on the dashboard.

Likely cause: Missing alert config, inactive rule, or the versions do not violate the rule thresholds.

Fix:

  1. Verify an alert config exists that links the scrape job, gather job, and rule
  2. Check that the alert config is active (not toggled off)
  3. Check that the linked rule is active
  4. Compare the deployed version against the latest version manually – it may not exceed the moderate threshold
  5. If using stable_only, check whether the latest upstream release is a prerelease (it would be skipped)

Alert shows wrong severity

Symptoms: An alert displays a severity level that seems incorrect for the version gap.

Likely cause: Threshold configuration does not match expectations, or the behind-by calculation differs from what you expect.

Fix:

  1. Check the rule’s threshold values – thresholds are ordered moderate < high < critical
  2. For minors_behind: this counts actual releases, not version number gaps. Check the upstream release list
  3. For days_behind: this measures days since the deployed version’s release date, not the gap between release dates
  4. Review the alert’s behind_by value to understand the exact calculation result

Alerts are not resolving automatically

Symptoms: You updated the deployed version, but the alert remains active.

Likely cause: The scrape job has not run since the version was updated.

Fix:

  1. Trigger the scrape job manually with Run Now
  2. Wait for the job to complete and re-evaluate the rule
  3. If the alert persists, verify the scrape job extracted the correct new version
  4. Confirm the new version no longer violates the rule thresholds

Notification issues

Notifications are not being delivered

Symptoms: Alerts are generated but no webhook messages arrive in your Slack/Discord channel.

Likely cause: Missing notification rule, inactive channel, or no default channel configured.

Fix:

  1. Verify a notification rule exists that matches the alert’s severity and event type
  2. Check that the notification rule is active
  3. Verify the notification channel assigned to the rule is active
  4. If the rule has no specific channel, check that a default notification channel is configured in Settings
  5. Use the channel Test button to confirm the webhook URL is reachable

Webhook returns wrong format errors

Symptoms: Notifications are sent but Discord or Slack rejects them with a 400 error.

Likely cause: The template produces invalid JSON for the target platform.

Fix:

  1. Check the Notification Deliveries page for failed deliveries and their error messages
  2. Verify the template produces valid JSON – watch for unescaped quotes or missing commas
  3. Use the channel Test button to send a sample notification and check the response
  4. Refer to the Discord or Slack recipe for tested template examples

Notifications end up in dead letter

Symptoms: The dead letter list shows failed deliveries that were not retried.

Likely cause: The webhook endpoint returned a non-retryable error (4xx except 429), or retries were exhausted.

Fix:

  1. Navigate to Notification Deliveries > Dead Letters
  2. Check the error details for each dead letter
  3. Fix the underlying issue (URL, template, or endpoint configuration)
  4. Retry the delivery via the API endpoint: POST /notification-deliveries/{id}/retry. A UI retry button is planned for a future release.
info

Non-retryable errors

HTTP 4xx responses (except 429 Too Many Requests) are treated as permanent failures and move directly to dead letter without retrying. Fix the configuration before retrying.


UI issues

Pages show stale data

Symptoms: The UI shows outdated job statuses or alert counts that do not match recent activity.

Likely cause: Browser cache serving old page content.

Fix:

  1. Hard-refresh the page (Ctrl+Shift+R or Cmd+Shift+R)
  2. Clear the browser cache for the Planekeeper domain
  3. If the issue persists, check that the API server is running and responding

Error banner appears after an action

Symptoms: A red error banner appears at the top of the page after creating or updating a resource.

Likely cause: The API rejected the request due to validation errors or a server-side issue.

Fix:

  1. Read the error message – it contains specific details about what went wrong
  2. Common validation errors:
    • Duplicate name or configuration
    • Invalid cron expression
    • Missing required fields
    • Webhook URL blocked (private IP address when SSRF protection is enabled)
  3. Correct the issue and try again