Site: Planekeeper Docs — Monitor software versions across your stack. Planekeeper tracks releases, gathers version data, and alerts on drift.
Section: Feature Guides > Scrape jobs
Source: https://docs.planekeeper.com/guides/scrape-jobs/
Title: Scrape jobs
Author: Planekeeper
Description: Set up scrape jobs to extract deployed versions from Git repositories using YQ, JQ, regex, or manual entry.
Word count: 2074
Reading time: 10 min

Contents:
- [What scrape jobs do](#what-scrape-jobs-do)
- [Create a scrape job](#create-a-scrape-job)
- [Manual version entry](#manual-version-entry)
  - [Create a manual scrape job](#create-a-manual-scrape-job)
  - [Set a version](#set-a-version)
  - [How it integrates with the pipeline](#how-it-integrates-with-the-pipeline)
  - [API usage](#api-usage)
- [Parser types](#parser-types)
  - [YQ (YAML and JSON files)](#yq-yaml-and-json-files)
  - [JQ (JSON files — simple key lookups)](#jq-json-files--simple-key-lookups)
  - [Regex (any text file)](#regex-any-text-file)
- [Version transforms](#version-transforms)
- [Private repository access](#private-repository-access)
  - [Security model](#security-model)
  - [Credential types](#credential-types)
  - [Agent configuration](#agent-configuration)
  - [How job assignment works](#how-job-assignment-works)
- [Schedule and manual runs](#schedule-and-manual-runs)
  - [Set a schedule](#set-a-schedule)
  - [Trigger a manual run](#trigger-a-manual-run)
- [Version snapshots and history](#version-snapshots-and-history)
- [Bulk actions](#bulk-actions)
- [Tips](#tips)

***

# Scrape jobs


Scrape jobs track your currently deployed software version. Most scrape jobs extract a version from a file in a Git repository using an agent. Alternatively, you can use the **manual** parse type to enter the version directly -- no agent or repository needed. Either way, scrape jobs answer the question "what version are we running?"

## What scrape jobs do

A scrape job clones a Git repository, reads a specific file, and applies a parser expression to extract a version string. The result is stored as a **version snapshot** -- a point-in-time record of the version found.

Every scrape creates a new snapshot, even if the version has not changed. This gives you a complete history of your deployed version over time, including rollbacks.

When a scrape job completes, all alert configs that reference it are re-evaluated immediately.

## Create a scrape job

1. Open the **Scrape Jobs** page from the sidebar.
2. Click **Create Scrape Job**.
3. Fill in the form:

| Field | Required | Description |
|-------|----------|-------------|
| **Name** | No | A human-readable label |
| **Repository URL** | Yes (except manual) | Git clone URL (HTTPS or SSH) |
| **Ref (Branch/Tag)** | Yes (except manual) | Git ref to check out (default: `main`) |
| **Target file** | Yes (except manual) | Path to the file containing the version |
| **Parser type** | Yes | `yq`, `jq`, `regex`, or `manual` |
| **Parse expression** | Yes (except manual) | Parser-specific expression to extract the version |
| **Credential name** | No | Named credential for private repository access |
| **Schedule** | No | Cron expression for recurring runs |
| **Version transform** | No | Post-parse transformation for the version string |
| **History limit** | No | Maximum snapshots to retain (1-20) |

4. Click **Create**.

> **success:** 
**Manual entry shortcut**

When you select **Manual Entry** as the parser type, the repository, file, expression, credential, and schedule fields are hidden. Only the job name is required. See [Manual version entry](#manual-version-entry) below for details.


## Manual version entry

Manual scrape jobs let you enter a deployed version directly, without requiring agent infrastructure or Git repository access. This is useful for:

- **Demos and testing** -- quickly set up the full monitoring pipeline without deploying an agent
- **Environments without agents** -- track versions for systems where agent deployment is not practical
- **One-off checks** -- verify that rules, alerts, and notifications work correctly with a known version

### Create a manual scrape job

1. Open the **Scrape Jobs** page from the sidebar.
2. Click **Create Scrape Job**.
3. Select **Manual Entry** as the parser type. The repository, file, expression, credential, and schedule fields are hidden automatically.
4. Enter a **Name** for the job (this is the only required field).
5. Optionally set a **Version transform** if you need to normalize the version format.
6. Click **Create**.

### Set a version

1. On the **Scrape Jobs** page, expand the row for your manual scrape job.
2. Use the **Set Version** form to enter the deployed version string (for example, `1.2.3` or `v2.0.0-rc1`).
3. Submit the form.

The version is saved as a new version snapshot, exactly like an agent-discovered version. If the scrape job has a version transform configured, it is applied to the entered version before storage.

### How it integrates with the pipeline

Manual scrape jobs participate in the full monitoring pipeline:

- **Alert configs** -- link a manual scrape job to a gather job and a rule, just like any other scrape job.
- **Rule evaluation** -- when you set a version, all alert configs referencing the scrape job are re-evaluated immediately.
- **Alerts** -- if the version violates a rule threshold, an alert is created or updated automatically.
- **Notifications** -- alert events (created, escalated, resolved) trigger notifications to your configured channels.

The only difference from agent-based scrape jobs is how the version enters the system -- everything downstream is identical.

### API usage

You can also set the version programmatically:

```
POST /api/v1/client/scrape-jobs/{id}/set-version
Content-Type: application/json

{
  "version": "1.2.3"
}
```

This is useful for CI/CD pipelines that want to report deployed versions directly to Planekeeper.

## Parser types

Choose the parser type based on the file format you are reading.

> **info:** 
**Current parser implementation**

The YQ and JQ parsers are **lightweight, built-in path navigators** — they do not shell out to the `yq` or `jq` CLI tools. This keeps Planekeeper dependency-free and fast, but it means the parsers support a subset of what the full CLI tools offer. Both use dot-notation path expressions (`.field.subfield`) rather than the full query languages.

The **YQ parser** supports array indexing (`.dependencies[0].version`) and can parse both YAML and JSON files, making it the more capable of the two. The **JQ parser** only supports simple key-based navigation and does not handle arrays.

We are actively exploring a more feature-rich parser implementation with broader query support. For now, if you need array access in JSON files, use the YQ parser. See the [Parser types reference](https://docs.planekeeper.com/reference/parser-types/) for detailed capabilities and limitations.


### YQ (YAML and JSON files)

Use YQ for YAML configuration files like `Chart.yaml`, `values.yaml`, or Kubernetes manifests. The YQ parser also handles JSON files and is the only parser that supports array indexing.

**Expression format:** Dot-notation with array indexing.

**Simple field**

    ```yaml
    # Chart.yaml
    version: 5.51.4
    ```

    **Expression:** `.version`
    **Result:** `5.51.4`

**Nested path**

    ```yaml
    # values.yaml
    metadata:
      version: 2.1.0
    ```

    **Expression:** `.metadata.version`
    **Result:** `2.1.0`

**Array access**

    ```yaml
    # Chart.yaml
    dependencies:
      - name: argo-cd
        version: 5.51.4
    ```

    **Expression:** `.dependencies[0].version`
    **Result:** `5.51.4`

### JQ (JSON files — simple key lookups)

Use JQ for simple key-based lookups in JSON files like `package.json` or `composer.json`.

**Expression format:** Dot-notation only — no array indexing, filters, or pipes.

**Simple field**

    ```json
    {
      "version": "3.2.1"
    }
    ```

    **Expression:** `.version`
    **Result:** `3.2.1`

**Nested path**

    ```json
    {
      "dependencies": {
        "react": "18.2.0"
      }
    }
    ```

    **Expression:** `.dependencies.react`
    **Result:** `18.2.0`

> **warning:** 
**Limited functionality**

The JQ parser only supports simple dot-notation key access. It does **not** support array indexing (e.g., `.items[0].version`), filters, or pipes. If your JSON file requires array access, use the **YQ parser** instead — it handles both YAML and JSON with full array support.


### Regex (any text file)

Use regex for any text file where the version is not in a structured format, or when you need precise control over extraction.

**Expression format:** A [Go RE2 regular expression](https://github.com/google/re2/wiki/Syntax). If the pattern contains a capture group, the first captured group is returned. Otherwise, the full match is returned. Note that Go's RE2 engine does not support lookahead, lookbehind, or backreferences — see [Parser Types Reference](https://docs.planekeeper.com/reference/parser-types/#regex-parser) for details.

**Capture group**

    ```
    # Dockerfile
    FROM nginx:1.25.3-alpine
    ```

    **Expression:** `FROM nginx:([\d.]+)`
    **Result:** `1.25.3`

**Full line match**

    ```
    # VERSION file
    v2.4.1
    ```

    **Expression:** `^v(\d+\.\d+\.\d+)$`
    **Result:** `2.4.1`

**Key-value pair**

    ```
    # .env file
    APP_VERSION=1.0.5
    ```

    **Expression:** `APP_VERSION=([\d.]+)`
    **Result:** `1.0.5`

## Version transforms

After extracting the version, you can apply an optional transform to normalize the format:

| Transform | Input | Output | Use case |
|-----------|-------|--------|----------|
| `add_v_lower` | `1.2.3` | `v1.2.3` | Match tags that include a `v` prefix |
| `add_v_upper` | `1.2.3` | `V1.2.3` | Match tags with an uppercase `V` prefix |
| `strip_v_lower` | `v1.2.3` | `1.2.3` | Remove `v` prefix for clean comparison |
| `strip_v_upper` | `V1.2.3` | `1.2.3` | Remove `V` prefix for clean comparison |

> **success:** 
Use a version transform when the version format in your file does not match the format of the upstream releases. For example, if your `Chart.yaml` contains `1.2.3` but GitHub tags are `v1.2.3`, apply `add_v_lower` so the versions align for comparison.


## Private repository access

For private Git repositories or container registries, configure credentials on the agent and reference them by name in the job configuration.

### Security model

Planekeeper uses a **decentralized credential model** -- all secrets stay local to the agent.

- Credentials are defined in the agent's local `config.yaml` file and never leave the machine.
- During heartbeat, the agent advertises only credential **names** (e.g. `github_pat`) to the API server -- secret values are never transmitted.
- The API server uses these names for job routing: a job that requires a credential is only assigned to an agent that has advertised that name.
- All authenticated Git clones and registry pulls happen locally on the agent.

> **info:** 
Because secrets never reach the API server, compromising the server does not expose repository credentials. Each agent is responsible for securing its own `config.yaml` file.


### Credential types

Three credential types are supported. SSH keys and HTTPS PATs are used by scrape jobs (Git repositories). Registry credentials are used by gather jobs (OCI container registries) but are configured the same way.

#### `ssh_key` -- SSH key authentication

For repositories accessed via SSH URLs (`git@github.com:org/repo.git`).

| Field | Required | Description |
|-------|----------|-------------|
| `type` | Yes | Must be `ssh_key` |
| `private_key_file` | One of these | Path to an SSH private key file (mount into the container) |
| `private_key` | One of these | Inline PEM content (alternative to file) |
| `passphrase` | No | Passphrase for encrypted keys |

> **info:** 
`private_key_file` and `private_key` are mutually exclusive. Use `private_key_file` when mounting a key from a Docker volume or host path. Use `private_key` when embedding the key directly in the config.


#### `https_pat` -- HTTPS personal access token

For repositories accessed via HTTPS URLs (`https://github.com/org/repo.git`).

| Field | Required | Description |
|-------|----------|-------------|
| `type` | Yes | Must be `https_pat` |
| `token` | Yes | Personal access token (GitHub, GitLab, Bitbucket, etc.) |

#### `registry_basic` -- OCI container registry

For private container registries (Docker Hub, GHCR, quay.io, etc.). Used by gather jobs that fetch release metadata from registries.

| Field | Required | Description |
|-------|----------|-------------|
| `type` | Yes | Must be `registry_basic` |
| `username` | Yes | Registry username |
| `password` | Yes | Registry password or access token |

### Agent configuration

Credentials are defined under `agent.credentials` in the agent's `config.yaml` file. Each credential has a name (the map key) that you reference when creating jobs.

Mount the config file into the agent container at `/etc/planekeeper/config.yaml`.

**SSH key (file-based)**

    ```yaml
    agent:
      credentials:
        my_ssh_key:
          type: ssh_key
          private_key_file: /ssh/id_ed25519
          passphrase: ""  # optional
    ```

**SSH key (inline)**

    ```yaml
    agent:
      credentials:
        my_inline_key:
          type: ssh_key
          private_key: |
            -----BEGIN OPENSSH PRIVATE KEY-----
            ...key content...
            -----END OPENSSH PRIVATE KEY-----
          passphrase: ""  # optional
    ```

**HTTPS PAT**

    ```yaml
    agent:
      credentials:
        github_pat:
          type: https_pat
          token: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    ```

**Registry**

    ```yaml
    agent:
      credentials:
        dockerhub:
          type: registry_basic
          username: myuser
          password: dckr_pat_xxxxxxxxxxxx
    ```

You can define multiple credentials of different types in the same config file:

```yaml
agent:
  credentials:
    github_ssh:
      type: ssh_key
      private_key_file: /ssh/id_ed25519
    gitlab_pat:
      type: https_pat
      token: glpat-xxxxxxxxxxxxxxxxxxxx
    ghcr:
      type: registry_basic
      username: github-username
      password: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```

### How job assignment works

1. The agent reads credentials from `config.yaml` on startup.
2. During each heartbeat, the agent sends the list of credential **names** to the API server.
3. When a job with a credential name is due to run, the task engine assigns it only to agents that advertised that name.
4. The agent performs the authenticated Git clone or registry pull locally using the full credential.

> **warning:** 
A job with a credential name will only be assigned to agents that have that credential. If no agent has the required credential, the job remains pending indefinitely.


See [Agents](https://docs.planekeeper.com/guides/agents/) for more on agent deployment and configuration.

## Schedule and manual runs

### Set a schedule

Add a cron expression to run the scrape job on a recurring basis:

| Expression | Frequency |
|-----------|-----------|
| `0 */6 * * *` | Every 6 hours |
| `0 0 * * *` | Daily at midnight |
| `*/30 * * * *` | Every 30 minutes |

### Trigger a manual run

1. Open the scrape job detail page.
2. Click **Run Now**.

## Version snapshots and history

Every scrape run creates a new version snapshot. View the history on the scrape job detail page under **Version History**.

Each snapshot records:

- The version string extracted
- The Git commit SHA at the time of scraping
- A timestamp of when the version was discovered

**History limit:** Set a limit (1-20) to control how many snapshots are retained. Older snapshots beyond this limit are automatically deleted during periodic cleanup. This prevents unbounded growth while keeping enough history for tracking version changes.

**Rollback detection:** Because every scrape creates a new snapshot regardless of whether the version changed, Planekeeper correctly detects rollbacks. If you deploy version 2.0.0 and then roll back to 1.5.0, the snapshot history shows the full sequence.

## Bulk actions

Select multiple scrape jobs using the checkboxes on the list page, then click **Delete Selected** to remove them in a single operation. Use the checkbox in the table header to select all visible items.

## Tips

- Test your parse expression on a local copy of the file before creating the job. Make sure the expression returns only the version string, not surrounding text.
- Use the **Test** button in the scrape job form to verify your regex compiles. Note that this only checks syntax validity — it does not test the pattern against file content. To verify extraction, test your pattern locally or use [regex101.com](https://regex101.com) with the **Golang** flavor. See [Parser Types Reference](https://docs.planekeeper.com/reference/parser-types/#regex-parser) for details on the Go RE2 engine.
- Start with a short history limit (5-10) and increase it if you need more historical data.
- If a scrape job consistently fails, check the job detail page for error messages. Common issues include incorrect file paths, unreachable repositories, or parse expressions that do not match the file content.


---

## Related

- Next: [Notifications](https://docs.planekeeper.com/guides/notifications/page.md) — Configure Planekeeper notifications. Channels, rules, templates, and delivery settings for Slack, Discord, PagerDuty, Telegram, SMTP, and webhooks.
- Section: [Feature Guides](https://docs.planekeeper.com/guides/index.md)
