How to continue org-details data fetching (eu_playwright)

This howto describes how to resume fetching org-details from the EU Funding & Tenders portal when you already have partial results (e.g. 3 projects) and want to fetch the remaining projects without re-fetching existing ones.

See also: docs/plans/2026-02-02_012400-org-details-run-analysis.md.

Prerequisites

  • Repo root: uv sync --group dev and uv run playwright install chromium done once.
  • Existing file: tmp/downloaded_data/eu_playwright/org_details_<slug>_results.json with at least some projects (resume loads these and skips them by URL).
  • Chrome or Chromium with remote debugging when using --connect (recommended for long runs).

Resume with an existing browser (recommended)

Attaching to Chrome avoids launching a browser in the background and makes long runs (38 projects) more reliable.

1. Start Chrome with remote debugging

Use a non-default --user-data-dir (required for remote debugging).

# Linux
chromium --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug
# or
google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

# macOS
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

2. Run org-details with resume and fetch-details

From the repo root:

uv run python -m eu_playwright.cli org-details --fetch-details --connect http://localhost:9222 \
  -o tmp/downloaded_data/eu_playwright
  • Resume (default): loads org_details_<slug>_results.json from the out-dir and only fetches project details for URLs not already in the file.
  • --fetch-details: for each project card, opens the project details page and parses acronym, dates, status, participants, EU contribution, programme, etc.
  • --connect: attaches to the Chrome instance above; the script will open the org-details page and click through project cards, then fetch each new project’s details.

Optional:

  • --debug – log URL, card counts, etc. after loading.
  • --save-pages DIR – save each project page HTML under DIR/<session_id>/ for offline parsing.
  • --max-projects N – cap the number of project cards to process. Default 0 = load all projects (no limit).
  • --log-file FILE – write log to FILE (default: <out-dir>/org_details_<session_id>.log).
  • --sleep-between-queries SEC – seconds to wait between EU tender portal requests (card clicks and project-detail fetches). Default 60; all projects are still downloaded, only the next request is delayed. Use 0 to disable.

3. Let the run finish

The script will:

  1. Load existing projects from org_details_<slug>_results.json.
  2. Open the org-details URL and click through project cards to collect URLs (up to 38).
  3. For each URL not already in the file, open the project details page, parse it, and append to the result.
  4. Write the JSON file after each new project so an interrupt (Ctrl+C) does not lose progress.

Allow 10–30 minutes for a full run depending on network and page load.

Output

  • JSON: tmp/downloaded_data/eu_playwright/org_details_<slug>_results.json (updated after each new project).
  • Log: tmp/downloaded_data/eu_playwright/org_details_<session_id>.log.

Headless run (no browser attach)

If you prefer not to attach to a browser:

uv run python -m eu_playwright.cli org-details --fetch-details \
  -o tmp/downloaded_data/eu_playwright

Resume works the same. In some environments (e.g. CI or no display), headless may hit timeouts or EPIPE; use --connect for long runs when possible.

Troubleshooting

  • “Could not attach to browser”
    Start Chrome with --remote-debugging-port=9222 and --user-data-dir=... first, then run the CLI with --connect http://localhost:9222.

  • EPIPE or timeout
    Often due to headless browser in a restricted environment. Run locally with --connect and an existing Chrome window.

  • Only 3 projects in file after run
    Either the run was interrupted before more projects were fetched, or the click loop did not see all cards (page not fully loaded). Re-run with --resume and --connect; existing 3 are skipped, remaining URLs are fetched.

See also

  • eu_playwright/README.md – CLI overview.
  • docs/plans/2026-02-02_012400-org-details-run-analysis.md – analysis of a previous run and recommendations.