EU Playwright data in a separate Git repository

Saved pages (HTML) and JSON result files from the EU portal (Funding & Tenders, CORDIS) are not stored in the Rediflow source code repository. They live in a separate Git repository so that:

  • The source repo stays small and free of large/volatile data.
  • Data can be versioned, shared, or backed up independently.
  • Different branches or clones can point to different data repos.

What goes in the data repo

Content In source repo In data repo
tmp/downloaded_data/eu_playwright/*.json Ignored (tmp/) Committed
saved-pages-dir/<session_id>/*.html Ignored (saved-pages-dir/) Committed

So the data repository should contain:

  • downloaded_data/eu_playwright/ – JSON files: org_details_<slug>_results.json, partner_search_<slug>_results.json, cordis_<slug>_results.json, eura2021_<slug>_results.json, and optionally pages/<session_id>/*.html if you also save pages under that path.
  • saved-pages-dir/ – Session subdirs (e.g. 20260202_175231/) each containing call_id_project_id.html files.

Creating the data repository

Create a new Git repository (e.g. Rediflow-eu-playwright-data or eu-playwright-data) and use this layout:

eu-playwright-data/          # root of the data repo
├── README.md                # short description + how to use with Rediflow
├── downloaded_data/
│   └── eu_playwright/
│       ├── org_details_<slug>_results.json
│       ├── partner_search_<slug>_results.json
│       ├── cordis_<slug>_results.json
│       ├── eura2021_<slug>_results.json
│       └── (optional) pages/<session_id>/*.html
└── saved-pages-dir/
    ├── 20260202_160537/
    │   └── *.html
    └── 20260202_175231/
        └── *.html
  1. Create the repo (GitLab, GitHub, or local):

    mkdir eu-playwright-data && cd eu-playwright-data
    git init
    
  2. Copy or move the data from the source repo workspace:

    # From Rediflow repo root
    mkdir -p downloaded_data/eu_playwright saved-pages-dir
    cp -r tmp/downloaded_data/eu_playwright/*.json downloaded_data/eu_playwright/
    cp -r saved-pages-dir/* saved-pages-dir/
    
  3. Add a .gitignore in the data repo if you want to ignore something (e.g. *.log), then commit and push.

  4. In the source repo, tmp/ and saved-pages-dir/ remain ignored; the data is only in the data repo.

Using the data repo with the source repo

Either clone the data repo into the source tree (recommended) or symlink so the existing scripts still find the files.

Option A: Clone data repo into the source repo (recommended)

Clone the data repo and run the link script:

# From Rediflow repo root
git clone <data-repo-url> tmp/eu_playwright_data_repo
./scripts/eu_portal/link_eu_playwright_data_repo.sh tmp/eu_playwright_data_repo
  • tmp/downloaded_data/eu_playwright and saved-pages-dir are now symlinks into the data repo.
  • The source repo’s .gitignore already ignores tmp/ and saved-pages-dir/, so the contents of the data repo are not committed to the source repo.
  • To refresh data: cd tmp/eu_playwright_data_repo && git pull.

Option B: Clone data repo alongside the source repo

If you clone the data repo next to the Rediflow root (e.g. ../eu-playwright-data):

# From Rediflow repo root
./scripts/eu_portal/link_eu_playwright_data_repo.sh ../eu-playwright-data

Same result: scripts use tmp/downloaded_data/eu_playwright and saved-pages-dir as usual, but the real files live in the other repo.

Scripts and paths

  • eu_playwright CLI uses --out-dir (default tmp/downloaded_data/eu_playwright) and --save-pages (e.g. ./saved-pages-dir). Point these at the symlinked dirs so new runs write into the data repo; then commit and push from the data repo.
  • scripts/eu_portal/update_org_details_dates_from_saved_pages.py reads org_details_<slug>_results.json from tmp/downloaded_data/eu_playwright and HTML from saved-pages-dir. With the symlinks above, it reads/writes the data repo.

Summary

Goal Action
Keep data out of the source repo tmp/ and saved-pages-dir/ are in .gitignore.
Store data in Git Use a separate repository with downloaded_data/eu_playwright/ and saved-pages-dir/.
Use data from the source repo Clone the data repo and symlink those two dirs into tmp/downloaded_data/eu_playwright and saved-pages-dir.