EU Playwright data in a separate Git repository
Saved pages (HTML) and JSON result files from the EU portal (Funding & Tenders, CORDIS) are not stored in the Rediflow source code repository. They live in a separate Git repository so that:
- The source repo stays small and free of large/volatile data.
- Data can be versioned, shared, or backed up independently.
- Different branches or clones can point to different data repos.
What goes in the data repo
| Content | In source repo | In data repo |
|---|---|---|
tmp/downloaded_data/eu_playwright/*.json |
Ignored (tmp/) |
Committed |
saved-pages-dir/<session_id>/*.html |
Ignored (saved-pages-dir/) |
Committed |
So the data repository should contain:
downloaded_data/eu_playwright/– JSON files:org_details_<slug>_results.json,partner_search_<slug>_results.json,cordis_<slug>_results.json,eura2021_<slug>_results.json, and optionallypages/<session_id>/*.htmlif you also save pages under that path.saved-pages-dir/– Session subdirs (e.g.20260202_175231/) each containingcall_id_project_id.htmlfiles.
Creating the data repository
Create a new Git repository (e.g. Rediflow-eu-playwright-data or eu-playwright-data) and use this layout:
eu-playwright-data/ # root of the data repo
├── README.md # short description + how to use with Rediflow
├── downloaded_data/
│ └── eu_playwright/
│ ├── org_details_<slug>_results.json
│ ├── partner_search_<slug>_results.json
│ ├── cordis_<slug>_results.json
│ ├── eura2021_<slug>_results.json
│ └── (optional) pages/<session_id>/*.html
└── saved-pages-dir/
├── 20260202_160537/
│ └── *.html
└── 20260202_175231/
└── *.html
-
Create the repo (GitLab, GitHub, or local):
mkdir eu-playwright-data && cd eu-playwright-data git init -
Copy or move the data from the source repo workspace:
# From Rediflow repo root mkdir -p downloaded_data/eu_playwright saved-pages-dir cp -r tmp/downloaded_data/eu_playwright/*.json downloaded_data/eu_playwright/ cp -r saved-pages-dir/* saved-pages-dir/ -
Add a
.gitignorein the data repo if you want to ignore something (e.g.*.log), then commit and push. -
In the source repo,
tmp/andsaved-pages-dir/remain ignored; the data is only in the data repo.
Using the data repo with the source repo
Either clone the data repo into the source tree (recommended) or symlink so the existing scripts still find the files.
Option A: Clone data repo into the source repo (recommended)
Clone the data repo and run the link script:
# From Rediflow repo root
git clone <data-repo-url> tmp/eu_playwright_data_repo
./scripts/eu_portal/link_eu_playwright_data_repo.sh tmp/eu_playwright_data_repo
tmp/downloaded_data/eu_playwrightandsaved-pages-dirare now symlinks into the data repo.- The source repo’s
.gitignorealready ignorestmp/andsaved-pages-dir/, so the contents of the data repo are not committed to the source repo. - To refresh data:
cd tmp/eu_playwright_data_repo && git pull.
Option B: Clone data repo alongside the source repo
If you clone the data repo next to the Rediflow root (e.g. ../eu-playwright-data):
# From Rediflow repo root
./scripts/eu_portal/link_eu_playwright_data_repo.sh ../eu-playwright-data
Same result: scripts use tmp/downloaded_data/eu_playwright and saved-pages-dir as usual, but the real files live in the other repo.
Scripts and paths
- eu_playwright CLI uses
--out-dir(defaulttmp/downloaded_data/eu_playwright) and--save-pages(e.g../saved-pages-dir). Point these at the symlinked dirs so new runs write into the data repo; then commit and push from the data repo. - scripts/eu_portal/update_org_details_dates_from_saved_pages.py reads
org_details_<slug>_results.jsonfromtmp/downloaded_data/eu_playwrightand HTML fromsaved-pages-dir. With the symlinks above, it reads/writes the data repo.
Summary
| Goal | Action |
|---|---|
| Keep data out of the source repo | tmp/ and saved-pages-dir/ are in .gitignore. |
| Store data in Git | Use a separate repository with downloaded_data/eu_playwright/ and saved-pages-dir/. |
| Use data from the source repo | Clone the data repo and symlink those two dirs into tmp/downloaded_data/eu_playwright and saved-pages-dir. |