Optimize Python ETL for BigQuery

I have an existing Python data-processing script that reads structured files (mostly CSV pulled from SQL dumps) and pushes the results into Google BigQuery. It works, but it’s messy. I’m looking for someone to go through the codebase, iron out the bugs that occasionally stop a run, streamline the logic so it performs faster, and then extend it with a couple of small features I’ve been postponing. Right now the script: • ingests source files from local disk or Cloud Storage • performs several Pandas transformations and a few custom calculations • uploads the final table into a dedicated BigQuery dataset What I need from you: • Debug existing errors so every run completes without manual tweaks • Refactor for clearer structure and better performance (vectorised Pandas where possible, fewer redundant reads/writes) • Add two new options: automatic schema detection for new CSVs and an argument to select destination tables on the fly • Update any BigQuery-specific calls to the latest google-cloud-bigquery library patterns • Hand back a single, well-commented script (or small module) plus a short README that shows install steps and example commands If you’re comfortable with Python 3.x, Pandas, and the BigQuery SDK, this should be straightforward work. I’ll provide the current code, a sample dataset, and access to a test GCP project as soon as we kick off.

Python

Регистрация