Unix Python Hive Workflow Automation

I’m streamlining a recurring data-driven workflow that lives entirely on a Unix box. The goal is to replace a patchwork of manual steps with a single Python-based utility that can: • Pick up incoming files from predefined directories, validate them, and move them through an archive/processing path. • Trigger the correct Hive queries to load, transform, or refresh tables, then write completion status back to a control table. • Interact with other database systems when needed for look-ups or logging. • Produce clear, rotating log files and raise Unix-level exit codes so the whole thing can be scheduled in cron or any enterprise scheduler. Core requirements – Written in Python 3, callable from the Unix shell. – Uses native Hive connections (beeline/pyhive or similar) for all Hive operations. – All paths, connection strings, and query text must be externalised in a simple config file so I can retarget environments without code edits. – Idempotent: if the same run fires twice, it should recognise completed steps and skip or safely overwrite. – Detailed inline comments plus a README that covers setup, dependency installation, and a sample cron entry. Deliverables 1. Fully-tested source code and shell wrapper(s). 2. Config / template files and a quickstart README. 3. One brief hand-off call or document that walks through deploying it on my server. I have direct access to the Unix environment and can supply sample files, current Hive DDL, and test tables as soon as we start. Let me know any libraries you expect to add so I can confirm they’re allowed on the host. Looking forward to seeing how you’d approach this.

Python

Регистрация