TJSP Case Scraping API with Cloudflare Bypass

Замовник: AI | Опубліковано: 11.12.2025
Бюджет: 250 $

### Project Title Development of a Python/FastAPI REST API for TJSP Case Scraping with Cloudflare Turnstile Bypass ### General Description Development of a robust REST API to automatically query court cases on the TJSP e-Proc portal, with token-based authentication, rate limiting, and bypass of Cloudflare Turnstile protection. ### Required Tech Stack - Python (version 3.10+) - FastAPI - Pydoll (library for Turnstile bypass and web scraping) - MySQL (native connection or SQLAlchemy) - SmartProxy (prepare configuration for residential rotating proxy) - Docker + Docker Compose *** ### 1. Authentication and Access Control #### 1.1 `tokens` Table Structure in MySQL The database must include a `tokens` table with at least the following fields: - `id`: integer, primary key, auto-increment - `token`: string, unique, not null - `ativo` (active): boolean, default `true` - `limite_req_minuto`: integer, default `0` (0 = unlimited) - `limite_req_hora`: integer, default `0` (0 = unlimited) - `limite_req_dia`: integer, default `0` (0 = unlimited) - `limite_req_mes`: integer, default `0` (0 = unlimited) - `data_criacao`: datetime, default current timestamp - `data_expiracao`: datetime (nullable) Note: the value `0` in any limit field means “unlimited”. #### 1.2 Rate Limiting Table Create a table to persist request counters per token, so limits survive restarts: - `id`: integer, primary key, auto-increment - `token_id`: foreign key referencing `tokens.id` - `contador_minuto`, `contador_hora`, `contador_dia`, `contador_mes`: integer counters - `ultimo_reset_minuto`, `ultimo_reset_hora`, `ultimo_reset_dia`, `ultimo_reset_mes`: datetime fields to control resets #### 1.3 Token Validation For each request the API must: - Check whether the token exists and is active - Check that `data_expiracao` has not passed (if not null) - Check all configured limits (per minute, hour, day, month) before processing If any limit would be exceeded, no process in the request should be handled. *** ### 2. API Endpoints #### 2.1 Main Endpoint: `POST /consultar` **Request JSON:** ```json { "token": "abc123xyz", "processos": [ { "numero": "4001323-38.2025.8.26.0505", "tipo": "primeiroGrau" }, { "numero": "1234567-89.2023.8.26.0100", "tipo": "tribunalDeJustica" }, { "numero": "9876543-21.2024.8.26.0100", "tipo": "turmasRecursais" } ] } ``` - `token`: access token to be validated against the database. - `processos`: array of objects, each with: - `numero`: case number - `tipo`: `"primeiroGrau"`, `"tribunalDeJustica"` or `"turmasRecursais"` **Validation Rules:** 1. The number of items in `processos` must respect the token’s limits (per minute/hour/day/month). 2. If the total number of requested cases exceeds the available limit: - The API must return an error. - It must **not** process any of the cases. - The response must include the configured limit and the remaining available quota. **Example Response JSON (Success – structured data):** ```json { "sucesso": true, "processos": [ { "numero": "4001323-38.2025.8.26.0505", "tipo": "primeiroGrau", "dados_estruturados": { "eventos": [...], "partes": [...], "movimentacoes": [...], "documentos": [ { "tipo": "link_processo", "url": "https://..." }, { "tipo": "documento", "conteudo": "extracted document text" } ] } } ] } ``` - The goal is to return structured data (events, parties, movements, and documents), not raw HTML. *** ### 3. Web Scraping Logic #### 3.1 Target URL Base search page: - `https://eproc-consulta.tjsp.jus.br/consulta_1g/externo_controlador.php?acao=tjsp@consulta_unificada_publica/consultar` #### 3.2 Per-Case Scraping Flow For each case in the `processos` array: 1. Access the unified search page. 2. In the dropdown list, select the option corresponding to the `tipo` field: - `"primeiroGrau"` → “1º Grau” - `"tribunalDeJustica"` → “Tribunal de Justiça” - `"turmasRecursais"` → “Turmas Recursais” 3. Fill in the case number and submit the form. 4. On the result page, find and click the link “Clique aqui para listar todos os eventos”. 5. Scrape **all information** from the page that opens after clicking this link. 6. Link handling: - If a link points to another case: - Only return the link URL in the `documentos` section, with `tipo: "link_processo"`. - If a link points to a case document: - Open the new page and extract the text content. - Return the extracted text in the `documentos` section, with `tipo: "documento"` and a `conteudo` field. If, in any specific case, the “Clique aqui para listar todos os eventos” link is not present, scraping must be performed on the main result page instead. #### 3.3 Cloudflare Turnstile Bypass - Use Pydoll to handle and bypass Cloudflare Turnstile. - Implement retry logic (for example, up to 3 attempts) in case of temporary blocks. - Prepare full support for SmartProxy integration: - A configuration option must exist to enable/disable proxy usage. - When proxy is enabled, requests must be routed through SmartProxy (residential rotating proxies). - Proxy credentials (host, port, username, password) will be provided and configured later by the client in environment variables. *** ### 4. HTTP Response Codes The API must use consistent HTTP status codes: - **200 OK** – Successful scraping and response. - **401 Unauthorized** – Invalid or expired token. - **429 Too Many Requests** – Rate limit exceeded (include configured limits and remaining quota in the message). - **404 Not Found** – Case not found on the TJSP website. - **503 Service Unavailable** – Turnstile block or TJSP site temporarily unavailable. - **500 Internal Server Error** – Generic internal error. *** ### 5. Infrastructure and Deployment #### 5.1 Required Deliverables The freelancer must deliver: 1. Fully working Python source code (FastAPI + Pydoll), clean and well-structured. 2. `Dockerfile` optimized for production. 3. `docker-compose.yml` including: - FastAPI service - MySQL 8.0+ - (Optional) Redis, if used for caching 4. `requirements.txt` listing all dependencies. 5. `.env.example` file with all necessary environment variables, for example: ```text MYSQL_HOST=localhost MYSQL_PORT=3306 MYSQL_USER=root MYSQL_PASSWORD=your_password MYSQL_DATABASE=api_processos USAR_PROXY=False SMARTPROXY_HOST= SMARTPROXY_PORT= SMARTPROXY_USER= SMARTPROXY_PASSWORD= ``` 6. SQL scripts to create all required tables (tokens, rate limiting, etc.). 7. A clear `README.md` with: - Setup and installation instructions - How to run with Docker/Docker Compose - Environment configuration - Example requests/responses 8. Automatic Swagger/OpenAPI documentation exposed by FastAPI. #### 5.2 Deployment Environment - The API will be deployed on the client’s own VPS. - The freelancer is responsible only for delivering the code and documentation, not for performing the actual deployment. - The client will manage tokens manually through phpMyAdmin. #### 5.3 Logging Implement structured logging, including: - Incoming requests (timestamp, token ID, number of cases requested). - Rate limiting checks and results. - Scraping errors and exceptions. - Turnstile blocks and retries. - Per-case processing time. *** ### 6. Non-Functional Requirements - **Performance**: Support at least 20 requests per minute. - **Security**: Strict token validation and persisted rate limiting. - **Maintainability**: Clean, well-organized, and documented code, following Python best practices. - **Reliability**: Resilient to temporary failures, with retry mechanisms where appropriate. *** ### 7. Out of Scope The project explicitly does **not** include: - Automated tests (unit or integration). - Deployment to the VPS/server. - Administrative endpoints (CRUD for tokens). - Any graphical user interface or frontend. *** ### 8. Timeline and Budget - To be agreed directly between client and freelancer.