Scrapling's engine layer — 5 functions with the highest activity-weighted risk

Scrapling's top activity-risk functions cluster in scrapling/spiders/engine.py and scrapling/engines/_browsers/_stealth.py, where god-function patterns, CC scores above 40, and fan-out above 25 combine with live commit churn.

Stephen Collins ·
oss python refactoring code-health

Antipatterns Detected

exit_heavy8complex_branching7deeply_nested7god_function7long_function6

Key Points

What is a god function and why does it matter in Scrapling?

A god function is one that handles too many distinct responsibilities — it calls many other functions, contains many branching paths, and becomes the single point where a large portion of behavior is controlled. In Scrapling, crawl and find_all both show this pattern: crawl calls 26 distinct functions while find_all calls 29, meaning a change to either can ripple across a wide surface of the codebase in unpredictable ways. The practical consequence is that these functions are hard to test in isolation, hard to review confidently, and disproportionately risky to modify.

How do I reduce cyclomatic complexity in Python?

The most direct technique is extract-method refactoring: identify cohesive groups of conditional branches and move each into a named helper function, replacing the original branch with a single function call. In Python, replacing long if/elif chains with dispatch dictionaries or strategy objects can further collapse multiple paths into a single lookup, reducing both complexity and test burden in one step.

Is Scrapling actively maintained?

The commit data points to active development in the engine layer: crawl in scrapling/spiders/engine.py received 6 commits in the last 30 days, fetch in the stealth engine was last changed 17 days ago, and _rate_limiter in the spider engine has 3 touches in 30 days. With 32 functions in the fire quadrant — meaning both structurally complex and recently active — the project shows clear signs of ongoing, substantive development.

How do I reproduce this analysis?

Run the Hotspots CLI against the D4Vinci/Scrapling repository at commit a228543 to reproduce the scores and quadrant assignments shown here.

What does activity-weighted risk mean?

Complexity × recent commit frequency — functions that are hard to understand AND actively changing are the highest priority for refactoring.

Scrapling’s highest-priority refactoring targets sit inside two files — scrapling/spiders/engine.py and scrapling/engines/_browsers/_stealth.py — where CC scores of 41 and 42, fan-out values above 25, and recent commit activity combine to create live regression risk right now. The top-ranked function, crawl, is both structurally complex and actively changing — this is not a backlog cleanup item. Across 391 total functions, Scrapling has 36 rated critical and 32 in the fire quadrant, signaling that the engine and browser abstraction layers deserve immediate attention.

The table below ranks functions by activity-weighted risk — a score that multiplies structural complexity by recent commit frequency. A function that is both hard to understand (high cyclomatic complexity) and actively changing is a higher priority than one that is complex but untouched. CC = cyclomatic complexity (independent execution paths); ND = max nesting depth; FO = fan-out (distinct callees).

Top 5 Hotspots

FunctionFileRiskCCNDFO
crawlscrapling/spiders/engine.py18.641726
fetchscrapling/engines/_browsers/_stealth.py16.342525
fetchscrapling/engines/_browsers/_controllers.py16.240524
parsescrapling/core/shell.py15.962330
_cloudflare_solverscrapling/engines/_browsers/_stealth.py15.834517

Hotspot Analysis

crawl — scrapling/spiders/engine.py

As the central orchestration point in Scrapling’s spider engine, crawl almost certainly coordinates request dispatch, response handling, and spider lifecycle decisions — the kind of function that touches everything. Its cyclomatic complexity of 41 means 41 independent execution paths, each a required test case and a potential regression surface. With a fan-out of 26, a max nesting depth of 7, and 6 commits in the last 30 days, this is a fire-quadrant function: it is structurally complex and actively changing right now, making every commit a live regression risk across a broad call graph.

Recommendation: Add characterization tests covering the dominant execution paths before any further changes, then extract the deeply nested branching blocks (ND 7) into named sub-functions to bring CC below 15 and reduce the blast radius of future edits.

fetch — scrapling/engines/_browsers/_stealth.py

fetch in the stealth browser engine likely handles the full HTTP fetch lifecycle for browser-based, anti-detection requests — a surface that must juggle headers, timing, proxy routing, and response parsing simultaneously. Its CC of 42 is the highest in the dataset alongside crawl, its fan-out of 25 signals broad coupling across 25 distinct callees, and its multiple exit paths add test-coverage burden. It sits in the fire quadrant and was last changed 17 days ago, meaning structural complexity and active development are colliding in real time.

Recommendation: Map the 25 fan-out callees to identify which are shared with crawl — overlapping dependencies between these two god-functions compound blast-radius risk; decoupling shared concerns into dedicated service objects will reduce both CC and cross-function coupling.

fetch — scrapling/engines/_browsers/_controllers.py

The fetch function in the controllers browser engine almost certainly implements the HTTP fetch lifecycle for standard (non-stealth) browser-based requests — a parallel surface to its counterpart in the stealth engine, with its own set of routing, header management, and response-handling branches. Its CC of 40 places it just below the stealth fetch, with a fan-out of 24 and max nesting depth of 5 confirming broad coupling across the browser abstraction layer. Together, the two fetch implementations share structural patterns (god_function, exit_heavy) and likely share callees — a coupling that amplifies the blast radius of changes to either.

Recommendation: Treat the controllers and stealth fetch functions as a paired refactoring target. Identify shared callees first, extract them into a common browser-agnostic service, and then reduce each fetch independently — this approach eliminates duplicated complexity in a single pass.

parse — scrapling/core/shell.py

parse in the shell module carries the highest cyclomatic complexity in the dataset at 62 — 62 independent execution paths, each a required test case and a potential regression surface. Its fan-out of 30 is also the highest in the table, meaning changes here ripple across 30 distinct callees. Despite this structural load, its nesting depth of 3 suggests the complexity is driven by wide branching rather than deep nesting — likely many parsing strategies, format handlers, or fallback cases evaluated in sequence. The low nesting depth is the one structural positive; the CC and fan-out together make this the highest-priority refactoring target for pure structural debt.

Recommendation: Before touching parse, write a characterization test suite covering its dominant output types. Then apply extract-method refactoring to each distinct parsing strategy or format branch, targeting CC below 20 and bringing each sub-function into independent testability.

_cloudflare_solver — scrapling/engines/_browsers/_stealth.py

From its name and location in the stealth browser engine, _cloudflare_solver almost certainly implements the challenge-solving logic for bypassing Cloudflare bot detection — a flow that by nature requires many conditional branches to handle different challenge types, timeouts, and fallback strategies. Its CC of 34 and fan-out of 17 confirm this complexity, and with 5 levels of nesting it is difficult to reason about in isolation. Critically, it sits in the debt quadrant: it has not been touched in 62 days, meaning this structural complexity is dormant for now but carries high blast radius when the next development push arrives.

Recommendation: Before the next feature work on Cloudflare handling, add a characterization test suite covering the branching outcomes so the existing behavior is locked down; then extract each challenge-type handler into its own function to reduce CC to a manageable level.

Patterns Found

Antipatterns detected across the top functions in this snapshot:

PatternOccurrences
exit_heavy8
complex_branching7
deeply_nested7
god_function7
long_function6

These labels belong to two tiers — Tier 1 (structural): complex_branching, deeply_nested, exit_heavy, long_function, god_function. Tier 2 (relational/temporal): hub_function, cyclic_hub, middle_man, neighbor_risk, stale_complex, churn_magnet, shotgun_target, volatile_god.

Key Takeaways

  • crawl in scrapling/spiders/engine.py has been touched 6 times in 30 days with a CC of 41 and fan-out of 26 — add characterization tests before the next commit to prevent silent regressions across its 26 callees.
  • parse in scrapling/core/shell.py has a cyclomatic complexity of 62 and a fan-out of 30 — the highest values in the dataset for both metrics — making it the largest structural debt item in the codebase and the highest-priority target for extract-method refactoring.
  • _cloudflare_solver and fetch share the same file (scrapling/engines/_browsers/_stealth.py) and together account for CC 34 + CC 42 with overlapping god-function and exit-heavy patterns — treating them as a paired refactoring target will reduce the stealth engine’s overall risk profile more efficiently than addressing either alone.

Reproduce This Analysis

git clone https://github.com/D4Vinci/Scrapling
cd Scrapling
git checkout a2285436e15952b8cf1cdafa9892210de84d4ac8
hotspots analyze . --mode snapshot --explain-patterns --force

To run the same analysis on your own codebase, run hotspots analyze . --mode snapshot in any local git repo — no configuration required.

Hotspots highlights structural and activity risk — not “bad code.” Findings are a prioritization aid, not a bug predictor. Editorial policy →

Run this on your own codebase

Hotspots runs locally in under a minute — no account, no data leaves your machine.

macOS
$ brew install Stephen-Collins-tech/tap/hotspots
Linux / cargo
$ cargo install hotspots-cli
Run in any repo
$ hotspots analyze .
★ Star on GitHub

Related Analyses