# Scrapy

Scrapy 通过内建的 `HTTP_PROXY` 环境变量与 `HttpProxyMiddleware` 接入 helodata。需要按请求轮换时再加 `scrapy-rotating-proxies`。

## 单一静态代理

在 `settings.py`：

```python
HTTPPROXY_AUTH_ENCODING = "latin-1"

DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
}
```

每个请求中：

```python
def start_requests(self):
    proxy = "http://helo_s1a2b3c4d5e-type-res-region-us:PASSWORD@gate.helodata.io:7777"
    for url in self.start_urls:
        yield scrapy.Request(url, meta={"proxy": proxy})
```

或启动前设环境变量：

```bash
export HTTP_PROXY="http://helo_s1a2b3c4d5e-type-res-region-us:PASSWORD@gate.helodata.io:7777"
export HTTPS_PROXY="$HTTP_PROXY"
scrapy crawl myspider
```

## 用 `scrapy-rotating-proxies` 做 ISP 轮换

```bash
pip install scrapy-rotating-proxies
```

`settings.py`：

```python
ROTATING_PROXY_LIST_PATH = "/path/to/isp.txt"      # 每行 ip:port:user:pass
# 或：
ROTATING_PROXY_LIST = [
    "http://helo_s1a2b3c4d5e:PASSWORD@198.51.100.42:8000",
    "http://helo_s1a2b3c4d5e:PASSWORD@198.51.100.43:8000",
]

DOWNLOADER_MIDDLEWARES = {
    "rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
    "rotating_proxies.middlewares.BanDetectionMiddleware": 620,
}

# 可选调优
ROTATING_PROXY_BAN_POLICY = "rotating_proxies.policy.BanDetectionPolicy"
ROTATING_PROXY_BACKOFF_BASE = 300                  # 秒
```

中间件会按代理记录失败并跳过被 ban 的；配合 `RETRY_HTTP_CODES = [429, 502, 522, 524]` 处理瞬时网关错误。

## 住宅会话动态轮换

无需外部清单，按需生成会话：

```python
import random, string

def session_proxy(country="us"):
    sid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
    user = f"helo_s1a2b3c4d5e-type-res-region-{country}-session-{sid}-sesstime-10"
    return f"http://{user}:PASSWORD@gate.helodata.io:7777"

class MySpider(scrapy.Spider):
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={"proxy": session_proxy()})
```

## 添加 helodata 标签

```python
yield scrapy.Request(
    url,
    meta={"proxy": proxy},
    headers={"X-Helodata-Tag": "campaign-acme-q2"},
)
```

标签会出现在流量统计中——见 [使用统计](/helodata-zh/chan-pin/overview/statistics.md)。

## 验证

```python
def parse(self, response):
    self.logger.info("Exit IP: %s", response.headers.get(b"X-Helodata-Exit-IP", b"").decode())
```

## 常见陷阱

* **`HttpProxyMiddleware` 被关闭** — Scrapy 默认启用，自定义 `DOWNLOADER_MIDDLEWARES` 时记得保留它。
* **小 ISP 批次配高 CONCURRENT\_REQUESTS** — 批次只有 N 个 IP，超 N 的并发只是排队。`CONCURRENT_REQUESTS_PER_DOMAIN` 与池大小匹配。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.helodata.com/helodata-zh/ji-cheng-zhi-nan/pa-chong-gong-ju/scrapy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
