# LangChain

抓取网页内容的 LangChain 管线（loaders、retrievers、web 搜索工具）可通过 helodata 路由，避免 LLM 驱动采集或 RAG 摄取中被 IP 屏蔽。

## 文档加载器

`WebBaseLoader`、`PlaywrightURLLoader` 等底层用 `requests` / `httpx` / Playwright，在对应客户端上设代理即可。

### `WebBaseLoader`

```python
from langchain_community.document_loaders import WebBaseLoader

USER = "helo_s1a2b3c4d5e-type-res-region-us"
PASS = "PASSWORD"
proxy = f"http://{USER}:{PASS}@gate.helodata.io:7777"

loader = WebBaseLoader("https://example.com")
loader.requests_kwargs = {
    "proxies": {"http": proxy, "https": proxy},
    "timeout": 30,
}
docs = loader.load()
```

### `PlaywrightURLLoader`

```python
from langchain_community.document_loaders import PlaywrightURLLoader

loader = PlaywrightURLLoader(
    urls=["https://example.com"],
    remove_selectors=["nav", "footer"],
    continue_on_failure=True,
    headless=True,
    launch_options={
        "proxy": {
            "server":   "http://gate.helodata.io:7777",
            "username": "helo_s1a2b3c4d5e-type-res-region-us",
            "password": "PASSWORD",
        }
    },
)
docs = loader.load()
```

## Web 搜索工具

`TavilySearch`、`DuckDuckGoSearchRun` 等命中外部 API 时是否需要代理因情况而异。对基于 HTTP 抓取的工具（如 `RequestsGetTool`），注入带代理的客户端：

```python
from langchain.tools.requests.tool import RequestsGetTool
from langchain_community.utilities.requests import TextRequestsWrapper

wrapper = TextRequestsWrapper(
    headers={"User-Agent": "Mozilla/5.0"},
    proxies={"http": proxy, "https": proxy},
)
tool = RequestsGetTool(requests_wrapper=wrapper, allow_dangerous_requests=True)
```

## 每次 LLM 调用使用粘性会话

LangChain agent 一次 chain 内常有多个子请求，用粘性会话让目标看到一致的 IP 行为：

```python
import random, string
sid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
user = f"helo_s1a2b3c4d5e-type-res-region-us-session-{sid}-sesstime-30"
proxy = f"http://{user}:PASSWORD@gate.helodata.io:7777"
```

整个 chain 内所有子请求复用同一个 `proxy`。

## 标签做归因

通过 `X-Helodata-Tag` 头把 LLM 驱动的流量与其他爬取分开统计：

```python
wrapper = TextRequestsWrapper(
    headers={"X-Helodata-Tag": "llm-agent-prod"},
    proxies={"http": proxy, "https": proxy},
)
```

## 常见陷阱

* **部分版本的 `UnstructuredURLLoader` 忽略代理** — 改用 `WebBaseLoader`，或先用 `requests` 预下载。
* **异步 loader** — 把客户端换成 `httpx.AsyncClient(proxy=proxy)`；`aiohttp` 按请求传 `proxy=`。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.helodata.com/helodata-zh/ji-cheng-zhi-nan/ai-yu-gong-zuo-liu/langchain.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
