# document-loader

<https://python.langchain.com/v0.2/docs/integrations/document\\_loaders/>

```python

%pip install  --user -Uq  langchain langchain_community pypdf pdf2image docx2txt pdfminer
```

## webBaseLoader

<https://python.langchain.com/v0.2/docs/integrations/document\\_loaders/web\\_base/>)

```python
%pip install  --user -Uq beautifulsoup4

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.thisisgame.com/webzine/news/nboard/4/?n=189952")

data = loader.load()

print(data[0].page_content)
```

```python
loader = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
docs = loader.load()
docs
```

## Load multiple urls concurrently

```python
%pip install  --user -Uq   nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()
```

```python
loader = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs
```

## xml parser

```python
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs
```

## sitemap loader

```python
%pip install --user -Uq  nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()
```

```python
from langchain_community.document_loaders.sitemap import SitemapLoader
```

```python
sitemap_loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml")
docs = sitemap_loader.load()
```

```python
docs[0]
```

```python
# Filtering sitemap URLs
loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()
```

```python
documents[0]
```

```python
sitemap_loader = SitemapLoader(
    web_path="https://beta.thisisgame.com/sitemap.xml",
    filter_urls=[
        "https://beta.thisisgame.com/articles/265823"
    ],
)

sitemap_loader.requests_per_second = 2

docs = sitemap_loader.load()
docs
```

이제 글을 수정하자

```python
#filter url
sitemap_loader = SitemapLoader(
    web_path="https://beta.thisisgame.com/sitemap.xml",
    filter_urls=[
        "https://beta.thisisgame.com/articles/265823"
    ],
)

sitemap_loader.requests_per_second = 2

docs = sitemap_loader.load()
docs
```

## Add custom scraping rules

```python
%pip install --user  -Uq beautifulsoup4
```

```python
from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all("nav")
    header_elements = content.find_all("header")

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())
```

```python
loader = SitemapLoader(
    "https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)
```

```python
docs = loader.load()
docs
```

```python
from bs4 import BeautifulSoup


def parse_page(soup):
    header = soup.find("header")
    footer = soup.find("footer")
    if header:
        header.decompose()
    if footer:
        footer.decompose()
    return (
        str(soup.get_text())
        .replace("\n", " ")
        .replace("\xa0", " ")
        .replace("진행게임계 화제인게임 이슈오피니언기획/특집연재/카툰갤러리커뮤니티 로그인로그인", "")
    )
```

## pdf Document Loader

```python
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/sample.pdf")
pages= loader.load_and_split()
```

```python

print(pages[0].page_content)

```

## MS Word Document Loader

<https://python.langchain.com/v0.2/docs/integrations/document\\_loaders/microsoft\\_word/>

API Reference:Docx2txtLoader (<https://api.python.langchain.com/en/latest/document\\_loaders/langchain\\_community.document\\_loaders.word\\_document.Docx2txtLoader.html>)

```python
%pip install  --user -Uq   docx2txt

from langchain_community.document_loaders import Docx2txtLoader
loader = Docx2txtLoader("data/family.docx")
data = loader.load()
data
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://teamsmiley.gitbook.io/devops/ai/langchain/document-loader.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
