๐Ÿ“—
smiley book
  • Smiley Books
  • AI
    • Readme
    • openai-whisper
      • ์ƒ˜ํ”Œ ์‹คํ–‰ํ•ด๋ณด๊ธฐ
      • GPU ์„œ๋ฒ„ ์ค€๋น„ํ•˜๊ธฐ
      • API๋กœ whisper๋ฅผ ์™ธ๋ถ€์— ์˜คํ”ˆํ•˜๊ธฐ
      • ํ”„๋กฌํ”„ํŠธ ์ง€์›
      • ์‹ค์‹œ๊ฐ„ message chat
      • ํ™”๋ฉด ์ด์˜๊ฒŒ ๋งŒ๋“ค๊ธฐ์™€ ๋กœ๊ทธ์ธ
      • ํŒŒ์ด์ฌ ๊ฐ€์ƒํ™˜๊ฒฝ
      • ์‹ค์‹œ๊ฐ„ voice chat
      • fine tunning(๋ฏธ์„ธ ์กฐ์ •) ์œผ๋กœ ์„ฑ๋Šฅ ์˜ฌ๋ฆฌ๊ธฐ
      • app์—์„œ api๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ…์ŠคํŠธ๋กœ ๋ฐ”๊ฟ”๋ณด๊ธฐ
    • ollama - llm์„ ์‰ฝ๊ฒŒ ๋‚ด์ปด์—์„œ ์‹คํ–‰
      • ollama webui
      • ollama docker
    • stable diffusion
      • SDXL - text to image
      • SD-webui
    • ChatGPT
      • ๋‹ต๋ณ€์ด ๋Š๊ธธ๋•Œ
      • ์—ญํ• ์„ ์ •ํ•˜์ž
      • ๊ตฌ์ฒด์ ์ธ ์งˆ๋ฌธ
      • ๊ฒฐ๊ณผํ˜•ํƒœ๋ฅผ ์ง€์ •
      • ํ”„๋กฌํ”„ํŠธ๋ฅผ ์—ฌ๋Ÿฌ์ค„๋กœ ์‚ฌ์šฉํ•˜์ž.
      • ๋งˆํ‹ด ํŒŒ์šธ๋Ÿฌ ๊ธ€ ๋ฒˆ์—ญ๋ณธ
    • Prompt Engineering
    • Auto-GPT
    • Gemini
      • google ai studio
      • gemini-api
      • embedding guide
    • Huggingface
      • huggingface ์‚ฌ์šฉ๋ฒ•
      • huggingface nlp ๊ณต๋ถ€์ค‘
    • kaggle
      • download dataset
    • langchain
      • langchain์„ ๊ณต๋ถ€ํ•˜๋ฉฐ ์ •๋ฆฌ
      • basic
      • slackbot
      • rag
      • document-loader
      • website-loader
      • confluence
      • memory
      • function-call
      • langsmith
      • agent-toolkit
  • Ansible
    • templates vs files and jinja2
    • dynamic inventory
    • limit ์˜ต์…˜ ๊ฐ•์ œํ•˜๊ธฐ
    • limit ์‚ฌ์šฉํ›„ gather_fact ๋ฌธ์ œ
  • AWS
    • AWS CLI
    • EKS
      • cluster manage
      • ALB Controller
      • external-dns
      • fargate
    • ECR
    • S3
    • Certificate Manager
  • Azure
    • Azure AD OAuth Client Flow
  • Container
    • Registry
    • ๋นŒ๋“œ์‹œ์— env๊ฐ’ ์„ค์ •ํ•˜๊ธฐ
  • DB
    • PXC
      • Operator
      • PMM
      • ์‚ญ์ œ
      • GTID
      • Cross Site Replication
    • Mssql
    • Mysql
  • dotnet
    • Thread Pool
    • Connection Pool
    • Thread Pool2
  • Devops
    • Recommendation
  • GIT
    • Basic
    • Submodule
  • GitHub
    • Repository
    • GitHub Action
    • GitHub PR
    • Self Hosted Runner
    • GitHub Webhook
  • GitLab
    • CI/CD
    • CI/CD Advance
    • Ssl renew
    • CI/CD Pass env to other job
  • Go Lang
    • ๊ฐœ๋ฐœ ํ™˜๊ฒฝ ๊ตฌ์ถ•
    • multi os binary build
    • kubectl๊ฐ™์€ cli๋งŒ๋“ค๊ธฐ
    • azure ad cli
    • embed static file
    • go study
      • pointer
      • module and package
      • string
      • struct
      • goroutine
  • Kubernetes
    • Kubernetes๋Š” ๋ฌด์—‡์ธ๊ฐ€
    • Tools
    • Install with kubespray
    • Kubernetes hardening guidance
    • 11 ways not to get hacked
    • ArgoCD
      • Install
      • CLI
      • Repository
      • Apps
      • AWS ALB ์‚ฌ์šฉ
      • Notification slack
      • Backup / DR
      • Ingress
      • 2021-11-16 Github error
      • Server Config
      • auth0 ์ธ์ฆ ์ถ”๊ฐ€(oauth,OIDC)
    • Extension
      • Longhorn pvc
      • External dns
      • Ingress nginx
      • Cert Manager
      • Kube prometheus
    • Helm
      • Subchart
      • Tip
    • Loki
    • Persistent Volume
    • TIP
      • Job
      • Pod
      • Log
  • KAFKA
    • raft
  • KVM
    • kvm cpu model
  • Linux
    • DNS Bind9
      • Cert-Manager
      • Certbot
      • Dynamic Update
      • Log
    • Export and variable
    • Grep ์‚ฌ์šฉ๋ฒ•
  • Modeling
    • C4 model introduce
    • Mermaid
    • reference
  • Monitoring
    • Readme
    • 0. What is Monitoring
    • 1. install prometheus and grafana
    • 2. grafana provisioning
    • 3. grafana dashboard
    • 4. grafana portable dashboard
    • 5. prometheus ui
    • 6. prometheus oauth2
    • Prometheus
      • Metric type
      • basic
      • rate vs irate
      • k8s-prometheus
    • Grafana
      • Expolorer
    • Node Exporter
      • advance
      • textfile collector
  • Motivation
    • 3 Simple Rule
  • OPENNEBULA
    • Install(ansible)
    • Install
    • Tip
    • Windows vm
  • Reading
    • comfort zone
    • ๋ฐฐ๋ ค
    • elon musk 6 rule for insane productivity
    • Feynman Technique
    • how to interview - elon musk
    • ๊ฒฝ์ฒญ
    • Readme
  • Redis
    • Install
    • Master-slave Architecture
    • Sentinel
    • Redis Cluster
    • Client programming c#
  • SEO
    • Readme
  • Security
    • criminalip.io
      • criminalip.io
  • Stock
    • robinhood-python
  • Terraform
    • moved block
    • output
  • vault
    • Readme
  • VS Code
    • dev container
    • dev container on remote server
  • Old fashione trend
    • curity
    • MAAS
      • Install maas
      • Manage maas
      • Tip
Powered by GitBook
On this page
  • webBaseLoader
  • Load multiple urls concurrently
  • xml parser
  • sitemap loader
  • Add custom scraping rules
  • pdf Document Loader
  • MS Word Document Loader

Was this helpful?

  1. AI
  2. langchain

document-loader

https://python.langchain.com/v0.2/docs/integrations/document_loaders/


%pip install  --user -Uq  langchain langchain_community pypdf pdf2image docx2txt pdfminer

webBaseLoader

https://python.langchain.com/v0.2/docs/integrations/document_loaders/web_base/)

%pip install  --user -Uq beautifulsoup4

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.thisisgame.com/webzine/news/nboard/4/?n=189952")

data = loader.load()

print(data[0].page_content)
loader = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
docs = loader.load()
docs

Load multiple urls concurrently

%pip install  --user -Uq   nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()
loader = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs

xml parser

loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs

sitemap loader

%pip install --user -Uq  nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()
from langchain_community.document_loaders.sitemap import SitemapLoader
sitemap_loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml")
docs = sitemap_loader.load()
docs[0]
# Filtering sitemap URLs
loader = SitemapLoader(
    web_path="https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest"],
)
documents = loader.load()
documents[0]
sitemap_loader = SitemapLoader(
    web_path="https://beta.thisisgame.com/sitemap.xml",
    filter_urls=[
        "https://beta.thisisgame.com/articles/265823"
    ],
)

sitemap_loader.requests_per_second = 2

docs = sitemap_loader.load()
docs

์ด์ œ ๊ธ€์„ ์ˆ˜์ •ํ•˜์ž

#filter url
sitemap_loader = SitemapLoader(
    web_path="https://beta.thisisgame.com/sitemap.xml",
    filter_urls=[
        "https://beta.thisisgame.com/articles/265823"
    ],
)

sitemap_loader.requests_per_second = 2

docs = sitemap_loader.load()
docs

Add custom scraping rules

%pip install --user  -Uq beautifulsoup4
from bs4 import BeautifulSoup


def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    nav_elements = content.find_all("nav")
    header_elements = content.find_all("header")

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in nav_elements + header_elements:
        element.decompose()

    return str(content.get_text())
loader = SitemapLoader(
    "https://api.python.langchain.com/sitemap.xml",
    filter_urls=["https://api.python.langchain.com/en/latest/"],
    parsing_function=remove_nav_and_header_elements,
)
docs = loader.load()
docs
from bs4 import BeautifulSoup


def parse_page(soup):
    header = soup.find("header")
    footer = soup.find("footer")
    if header:
        header.decompose()
    if footer:
        footer.decompose()
    return (
        str(soup.get_text())
        .replace("\n", " ")
        .replace("\xa0", " ")
        .replace("์ง„ํ–‰๊ฒŒ์ž„๊ณ„ ํ™”์ œ์ธ๊ฒŒ์ž„ ์ด์Šˆ์˜คํ”ผ๋‹ˆ์–ธ๊ธฐํš/ํŠน์ง‘์—ฐ์žฌ/์นดํˆฐ๊ฐค๋Ÿฌ๋ฆฌ์ปค๋ฎค๋‹ˆํ‹ฐ ๋กœ๊ทธ์ธ๋กœ๊ทธ์ธ", "")
    )

pdf Document Loader

from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/sample.pdf")
pages= loader.load_and_split()

print(pages[0].page_content)

MS Word Document Loader

https://python.langchain.com/v0.2/docs/integrations/document_loaders/microsoft_word/

API Reference:Docx2txtLoader (https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.word_document.Docx2txtLoader.html)

%pip install  --user -Uq   docx2txt

from langchain_community.document_loaders import Docx2txtLoader
loader = Docx2txtLoader("data/family.docx")
data = loader.load()
data
PreviousragNextwebsite-loader

Last updated 11 months ago

Was this helpful?