ContextIQ

Turn messy files into agent-ready context for RAG, search, and AI workflows.

ContextIQ

ContextIQ is a local-first ingestion pipeline for developers building RAG systems, agent memory layers, document search, and eval datasets.

Point it at a folder of mixed files and it produces clean, traceable JSONL and Markdown outputs that AI systems can actually use.

Why ContextIQ

Most AI tooling starts after your data is already clean. Real projects usually break much earlier:

PDFs are noisy
Word docs lose structure
JSON and CSV need normalization
repos and notes mix formats
chunks become inconsistent
source traceability gets lost

ContextIQ focuses on the missing middle: ingestion, normalization, chunking, and export.

Installation

Install from PyPI:

pip install contextiq

Run the CLI:

contextiq ingest ./docs --out ./build/context

Or with module execution:

python -m contextiq ingest ./docs --out ./build/context

Quickstart

Use the built-in example content:

contextiq ingest ./examples --out ./build/context

PowerShell example:

contextiq ingest .\examples --out .\build\context

Generated output:

documents.jsonl - normalized source documents
chunks.jsonl - chunked outputs for RAG and agents
chunks.md - human-readable review output
manifest.json - run summary, warnings, and config

What It Supports

Built-in file types

.txt, .md, .rst
.json, .jsonl
.csv, .tsv
.html, .htm
optional .pdf via pypdf
optional .docx via python-docx

Output behavior

recursive directory ingestion
normalized plain-text extraction
document-aware chunking
source-preserving metadata
JSONL and Markdown export
manifest output for reproducibility

CLI

Basic usage

contextiq ingest <path> --out <directory>

Useful flags

--include-ext .md,.txt,.json
--exclude-glob "*.min.js,*.lock"
--chunk-size 1200
--chunk-overlap 150
--formats jsonl,markdown
--fail-on-warning

Example commands

contextiq ingest ./docs --out ./dist/context --chunk-size 900 --chunk-overlap 120

contextiq ingest ./knowledge-base --out ./build/export --include-ext .md,.txt,.json

How It Works

ContextIQ runs in four stages:

1. Discovery

Recursively finds supported files while skipping common noise such as virtualenvs, caches, and build directories.

2. Loading and normalization

Converts each file into normalized plain text:

Markdown and text are read directly
JSON and JSONL are pretty-printed into readable text
CSV and TSV become row-based text
HTML is stripped to visible text
PDF and DOCX are supported through optional extras

3. Chunking

Splits documents into retrieval-friendly chunks with:

target chunk size
overlap between chunks
paragraph and sentence-aware boundaries
source path and character ranges preserved

4. Export

Writes machine-friendly and human-readable outputs for downstream AI workflows.

Project Structure

src/contextiq/
|- cli.py
|- pipeline.py
|- loaders.py
|- chunking.py
|- exporters.py
|- discovery.py
|- models.py
`- utils.py

Use Cases

RAG ingestion

Prepare mixed files for vector indexing and retrieval pipelines.

Agent memory and context packing

Turn project docs into clean, bounded chunks for coding and research agents.

Search systems

Produce normalized text and chunk exports for semantic or hybrid retrieval.

Eval datasets

Create stable, traceable corpora for retrieval benchmarking and prompt evaluation.

Development

Install editable dependencies:

pip install -e .[dev]

Run tests:

pytest

Run the demo:

.\demo.ps1

Roadmap

embeddings plugin interface
vector database exporters
OCR pipeline
table extraction
citation-aware retrieval benchmarks

Contributing

Contributions are welcome.

improve loaders
add exporters
extend chunking strategies
improve docs and examples

Open an issue or submit a PR if you want to help shape ContextIQ.

License

MIT License - see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
src/contextiq		src/contextiq
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PUBLISHING.md		PUBLISHING.md
README.md		README.md
demo.ps1		demo.ps1
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContextIQ

Why ContextIQ

Installation

Quickstart

What It Supports

Built-in file types

Output behavior

CLI

Basic usage

Useful flags

Example commands

How It Works

1. Discovery

2. Loading and normalization

3. Chunking

4. Export

Project Structure

Use Cases

RAG ingestion

Agent memory and context packing

Search systems

Eval datasets

Development

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ContextIQ

Why ContextIQ

Installation

Quickstart

What It Supports

Built-in file types

Output behavior

CLI

Basic usage

Useful flags

Example commands

How It Works

1. Discovery

2. Loading and normalization

3. Chunking

4. Export

Project Structure

Use Cases

RAG ingestion

Agent memory and context packing

Search systems

Eval datasets

Development

Roadmap

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages