Mikel Bahn - Personal Website

? Overview

This system transforms your Obsidian Vault into a searchable knowledge graph using Graph Neural Networks and Large Language Models. It combines modern RAG (Retrieval-Augmented Generation) techniques with G-Retriever to provide precise answers based on your personal notes.

? What does the system do?

Graph Conversion: Converts Markdown notes into a NetworkX graph
QA Generation: Automatically creates question-answer pairs using Ollama
Smart Retrieval: Finds relevant notes using embeddings and graph algorithms
Contextual Answers: Uses your local LLM for precise answers
Optional: GNN Training: Trains a specialized neural network on your data

System Architecture

Obsidian Vault

→

Graph Builder

→

Training Data

→

PyG Dataset

→

GNN Training

→

Chat Interface

?️ Technical Architecture

Two variants available:

G-Retriever Light (Untrained)

Ready to use immediately
No GPU required
Fast responses
Embedding-based retrieval
PCST subgraph construction
Ollama for answer generation

Recommendation: Start with this! It works very well without training.

G-Retriever Full (Trained)

Requires training (1-3h)
GPU recommended
Specialized for your data
GNN-based retrieval
Graph Attention Networks
5-10% better results

Note: Only necessary for enthusiasts or large vaults (>5000 notes).

Core components:

1. Graph Neural Network (GAT)

Uses Graph Attention Networks to learn relationships between notes. With 3 layers and 4 attention heads, the model can recognize complex connection patterns.

2. Sentence Transformers

< p>Creates semantic embeddings for all notes. The model all-MiniLM-L6-v2 is fast and efficient with 384-dimensional vectors.

3. PCST Algorithm

Prize-Collecting Steiner Tree finds the optimally connected subgraph from relevant nodes – essential for coherent answers.

4. Ollama LLM

Your local Llama3 model generates the final answers based on the retrieved context. Complete privacy, no cloud!

⚙️ Installation

Requirements:

Python 3.9 or higher
CUDA (optional, for GPU acceleration)
Ollama installed with llama3:8b model
Approx. 10 GB free storage space

Step 1: Virtual Environment

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

Step 2: Install PyTorch

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Step 3: PyTorch Geometric

pip install torch-geometric
pip install pyg-lib torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.0+cu118.html

Step 4: Additional Dependencies

pip install sentence-transformers networkx pcst-fast requests tqdm numpy pandas

Step 5: Set up Ollama

# Check if Ollama is running
curl http://localhost:11434/api/version

# Pull Llama3 model
ollama pull llama3:8b

✓ Installation complete! You are ready to get started.

? Modules

1. obsidian_to_graph.py

Function: Converts Obsidian Vault into a NetworkX graph

Input: Path to the vault

Output: graph.gpickle, graph.json, stats.json

Features:

Parses Markdown files
Extracts Wiki-links [[link]] and Markdown links
Automatically removes images
Extracts #tags
Creates a directed graph with edges for links

2. generate_training_data.py

Function: Generates QA pairs using Ollama

Input: graph.gpickle

Output: train.json, val.json, qa_pairs.json

Question types:

Factual: Precise factual questions
Connection: Questions about relationships
Summary: Summary questions
Multi-Node: Questions spanning multiple connected notes

Performance: ~500 QA pairs in 1-2 hours

3. pyg_dataset.py

Function: Creates PyTorch Geometric datasets

Input: Graph + QA JSONs

Output: train_data.pt, val_data.pt

Features:

Node embeddings with Sentence Transformers
Question embeddings
Edge index for GNN
80/20 train/val split

4. gretriever_inference.py

Function: Chat interface (untrained)

Pipeline:

Retrieval: k-NN with cosine similarity
Subgraph Construction: PCST for optimal subgraph
Answer Generation: Ollama with context

Advantage: Ready to use immediately, no training needed!

5. train_gretriever.py

Function: Trains GNN on QA pairs

Model: GAT (Graph Attention Network)

Loss: Binary Cross Entropy (relevant vs. irrelevant nodes)

Optimizer: Adam with learning rate 0.001

Training: 20 epochs, ~1-3 hours

6. gretriever_inference_trained.py

Function: Chat interface with trained GNN

Difference: Uses trained model for retrieval instead of embeddings

Performance: 5-10% better relevance for large vaults

7. pipeline.py

Function: Runs the complete pipeline automatically

Options: Skip individual steps with --skip

Perfect for: Initial setup or restart

? Workflow

Quick Start (Untrained Variant):

Create graph

python obsidian_to_graph.py

Converts your notes into a graph. Takes: ~1-5 minutes for 1100 notes.

Generate training data

python generate_training_data.py

Creates 500 QA pairs using Ollama. Takes: 1-2 hours.

Tip: Start with 200 QA pairs for testing (num_samples=200), then expand to 500-1000.

Start chat

python gretriever_inference.py

Interactive chat interface opens. Ask questions about your notes!

Advanced (Trained Variant):

Create PyG dataset

python pyg_dataset.py

Converts data into PyTorch Geometric format. Takes: 5-10 minutes.

GNN Training

python train_gretriever.py

Trains the Graph Neural Network. Takes: 1-3 hours depending on hardware.

GPU Tip: With GPU 3-5x faster. CPU works too!

Chat with trained model

python gretriever_inference_trained.py

Uses the trained model for better retrieval.

? Training Details

How many training data do you need?

Vault Size	Recommended QA Pairs	Duration	Purpose
< 500 notes	200-300	30-60 min	Quick Test
500-1500 notes	500-800	1-2 h	Standard (recommended)
1500-3000 notes	1000-1500	3-4 h	Good coverage
> 3000 notes	2000+	6+ h	Very good

Training Hyperparameters:

Model Architecture

Node Embed Dim: 384 (from Sentence Transformer)
Hidden Dim: 256
Num Layers: 3
Attention Heads: 4
Total Parameters: ~2.5M

Training Setup

Optimizer: Adam
Learning Rate: 0.001
Loss Function: BCE with Logits
Epochs: 20 (default)
Batch Size: 1 (full graph per sample)

? Training Tips:

Start with fewer epochs (10) for testing
Monitor validation loss – stop early if overfitting
Best model is automatically saved
Training history is exported as JSON

? Usage

Example Chat Session:

$ python gretriever_inference.py

============================================================
G-Retriever Chat Interface for Obsidian Vault
Type 'quit' or 'exit' to end
============================================================

Your question: What are the most important concepts in my ML notes?
Query: What are the most important concepts in my ML notes?
Retrieving relevant nodes...
Constructing subgraph...
Generating answer...
Answer: Based on your notes, the most important Machine Learning
concepts are: Neural Networks with Backpropagation, Gradient Descent for
optimization, various Loss Functions (MSE, Cross-Entropy), and
regularization via L1/L2. You also have detailed notes on
Convolutional Neural Networks and their application in Computer Vision.
Used notes: Neural Networks, Backpropagation, Gradient Descent,
Loss Functions, Regularization

Example Queries:

? Factual Questions

"What is the difference between L1 and L2 regularization?"
"Which Python libraries do I use for Data Science?"
"What does my note about Transformers say?"

? Relationship Questions

"How are my notes on GraphQL and REST APIs connected?"
"Which projects use React?"
"What are the connections between my psychology notes?"

? Summaries

"Summarize my notes on Quantum Computing"
"What have I learned about productivity?"
"Overview of my travel notes to Japan"

Code Adjustments:

Adjust paths in the modules:

# In obsidian_to_graph.py
vault_path = "/path/to/your/vault"
output_path = "./graph_output"
In generate_training_data.py
graph_path = "./graph_output/graph.gpickle"
output_path = "./training_data"
num_samples = 500  # Number of QA pairs
In gretriever_inference.py
graph_path = "./graph_output/graph.gpickle"
ollama_model = "llama3:8b"

⚖️ Untrained vs. Trained

Performance Comparison:

Aspect	Untrained (Light)	Trained (Full)
Setup Time	1-2 hours	3-5 hours
GPU required?	❌ No	⚠️ Recommended
Retrieval Quality	85-90%	90-95%
Response Speed	2-5 seconds	3-6 seconds
Vault Size Recommendation	< 2000 notes	> 2000 notes
Maintenance	None	Re-training for major changes
Memory Requirement	~2 GB RAM	~4 GB RAM + 2 GB VRAM

✨ Recommendation:

Start with the untrained variant! It is quick to set up, works excellently, and you can start right away. Only train if:

You have more than 2000-3000 notes
You need the absolute best retrieval quality
You enjoy experimenting

The quality improvement from training is marginal (5-10%), but the effort is significantly higher.

? Troubleshooting

Problem: Ollama Connection Error

Solution:

# Check if Ollama is running
curl http://localhost:11434/api/version
Start Ollama
ollama serve

Problem: CUDA Out of Memory

Solution:

# In gretriever_inference.py or train_gretriever.py
device = "cpu"  # Instead of "cuda"

Problem: Too few QA pairs generated

Causes:

Many notes are too short (< 100 characters)
JSON parsing fails
Ollama timeouts

Solution: Increase num_samples by 20-30% more than desired.

Problem: Import Errors

Solution:

# Reinstall dependencies
pip install --force-reinstall torch-geometric
pip install pyg-lib torch-scatter torch-sparse

Problem: Training very slow

Optimizations:

Use GPU instead of CPU
Reduce Hidden Dim to 128
Reduce Num Layers to 2
Use fewer QA pairs for first test

? Advanced Configuration

Change Embedding Models:

# Better quality (slower)
embedding_model = "all-mpnet-base-v2"
Multilingual
embedding_model = "paraphrase-multilingual-MiniLM-L12-v2"
Specialized for code
embedding_model = "microsoft/codebert-base"

Tune GNN Architecture:

# More capacity
hidden_dim = 512
num_layers = 5
num_heads = 8
Faster, less capacity
hidden_dim = 128
num_layers = 2
num_heads = 2

Retrieval Parameters:

# In gretriever_inference.py
More context
k_retrieve = 30  # Instead of 20
Larger subgraph
max_subgraph_size = 20  # In construct_subgraph_pcst
More notes in LLM context
max_context_nodes = 15  # In generate_answer

Switch Ollama Model:

# Larger model (better quality)
ollama_model = "llama3:70b"
Faster model
ollama_model = "phi3:mini"
Specialized
ollama_model = "codellama:13b"  # For code-heavy vaults

⚠️ PCST Behavior: Selection, not Expansion

The Prize-Collecting Steiner Tree (PCST) step does not expand the retrieved node set. It performs a global optimization and selects a structurally optimal subset of nodes.

Key Point:
A retrieved node is never guaranteed to appear in the final subgraph. Retrieval provides candidates — PCST decides which ones are worth keeping.

In the current implementation, PCST is called as:

vertices, _ = pcst_fast(
    edges,
    prizes,
    costs,
    root,
    1,
    1,
    'strong'
)

How PCST makes decisions

1. Node Prizes

prizes[relevant_nodes] = similarities[relevant_nodes]

Only retrieved nodes receive a prize > 0
All other nodes start with prize = 0
A retrieved node is optional, not mandatory

2. Edge Costs

costs = np.ones(edges.shape[0])

Each edge has uniform cost = 1
Long or weakly connected paths are expensive

3. Optimization Criterion

keep node if:  prize(node) ≥ sum(edge costs to connect it)

High similarity + short distance → kept
Medium similarity + many hops → dropped
Low similarity + strong connectivity → often kept

Formal Property:
subgraph ⊆ retrieved_nodes ∪ connector_nodes
PCST never guarantees that all retrieved nodes survive.

Why the subgraph is usually smaller than retrieval

Retrieved nodes may be thematically scattered
Connection costs can outweigh semantic relevance
Highly connected hubs are often preferred

This explains why, for example, well-connected authors or concepts may remain in the subgraph while isolated but semantically relevant notes are removed.

How to influence PCST behavior

You can actively steer how selective PCST is:

# Option A: Increase prizes (keep more retrieved nodes)
prizes[relevant_nodes] = similarities[relevant_nodes] * 100

# Option B: Reduce edge costs (favor larger connected subgraphs)
costs = np.full(edges.shape[0], 0.01)

# Option C: Disable PCST entirely (pure Top-K retrieval)
subgraph_nodes = relevant_nodes

Summary:
PCST is a filtering mechanism that extracts the most structurally coherent core — not an expansion step. Differences between retrieval output and final context are expected and indicate correct behavior.

? Performance Optimization

For large vaults (>5000 notes):

1. Node Embedding Caching

Pre-compute and store embeddings separately:

import pickle
After first run
with open('node_embeddings.pkl', 'wb') as f:
    pickle.dump(self.node_embeddings, f)
In subsequent runs load
with open('node_embeddings.pkl', 'rb') as f:
    self.node_embeddings = pickle.load(f)

2. Batch Processing for QA Generation

Use larger batches:

# In generate_training_data.py
batch_size = 10  # Multiple prompts in parallel

3. Graph Pruning

Remove isolated nodes:

# After graph.build()
isolated = list(nx.isolates(self.graph))
self.graph.remove_nodes_from(isolated)

Benchmark (1100 nodes):

Operation	CPU (M2)	GPU (A100)
Graph Building	tbd	tbd
Node Embeddings	tbd	tbd
500 QA pairs	tbd	tbd
PyG Dataset	tbd	tbd
Training (tbd epochs)	tbd	tbd
Query Inference	tbd	tbd

❓ FAQ

Can I use other LLMs instead of Ollama?

Yes! You can modify generate_answer() to use OpenAI, Anthropic, or other APIs. Ollama is just the privacy-friendly default option.

Does it also work with other note-taking apps?

In principle, yes! You just need to adapt obsidian_to_graph.py to parse the specific format (e.g., Notion, Roam Research).

How do I keep the system up to date when I add new notes?

Simply run the pipeline again. For incremental updates, you could write a script that processes only new/changed notes.

Can I use multiple vaults at the same time?

Yes! Create a separate output folder for each vault. You can even combine multiple graphs in the same chat interface.

Are my data uploaded anywhere?

No! Everything runs locally. Ollama is local, embeddings are local, training is local. Complete privacy.

Does the system work in other languages?

Yes! Use multilingual embedding models and ensure your Ollama model supports the language. Llama3 works well with German, French, Spanish, etc.

? Resources & Links

Papers & Research

Tools

Community

PyTorch Geometric Discord
Obsidian Community Forum
r/LocalLLaMA on Reddit

? Conclusion

You now have a complete graph-based RAG system!

This system combines state-of-the-art technologies:

✅ Graph Neural Networks for structured knowledge
✅ Semantic Search with embeddings
✅ Intelligent subgraph construction (PCST)
✅ Local LLMs for privacy
✅ Modular, extensible code

Next Steps:

Start with the untrained variant
Test different questions
Generate more QA pairs if needed
Optional: Train for better results
Experiment with different models and parameters