Borg/rfc/RFC-003-DATANODE.md

326 lines
7.3 KiB
Markdown

# RFC-003: DataNode In-Memory Filesystem
**Status**: Draft
**Author**: [Snider](https://github.com/Snider/)
**Created**: 2026-01-13
**License**: EUPL-1.2
---
## Abstract
DataNode is an in-memory filesystem abstraction implementing Go's `fs.FS` interface. It provides the foundation for collecting, manipulating, and serializing file trees without touching disk.
## 1. Overview
DataNode serves as the core data structure for:
- Collecting files from various sources (GitHub, websites, PWAs)
- Building container filesystems (TIM rootfs)
- Serializing to/from tar archives
- Encrypting as TRIX format
## 2. Implementation
### 2.1 Core Type
```go
type DataNode struct {
files map[string]*dataFile
}
type dataFile struct {
name string
content []byte
modTime time.Time
}
```
**Key insight**: DataNode uses a **flat key-value map**, not a nested tree structure. Paths are stored as keys directly, and directories are implicit (derived from path prefixes).
### 2.2 fs.FS Implementation
DataNode implements these interfaces:
| Interface | Method | Description |
|-----------|--------|-------------|
| `fs.FS` | `Open(name string)` | Returns fs.File for path |
| `fs.StatFS` | `Stat(name string)` | Returns fs.FileInfo |
| `fs.ReadDirFS` | `ReadDir(name string)` | Lists directory contents |
### 2.3 Internal Helper Types
```go
// File metadata
type dataFileInfo struct {
name string
size int64
modTime time.Time
}
func (fi *dataFileInfo) Mode() fs.FileMode { return 0444 } // Read-only
// Directory metadata
type dirInfo struct {
name string
}
func (di *dirInfo) Mode() fs.FileMode { return fs.ModeDir | 0555 }
// File reader (implements fs.File)
type dataFileReader struct {
info *dataFileInfo
reader *bytes.Reader
}
// Directory reader (implements fs.File)
type dirFile struct {
info *dirInfo
entries []fs.DirEntry
offset int
}
```
## 3. Operations
### 3.1 Construction
```go
// Create empty DataNode
node := datanode.New()
// Returns: &DataNode{files: make(map[string]*dataFile)}
```
### 3.2 Adding Files
```go
// Add file with content
node.AddData("path/to/file.txt", []byte("content"))
// Trailing slashes are ignored (treated as directory indicator)
node.AddData("path/to/dir/", []byte("")) // Stored as "path/to/dir"
```
**Note**: Parent directories are NOT explicitly created. They are implicit based on path prefixes.
### 3.3 File Access
```go
// Open file (fs.FS interface)
f, err := node.Open("path/to/file.txt")
if err != nil {
// fs.ErrNotExist if not found
}
defer f.Close()
content, _ := io.ReadAll(f)
// Stat file
info, err := node.Stat("path/to/file.txt")
// info.Name(), info.Size(), info.ModTime(), info.Mode()
// Read directory
entries, err := node.ReadDir("path/to")
for _, entry := range entries {
// entry.Name(), entry.IsDir(), entry.Type()
}
```
### 3.4 Walking
```go
err := fs.WalkDir(node, ".", func(path string, d fs.DirEntry, err error) error {
if err != nil {
return err
}
if !d.IsDir() {
// Process file
}
return nil
})
```
## 4. Path Semantics
### 4.1 Path Handling
- **Leading slashes stripped**: `/path/file``path/file`
- **Trailing slashes ignored**: `path/dir/``path/dir`
- **Forward slashes only**: Uses `/` regardless of OS
- **Case-sensitive**: `File.txt``file.txt`
- **Direct lookup**: Paths stored as flat keys
### 4.2 Valid Paths
```
file.txt → stored as "file.txt"
dir/file.txt → stored as "dir/file.txt"
/absolute/path → stored as "absolute/path" (leading / stripped)
path/to/dir/ → stored as "path/to/dir" (trailing / stripped)
```
### 4.3 Directory Detection
Directories are **implicit**. A directory exists if:
1. Any file path has it as a prefix
2. Example: Adding `a/b/c.txt` implicitly creates directories `a` and `a/b`
```go
// ReadDir finds directories by scanning all paths
func (dn *DataNode) ReadDir(name string) ([]fs.DirEntry, error) {
// Scans all keys for matching prefix
// Returns unique immediate children
}
```
## 5. Tar Serialization
### 5.1 ToTar
```go
tarBytes, err := node.ToTar()
```
**Format**:
- All files written as `tar.TypeReg` (regular files)
- Header Mode: **0600** (fixed, not original mode)
- No explicit directory entries
- ModTime preserved from dataFile
```go
// Serialization logic
for path, file := range dn.files {
header := &tar.Header{
Name: path,
Mode: 0600, // Fixed mode
Size: int64(len(file.content)),
ModTime: file.modTime,
Typeflag: tar.TypeReg,
}
tw.WriteHeader(header)
tw.Write(file.content)
}
```
### 5.2 FromTar
```go
node, err := datanode.FromTar(tarBytes)
```
**Parsing**:
- Only reads `tar.TypeReg` entries
- Ignores directory entries (`tar.TypeDir`)
- Stores path and content in flat map
```go
// Deserialization logic
for {
header, err := tr.Next()
if header.Typeflag == tar.TypeReg {
content, _ := io.ReadAll(tr)
dn.files[header.Name] = &dataFile{
name: filepath.Base(header.Name),
content: content,
modTime: header.ModTime,
}
}
}
```
### 5.3 Compressed Variants
```go
// gzip compressed
tarGz, err := node.ToTarGz()
node, err := datanode.FromTarGz(tarGzBytes)
// xz compressed
tarXz, err := node.ToTarXz()
node, err := datanode.FromTarXz(tarXzBytes)
```
## 6. File Modes
| Context | Mode | Notes |
|---------|------|-------|
| File read (fs.FS) | 0444 | Read-only for all |
| Directory (fs.FS) | 0555 | Read+execute for all |
| Tar export | 0600 | Owner read/write only |
**Note**: Original file modes are NOT preserved. All files get fixed modes.
## 7. Memory Model
- All content held in memory as `[]byte`
- No lazy loading
- No memory mapping
- Thread-safe for concurrent reads (map is not mutated after creation)
### 7.1 Size Calculation
```go
func (dn *DataNode) Size() int64 {
var total int64
for _, f := range dn.files {
total += int64(len(f.content))
}
return total
}
```
## 8. Integration Points
### 8.1 TIM RootFS
```go
tim := &tim.TIM{
Config: configJSON,
RootFS: datanode, // DataNode as container filesystem
}
```
### 8.2 TRIX Encryption
```go
// Encrypt DataNode to TRIX
encrypted, err := trix.Encrypt(datanode.ToTar(), password)
// Decrypt TRIX to DataNode
tarBytes, err := trix.Decrypt(encrypted, password)
node, err := datanode.FromTar(tarBytes)
```
### 8.3 Collectors
```go
// GitHub collector returns DataNode
node, err := github.CollectRepo(url)
// Website collector returns DataNode
node, err := website.Collect(url, depth)
```
## 9. Implementation Reference
- Source: `pkg/datanode/datanode.go`
- Tests: `pkg/datanode/datanode_test.go`
## 10. Security Considerations
1. **Path traversal**: Leading slashes stripped; no `..` handling needed (flat map)
2. **Memory exhaustion**: No built-in limits; caller must validate input size
3. **Tar bombs**: FromTar reads all entries into memory
4. **Symlinks**: Not supported (intentional - tar.TypeReg only)
## 11. Limitations
- No symlink support
- No extended attributes
- No sparse files
- Fixed file modes (0600 on export)
- No streaming (full content in memory)
## 12. Future Work
- [ ] Streaming tar generation for large files
- [ ] Optional mode preservation
- [ ] Size limits for untrusted input
- [ ] Lazy loading for large datasets