Working with Branches & Merging¶
Branches are the foundation of version control in lakeFS. They enable parallel development, safe experimentation, and collaborative data management. This guide covers all branch operations using the Python SDK.
Understanding Repositories and Branches¶
Repositories
A repository is a versioned storage namespace that holds all your data and version history. Create a repository by specifying a storage location:
```python import lakefs
repo = lakefs.repository("my-repo").create( storage_namespace="s3://my-bucket/data" ) ```
Branches
Branches enable parallel development and experimentation. Each branch maintains its own version of the data:
python
branch = lakefs.repository("my-repo").branch("main")
experiment_branch = lakefs.repository("my-repo").branch("experiment1").create(
source_reference="main"
)
Creating and Listing Branches¶
Creating a Branch
Create a new branch from any reference (branch, tag, or commit):
```python import lakefs
Initialize repository¶
repo = lakefs.repository("my-data-repo")
Create a branch from main¶
experiment_branch = repo.branch("experiment-1").create(source_reference="main")
print(f"Created branch: {experiment_branch.id}")
Check what commit this branch points to¶
commit = experiment_branch.get_commit() print(f"Branch head: {commit.id}") ```
Using an explicit client:
```python import lakefs from lakefs.client import Client
Create client¶
client = Client( host="http://localhost:8000", username="your-access-key", password="your-secret-key" )
Create branch with explicit client¶
branch = lakefs.repository("my-repo", client=client).branch("dev").create( source_reference="main" ) ```
Listing Branches
List all branches in a repository:
```python import lakefs
repo = lakefs.repository("my-data-repo")
print("All branches:") for branch in repo.branches(): commit = branch.get_commit() print(f" {branch.id:20} -> {commit.id[:8]}... ({commit.message})") ```
Checking Branch Head
Get the commit a branch is currently pointing to:
```python import lakefs
branch = lakefs.repository("my-data-repo").branch("main")
Get the head commit¶
head_commit = branch.get_commit() print(f"Current head: {head_commit.id}") print(f"Committed by: {head_commit.committer}") print(f"Message: {head_commit.message}")
Or use the head property¶
head_ref = branch.head print(f"Head reference ID: {head_ref.id}") ```
Working with Branch Content¶
Viewing Uncommitted Changes
See what's changed on a branch since the last commit:
```python import lakefs
branch = lakefs.repository("my-data-repo").branch("feature-branch")
List uncommitted changes¶
print("Uncommitted changes:") for change in branch.uncommitted(): print(f" {change.type:10} {change.path} ({change.size_bytes} bytes)")
Count changes¶
change_count = len(list(branch.uncommitted())) print(f"Total changes: {change_count}")
Filter changes by prefix¶
data_changes = [c for c in branch.uncommitted() if c.path.startswith("data/")] print(f"Changes in data/ folder: {len(data_changes)}") ```
Committing Changes
Create a commit with your changes:
```python import lakefs
branch = lakefs.repository("my-data-repo").branch("feature-branch")
Upload some data first¶
branch.object("data/dataset.csv").upload(data=b"id,value\n1,100\n2,200")
Commit with message and metadata¶
ref = branch.commit( message="Add customer dataset", metadata={ "author": "data-team", "version": "1.0", "dataset-type": "raw" } )
print(f"Committed: {ref.id}") print(f"Message: {ref.get_commit().message}") print(f"Metadata: {ref.get_commit().metadata}") ```
Comparing Branches¶
Diff Between References
See what changed between two branches or commits:
```python import lakefs
repo = lakefs.repository("my-data-repo") main_branch = repo.branch("main") feature_branch = repo.branch("feature-add-models")
Compare branches¶
print("Changes in feature-add-models vs main:") for change in main_branch.diff(other_ref=feature_branch): print(f" {change.type:10} {change.path}")
Count different types of changes¶
changes = list(main_branch.diff(other_ref=feature_branch)) added = len([c for c in changes if c.type == "added"]) removed = len([c for c in changes if c.type == "removed"]) modified = len([c for c in changes if c.type == "modified"])
print(f"Added: {added}, Removed: {removed}, Modified: {modified}") ```
Merging Branches¶
Merging Into Another Branch
Merge changes from one branch into another:
```python import lakefs
repo = lakefs.repository("my-data-repo") feature_branch = repo.branch("feature-branch") main_branch = repo.branch("main")
try: # Merge feature branch into main merge_result = feature_branch.merge_into(main_branch) print(f"Merge successful: {merge_result}")
# Verify merge by checking that differences are gone
remaining_diffs = list(main_branch.diff(other_ref=feature_branch))
print(f"Remaining differences: {len(remaining_diffs)}")
except Exception as e: print(f"Merge failed: {e}") # Resolve conflicts or adjust data and try again ```
Merge with Conflict Detection
```python import lakefs
repo = lakefs.repository("my-data-repo")
try: branch1 = repo.branch("feature-1") branch2 = repo.branch("feature-2") main = repo.branch("main")
# Check for conflicts before merging
conflicts = list(main.diff(other_ref=branch1))
if conflicts:
print(f"Potential conflicts: {len(conflicts)} changes")
for change in conflicts[:5]: # Show first 5
print(f" {change.type}: {change.path}")
# Proceed with merge if acceptable
merge_commit = branch1.merge_into(main)
print(f"Merged into main: {merge_commit}")
except Exception as e: print(f"Merge error: {e}") ```
Cherry-Picking Commits¶
Apply a specific commit from one branch to another:
```python import lakefs
repo = lakefs.repository("my-data-repo") main_branch = repo.branch("main") release_branch = repo.branch("release-v1.0")
try: # Cherry-pick a specific commit onto release branch # First, get a commit ID from main main_commits = list(main_branch.log(max_amount=5)) if main_commits: commit_to_cherry_pick = main_commits[0]
# Cherry-pick it onto release branch
new_commit = release_branch.cherry_pick(commit_to_cherry_pick.id)
print(f"Cherry-picked commit: {new_commit.id}")
print(f"Message: {new_commit.message}")
except Exception as e: print(f"Cherry-pick failed: {e}") ```
Reverting Changes¶
Revert a Commit
Undo the changes from a specific commit:
```python import lakefs
repo = lakefs.repository("my-data-repo") branch = repo.branch("develop")
try: # Get recent commits recent_commits = list(branch.log(max_amount=10))
if len(recent_commits) > 1:
# Revert the most recent commit
commit_to_revert = recent_commits[0]
revert_result = branch.revert(reference=commit_to_revert.id)
print(f"Reverted commit: {commit_to_revert.id}")
print(f"New commit created: {revert_result.id}")
except Exception as e: print(f"Revert failed: {e}") ```
Reverting Merge Commits
When reverting a merge commit, you can specify which parent to revert against:
```python import lakefs
branch = lakefs.repository("my-data-repo").branch("main")
try: # Get commit history commits = list(branch.log(max_amount=5))
for commit in commits:
# If this is a merge commit (has multiple parents)
if len(commit.parents) > 1:
# Revert against parent 1 (the original main)
result = branch.revert(reference=commit.id, parent_number=1)
print(f"Reverted merge commit: {commit.id}")
break
except Exception as e: print(f"Error: {e}") ```
Resetting Branches¶
Reset Uncommitted Changes
Discard uncommitted changes on a branch:
```python import lakefs
branch = lakefs.repository("my-data-repo").branch("feature-branch")
Show uncommitted changes before reset¶
changes_before = list(branch.uncommitted()) print(f"Uncommitted changes before reset: {len(changes_before)}")
try: # Reset all changes branch.reset_changes(path_type="reset")
# Verify changes are gone
changes_after = list(branch.uncommitted())
print(f"Uncommitted changes after reset: {len(changes_after)}")
except Exception as e: print(f"Reset failed: {e}") ```
Reset Changes for Specific Paths
Reset changes only for certain paths or prefixes:
```python import lakefs
branch = lakefs.repository("my-data-repo").branch("develop")
try: # Reset changes for a specific object branch.reset_changes(path_type="object", path="data/temp.csv") print("Reset: data/temp.csv")
# Reset all changes in a folder (common prefix)
branch.reset_changes(path_type="common_prefix", path="logs/")
print("Reset: logs/*")
# Verify remaining changes
remaining = list(branch.uncommitted())
print(f"Remaining changes: {len(remaining)}")
except Exception as e: print(f"Reset error: {e}") ```
Deleting Branches¶
Delete a Branch
Remove a branch that's no longer needed:
```python import lakefs
repo = lakefs.repository("my-data-repo")
try: # Delete a branch branch = repo.branch("old-experiment") branch.delete() print("Branch deleted: old-experiment")
except Exception as e: print(f"Delete failed: {e}")
Verify it's gone¶
remaining_branches = [b.id for b in repo.branches()] print(f"Remaining branches: {remaining_branches}") ```
Real-World Workflows¶
Isolated Dev/Test Environment¶
Feature Branch Workflow
Implement a typical feature development workflow:
```python import lakefs
def feature_workflow(repo_name, feature_name, data_updates): """ Complete feature branch workflow: 1. Create feature branch 2. Make changes 3. Test/commit 4. Merge to main """ repo = lakefs.repository(repo_name) main = repo.branch("main") feature = repo.branch(feature_name)
try:
# 1. Create feature branch from main
feature.create(source_reference="main")
print(f"Created feature branch: {feature_name}")
# 2. Make changes
for path, data in data_updates.items():
feature.object(path).upload(data=data)
print(f"Uploaded {len(data_updates)} objects")
# 3. Review changes
changes = list(feature.uncommitted())
print(f"Changes to commit: {len(changes)}")
# 4. Commit
commit_ref = feature.commit(
message=f"Implement {feature_name}",
metadata={"feature": feature_name}
)
print(f"Committed: {commit_ref.id}")
# 5. Verify diff before merge
diff = list(main.diff(other_ref=feature))
print(f"Ready to merge {len(diff)} changes")
# 6. Merge to main
merge_result = feature.merge_into(main)
print(f"Merged to main: {merge_result}")
# 7. Cleanup
feature.delete()
print(f"Deleted feature branch: {feature_name}")
return True
except Exception as e:
print(f"Feature workflow failed: {e}")
# Cleanup on error
try:
feature.delete()
except:
pass
return False
Usage:¶
feature_workflow( repo_name="analytics-repo", feature_name="add-customer-metrics", data_updates={ "data/metrics/customer_v2.csv": b"id,value\n1,100\n2,200", "data/metrics/metadata.json": b'{"version": "2", "date": "2024-01-15"}' } ) ```
Experimentation Branch Workflow
Create isolated branches for experiments:
```python import lakefs import time
def create_experiment(repo_name, experiment_name, experiment_logic): """Run an experiment in isolation, keep results if successful""" repo = lakefs.repository(repo_name) exp_branch = repo.branch(experiment_name)
try:
# Create isolated experiment branch
exp_branch.create(source_reference="main")
print(f"Started experiment: {experiment_name}")
# Run experiment logic
experiment_logic(exp_branch)
# Commit results
changes = list(exp_branch.uncommitted())
if changes:
exp_branch.commit(
message=f"Results from {experiment_name}",
metadata={"experiment": experiment_name, "timestamp": str(time.time())}
)
print(f"Experiment successful, committed {len(changes)} changes")
return exp_branch.get_commit().id
except Exception as e:
print(f"Experiment failed: {e}")
# Clean up failed experiment
try:
exp_branch.delete()
except:
pass
return None
Example experiment¶
def model_training_experiment(branch): # Simulate model training branch.object("models/trained_v2.pkl").upload(data=b"model_binary_data") branch.object("logs/training.log").upload(data=b"Training complete")
Usage:¶
result_commit = create_experiment( repo_name="ml-repo", experiment_name="neural-net-v3", experiment_logic=model_training_experiment )
if result_commit: print(f"Keep results from commit: {result_commit}") ```
Release Branch Management
Manage release versions with branches and tags:
```python import lakefs
def create_release(repo_name, version_tag): """Create a release: branch from main, prepare, merge, tag""" repo = lakefs.repository(repo_name) main = repo.branch("main") release_branch = repo.branch(f"release-{version_tag}")
try:
# 1. Create release branch
release_branch.create(source_reference="main")
print(f"Created release branch: release-{version_tag}")
# 2. Any release-specific changes (version bumps, etc)
release_branch.object("VERSION").upload(data=version_tag.encode())
release_branch.object("RELEASE_NOTES.md").upload(
data=f"# Release {version_tag}\nDate: 2024-01-15".encode()
)
# 3. Commit release changes
release_ref = release_branch.commit(
message=f"Release {version_tag}",
metadata={"version": version_tag, "type": "release"}
)
# 4. Merge back to main
release_branch.merge_into(main)
print(f"Merged release to main")
# 5. Create tag for this release
tag = repo.tag(f"v{version_tag}").create(source_ref=release_ref)
print(f"Tagged release: v{version_tag}")
# 6. Clean up release branch
release_branch.delete()
print(f"Cleaned up release branch")
return f"v{version_tag}"
except Exception as e:
print(f"Release creation failed: {e}")
return None
Usage:¶
release_tag = create_release("prod-repo", "1.5.0") if release_tag: print(f"Release ready: {release_tag}") ```
Error Handling¶
Handling Common Branch Errors
```python import lakefs from lakefs.exceptions import NotFoundException, ConflictException, ForbiddenException
repo = lakefs.repository("my-data-repo")
Branch already exists¶
try: branch = repo.branch("main").create(source_reference="main", exist_ok=False) except ConflictException: print("Branch already exists") branch = repo.branch("main")
Branch doesn't exist¶
try: branch = repo.branch("non-existent") commit = branch.get_commit() except NotFoundException: print("Branch not found")
Protected branch (cannot delete)¶
try: repo.branch("main").delete() except ForbiddenException: print("Cannot delete protected branch")
Source reference not found¶
try: branch = repo.branch("new-branch").create( source_reference="non-existent-ref" ) except NotFoundException: print("Source reference does not exist") ```