GIT-FILTER-REPO(1) | git-filter-repo Manual | GIT-FILTER-REPO(1) |
NAME¶
git-filter-repo - Rewrite repository history
SYNOPSIS¶
git filter-repo --analyze git filter-repo [<path_filtering_options>] [<content_filtering_options>]
[<ref_renaming_options>] [<commit_message_filtering_options>]
[<name_or_email_filtering_options>] [<parent_rewriting_options>]
[<generic_callback_options>] [<miscellaneous_options>]
DESCRIPTION¶
Rapidly rewrite entire repository history using user-specified filters. This is a destructive operation which should not be used lightly; it writes new commits, trees, tags, and blobs corresponding to (but filtered from) the original objects in the repository, then deletes the original history and leaves only the new. See the section called “DISCUSSION” for more details on the ramifications of using this tool. Several different types of history rewrites are possible; examples include (but are not limited to):
Additionally, several concerns are handled automatically (many of these can be overridden, but they are all on by default):
And additional facilities are available via a config option
Also, it’s worth noting that there is an important safety mechanism:
For those who know that there is large unwanted stuff in their history and want help finding it, this command also
See also the section called “VERSATILITY”, the section called “DISCUSSION”, the section called “EXAMPLES”, and the section called “INTERNALS”.
OPTIONS¶
Analysis Options¶
--analyze
Filtering based on paths (see also --filename-callback)¶
These options specify the paths to select. Note that much like git itself, renames are NOT followed so you may need to specify multiple paths, e.g. --path olddir/ --path newdir/
--invert-paths
--path-match <dir_or_file>, --path <dir_or_file>
--path-glob <glob>
--path-regex <regex>
--use-base-name
Renaming based on paths (see also --filename-callback)¶
Note: if you combine path filtering with path renaming, be aware that a rename directive does not select paths, it only says how to rename paths that are selected with the filters.
--path-rename <old_name:new_name>, --path-rename-match <old_name:new_name>
Path shortcuts¶
--paths-from-file <filename>
--subdirectory-filter <directory>
--to-subdirectory-filter <directory>
Content editing filters (see also --blob-callback)¶
--replace-text <expressions_file>
--strip-blobs-bigger-than <size>
--strip-blobs-with-ids <blob_id_filename>
Renaming of refs (see also --refname-callback)¶
--tag-rename <old:new>
Filtering of commit messages (see also --message-callback)¶
--replace-message <expressions_file>
--preserve-commit-hashes
--preserve-commit-encoding
Filtering of names & emails (see also --name-callback and --email-callback)¶
--mailmap <filename>
--use-mailmap
Parent rewriting¶
--replace-refs {delete-no-add, delete-and-add, update-no-add, update-or-add, update-and-add, old-default}
--prune-empty {always, auto, never}
--prune-degenerate {always, auto, never}
--no-ff
Generic callback code snippets¶
--filename-callback <function_body>
--message-callback <function_body>
--name-callback <function_body>
--email-callback <function_body>
--refname-callback <function_body>
--file-info-callback <function_body>
--blob-callback <function_body>
--commit-callback <function_body>
--tag-callback <function_body>
--reset-callback <function_body>
Sensitive Data Removal¶
--sensitive-data-removal, --sdr
Note that if you have any local-only changes (i.e. un-pushed changes) in your repository, on any branch or ref, this fetch step may discard them. Working in a fresh clone avoids this problem; see also the --no-fetch option if you don't want to work with a fresh clone and you have important local-only changes.
--no-fetch
Location to filter from/to¶
Note
Specifying alternate source or target locations implies --partial. However, unlike normal uses of --partial, this doesn’t risk mixing old and new history since the old and new histories are in different repositories.
--source <source>
--target <target>
Miscellaneous options¶
--help, -h
--force, -f
--partial
--refs <refs+>
--dry-run
--debug
--stdin
--quiet
OUTPUT¶
Every time filter-repo is run, files are created in the .git/filter-repo/ directory. These files are updated or overwritten on every run.
Commit map¶
The $GIT_DIR/filter-repo/commit-map file contains a mapping of how all commits were (or were not) changed.
Reference map¶
The $GIT_DIR/filter-repo/ref-map file contains a mapping of which local references were (or were not) changed.
Changed References¶
The $GIT_DIR/filter-repo/changed-refs file contains a list of refs that were changed.
First Changed Commits¶
The $GIT_DIR/filter-repo/first-changed-commits contains a list of the first commit(s) changed by the filtering operation. These are the commits that got rewritten and which had no parents that were also rewritten.
So, for example if you had commits A1-B1-C1-D1-E1 before running git-filter-repo, and afterward you had commits A1-B2-C2-D2-E2 then the First Changed Commits file would contain just one line, which would be the hash of B2.
In most cases, there will only be one commit listed, but if you had multiple root commits or a non-linear history where the commits on those diverging histories were the first ones modified, then there could be multiple first changed commits and they will each be listed on separate lines.
Already Ran¶
The $GIT_DIR/filter-repo/already_ran file contains a file recording that git-filter-repo has been run. When this file is present, future runs will be treated as an extension of the previous filtering operation.
Concretely, this means: * The "Fresh Clone" check is bypassed
This is done because past runs would cause the repository to no longer look like a fresh clone, and thus fail the fresh clone check, but doing filtering via multiple invocations of git-filter-repo is an intended and support usecase. You already passed or bypassed the "Fresh Clone" check on your initial run.
In other words, if the first filter-repo invocation rewrote commit A to commit B, and the second filter-repo invocation rewrite commit B to commit C, then the second run would have an "A C" entry rather than a "B C" entry for the changed commit.
In more detail, if the repository original had the following commits:
A1-B1-C1-D1-E1 and the first invocation of filter-repo changed this to
A1-B1-C2-D2-E2 then the first run would report "C1" as the first changed commit. If a second filter-repo run further changed this to
A1-B1-C2-D3-E3 then it would report "C1" as the first changed commit, not "D2", because it is comparing to the original commits rather than the intermediate ones.
However, if the already_ran file exists but is older than 1 day when they invoke git-filter-repo, the user will be prompted for whether the new run should be considered a continuation of the previous run. If they do not answer in the affirmative, then the above three bullets will not apply. This prompt exists because users might do a history rewrite in a repository, forget about it and leave the $GIT_DIR/filter-repo directory around, and then some months or years later need to do another rewrite. If commits have been made public and shared from the previous rewrite, then the next filter-repo run should not be considered a continuation of the previous filtering run.
Original LFS Objects¶
When running with the --sensitive-data-removal flag, and LFS is in use by the repository, the $GIT_DIR/filter-repo/original_lfs_objects contains a list of LFS objects referenced by the repository before the rewrite, in sorted order.
Orphaned LFS Objects¶
When running with the --sensitive-data-removal flag, and LFS is in use by the repository, the $GIT_DIR/filter-repo/orphaned_lfs_objects contains a list of LFS objects that used to be referenced by the repository but no longer are after git-filter-repo has run. Objects appear in sorted order.
FRESH CLONE SAFETY CHECK AND --FORCE¶
Since filter-repo does irreversible rewriting of history, it is important to avoid making changes to a repo for which the user doesn’t have a good backup. The primary defense mechanism is to simply educate users and rely on them to be good stewards of their data; thus there are several warnings in the documentation about how filter repo rewrites history.
However, as a service to users, we would like to provide an additional safety check beyond the documentation. There isn’t a good way to check if the user has a good backup, but we can ask a related question that is an imperfect but quite reasonable proxy: "Is this repository a fresh clone?" Unfortunately, that is also a question we can’t get a perfect answer to; git provides no way to answer that question. However, there are approximately a dozen things that I found that seem to always be true of brand new clones (assuming they are either clones of remote repositories or are made with the --no-local flag), and I check for all of those.
These checks can have both false positives and false negatives. Someone might have a perfectly good backup of their repo without it actually being a fresh clone — but there’s no way for filter-repo to know that. Conversely, someone could look at all things that filter-repo checks for in its safety checks and then just tweak their non-backed-up repository to satisfy those conditions (though it would take a fair amount of effort, and it’s astronomically unlikely that a repo that isn’t a fresh clone randomly happens to match all the criteria). In practice, the safety checks filter-repo uses seem to be really good at avoiding people accidentally running filter-repo on a repository that they shouldn’t be running it on. It even caught me once when I did mean to run filter-repo but was in a different directory than I thought I was.
In short, it’s perfectly fine to use ‘--force` to override the safety checks as long as you’re okay with filter-repo irreversibly rewriting the contents of the current repository. It is a really bad idea to get in the habit of always specifying --force; if you do, one day you will run one of your commands in the wrong directory like I did, and you won’t have the safety check anymore to bail you out. Also, it is definitely NOT okay to recommend --force on forums, Q&A sites, or in emails to other users without first carefully explaining that --force means putting your repositories’ data at risk. I am especially bothered by people who suggest the flag when it clearly is NOT needed; they are needlessly putting other peoples' data at risk.
VERSATILITY¶
filter-repo has a hierarchy of capabilities on the spectrum from easy to use convenience flags that perform pre-defined types of filtering, to choices that provide lots of flexibility in controlling how filtering occurs. This spectrum includes the following:
For more information about callbacks, see the section called “CALLBACKS”. For examples on writing python programs that import filter-repo as a module to create new history rewriting tools, look at the contrib/filter-repo-demos/ directory. That directory includes, among other examples, a reimplementation of git-filter-branch which is faster than git-filter-branch, and a reimplementation of BFG Repo Cleaner with several bug fixes and new features.
DISCUSSION¶
Using filter-repo is relatively simple, but rewriting history is part of a larger discussion in terms of collaboration. When you rewrite history, the old and new histories are no longer compatible; if you push this history somewhere for others to view, it will look as though you’ve done a rebase of all branches and tags. Make sure you are familiar with the "RECOVERING FROM UPSTREAM REBASE" section of git-rebase(1) (and in particular, "The hard case") before proceeding, in addition to this section.
Steps to use git-filter-repo as part of the bigger picture of doing a history rewrite are roughly as follows:
Why is my origin removed?¶
When you rewrite history, all commit IDs (starting with the first one where changes are made) are modified. Even if you think you didn’t change an intermediate commit, the fact that you changed any of its ancestors is also a change that counts and will cause a commit’s ID to change as well. It is unfortunately all-too-easy for yourself or someone else to accidentally merge the old ugly history you were trying to rewrite with the new history, resulting in not only the old ugly history returning but getting you "two copies" of each commit (both an original commit and a cleaned-up alternative), and thus doubling the number of commits in your repository. In short, you end up with an even bigger mess to clean up than you started with.
This happens frequently to people using git filter-branch or BFG repo cleaner, and can happen to folks using git filter-repo if they insist on pushing back to the original repo. Example ways you can get such an even uglier history include:
Removing the origin remote and suggesting people push to a new repo (and ensuring they tell others to clone the new repo) is usually a good forcing function to avoid these problems. But, if people really want to push to the original repository despite these warnings, it is trivial to do so; simply run:
and then you can push (e.g. git push --force --branches --tags --prune). Since removing the origin url is such a cheap way to potentially prevent big messes, and it’s so easy to work around for those that really do want to push back over the original history, removing the origin url is a great safety measure that I employ.
One final warning if you really want to push back to the original repo: see the next section on sensitive data removals. Those are the steps needed when pushing back to the original repo; they are so involved that I assume they are only worth it when sensitive data is involved, but you can choose to follow them for other kinds of rewrites too.
Sensitive Data Removals¶
Sensitive data removals are a specialized type of history rewrite. While it is always very problematic to mix the cleaned-up history with the non-cleaned-up history, for sensitive data removals it is also bad to allow others to continue to view/clone/fetch the non-cleaned-up history at all; users often need to try to expunge the old history as well.
Note that if the sensitive data under consideration is a token/password/credential/secret (as is often the case), then it is important that you revoke and rotate that credential first. Once the credential is revoked or rotated, it can no longer be used for access. Revoking/rotating may resolve your problem without resorting to the heavy-handed action of rewriting and purging history.
For sensitive data removal history rewrites, there are three high-level steps:
Each will be discussed in greater detail below.
One important thing to note, though, is that others working on the same repository should be instructed to stop while you do the cleanup; if they continue development during your cleanup, you’ll likely be forced to either discard their changes or start over on your cleanup.
Rewrite the repository locally, using git-filter-repo
The first step is to rewrite a copy of your repository locally using git-filter-repo. The exact commands to run will differ based on where in your repository the sensitive data is found, but some general tips:
After rewriting the history locally, make sure to inspect it to ensure the sensitive data has been removed. Some commands that might be handy for checking are:
git log --all --name-status -- ${PROBLEMATIC_FILE1} ${PROBLEMATIC_FILE2}
or
git log -S"${PROBLEMATIC_STRING}" --all -p --
If either of these commands turn up more sensitive data, then run additional git-filter-repo commands to clean up the necessary data before proceeding.
Make sure other copies are cleaned up: primary server
Cleaning up the repository you cloned from requires force pushing your rewritten history over the original. You need to force push all refs, not just your current branch. You can use the following command to do so (read the bulleted list right after this command before running it):
git push --force --mirror origin
Several comments on this command:
Also, if any LFS objects were orphaned by your rewrite, those objects likely contain sensitive data and need to be deleted/purged from the LFS server. You’ll have to ask the maintainer of the LFS server you are using for how to delete/purge those on the server.
Make sure other copies are cleaned up: clones of colleagues
After you have cleaned up the server, the easiest way to clean up other clones is to make everyone delete their existing clones and reclone.
If that isn’t an option, then you will need to proceed carefully because a simple git pull && git push from any other clone will recontaminate the main repository and make the mess even harder to clean up. To avoid this, before pushing from any other clone, you’ll need to have them clean up their copy, as detailed below.
First, though, let me note that you should not have other developers try to cleanup their clone by running the same git-filter-repo commands that you ran. While that sometimes may happen to work, it is not reliable in general. Running the same git-filter-repo commands, even if identical, can result in them getting new hashes for commits that are different than your new hashes, and you’ll end up with a mess involving two or more copies of every commit.
Instead developers with other clones of the repository should run through the following steps to clean up their copy if they are unwilling to discard their copy and reclone:
Once these steps are complete, you also need to verify that the clone no longer contains any sensitive data (it is really easy to miss something, which puts you at risk of recontaminating other repositories with the sensitive data). You can do so by running:
git cat-file -t ${HASH_OF_FIRST_CHANGED_COMMIT}
Where ${HASH_OF_FIRST_CHANGED_COMMIT} was printed by git-filter-repo at the end of its run (if there was more than one "first changed commit", run this command multiple times, with each commit hash). If this command returns a fatal error, then the commit has correctly been removed from this repository. If it responds with "commit", then the object still exists and you need to re-delete tags, re-rebase all necessary branches/refs, and re-expire reflogs and redo the gc. If you are curious about which branches or refs were the problematic ones holding on to ${HASH_OF_FIRST_CHANGED_COMMIT}, then presuming you did the reflog expire and gc jobs above, the following command should help you find the problematic branches/refs:
git for-each-ref --contains ${HASH_OF_FIRST_CHANGED_COMMIT}
Also, remember, the cat-file command needs to come back with a fatal error for every ${HASH_OF_FIRST_CHANGED_COMMIT} involved if you have more than one.
After this is all done, then if any LFS objects were orphaned by the rewrite (which again, you will be told if you use the --sensitive-data-removal option when you run git-filter-repo), then you also need to remove those LFS objects. Look for them a couple directories under .git/lfs/objects/, and delete them.
Prevent repeats and avoid future sensitive data spills
There are several measures you can take to help avoid repeat problems. Not all may be applicable for your case, but the more that are, the more likely you can avoid problems.
For dealing with the existing sensitive data spill:
Steps to help avoid other future sensitive data spills:
EXAMPLES¶
Path based filtering¶
To only keep the README.md file plus the directories guides and tools/releases/:
git filter-repo --path README.md --path guides/ --path tools/releases
Directory names can be given with or without a trailing slash, and all filenames are relative to the toplevel of the repo. To keep all files except these paths, just add --invert-paths:
git filter-repo --path README.md --path guides/ --path tools/releases --invert-paths
If you want to have both an inclusion filter and an exclusion filter, just run filter-repo multiple times. For example, to keep the src/main subdirectory but exclude files under src/main named data, run:
git filter-repo --path src/main/ git filter-repo --path-glob 'src/*/data' --invert-paths
Note that the asterisk (*) will match across multiple directories, so the second command would remove e.g. src/main/org/whatever/data. Also, the second command by itself would also remove e.g. src/not-main/foo/data, but since src/not-main/ was removed by the first command, that’s not an issue. Also, the use of quotes around the asterisk is sometimes important to avoid glob expansion by the shell.
You can also select paths by regular expression (see https://docs.python.org/3/library/re.html#regular-expression-syntax). For example, to only include files from the repo whose name is in the format YYYY-MM-DD.txt and is found at least two subdirectories deep:
git filter-repo --path-regex '^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$'
If you want two directories to be renamed (and maybe merged if both are renamed to the same location), use --path-rename; for example, to rename both cmds/ and src/scripts/ to tools/:
git filter-repo --path-rename cmds:tools --path-rename src/scripts/:tools/
As with --path, directories can be specified with or without a trailing slash for --path-rename.
If you do a --path-rename to something that was already in use, it will be silently overwritten. However, if you try to rename multiple files to the same location (e.g. src/scripts/run_release.sh and cmds/run_release.sh both existed and had different content with the renames above), then you will be given an error. If you have such a case, you may want to add another rename command to move one of the paths somewhere else where it won’t collide:
git filter-repo --path-rename cmds/run_release.sh:tools/do_release.sh \
--path-rename cmds/:tools/ \
--path-rename src/scripts/:tools/
Also, --path-rename brings up ordering issues; all path arguments are applied in order. Thus, a command like
git filter-repo --path-rename sources/:src/main/ --path src/main/
would make sense but reversing the two arguments would not (src/main/ is created by the rename so reversing the two would give you an empty repo). Also, note that the rename of cmds/run_release.sh a couple examples ago was done before the other renames.
Note that path renaming does not do path filtering, thus the following command
git filter-repo --path src/main/ --path-rename tools/:scripts/
would not result in the tools or scripts directories being present, because the single filter selected only src/main/. It’s likely that you would instead want to run:
git filter-repo --path src/main/ --path tools/ --path-rename tools/:scripts/
If you prefer to filter based solely on basename, use the --use-base-name flag (though this is incompatible with --path-rename). For example, to only include README.md and Makefile files from any directory:
git filter-repo --use-base-name --path README.md --path Makefile
If you wanted to delete all .DS_Store files in any directory, you could either use:
git filter-repo --invert-paths --path '.DS_Store' --use-base-name
or
git filter-repo --invert-paths --path-glob '*/.DS_Store' --path '.DS_Store'
(the --path-glob isn’t sufficient by itself as it might miss a toplevel .DS_Store file; further while something like --path-glob '*.DS_Store' would workaround that problem it would also grab files named foo.DS_Store or bar/baz.DS_Store)
Finally, see also the --filename-callback from the section called “CALLBACKS”.
Filtering based on many paths¶
If you have a long list of files, directories, globs, or regular expressions to filter on, you can stick them in a file and use --paths-from-file; for example, with a file named stuff-i-want.txt with contents of
# Blank lines and comment lines are ignored. # Examples similar to --path: README.md guides/ tools/releases # An example that is like --path-glob: glob:*.py # An example that is like --path-regex: regex:^.*/.*/[0-9]{4}-[0-9]{2}-[0-9]{2}.txt$ # An example of renaming a path tools/==>scripts/ # An example of using a regex to rename a path regex:(.*)/([^/]*)/([^/]*)\.text$==>\2/\1/\3.txt
then you could run
git filter-repo --paths-from-file stuff-i-want.txt
to get a repo containing only the toplevel README.md file, the guides/ and tools/releases/ directories, all python files, files whose name was of the form YYYY-MM-DD.txt at least two subdirectories deep, and would rename tools/ to scripts/ and rename files like foo/bar/baz.text to bar/foo/baz.txt. Note the special line prefixes of glob: and regex: and the special string ==> denoting renames.
Sometimes you have a way of easily generating all the files you want. For example, if you know that none of the currently tracked files have any newlines or special characters in them (see core.quotePath from git config --help) so that git ls-files would print all files literally one per line, and you knew that you wanted to keep only the files that are currently tracked (thus deleting from all commits in history any files that only appear on other branches or that only appear in older commits), then you could use a pair of commands such as
git ls-files >../paths-i-want.txt git filter-repo --paths-from-file ../paths-i-want.txt
Similarly, you could use --paths-from-file to delete many files. For example, you could run git filter-repo --analyze to get reports, look in one such as .git/filter-repo/analysis/path-deleted-sizes.txt and copy all the filenames into a file such as /tmp/files-i-dont-want-anymore.txt and then run
git filter-repo --invert-paths --paths-from-file /tmp/files-i-dont-want-anymore.txt
to delete them all.
Directory based shortcuts¶
Let’s say you had a directory structure like the following:
module/
foo.c
bar.c otherDir/
blah.config
stuff.txt zebra.jpg
If you wanted just the module/ directory and you wanted it to become the new root so that your new directory structure looked like
foo.c bar.c
then you could run:
git filter-repo --subdirectory-filter module/
If you wanted all the files from the original repo, but wanted to move everything under a subdirectory named my-module/, so that your new directory structure looked like
my-module/
module/
foo.c
bar.c
otherDir/
blah.config
stuff.txt
zebra.jpg
then you would instead run run
git filter-repo --to-subdirectory-filter my-module/
Content based filtering¶
If you want to filter out all files bigger than a certain size, you can use --strip-blobs-bigger-than with some size (K, M, and G suffixes are recognized), e.g.:
git filter-repo --strip-blobs-bigger-than 10M
If you want to strip out all files with specified git object ids (hashes), list the hashes in a file and run
git filter-repo --strip-blobs-with-ids FILE_WITH_GIT_BLOB_IDS
If you want to modify file contents, you can do so based on a list of expressions in a file, one per line. For example, with a file named expressions.txt containing
p455w0rd foo==>bar glob:*666*==> regex:\bdriver\b==>pilot literal:MM/DD/YYYY==>YYYY-MM-DD regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2
then running
git filter-repo --replace-text expressions.txt
will go through and replace p455w0rd with ***REMOVED***, foo with bar, any line containing 666 with a blank line, the word driver with pilot (but not if it has letters before or after; e.g. drivers will be unmodified), replace the exact text MM/DD/YYYY with YYYY-MM-DD and replace date strings of the form MM/DD/YYYY with ones of the form YYYY-MM-DD. In the expressions file, there are a few things to note:
See also the --blob-callback from the section called “CALLBACKS”.
Updating commit/tag messages¶
If you want to modify commit or tag messages, you can do so with the same syntax as --replace-text, explained above. For example, with a file named expressions.txt containing
foo==>bar
then running
git filter-repo --replace-message expressions.txt
will replace foo in commit or tag messages with bar.
See also the --message-callback from the section called “CALLBACKS”.
Refname based filtering¶
To rename tags, use --tag-rename, e.g.:
git filter-repo --tag-rename foo:bar
This will rename any tags starting with foo to now start with bar. Either side of the colon could be blank, e.g.
git filter-repo --tag-rename '':'my-module-'
For more general refname modification, see --refname-callback from the section called “CALLBACKS”.
User and email based filtering¶
To modify username and emails of commits, you can create a mailmap file in the format accepted by git-shortlog(1). For example, if you have a file named my-mailmap you can run
git filter-repo --mailmap my-mailmap
and if the current contents of that file are as follows (if the specified mailmap file is version controlled, historical versions of the file are ignored):
Name For User <email@addre.ss> <new@ema.il> <old1@ema.il> New Name And <new@ema.il> <old2@ema.il> New Name And <new@ema.il> Old Name And <old3@ema.il>
then we can update username and/or emails based on the specified mapping.
See also the --name-callback and --email-callback from the section called “CALLBACKS”.
Parent rewriting¶
To replace $commit_A with $commit_B (e.g. make all commits which had $commit_A as a parent instead have $commit_B for that parent), and rewrite history to make it permanent:
git replace $commit_A $commit_B git filter-repo --proceed
To create a new commit with the same contents as $commit_A except with different parent(s) and then replace $commit_A with the new commit, and rewrite history to make it permanent:
git replace --graft $commit_A $new_parent_or_parents git filter-repo --proceed
The --proceed option is needed to avoid failing the "no arguments specified" check. Note that older versions of git-filter-repo required --force to be passed after creating a graft to avoid triggering the not-a-fresh-clone check; that check has been modified to remove this overuse of --force.
Partial history rewrites¶
To rewrite the history on just one branch (which may cause it to no longer share any common history with other branches), use --refs. For example, to remove a file named extraneous.txt from the master branch:
git filter-repo --invert-paths --path extraneous.txt --refs master
To rewrite just some recent commits:
git filter-repo --invert-paths --path extraneous.txt --refs master~3..master
CALLBACKS¶
For flexibility, filter-repo allows you to specify functions on the command line to further filter all changes. Please note that there are some API compatibility caveats associated with these callbacks that you should be aware of before using them; see the "API BACKWARD COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source code.
Most callback functions are of the same general format (--file-info-callback is an exception which will be noted later). For a command line argument like
--foo-callback 'BODY'
the following code will be compiled and called:
def foo_callback(foo):
BODY
Thus, you just need to make sure your BODY modifies and returns foo appropriately. One important thing to note for all callbacks is that filter-repo uses bytestrings (see https://docs.python.org/3/library/stdtypes.html#bytes) everywhere instead of strings.
There are four callbacks that allow you to operate directly on raw objects that contain data that’s easy to write in git-fast-import(1) format:
--blob-callback --commit-callback --tag-callback --reset-callback
We’ll come back to these later because it is often the case that the other callbacks are more convenient. The other callbacks operate on a small piece of the raw objects or operate on pieces across multiple types of raw object (e.g. author names and committer names and tagger names across commits and tags, or refnames across commits, tags, and resets, or messages across commits and tags). The convenience callbacks are:
--filename-callback --message-callback --name-callback --email-callback --refname-callback --file-info-callback
in each you are expected to simply return a new value based on the one passed in. For example,
git-filter-repo --name-callback 'return name.replace(b"Wiliam", b"William")'
would result in the following function being called:
def name_callback(name):
return name.replace(b"Wiliam", b"William")
The email callback is quite similar:
git-filter-repo --email-callback 'return email.replace(b".cm", b".com")'
The refname callback is also similar, but note that the refname passed in and returned are expected to be fully qualified (e.g. b"refs/heads/master" instead of just b"master" and b"refs/tags/v1.0.7" instead of b"1.0.7"):
git-filter-repo --refname-callback '
# Change e.g. refs/heads/master to refs/heads/prefix-master
rdir,rpath = os.path.split(refname)
return rdir + b"/prefix-" + rpath'
The message callback is quite similar to the previous three callbacks, though it operates on a bytestring that is likely more than one line:
git-filter-repo --message-callback '
if b"Signed-off-by:" not in message:
message += b"\nSigned-off-by: Me My <self@and.eye>"
return re.sub(b"[Ee]-?[Mm][Aa][Ii][Ll]", b"email", message)'
The filename callback is slightly more interesting. Returning None means the file should be removed from all commits, returning the filename unmodified marks the file to be kept, and returning a different name means the file should be renamed. An example:
git-filter-repo --filename-callback '
if b"/src/" in filename:
# Remove all files with a directory named "src" in their path
# (except when "src" appears at the toplevel).
return None
elif filename.startswith(b"tools/"):
# Rename tools/ -> scripts/misc/
return b"scripts/misc/" + filename[6:]
else:
# Keep the filename and do not rename it
return filename
'
The file-info callback is more involved. It is designed to be used in cases where filtering depends on both filename and contents (and maybe mode). It is called for file changes other than deletions (since deletions have no file contents to operate on). The file info callback takes four parameters (filename, mode, blob_id, and value), and expects three to be returned (filename, mode, blob_id). The filename is handled similar to the filename callback; it can be used to rename the file (or set to None to drop the change). The mode is a simple bytestring (b"100644" for regular non-executable files, b"100755" for executable files/scripts, b"120000" for symlinks, and b"160000" for submodules). The blob_id is most useful in conjunction with the value parameter. The value parameter is an instance of a class that has the following functions value.get_contents_by_identifier(blob_id) → contents (bytestring) value.get_size_by_identifier(blob_id) → size_of_blob (int) value.insert_file_with_contents(contents) → blob_id value.is_binary(contents) → bool value.apply_replace_text(contents) → new_contents (bytestring) and has the following member data you can write to value.data (dict) These functions allow you to get the contents of the file, or its size, create a new file in the stream whose blob_id you can return, check whether some given contents are binary (using the heuristic from the grep(1) command), and apply the replacement rules from --replace-text (note that --file-info-callback makes the changes from --replace-text not auto-apply). You could use this for example to only apply the changes from --replace-text to certain file types and simultaneously rename the files it applies the changes to:
git-filter-repo --file-info-callback '
if not filename.endswith(b".config"):
# Make no changes to the file; return as-is
return (filename, mode, blob_id)
new_filename = filename[0:-7] + b".cfg"
contents = value.get_contents_by_identifier(blob_id)
new_contents = value.apply_replace_text(contents)
new_blob_id = value.insert_file_with_contents(new_contents)
return (new_filename, mode, new_blob_id)
Note that if history has multiple revisions with the same file (e.g. it was cherry-picked to multiple branches or there were a number of reverts), then the --file-info-callback will be called multiple times. If you want to avoid processing the same file multiple times, then you can stash transformation results in the value.data dict. For, example, we could modify the above example to make it only apply transformations on blob_ids we have not seen before:
git-filter-repo --file-info-callback '
if not filename.endswith(b".config"):
# Make no changes to the file; return as-is
return (filename, mode, blob_id)
new_filename = filename[0:-7] + b".cfg"
if blob_id in value.data:
return (new_filename, mode, value.data[blob_id])
contents = value.get_contents_by_identifier(blob_id)
new_contents = value.apply_replace_text(contents)
new_blob_id = value.insert_file_with_contents(new_contents)
value.data[blob_id] = new_blob_id
return (new_filename, mode, new_blob_id)
An alternative example for the --file-info-callback is to make all .sh files executable and add an extra trailing newline to the .sh files:
git-filter-repo --file-info-callback '
if not filename.endswith(b".sh"):
# Make no changes to the file; return as-is
return (filename, mode, blob_id)
# There are only 4 valid modes in git:
# - 100644, for regular non-executable files
# - 100755, for executable files/scripts
# - 120000, for symlinks
# - 160000, for submodules
new_mode = b"100755"
contents = value.get_contents_by_identifier(blob_id)
new_contents = contents + b"\n"
new_blob_id = value.insert_file_with_contents(new_contents)
return (filename, new_mode, new_blob_id)
In contrast to the previous callback types, the blob, reset, tag, and commit callbacks are not expected to return a value, but are instead expected to modify the object passed in. Major fields for these objects are (subject to API backward compatibility caveats mentioned previously):
An example of each:
git filter-repo --blob-callback '
if len(blob.data) > 25:
# Mark this blob for removal from all commits
blob.skip()
else:
blob.data = blob.data.replace(b"Hello", b"Goodbye")
'
git filter-repo --reset-callback 'reset.ref = reset.ref.replace(b"master", b"dev")'
git filter-repo --tag-callback '
if tag.tagger_name == b"Jim Williams":
# Omit this tag
tag.skip()
else:
tag.message = tag.message + b"\n\nTag of %s by %s on %s" % (tag.ref, tag.tagger_email, tag.tagger_date)'
git filter-repo --commit-callback '
# Remove executable files with three 6s in their name (including
# from leading directories).
# Also, undo deletion of sources/foo/bar.txt (change types are
# either b"D" (deletion) or b"M" (add or modify); renames are
# handled by deleting the old file and adding a new one)
commit.file_changes = [
change for change in commit.file_changes
if not (change.mode == b"100755" and
change.filename.count(b"6") == 3) and
not (change.type == b"D" and
change.filename == b"sources/foo/bar.txt")]
# Mark all .sh files as executable; modes in git are always one of
# 100644 (normal file), 100755 (executable), 120000 (symlink), or
# 160000 (submodule)
for change in commit.file_changes:
if change.filename.endswith(b".sh"):
change.mode = b"100755"
'
INTERNALS¶
You probably don’t need to read this section unless you are just very curious or you are trying to do a very complex history rewrite.
How filter-repo works¶
Roughly, filter-repo works by running
git fast-export <options> | filter | git fast-import <options>
where filter-repo not only launches the whole pipeline but also serves as the filter in the middle. However, filter-repo does a few additional things on top in order to make it into a well-rounded filtering tool. A sequence that more accurately reflects what filter-repo runs is:
Some notes or exceptions on each of the above:
Information about these steps is printed out when --debug is passed to filter-repo. When doing a --partial history rewrite, steps 2, 3, 7, and 8 are unconditionally skipped, step 5 is skipped if --replace-refs is update-no-add, and just the nuke-unused-refs portion of step 5 is skipped if --replace-refs is something else.
Limitations¶
Inherited limitations
Since git filter-repo calls fast-export and fast-import to do a lot of the heavy lifting, it inherits limitations from those systems:
There are also some limitations due to the design of these systems:
Intrinsic limitations
Some types of filtering have limitations that would affect any tool attempting to perform them; the most any tool can do is attempt to notify the user when it detects an issue:
Issues specific to filter-repo
Comments on reversibility
Some people are interested in reversibility of a rewrite; e.g. rewrite history, possibly add some commits, then unrewrite and get the original history back plus a few new "unrewritten" commits. Obviously this is impossible if your rewrite involves throwing away information (e.g. filtering out files or replacing several different strings with ***REMOVED***), but may be possible with some rewrites. filter-repo is likely to be a poor fit for this type of workflow for a few reasons:
SEE ALSO¶
GIT¶
Part of the git(1) suite
NOTES¶
- 1.
- GitLab’s docs on reducing repository size
- 2.
- the "Fully removing the data from GitHub" section of GitHub’s docs
12/27/2024 | Git 2.47.0-2 |