Skip to contents

Deduplicate an orderly archive. Deduplicating an orderly archive will replace all files that have the same content with "hard links". This requires hard link support in the underlying operating system, which is available on all unix-like systems (e.g. MacOS and Linux) and on Windows since Vista. However, on windows systems this might require somewhat elevated privileges. If you use this feature, it is very important that you treat your orderly archive as read-only (though you should be anyway) as changing one copy of a linked file changes all the other instances of it - the files are literally the same file.

Usage

orderly_deduplicate(root = NULL, locate = TRUE, dry_run = TRUE, quiet = FALSE)

Arguments

root

The path to an orderly root directory, or NULL (the default) to search for one from the current working directory if locate is TRUE.

locate

Logical, indicating if the configuration should be searched for. If TRUE and config is not given, then orderly looks in the working directory and up through its parents until it finds an orderly_config.yml file.

dry_run

Logical, indicating if the deduplication should be planned but not run

quiet

Logical, indicating if the status should not be printed

Value

Invisibly, information about the duplication status of the archive before deduplication was run.

Details

This function will alter your orderly archive. Ordinarily this is not something that should be done, so we try to be careful. In order for this to work, it is very important to treat your orderly archive as read-only generally. If your canonical orderly archive is behind OrderlyWeb this will almost certainly be the case already.

With "hard linking", two files with the same content can be updated so that both files point at the same physical bit of data. This is great, as if the file is large, then only one copy needs to be stored. However, this means that if a change is made to one copy of the file, it is immediately reflected in the other, but there is nothing to indicate that the files are linked!

This approach is worth exploring if you have large files that are outputs of one report and inputs to another, or large inputs repeatedly used in different reports, or outputs that end up being the same in multiple reports. If you run the deduplication with dry_run = TRUE, an indication of the savings will be printed.

Examples


path <- orderly::orderly_example("demo")
id1 <- orderly::orderly_run("minimal", root = path)
#> [ name       ]  minimal
#> [ id         ]  20230621-105027-839a8e16
#> [ start      ]  2023-06-21 10:50:27
#> [ data       ]  source => dat: 20 x 2
#> 
#> > png("mygraph.png")
#> 
#> > par(mar = c(15, 4, 0.5, 0.5))
#> 
#> > barplot(setNames(dat$number, dat$name), las = 2)
#> 
#> > dev.off()
#> agg_png 
#>       2 
#> [ end        ]  2023-06-21 10:50:27
#> [ elapsed    ]  Ran report in 0.01687074 secs
#> [ artefact   ]  mygraph.png: 175369b2bcf4115f343c8ad746c0c072
id2 <- orderly::orderly_run("minimal", root = path)
#> [ name       ]  minimal
#> [ id         ]  20230621-105027-90a5fe1d
#> [ start      ]  2023-06-21 10:50:27
#> [ data       ]  source => dat: 20 x 2
#> 
#> > png("mygraph.png")
#> 
#> > par(mar = c(15, 4, 0.5, 0.5))
#> 
#> > barplot(setNames(dat$number, dat$name), las = 2)
#> 
#> > dev.off()
#> agg_png 
#>       2 
#> [ end        ]  2023-06-21 10:50:27
#> [ elapsed    ]  Ran report in 0.01501155 secs
#> [ artefact   ]  mygraph.png: 175369b2bcf4115f343c8ad746c0c072
orderly_commit(id1, root = path)
#> [ commit     ]  minimal/20230621-105027-839a8e16
#> [ copy       ]
#> [ import     ]  minimal:20230621-105027-839a8e16
#> [ success    ]  :)
#> [1] "/tmp/RtmpGRuIRx/file471472a937/archive/minimal/20230621-105027-839a8e16"
orderly_commit(id2, root = path)
#> [ commit     ]  minimal/20230621-105027-90a5fe1d
#> [ copy       ]
#> [ import     ]  minimal:20230621-105027-90a5fe1d
#> [ success    ]  :)
#> [1] "/tmp/RtmpGRuIRx/file471472a937/archive/minimal/20230621-105027-90a5fe1d"
tryCatch(
  orderly::orderly_deduplicate(path, dry_run = TRUE),
  error = function(e) NULL)
#> Deduplication information for
#>   /tmp/RtmpGRuIRx/file471472a937/archive
#>   - 6 tracked files
#>   - 20.43 kB total size
#>   - 3 duplicate files
#>   - 10.22 kB duplicated size
#>   - 0 deduplicated files
#>   - 0 B deduplicated size
#>   - 0 untracked files
#>   - 0 B untracked size