Package is work in progress! If you encounter errors / problems, please file an issue or make a PR.
This package parses a git repository history to collect comprehensive information about the activity in the repo. The parsed data is made available to the user in a tabular format. The package can also generate reports based on the parse data.
There are two main functions for parsing the history, both return tabular data:
parse_log_simple()
is a relatively fast parser and returns a tibble with one commit per row. There is no file-specific information.parse_log_detailed()
outputs a nested tibble and for each commit, the names of the amended files, number of lines changed ect. available. This function is slower.report_git()
creates a html, pdf, or word report with the parsed log data according to a template. Templates can be created by the user or a template from the gitsum
package can be used.
Let’s see the package in action.
library("gitsum")
library("tidyverse")
library("forcats")
We can obtain a parsed log like this:
tbl <- parse_log_detailed() %>%
arrange(date) %>%
select(short_hash, short_message, total_files_changed, nested)
tbl
#> # A tibble: 101 x 4
#> short_hash short_message total_files_changed nested
#> <chr> <chr> <dbl> <list>
#> 1 243f initial commit 7 <tibble [7 x 5]>
#> 2 f8ee add log example data 1 <tibble [1 x 5]>
#> 3 6328 add parents 3 <tibble [3 x 5]>
#> 4 dfab intermediate 1 <tibble [1 x 5]>
#> 5 7825 add licence 1 <tibble [1 x 5]>
#> 6 2ac3 add readme 2 <tibble [2 x 5]>
#> 7 7a2a document log data 1 <tibble [1 x 5]>
#> 8 943c add helpfiles 10 <tibble [10 x 5]>
#> 9 917e update infrastructur 3 <tibble [3 x 5]>
#> 10 4fc0 remove garbage 6 <tibble [6 x 5]>
#> # ... with 91 more rows
Since we used parse_log_detailed()
, there is detailed file-specific information available for every commit:
tbl$nested[[3]]
#> # A tibble: 3 x 5
#> changed_file edits insertions deletions is_exact
#> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 DESCRIPTION 6 5 1 TRUE
#> 2 NAMESPACE 3 2 1 TRUE
#> 3 R/get_log.R 19 11 8 TRUE
Since the data has such a high resolution, various graphs, tables etc. can be produced from it to provide insights into the git history.
Since the output of git_log_detailed()
is a nested tibble, you can work on it as you work on any other tibble. Let us first have a look at who comitted to this repository:
log <- parse_log_detailed()
log %>%
group_by(author_name) %>%
summarize(n = n())
#> # A tibble: 3 x 2
#> author_name n
#> <chr> <int>
#> 1 Jon Calder 2
#> 2 jonmcalder 6
#> 3 Lorenz Walthert 93
We can also investigate how the number of lines of each file in the R directory evolved.
lines <- log %>%
unnest() %>%
add_line_history()
r_files <- grep("^R/", lines$changed_file, value = TRUE)
to_plot <- lines %>%
filter(changed_file %in% r_files)
ggplot(to_plot, aes(x = date, y = current_lines)) +
geom_step() +
scale_y_continuous(name = "Number of Lines", limits = c(0, NA)) +
facet_wrap(~changed_file, scales = "free_y")
Next, we want to see which files were contained in most commits:
log %>%
unnest(nested) %>% # unnest the tibble
mutate(changed_file = fct_lump(fct_infreq(changed_file), n = 10)) %>%
filter(changed_file != "Other") %>%
ggplot(aes(x = changed_file)) + geom_bar() + coord_flip() +
theme_minimal()
We can also easily get a visual overview of the number of insertions & deletions in commits over time:
commit.dat <- data.frame(
edits = rep(c("Insertions", "Deletions"), each = nrow(log)),
commit = rep(1:nrow(log), 2),
count = c(log$total_insertions, -log$total_deletions))
ggplot(commit.dat, aes(x = commit, y = count, fill = edits)) +
geom_bar(stat = "identity", position = "identity") +
theme_minimal()
Or the number of commits broken down by day of the week:
log %>%
mutate(weekday = factor(weekday, c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))) %>%
ggplot(aes(x = weekday)) + geom_bar() +
theme_minimal()