Building software together with Git

# Building software together with Git As developers, we tend to forget that we're using obscure jargon that makes us look like occult wizards. There may be some kind of elitism behind that, of course, but I think it also happens because software development is really a weird, strange field, with its own problems and solutions. Additionally, the obscure terminology often describes some very specific and complex concept that would be difficult to express otherwise. There are not many resources available out there that introduce Git, GitHub, commits, branching, versioning, forks, pull requests and other development-related jargon to non-technical audience. This post is an attempt to help with that. Other resources you may want to check: - [Git Novice](http://swcarpentry.github.io/git-novice/): a step-by-step, beginner-friendly guide explaining how to use git ## How do we collaborate on software? Like most projects, successful software projects work best when multiple people can work on different tasks in parallel. In a tyical organization, you expect accountants, managers, secretaries, salesfolks, and in fact everyone to work on their own tasks, seamlessly, at the same time. You want, as much as possible, to avoid situations when someone needs to wait for someone else to proceed. We call those situations bottlenecks. An example of a bottleneck would be having a single phone in a 100-person company: everyone would have to wait to make a call, which would be a waste of time. Software development works in the same way: developers, designers, translators – pretty much everyone – wants to work without worrying about each other's tasks, especially as the project grows and attracts dozens or even hundreds of contributors. In order to deal with that, people involved in software development usually rely on a few tools and processes I will describe below. ## What is software? What we call software, in its most common form, is just a set of text files, also known as a codebase. Those text files contain instructions than can be executed by computers. Yes, the act of programming is just about writing stuff. Of course, as a programmer, you have to think about *what* you are writing, like a storywriter ;) If you've ever worked on on thesis, or any long-form textual essay, you faced a lot of the issues developers encounter when they need to collaborate on the same piece of software. When you're in the process of writing an essay, quite often, you'll need: - **Reviewing**: you want someone else to read your work, and possibly add suggestions or edit it - **Collaboration**: you want someone else to work on a section of the documention, while you're also working on another one - **Versioning**: the ability to go back to a previous version of the document (e.g. because you deleted something by mistake) You can implement review and collaboration by sending email copies of your working version to other people involved, then integrating their changes into your own copy, regularly, via copy-pasting. For versioning, "redo/undo" features of your text editor can help, as can copying your document onto a separate medium from time to time. However, if you've worked with more than one or two people on the same document, you know that this is absolutely awful to manage, and very error-prone. Did you send the latest versions to your friends? Have you integrated all their suggestions? How do you go back to yesterday's version of your work when your last backup was made last week? Software development is *exactly* the same. But usually with more people involved ;) ## Introducing Git Git, and associated tools like GitHub, are an attempt to solve the issues I described in the previous section. Git allows people to contribute to a single codebase in a sane and efficient way. However, to do so, it completely rethinks the way we should collaborate and introduce new concepts. All of this sounds like jargon to newcomers and is frankly overwhelming, so I'll try to demystify it a bit. ### Commits and versioning At its core, Git provides a mechanism to *version* a codebase. Each version of the codebase is basically a snapshot of the codebase, associated with the snapshot date. This gives you versioning, because you can go back in time, to any previous snapshot. Those snapshots are named *commits*. However, doing a full copy of the project with each commit would require a lot of space. Git is a bit smarter than that, and it will only store the differences (called *diffs*) between each commit. Let's take a concrete example: 1. Alice starts her software project, creates a text file with 10 lines, and makes the first commit. 2. 3 days later, she makes a change on line 7 and makes another commit. The commit will store the fact that line 7 was edited. 3. 5 days later, she deletes line 3 and creates another commit. The commit will only store the fact that a line was deleted. All those commits create a log, or history of what happened in the project: - Day 1: Alice added 10 lines - Day 4: Alice edited line 7 - Day 9: Alice removed line 3 And if we want to go back to day 1, we can tell Git to undo the changes from day 9 and day 4, in that order, and we'll obtain the codebase as it was on the first day. Then we can replay the next commit, the one from day 4, to go to the next version of the project, then replay the commit from day 9 to obtain our latest version. You may remember I mentioned three desirable features: versioning, collaboration and reviewing. **Commits bring us versioning**, and also auditability: who did what, and when, which is a nice bonus. ## Branches and parallelization On day 10, Alice decides she wants to experiment with something new, but she's not 100% sure that it will work. To start hacking on her idea, she creates what Git calls a *branch*. You can think of branches as roads, splitting off another one. Eventually, two roads can rejoin, but that's not mandatory. In Git, all commits happen on a branch, the default one being usually named the `master` branch. So if we represent the current state of the project with that in mind, this is how it could look like: ``` | master branch | * Commit from day 1: Alice added 10 lines | * Commit from day 4: Alice edited line 7 | * Commit from day 9: Alice removed line 3 ``` So, Alice starts this new branch, named `experiment` from the `master` branch. The project now looks like this: ``` | master branch | * Commit from day 1: Alice added 10 lines | * Commit from day 4: Alice edited line 7 | * Commit from day 9: Alice removed line 3 | |\ | \ | | experiment branch ``` The `master` branch still exists, on the left, but Alice is now working on the `experiment` branch, on the right. She's very productive, and makes a few commits on that branch: ``` | master branch | * Commit from day 1: Alice added 10 lines | * Commit from day 4: Alice edited line 7 | * Commit from day 9: Alice removed line 3 | |\ | \ | | experiment branch | | | * Commit from day 11: Alice added 10 new lines | | | * Commit from day 13: Alice edited lines 5 to 9 ``` Because she's satisfied with the changes, she decides to *merge* the `experiment` branch into the `master` branch. This is Git's way to apply changes from one branch to another. Remember the previous road analogy I used? This is how the merge would look: ``` | Main road (master branch) | |\ Roads are splitting | \ | | Secondary road (experiment branch) | | | | | | | | | / |/ Roads are joining | | The main road remains ``` When the merge is done, the `experiment` branch is deleted, and all its commits are now present on the `master` branch: ``` | master branch | * Commit from day 1: Alice added 10 lines | * Commit from day 4: Alice edited line 7 | * Commit from day 9: Alice removed line 3 | * Commit from day 11: Alice added 10 new lines (from experiment branch) | * Commit from day 13: Alice edited lines 5 to 9 (from experiment branch) ``` If, for any reason, Alice wasn't satisfied with her experiment, she could could have deleted it without merging it, and the `master` branch would have remained unaffected. Branches are a powerful but also hard-to-grasp concept in Git. They are useful for experimenting without risk, but also for enabling collaboration, as we'll see in the next section. ## Repositories and collaboration In the previous scenario, Alice was alone. But on day 14, her friend Bob wants to help her with this new project. How can he achieve that with Git? When Alice started to work on the project, she was using her local copy, which we call a *repository*. You can think of a repository as a workspace, belonging to someone (Alice, in this case). Since Bob wants to start contributing, he will need his own repository. One way to do that is for Alice to *push* her repository to a platform like GitHub or GitLab, have Bob create an account there, and use the *fork* button. Forking essentially means "creating a copy of someone else's repository". Git and GitHub have similar names, but are different beasts. If I go back to my document analogy, Git is similar to a text editor, like Word or LibreOffice Writer. It's a tool you install on your computer to work on your documents. GitHub (and GitLab, BitBucket and related platforms) is more similar to Google Docs: it's a web service that hosts your documents, and makes those browsable and editable by others. You don't *have to* use GitHub if you're using Git, but both tend to be used together and serve different purposes. When Bob forks Alice's repository, on GitHub, he ends up with an exact copy of her repository. It's Git's equivalent of "sending your thesis by email to a friend". So, Bob has a working repository, and starts adding some commits on the master branch: ``` | Bob's workspace / master branch | | (previous commits omitted) | * Commit from day 11: Alice added 10 new lines (from experiment branch) | * Commit from day 13: Alice edited lines 5 to 9 (from experiment branch) | * Commit from day 14: Bob edited lines 8 | * Commit from day 15: Bob deleted line 12 | ``` Bob added two commits on day 14 and 15. He'd like this to be included in Alice's repository. One way to do that using platforms like GitHub or GitLab is to create a *pull request* (named *merge request* in GitLab, but those are the same thing). Pull requests are often abbreviated PRs. Do you remember when Alice merged her `experiment` branch in her `master` branch in the previous section? A pull request is essentially asking someone to merge a branch from your repository, into a branch of their repository. So, Bob creates the pull request: > Hello Alice! > > I'd like to merge the branch `master` from my repository into the `master` branch of your repository > > I've added one commit that fixes a typo, and one commit that improves the performance. > > Let me know if you have any questions, > > Bob When Alice receives that pull request, she'll be able to review Bob's commits, and decide whether she is fine with them. That's what we call a *code review*. During the code review, Alice will read the changes introduced by Bob's commit, suggest some changes, and when she's satisfied with the result, accept the pull request. Accepting the pull request will merge Bob's `master` branch into the `master` branch in her repository: ``` | Alice's workspace / master branch | | (previous commits ommited) | * Commit from day 11: Alice added 10 new lines (from experiment branch) | * Commit from day 13: Alice edited lines 5 to 9 (from experiment branch) | * Commit from day 14: Bob edited lines 8 (from bob/master branch) | * Commit from day 15: Bob deleted line 12 (from bob/master branch) | ``` Of course, she could also have refused the pull request, in which case her `master` branch would have been left untouched. Also note that Bob's branch and pull request don't affect Alice's ability to work on her own branch in the meantime: she can continue to add commits to her branch and have those preserved when Bob's branch is merged into hers. Using branches, repositories and pull requests, Alice and Bob managed to collaborate on the same piece of software. How exciting! ## Bonus: Conflicts, issues and releases If you've read up to here, things should be less scary for you. However, there are a few additional things you may want to know about software development and the way we usually collaborate while working on software. ### Conflicts In Git's terminology, a *conflict* is a situation where two changes concerning the same lines in the same files are made in different branches, and you try to merge those branches. Let's examine a typical Git conflict situation: - Bob forks Alice's repository, as described in the previous section - Bob notices a typo somewhere and adds a commit to the `master` branch in his repository to fix the typo. He forgets to open a pull request right away. - At the same time, Alice notices the same typo, and adds a commit to the `master` branch in her repository to fix the typo. - A few days later, Bob remembers to open a pull request to merge his `master` branch into Alice's `master` branch - Unfortunately, because Alice's `master` branch has a commit related to the same line, Git cannot perform the merge automatically and will complain loudly about it: "Both Alice and you edited this line, what change should I keep?" To sum it up, just like in real life, conflicts usually happen when different people work on the exact same thing at the same time. Two people offering the same book at a birthday party is the equivalent, in the physical world, of what Git calls a conflict. When they occur, Git conflicts will require manual, human resolution. Someone has to read and understand both conflicting changes, then usually pick the one that makes more sense. Common ways to avoid conflicts in Git include: - Merging your branches often. The longer you wait, the higher chance there is for someone else to commit changes that could interfere with your own work. - Coordinate with other contributors to split the work in small, independant tasks, and assign those tasks to specific persons. Issues, as outlined below, can help with that! Conflicts don't always indicate a lack of coordination though, and they will likely occur from time to time in any project. However, minimizing conflict-prone situations is required to attain efficient parallelization. ### Issues Issues are an important part of software development. You may have heard those sentences already: "Please file an issue" or "Please open a bug in our issue tracker". But what is an issue? Issues, also known as tickets, bug reports or feature requests are messages posted in a project's issue tracker. Developers, contributors and software users usually open issues to: - Keep track of a new bug in the software - Suggest an improvement or a new feature - Ask a question about the software behaviour Other people can usually comment on issues, discuss possible solutions and pitfalls, provide workarounds, etc. When development is needed to address the issue or the feature request, a developer will usually create a branch, work on a fix, then submit a pull request with the changes. Once this pull request is accepted, the related issue is usually closed. To sum it up, this is the typical lifecycle of an issue: 1. Bob encounters a bug in the software 2. Bob opens an issue describing the bug 3. Maria, who is facing the same bug, adds a comment on the issue, and describes a possible solution 4. Alice decides to work on the issue 5. She assigns the issue to herself, creates a branch, commits the changes that fix the issues, and opens a pull request with that branch 5. The pull request is merged into the `master` branch 6. Bob's issue is closed Issues are *extremely* useful, because they constitute the memory of a project, and also give a lot of insight about future development, popular requests and common problems faced by a community. On a daily basis, contributors working on a project tend to fix specific issues, which ensures that they work on different problems and helps to achieve parallelization without conflicts. Do you remember the Git conflict that occured when both Alice and Bob attempted to fix the same typo in the codebase? Had Bob created an issue when he first spotted the typo, Alice would have noticed it and they could have decided together who had to submit a fix. ### Releases Releases, also known as tags, are the last missing piece of a typical software development process. Most projects tend to follow similar cycles: 1. Project maintainers or communities choose a set of issues they deem high-priority 2. Contributors fix those issues 3. Once all selected issues are fixed, a release is published 4. End users update to the new release 5. Back to 1. A release is a version of a software that is distributed widely and intended to improve or replace previous releases. Usually, releases are named using a specific pattern, like `version 1.2.3`, `version 1.2.4` and `version 1.3`. ## Conclusion I hope you enjoyed this essay, and that the explanation gave you a clearer view of what's going on in software development. It takes some effort to untangle everything and demonstrate the usefulness of all of this in a non-technical way. If you think I failed somewhere, or there is a missing piece, please let me know!