Git が暗号化ハッシュ関数を使用するのはなぜですか? 質問する

Git が暗号化ハッシュ関数を使用するのはなぜですか? 質問する

Gitがなぜ使うのかSHA-1より高速な非暗号化ハッシュ関数の代わりに、暗号化ハッシュ関数を使用するのはなぜですか?

関連する質問:

スタックオーバーフローの質問Git がバージョン番号として SHA-1 を使用するのはなぜですか?Git がコミットに連番ではなく SHA-1 を使用する理由を尋ねます。

ベストアンサー1

TLDR;


こちらから確認できます2007年にGoogleにGitを紹介したLinus Torvalds氏自身:
(強調は筆者による)

暗号的に安全とみなされるチェックサムをチェックします。SHA-1を破ることができた人はいませんが、重要なのは、gitに関する限り、SHA-1はセキュリティ機能ではありません。純粋に一貫性チェックです。セキュリティ部分は別のところにあります。git
は SHA-1 を使用し、SHA-1 は暗号化された安全なものに使用されているため、非常に重要なセキュリティ機能であると考える人が多いようです。しかし、これはセキュリティとはまったく関係がなく、単に取得できる最高のハッシュというだけです。

良いハッシュを持つことは、データを信頼できることに繋がります他にもいくつか優れた機能があり、オブジェクトをハッシュするときにハッシュが適切に分散されていることがわかり、特定の分散問題について心配する必要がありません。

Internally it means from the implementation standpoint, we can trust that the hash is so good that we can use hashing algorithms and know there are no bad cases.

So there are some reasons to like the cryptographic side too, but it's really about the ability to trust your data.
I guarantee you, if you put your data in git, you can trust the fact that five years later, after it is converted from your harddisc to DVD to whatever new technology and you copied it along, five years later you can verify the data you get back out is the exact same data you put in. And that is something you really should look for in a source code management system.


Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".


I mentioned in "How would git handle a SHA-1 collision on a blob?" that you could engineer a commit with a particular SHA1 prefix (still an extremely costly endeavor).
But the point remains, as Eric Sink mentions in "Git: Cryptographic Hashes" (Version Control by Example (2011) book:

It is rather important that the DVCS never encounter two different pieces of data which have the same digest. Fortunately, good cryptographic hash functions are designed to make such collisions extremely unlikely.

It is harder to find good non-cryptographic hash with low collision rate, unless you consider research like "Finding State-of-the-Art Non-cryptographic Hashes with Genetic Programming".

You can also read "Consider use of non-cryptographic hash algorithm for hashing speed-up", which mentions for instance "xxhash", an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.


Discussions around changing the hash in Git are not new:

(Linus Torvalds)

There's not really anything remaining of the mozilla code, but hey, I started from it. In retrospect I probably should have started from the PPC asm code that already did the blocking sanely - but that's a "20/20 hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so convinced that I could improve on things. So that's a kind of source for it, even if it's more about the motivational side than any actual remaining code ;)

And you need to be careful about how to measure the actual optimization gain

(Linus Torvalds)

I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues.

(John Tapsell - johnflux)

The engineering cost for upgrading git from SHA-1 to a new algorithm is much higher. I'm not sure how it can be done well.

First of all we probably need to deploy a version of git (let's call it version 2 for this conversation) which allows there to be a slot for a new hash value even though it doesn't read or use that space -- it just uses the SHA-1 hash value which is in the other slot.

That way once we eventually deploy yet a newer version of git, let's call it version 3, which produces SHA-3 hashes in addition to SHA-1 hashes, people using git version 2 will be able to continue to inter-operate.
(Although, per this discussion, they may be vulnerable and people who rely on their SHA-1-only patches may be vulnerable.)

In short, switching to any hash is not easy.


Update February 2017: yes, it is in theory possible to compute a colliding SHA1: shattered.io

How is GIT affected?

GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits.
It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one.
An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision.

But:

This attack required over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.

So let's not panic just yet.
See more at "How would Git handle a SHA-1 collision on a blob?".

おすすめ記事