Login

***joevenzon*** · 10-01-2011, 04:35 PM,

BiGBeN87 Wrote:
joevenzon Wrote:Let's keep data on sourceforge. DVCSs aren't that useful for binary files.

Do you have any data to back up this statement? I do not see how DVCS are conceptional less useful for binary files. To my knowledge SVN specifically does not handle binary files any better than Git does.

DVCSs allow multiple users to work on the same things independently of a central authority, which is the big advantage that puts the D in DVCS. However, binary files cannot be merged. That means more than one user can't be working on the same file at the same time, or their changes will conflict, and someone will need to redo their work. With centralized version control you can lock files you are editing to prevent others from working on that file at the same time, with DVCS you cannot (to my knowledge).

Other stuff specific to git: I had heard that a git repository size grows much faster with binary file changes than an SVN repository. I had also heard that the way git works on a blob level (versioning the content of files, not the file itself) makes it really slow at scanning for updated files (since it needs to hash everything to determine if it changed) while subversion can cheat by looking at file properties. This info may be out of date. If you set up a test git data repo we can do some experiments.

BiGBeN87 · 10-02-2011, 12:16 PM,

Hello joevenzon,

first of all thank you for your elaborate reply. In fact I haven't considered locking and in fact it seems that git does not have any equivalent mechanism:

http://stackoverflow.com/questions/11944...rol-system

I agree that the distributed nature of Git puts an stronger emphasis on this problem, because people might not get notice of a lock, when they work offline. GitHub also adds to this, because it encourages spontaneous contributions more than the svn on sourceforge.

I have never used locking in subversion, myself. The comments on StackOverflow suggest that even though SVN has locking, one can run into the same problems as without it. Was it usual for VDrift to use subversions locking in the past? Were your experiences positive?

I am working on a project (8 developers) that till recently used a mailing list for locking and releasing an unmergeable database dump. That worked pretty well, we only had one situation where two persons made changes concurrently and one had to redo them on the dump of the other.

We did move away from this method because we did not want the overhead of checking/writing emails anymore. Instead we now export our changes in code and automate the import/export of them via scripts. The binary files of VDrift however are not substitutable with more atomic files, as far as I understand.

Therefore I think, handling them is only possible with proper communication and sticking to the necessary protocol of editing. I am in doubt, whether the locking in svn really helps avoiding conflicts better than having split the code into individual cars/tracks/etc on github. With more atomic code and easier and therefore more frequent commits the maximum possible damage could actually be lowered.

I would suggest having a lockfile in the root each repository, that needs to be touched before and after working on the data repositories. An obligatory guide on how to fork, lock, edit, commit and release could be given in each repo's readme, too.

joevenzon Wrote:Other stuff specific to git: I had heard that a git repository size grows much faster with binary file changes than an SVN repository. I had also heard that the way git works on a blob level (versioning the content of files, not the file itself) makes it really slow at scanning for updated files (since it needs to hash everything to determine if it changed) while subversion can cheat by looking at file properties. This info may be out of date. If you set up a test git data repo we can do some experiments.

I would be happy to help testing this. I will deploy something later today or tomorrow.

***joevenzon*** · 10-02-2011, 02:57 PM,

BiGBeN87 Wrote:Was it usual for VDrift to use subversions locking in the past? Were your experiences positive?

I've used it at work for other projects where it was vital, but to be honest, I don't think anyone's really used it for VDrift. There are a small number of people working on anything at a given moment, so not many collisions. I think the forum has been used sometimes to communicate before beginning work on something that someone else checked in (although this has failed in the past as well). The problem we run into on VDrift most often is just someone changing a file that someone else had previously changed, and then that person being like "wtf, why did you change that?", but they weren't working on it simultaneously, so that's a different problem with some different solution.

Quote:I would be happy to help testing this. I will deploy something later today or tomorrow.

Some tests:
* making a series of small changes to a .png image file, checking repository size change
* adding and then deleting the equivalent of a track's set of files, checking repository size change
* on a large repository like the entire vdrift data repo, make a single change to a .png file somewhere deep and check the time taken to make a git commit -a or svn ci
* on a large repo, test git workflow for branching, changing a .png file, and merging back in. test time and repository size change

Considerations:
* for the size and time tests, how does running git gc affect the results? would this need to be a regular manual maintenance for a github hosted repo?

BiGBeN87 · 10-05-2011, 05:43 PM,

I just finished cloning the svn and uploading it to github:

https://github.com/bigben87/VDrift-Data

So fork me on GitHub and do your tests! I will go to sleep now, but I have an early measurement already:
1.6 GiB of binary data with 900 revisions weigh in at 1,.7 GiB .git directory. So size considerations seem for no reason.

joevenzon Wrote:The problem we run into on VDrift most often is just someone changing a file that someone else had previously changed, and then that person being like "wtf, why did you change that?", but they weren't working on it simultaneously, so that's a different problem with some different solution.

I have witnessed this behavior in some other projects to. I always felt that a forum is the wrong place to talk about code, however. GitHub's pull-request implementation encourages discussion on specific changes throughout the whole development process. Maybe that already gives enough structure on the right place to prevent wtf-commits in the future.

joevenzon Wrote:Some tests:
* making a series of small changes to a .png image file, checking repository size change

PNGs are usually compressed. Therefore Git will add almost the full file size.

joevenzon Wrote:Considerations:
* for the size and time tests, how does running git gc affect the results? would this need to be a regular manual maintenance for a github hosted repo?

Git automatically invoked gc after cloning the svn: https://gist.github.com/1261544 Therefore I can not test it now.

***joevenzon*** · 10-06-2011, 10:35 AM,

I hope to get some time to run tests this weekend.

BiGBeN87 · 10-07-2011, 01:55 PM,

I toyed around with time and git and tryed to answer:

joevenzon Wrote:* on a large repository like the entire vdrift data repo, make a single change to a .png file somewhere deep and check the time taken to make a git commit -a or svn ci

Blind test on a fresh repository:

Code:
$ time git status

# On branch master

nothing to commit (working directory clean)

real   0m0.177s

user   0m0.060s

sys    0m0.100s

Git does not seem to have trouble with finding a single changed bit deep in the tree:

Code:
$ chmod +x cars/FF/interior.png

$ time git commit -a -m 'testing commit -a time'

[master c91cd97] testing commit -a time

 1 files changed, 0 insertions(+), 0 deletions(-)

 mode change 100644 => 100755 cars/FF/interior.png

real   0m0.266s

user   0m0.100s

sys    0m0.090s

$ time git status

# On branch master

# Your branch is ahead of 'origin/master' by 1 commit.

#

nothing to commit (working directory clean)

real   0m0.139s

user   0m0.080s

sys    0m0.050s

BiGBeN87 · 10-08-2011, 08:18 PM,

Git svn clone should take around 5 h with my internet connection, but I haven't actually timed it. Git gives us a speed up by factor 4, on the equivalent case:

Code:
$ time git clone https://github.com/bigben87/VDrift-Data.git VDrift-Data

Cloning into VDrift-Data...

remote: Counting objects: 25277, done.

remote: Compressing objects: 100% (16627/16627), done.

remote: Total 25277 (delta 8576), reused 25277 (delta 8576)

Receiving objects: 100% (25277/25277), 1.54 GiB | 306 KiB/s, done.

Resolving deltas: 100% (8576/8576), done.

real   70m30.272s

user   4m16.600s

sys    1m18.090s

This translates into shallow clones as follows:

Code:
$ time git clone --depth 1 https://github.com/bigben87/VDrift-Data.git VDrift-Data

Cloning into VDrift-Data...

remote: Counting objects: 12128, done.

remote: Compressing objects: 100% (11709/11709), done.

remote: Total 12128 (delta 443), reused 11895 (delta 375)

Receiving objects: 100% (12128/12128), 1.25 GiB | 736 KiB/s, done.

Resolving deltas: 100% (443/443), done.

real    39m2.372s

user    3m19.890s

sys    1m4.370s

svn checkout for comparison:

Code:
$ time svn checkout -q https://vdrift.svn.sourceforge.net/svnroot/vdrift/vdrift-data VDrift-Data

real    55m15.538s

user    4m18.790s

sys    1m36.670s

So Git/Hub is faster at downloading for end users and patch-only developers, who can use the shallow clone.

BiGBeN87 · 10-19-2011, 11:45 AM,

I imported the SVN-Tags manually into Git: https://github.com/bigben87/VDrift-Data/tags

I tested downloading them:

Code:
$ time wget https://github.com/bigben87/VDrift-Data/tarball/2011-09-01

--2011-10-19 16:02:56--  https://github.com/bigben87/VDrift-Data/tarball/2011-09-01

AuflÃ¶sen des Hostnamen github.com... 207.97.227.239

Verbindungsaufbau zu github.com|207.97.227.239|:443... verbunden.

HTTP-Anforderung gesendet, warte auf Antwort... 302 Found

Platz: https://nodeload.github.com/bigben87/VDrift-Data/tarball/2011-09-01 [folge]

--2011-10-19 16:02:57--  https://nodeload.github.com/bigben87/VDrift-Data/tarball/2011-09-01

AuflÃ¶sen des Hostnamen nodeload.github.com... 207.97.227.252

Verbindungsaufbau zu nodeload.github.com|207.97.227.252|:443... verbunden.

HTTP-Anforderung gesendet, warte auf Antwort... 200 OK

LÃ¤nge: 1442262807 (1,3G) [application/octet-stream]

In Â»2011-09-01Â« speichern.

100%[====================================>] 1.442.262.807  382K/s   in 52m 51s 

2011-10-19 16:55:50 (444 KB/s) - Â»2011-09-01Â« gespeichert [1442262807/1442262807]

real   52m53.388s

user   1m16.469s

sys    1m56.087s

***joevenzon*** · 10-22-2011, 01:37 PM,

Only 4 tags...?

***joevenzon*** · 10-22-2011, 01:48 PM,

If we do switch to git for data, the auto-updater needs to be rewritten to use it instead of the sourceforge svn. This may be easier because git has an API, whereas the sourceforge svn code is scraping the html, although having a concept of revision number is handy.

***joevenzon*** · 10-22-2011, 02:08 PM,

BiGBeN87 Wrote:So Git/Hub is faster at downloading for end users and patch-only developers, who can use the shallow clone.

1) What's the workflow for a patch-only developer using a shallow clone?
2) Is using a fork and pull request a valid workflow for the entire data tree?
3) What's the workflow for a developer working in master? They must do a full clone, correct?

BiGBeN87 · 10-23-2011, 05:02 PM,

joevenzon Wrote:Only 4 tags...?

These 4 were the only ones, I found on SourceForge: http://vdrift.svn.sourceforge.net/viewvc/vdrift/tags/

joevenzon Wrote:If we do switch to git for data, the auto-updater needs to be rewritten to use it instead of the sourceforge svn. This may be easier because git has an API, whereas the sourceforge svn code is scraping the html, although having a concept of revision number is handy.

Yes, GitHub has an API that can be accessed via HTTP:
http://developer.github.com/v3/

I would suggest the updater to consider new tags, only. Usually packages are updated with new releases, so this would be the equivalent to the usual behaviour. There is a function to list the commits of a repository, so the last tag can be identified:
http://developer.github.com/v3/git/commits/

There is a function to get information about a specific tag:
http://developer.github.com/v3/git/tags/

A tag object contains a commit object and in it is a tree object, listing the sha1s of contained blobs.

These can then be cross-referenced with the hashes returned by the VDriftDataHasher I am working on. I added object hashing that is compatible to git in:

https://github.com/bigben87/VDriftDataHa...0b31e31b87

joevenzon Wrote:
BiGBeN87 Wrote:So Git/Hub is faster at downloading for end users and patch-only developers, who can use the shallow clone.

1) What's the workflow for a patch-only developer using a shallow clone

patch-only-dev:

- initally, git clone --depth=1
- checkout master
- modify files
- git add them
- git commit them
- git format-patch origin/master..master
- create a gist containing the patche(s)
- create an issue

Examples:
- https://github.com/mootools/mootools-cor...it-Patches
- https://github.com/rakudo/rakudo/wiki/st...te-a-patch

VDrift-maintainer:

- see the issue
- download the raw patch(es) from gist
- git apply/am the patch(es)
- git push origin master

joevenzon Wrote:2) Is using a fork and pull request a valid workflow for the entire data tree?

I think it is, because the strict hierarchy separates commits contained in branches from each other even the forking was done some time ago and the master was not updated. I think, splitting data into individual repos for each car/track would help keeping things clear and atomic, too.

joevenzon Wrote:3) What's the workflow for a developer working in master? They must do a full clone, correct?

- git pull
- modify
- git add
- git commit
- git push

I would think this is the recommendable workflow, for developers/maintainers and contributors as well, because branching/merging/pull-requesting is only possible with non-shallow clones and therefore needed for agile and clean development.

In cases of more complex (read: multi-commit) projects, developers should work in branches, too and merge their work into master when they are more or less done.

Login
Username:
Password:	Lost Password?
	Remember me