Number of people on a file

Peter J Seymour · Legg inn av **Peter J Seymour** » 8. mai 2005 kl. 10.21

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

john · Legg inn av **john** » 8. mai 2005 kl. 11.18

Peter J Seymour wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

There is a message on the The Master Genealogist forum from someone
trying to import 230,000 individuals from Generations. And if you look
at The Next Generation of Genealogy Sitebuilding© ("TNG") web site you
will see reference to someone with 1,250,000 names and indications
performance may degrade with a few hundred thousand names but most
people with less than 10,000 names will not experience problems.

I would suspect a significant percentage of private genealogists have
between 1000 and 10,000 individuals in their files. Which probably
implies any application should be capable of handling at least 100,000
names without problems.

K0BBE · Legg inn av **K0BBE** » 8. mai 2005 kl. 12.14

"Peter J Seymour" schreef ...

First of all, I don't want to start a "mine's bigger than
yours" type of argument, but I am wondering what
range of size genealogy computer files might have in
practice. This is relevant to some software I am developing
where I am currently looking at some
optimisation issues.
My own experience is that there might be two or
three hundred people on a file, but I am interested
in what range of values other people can show.
It's the number of people that is relevant, not the
file format.
Regards
Peter

Pro-Gen, the programm that I use (also available in
English), accepts 250.000 idividuals.

See [ http://www.pro-gen.nl/nlhome.htm ]

--
K0BBE [ http://go.to/coilge ]
e-adres: incorrect

David Harper · Legg inn av **David Harper** » 8. mai 2005 kl. 12.38

Peter J Seymour wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.

My genealogy database has 18,000 individuals in it, and I don't suppose
it's exceptionally large by the standards of many serious genealogists.

You could easily hit two or three hundred individuals with just three or
four generations. That barely takes you back to the start of the 20th
century.

My advice would be to design your software on the assumption that it
will need to handle half a million individuals efficiently. Then all
your users/customers will thank you, whether they have 200 people in
their database, or 20,000, or 200,000 :-)

David Harper
Cambridge, England

mickg · Legg inn av **mickg** » 8. mai 2005 kl. 14.40

Well I have about 1,000 and that's just going back to 1800 in most
cases. But if you're developing software why do you feel a need to limit
it at all? Just let the available hardware limit it (i.e. out of disk
space).

MickG

Peter J Seymour wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

Gjest · Legg inn av **Gjest** » 8. mai 2005 kl. 14.48

Hi Peter,

I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.

Have a look at http://treefic.com/admin/index.cgi

If you are lucky you may find similar tables at other sites. How
representative these are of your potential users is up to you to judge.

Ignore all the really small sites - they are people who had a play but
didn't take it any further. There is currently a limit of 10,000. I
have had enquiries from people with larger genealogies and that limit
will rise.

My experience suggests that you really need all of your main algorithms
to be no worse that O(n log^2 n) in time and I/O, and no worse than
O(n) in RAM. Treefic still has some t=io=O(n^2) code but I am slowly
getting rid of it.

Regards,

Phil Endecott - Treefic.com

Patrick Texier · Legg inn av **Patrick Texier** » 8. mai 2005 kl. 15.12

Le Sun, 08 May 2005 11:38:06 GMT, David Harper
<[email protected]> a écrit :

My genealogy database has 18,000 individuals in it, and I don't suppose
it's exceptionally large by the standards of many serious genealogists.

More than one million since this this week with GeneWeb :
<http://geneweb.inria.fr/roglo>

singhals · Legg inn av **singhals** » 8. mai 2005 kl. 16.31

Peter J Seymour wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

I've got about 30,000 people in one of my files, descendants, spouses,
in-laws of one couple, plus people of the right name who either provably
are not ours or who can't yet be ruled out/in.

Other files I maintain have 500 to 4000 names.

I know a man who has over 70,000 names in his main file.

"unlimited" is currently the standard maximum. (g)

Cheryl

D. Stussy · Legg inn av **D. Stussy** » 9. mai 2005 kl. 0.28

On Sun, 8 May 2005, Peter J Seymour wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files might
have in practice. This is relevant to some software I am developing where I
am currently looking at some optimisation issues.....

I have about 145,000 people in 53,700 families (+/- 100).

The first 50,000 are my direct relatives (ancestors and blood cousins, and
their spouses). The next 40,000 are the cousins' spouses' "other sides" that
don't share ancestry with me. There's about 5,000 fragmentary lineages and
unconnected individuals added due to two "one-surname studies" on rare surnames
that occur in my ancestry. The remaining 50,000 are "chaff" - data that was
imported from other researchers' databases that came along for the ride. Most
of this is "second-level cousins other sides" - i.e. the lineages of spouses of
cousins to those who are my cousins' spouses.

I've seen some ancestral families have more than 15 children (reaching
adulthood), and one ancestor had about 25 children (across 3 wives), so
branching and tracing their descendants can be plentiful. The only saving
grace in shrinking a tree is when cousins marry and thus some ancestors are
"duplicated" (multiply ancestral).

One thing that I have noted is that some genealogy programs have a sub-database
for PLACES while others don't. Since in the early days, families generally
didn't move much, it's more storage efficient to store a pointer to a place
name that appears many times than it is to duplicate the name in each record.

Rich256 · Legg inn av **Rich256** » 9. mai 2005 kl. 1.33

Legacy lists an Individuals and Families file limit of 1 billion characters
(1 gigabyte). I think they mean that the number of individuals is unlimited
until the file gets too big.

About 20 years ago I remember a friend whose data file got to 65536 (2^16).
He worked with the company to expand their data base. That I think was
"Family Reunion" which I see is now a different name and made by Famware.

"Peter J Seymour" <[email protected]> wrote in message
news:[email protected]...

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

bella fortuni · Legg inn av **bella fortuni** » 9. mai 2005 kl. 4.02

17th/18th/19th occupational study of American silversmiths and related
craftsmen; 175,500+ individuals in a single tree. Legacy 5; runs
without a blink.

Wm Voss

On Sun, 08 May 2005 10:21:45 +0100, Peter J Seymour <[email protected]>
wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

Dennis Lee Bieber · Legg inn av **Dennis Lee Bieber** » 9. mai 2005 kl. 4.59

On Mon, 09 May 2005 00:33:16 GMT, "Rich256" <[email protected]>
declaimed the following in soc.genealogy.computing:

Legacy lists an Individuals and Families file limit of 1 billion characters
(1 gigabyte). I think they mean that the number of individuals is unlimited
until the file gets too big.

Legacy uses (used?) the JET database engine, more commonly known

as "Access" (which is, properly, just a GUI form designer and report
generator using JET as its native database).

I believe JET supports 2GB data files (since all data is stored
in a single file), maybe 4GB on an NT OS... However, one must take into
account all the primary keys, along with foreign keys, used to link the
data in many independent tables.

--

==============================================================
[email protected] | Wulfraed Dennis Lee Bieber KD6MOG
[email protected] | Bestiaria Support Staff
==============================================================
Home Page: <http://www.dm.net/~wulfraed/
Overflow Page: <http://wlfraed.home.netcom.com/

Peter J Seymour · Legg inn av **Peter J Seymour** » 9. mai 2005 kl. 9.36

D. Stussy wrote:

One thing that I have noted is that some genealogy programs have a sub-database
for PLACES while others don't. Since in the early days, families generally
didn't move much, it's more storage efficient to store a pointer to a place
name that appears many times than it is to duplicate the name in each record.

Just for the moment picking up on this particular point.
The Gendatam data model provides the 'Global Address' record type which
allows a specific address (or place) to be recorded once and then
referred to any number of times via the 'Private Address' record type.
The global record provides for convenience (and a consistent level of
certainty as to the details of the address), while the private record
provides for qualification of the address in a particular case. This
seems to fit with the often quoted Gentech principles, but I haven't
done a detailed comparison. Because of the two-level arrangement, it's
not clear if this particular implementation actually saves any space
overall - the focus is on an appropriate data structure.
Regards
Peter

Rich256 · Legg inn av **Rich256** » 9. mai 2005 kl. 15.08

It has been years since I looked at the file structure of PAF 2.5. It
appears that it can handle 99,999,999,999 records.

"Rich256" <[email protected]> wrote in message
news:[email protected]...

Legacy lists an Individuals and Families file limit of 1 billion
characters
(1 gigabyte). I think they mean that the number of individuals is
unlimited
until the file gets too big.

About 20 years ago I remember a friend whose data file got to 65536
(2^16).
He worked with the company to expand their data base. That I think was
"Family Reunion" which I see is now a different name and made by Famware.

"Peter J Seymour" <[email protected]> wrote in message
news:[email protected]...
First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
My own experience is that there might be two or three hundred people on
a file, but I am interested in what range of values other people can
show.
It's the number of people that is relevant, not the file format.
Regards
Peter
http://www.gendatam.com

Gjest · Legg inn av **Gjest** » 9. mai 2005 kl. 16.02

store a pointer to a place name that appears many times
[or] duplicate the name in each record.

Disk space is almost certainly better optimised by using a low-level
compression library like zlib, rather than applying optimisations at
the application level.

--Phil.

Peter J Seymour · Legg inn av **Peter J Seymour** » 10. mai 2005 kl. 10.27

Peter J Seymour wrote:

First of all, I don't want to start a "mine's bigger than yours" type of
argument, but I am wondering what range of size genealogy computer files
might have in practice. This is relevant to some software I am
developing where I am currently looking at some optimisation issues.
...
Many thanks for the discussion.

The challenge from a programming point of view is to move successfully
from a test environment involving relatively small file sizes to the
real world where files can be 'large'.
The parameters I am working to are:
- given 'unlimited hardware', an 'unlimited' number of records. In this
context 'unlimited' is not infinity but some rather large number.
- given adequate hardware, reasonable performance with up to one million
records. With multiple record types this might equate to around 50,000
people.
- Disk space is assumed not to be a problem. Most computers can be
fitted with a second disk and disk sizes are gigantic these days.
- All records in a file held in RAM when file is being processed. This
is likely to be the limiting factor initially.
- responding gracefully to insufficient memory situations such as might
arise when using 'limited hardware'. For instance, not allowing records
to be generated that would exceed the available RAM limit. This does
not extend to processing a large file on a small computer, for the time
being I am deferring this problem. Perhaps later.
Now I must go and generate a really large test file!
Regards
Peter

Dennis Lee Bieber · Legg inn av **Dennis Lee Bieber** » 10. mai 2005 kl. 17.07

On Tue, 10 May 2005 10:27:00 +0100, Peter J Seymour <[email protected]>
declaimed the following in soc.genealogy.computing:

- Disk space is assumed not to be a problem. Most computers can be
fitted with a second disk and disk sizes are gigantic these days.

If there is a constraint, it will be on what the OS allows for
FILE SIZE, not disk... W9x is limited to either 2 or 4GB
(signed/unsigned pointers -- I think the OS takes 4GB but many
applications choke when you exceed 2GB). I don't know what the limit is
for WinNT, or the Linux variants.

If your data is maintained, for example, in M$ JET, then you
have a single file containing all data, and will reach such a limit much
faster than you would on a DBMS that uses multiple files (MySQL uses one
"file" per "table" [actually three files: definition, index, data], with
each table then being able to grow to the limit).

- All records in a file held in RAM when file is being processed. This
is likely to be the limiting factor initially.

Worst requirement ever... You have to load/save the entire file
every time. You need to determine how much memory will be needed for OS
and program; and assume nothing else will be running -- otherwise you'll
find that your data is being swapped out to disk anyways as the OS looks
for unused memory to assign to other uses. In a proper virtual memory
system, every process thinks it has whatever the processor addressing
limit is, regardless of physical memory (theory says you could run a 2GB
program in 256MB, but practically every operation is going to trigger a
page swap to disk).

About the only serious applications I know of that are memory
based are text editors/word processors.

You might side track some I/O by setting up the data file(s) as
memory-mapped files, letting the OS do the read/write as needed, instead
of manually loading data.

- responding gracefully to insufficient memory situations such as might
arise when using 'limited hardware'. For instance, not allowing records

You'd be better off to code on the assumption that only the
record(s) of interest will ever be in memory at any given time, and rely
on disk I/O...

--

==============================================================
[email protected] | Wulfraed Dennis Lee Bieber KD6MOG
[email protected] | Bestiaria Support Staff
==============================================================
Home Page: <http://www.dm.net/~wulfraed/
Overflow Page: <http://wlfraed.home.netcom.com/

Peter J Seymour · Legg inn av **Peter J Seymour** » 10. mai 2005 kl. 17.42

Dennis Lee Bieber wrote:

On Tue, 10 May 2005 10:27:00 +0100, Peter J Seymour <[email protected]
declaimed the following in soc.genealogy.computing:

- Disk space is assumed not to be a problem. Most computers can be
fitted with a second disk and disk sizes are gigantic these days.

- All records in a file held in RAM when file is being processed. This
is likely to be the limiting factor initially.

Worst requirement ever... You have to load/save the entire file
every time. ....

I understand what you are saying but I have to get an initial version

out sooner rather than later. Before the first production version is
released I will have done some large file trials and will have a good
idea of what is feasible. If there is a significant limitation at that
point it will be swiftly dealt with in a subsequent version. There are
so many issues to deal with in implementing this data model for the
first time that some
other things such as ease-of-use have to take priority over really large
file sizes in the initial version. We shall see how it goes.
Regards
Peter
http://www.gendatam.com

Dennis Lee Bieber · Legg inn av **Dennis Lee Bieber** » 11. mai 2005 kl. 6.19

On Tue, 10 May 2005 17:42:33 +0100, Peter J Seymour <[email protected]>
declaimed the following in soc.genealogy.computing:

I understand what you are saying but I have to get an initial version
out sooner rather than later. Before the first production version is
released I will have done some large file trials and will have a good
idea of what is feasible. If there is a significant limitation at that
point it will be swiftly dealt with in a subsequent version. There are
so many issues to deal with in implementing this data model for the
first time that some
other things such as ease-of-use have to take priority over really large
file sizes in the initial version. We shall see how it goes.

The biggest problem is that, without a definition of the data
itself, and what is to be stored, estimating is very difficult. You are
focused on "# of people", but that is practically meaningless in the
world of "The Master Genealogist", for instance. TMG is an event-based
program. You could have one person, but that one person could have an
event record created for every day of their life (assuming someone
really wanted to document a "diary" using daily "biography" events). TMG
uses 29 distinct tables (Visual FoxPro, so each table is three files:
fixed-length data, variable length text data, index).

My data (not quite accessible at the moment -- I'm having
trouble getting TMG to work on my new XP machine; heck, I'm having lots
of trouble with M$ services not starting properly, etc.) takes ~15MB for
some 4000 people. But many of those people only have data for parent(s)
and name, no DoB, no marriage, etc. Since all such are "events" in TMG,
no information means no event means no record in the database.

For a genealogy program, the "data model" pretty much defines
the capabilities of the program -- everything else is just
user-interface and report generation. Change the data model, and you
practically have to redo the program. Is your data model in 3rd normal
form? If it isn't, can you justify the use of unnormalized data. (If it
is in 3rd normal form, it should map to /any/ RDBMS on the market, and
all you'd have to change in the code is details of the SQL syntax
relevant to the chosen RDBMS -- some use ? for parameter substitution,
some use other codes...).

And reports... Consider a descendent narrative report. Given the
end-of-line ancestor, you have to query for children... While holding
that information in a temporary structure, you have to recursively query
for each child's children, etc... Are you going to assume a limit for
memory needed by the report -- that's going to cut into the memory
available for other data.

For all I can tell, you're trying to manipulate one massive XML
file (maybe good for defining files that don't change much, but not
really a data format meant for dynamic update -- since you pretty much
have to convert the contents into some other structure to make changes,
and then write them back out again, regenerating the file entirely).

--

==============================================================
[email protected] | Wulfraed Dennis Lee Bieber KD6MOG
[email protected] | Bestiaria Support Staff
==============================================================
Home Page: <http://www.dm.net/~wulfraed/
Overflow Page: <http://wlfraed.home.netcom.com/

Peter J Seymour · Legg inn av **Peter J Seymour** » 11. mai 2005 kl. 9.00

Dennis Lee Bieber wrote:

On Tue, 10 May 2005 17:42:33 +0100, Peter J Seymour <[email protected]
declaimed the following in soc.genealogy.computing:

I understand what you are saying but I have to get an initial version
out sooner rather than later. Before the first production version is
released I will have done some large file trials and will have a good
idea of what is feasible. If there is a significant limitation at that
point it will be swiftly dealt with in a subsequent version. There are
so many issues to deal with in implementing this data model for the
first time that some
other things such as ease-of-use have to take priority over really large
file sizes in the initial version. We shall see how it goes.

The biggest problem is that, without a definition of the data
itself, and what is to be stored, estimating is very difficult. You are
focused on "# of people", but that is practically meaningless
....

I pretty well agree with what you are saying. I should be more specific
about my strategy. The initial version of the program will use a flat
file and this will limit its ability regarding file size. In the
Gendatam model, the amount of data per person depends on the profile of
what data is stored. It could range from about a 1000 bytes per person
upwards. (Or about 250 if you were being really minimalist). So we are
looking at 1MB+ per 1000 people. That tells me that a few thousand
people is the practical limit for this approach.
Moving on from there, the data is intended to be at least in 3rd normal
form and the model is not restricted to any particular forms of storage.
An eventual move to a database to facilitate the intended larger file
sizes is highly probable, although as an intermediate step, a form of
mutli-file organisation is a possibility.
How much can be held in RAM is a different question and obviously
depends on the memory available amongst other things. My view is that a
least the flat file approach should be fully accommodateable in RAM.
I intend to publish some guidance information on file size / number of
people when I have it.
Regards
Peter
http://www.gendatam.com

Dennis Lee Bieber · Legg inn av **Dennis Lee Bieber** » 11. mai 2005 kl. 17.30

On Wed, 11 May 2005 09:00:13 +0100, Peter J Seymour <[email protected]>
declaimed the following in soc.genealogy.computing:

An eventual move to a database to facilitate the intended larger file
sizes is highly probable, although as an intermediate step, a form of
mutli-file organisation is a possibility.

I doubt I can state anything to change that implementation plan,
though I'd strongly recommend designing for, and using, the RDBM from
the start. That should bypass pretty much all concerns of "memory"
limitations, and reduce the costs of rework -- since ALL data handling
operations would have to be rewritten. Other than the short bit of time
needed to analyze the data for normalization (and that time is being
used just to dither about memory limits et al) you could concentrate on
the user interface and pretty much be done with the data storage...

How much can be held in RAM is a different question and obviously
depends on the memory available amongst other things. My view is that a
least the flat file approach should be fully accommodateable in RAM.

But, again, if you are on a VM OS (and I don't know of any
current system that /doesn't/ support VM), you may not /know/ that you
are out of physical RAM -- the OS may start swapping your large "in
memory" data space, which puts you back into the realms of disk I/O.

It's not memory based, but have you looked at
http://www.sqlite.org/ (Note, I've not used this, but the Python
bindings get mentioned a lot on comp.lang.python... I've already got,
available if not running on my desktop at home: JET ("Access"), MySQL
(need to get that going so I can port data from my older machine), MaxDB
(former SAP-DB), Firebird (former Interbase), and MSDE (low
usage/developer version of SQL Server).

If you really want a flat file, I'd suggest you look up
information on the mmap() function [as it is called on Linux... I'm not
sure what the equivalent Windows function is called -- the Python
library uses mmap() for both systems, with slightly different arguments
on the call]. A Google search should find some stuff (see below).

Rather than opening, reading, writing, closing a file, mmap()
"maps" the file into the process address space (you could think of it as
a giant array) and uses the existing virtual memory swapper to load and
unload "pages" as needed. You probably will need to preset the file to
the maximum length, since the mmap() call specifies the start and end
points of the mapped file.

Linux:
http://www.opengroup.org/onlinepubs/009 ... /mmap.html
Windows:
http://msdn.microsoft.com/library/defau ... namemo.asp

--

==============================================================
[email protected] | Wulfraed Dennis Lee Bieber KD6MOG
[email protected] | Bestiaria Support Staff
==============================================================
Home Page: <http://www.dm.net/~wulfraed/
Overflow Page: <http://wlfraed.home.netcom.com/

Peter J Seymour · Legg inn av **Peter J Seymour** » 12. mai 2005 kl. 19.53

Dennis Lee Bieber wrote:

On Wed, 11 May 2005 09:00:13 +0100, Peter J Seymour <[email protected]
declaimed the following in soc.genealogy.computing:

An eventual move to a database to facilitate the intended larger file
sizes is highly probable, although as an intermediate step, a form of
mutli-file organisation is a possibility.

I doubt I can state anything to change that implementation plan,
though I'd strongly recommend designing for, and using, the RDBM from
the start.
....

Thanks for your comments. Yes, you are correct and I am satisfied the
plan will work out ok. I'm not claiming to be following the 'best'
route, merely one I know I can make work.
Regards
Peter
http://www.gendatam.com

D. Stussy · Legg inn av **D. Stussy** » 16. mai 2005 kl. 4.02

On Mon, 9 May 2005 [email protected] wrote:

store a pointer to a place name that appears many times
[or] duplicate the name in each record.

Disk space is almost certainly better optimised by using a low-level
compression library like zlib, rather than applying optimisations at
the application level.

Compression requires repeating data sequences, and achieves better ratios as
the frequency of repetition increases. Many techniques do this by
back-referencing to the first location of a string (or a sub-string), or other
repeating pattern. Since that's equvalent to storing the name once and
subsequently using pointers (the only difference being the physical size of the
pointers vs. the physical size of the back-references in the compression
stream), it is UNLIKELY in the general case that such will provide any relevent
effect on overall size.

As to when compression is applied, your comment is meaningless. A perfect
example is an application that itself uses zlib. In that case, the OS cannot
do better (but it can do the same). Let's suppose that the application uses an
even better compression algorithm for its data than the OS: The application
will BEAT the OS's ability. Your conclusion is wrong.

I don't think that you want to challenge me on this. My post-graduate work
dealt with lossless/reversible compression algorithms.

Gjest · Legg inn av **Gjest** » 16. mai 2005 kl. 16.35

I claimed:

Disk space is almost certainly better optimised by using a low-level
compression library like zlib, rather than applying optimisations at
the application level.

D. Stussy replied:

[snip]

Your conclusion is wrong.
I don't think that you want to challenge me on this.

I love a challenge :-)

As long as we can keep it light-hearted. Maybe
we will all learn something new.

I propose that a general-purpose compression library (zlib or libbzip2)
is better, while you propose that an application-level technique can
yield a better result. I have a test GEDCOM file [I don't have any
examples of Peter's new format, but we could try that if you prefer],
and bzip2 reduces it to 18% of its original size:

$ ls -l test.ged
-rw-r--r-- 1 phil phil 636842 2005-05-16 16:19 test.ged
$ bzip2 test.ged
$ ls -l test.ged.bz2
-rw-r--r-- 1 phil phil 117390 2005-05-16 16:19
test.ged.bz2

That took me about 5 seconds. So my challenge is, can you produce an
application-level compression method that gives better results? I bet
that if you spend 100 times longer working on the problem than I did
(~8 mins) or even 1000 times longer (~1 1/4 hr) you won't come up with
anything better.

Looking forward to your reply!

--Phil.

Kerry Raymond · Legg inn av **Kerry Raymond** » 20. mai 2005 kl. 7.53

Just over 16000 in our main family tree file, but we also have some other
files for a few other branches of the family with probably another 3000 or
so people in them, which we occasionally contemplate merging back into a
single file, but never seem to get around to. So about 19000-ish for ours.

Kerry

Kerry Raymond · Legg inn av **Kerry Raymond** » 20. mai 2005 kl. 8.01

The biggest problem is that, without a definition of the data
itself, and what is to be stored, estimating is very difficult. You are
focused on "# of people", but that is practically meaningless in the
world of "The Master Genealogist", for instance. TMG is an event-based
program. You could have one person, but that one person could have an
event record created for every day of their life

Yes, perhaps you might have an enormous amount of data on some people. But I
doubt if most people have enormous amounts of data on enormous numbers of
people. As the number of people in the database grows, my suspicion is that
the average amount of data per person actually reduces. If you have
thousands of people in your database, many of them will consist of a name
and maybe one other fact (e.g. birth details).

Kerry

D. Stussy · Legg inn av **D. Stussy** » 22. mai 2005 kl. 3.34

On Mon, 16 May 2005 [email protected] wrote:

I claimed:
Disk space is almost certainly better optimised by using a low-level
compression library like zlib, rather than applying optimisations at
the application level.

D. Stussy replied:

[snip]
Your conclusion is wrong.
I don't think that you want to challenge me on this.

I love a challenge As long as we can keep it light-hearted. Maybe
we will all learn something new.

I propose that a general-purpose compression library (zlib or libbzip2)
is better, while you propose that an application-level technique can
yield a better result....

Read what I said again. I did not say that it would be better. I said that it
COULD be better - but it also could be the same. The outcome of being the same
was sufficient to prove your conclusion wrong.

It doesn't matter at which level (in the OSI programming model) compression is
applied - if it's the same algorithm applied at both levels for comparison AND
on the same data (i.e. not including any overhead or control data). The same
data should produce the same result.

... I have a test GEDCOM file [I don't have any examples of Peter's new
format, but we could try that if you prefer], and bzip2 reduces it to 18% of
its original size:

$ ls -l test.ged
-rw-r--r-- 1 phil phil 636842 2005-05-16 16:19 test.ged
$ bzip2 test.ged
$ ls -l test.ged.bz2
-rw-r--r-- 1 phil phil 117390 2005-05-16 16:19
test.ged.bz2

That took me about 5 seconds. So my challenge is, can you produce an
application-level compression method that gives better results? ...

However, what you did IS an application program. For you to do a comparable
according to your given information, you have to run the low-level disk I/O
buffer stream through the bzip2 compressor. What you did is use the bzip2
program. That's not a level 2 operation.

... I bet that if you spend 100 times longer working on the problem than I
did (~8 mins) or even 1000 times longer (~1 1/4 hr) you won't come up with
anything better.

Nor do I have to. All I have to do is MATCH what you came up with by moving
the compression algorithm insertion to a different level - and as the data
doesn't vary, nor will the result.

Number of people on a file

Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file

Re: Number of people on a file