c++ - software - what is gedcom

Cycles in family tree software (12)

I am the developer of some family tree software (written in C++ and Qt). I had no problems until one of my customers mailed me a bug report. The problem is that the customer has two children with their own daughter, and, as a result, he can't use my software because of errors.

Those errors are the result of my various assertions and invariants about the family graph being processed (for example, after walking a cycle, the program states that X can't be both father and grandfather of Y).

How can I resolve those errors without removing all data assertions?

Assertions don't survive reality

Usually assertions don't survive the contact with real world data. It's a part of the process of software engineering to decide, with which data you want to deal and which are out of scope.

Cyclic family graphs

Regarding family "trees" (in fact it are full blown graphs, including cycles), there is a nice anecdote:

I married a widow who had a grown daughter. My father, who often visited us, fell in love with my step-daughter and married her. As a result, my father became my son, and my daughter became my mother. Some time later, I gave my wife a son, who was the brother of my father, and my uncle. My father's wife (who is also my daughter and my mother) got a son. As a result, I got a brother and a grandson in the same person. My wife is now my grandmother, because she is my mother's mother. So I am the husband of my wife, and at the same time the step-grandson of my wife. In other words, I'm my own grandpa.

Things get even more strange, when you take surrogates or "fuzzy fatherhood" into account.

How to deal with that

Define cycles as out-of-scope

You could decide that your software should not deal with such rare cases. If such a case occurs, the user should use a different product. This makes dealing with the more common cases much more robust, because you can keep more assertions and a simpler data model.

In this case, add some good import and export features to your software, so the user can easily migrate to a different product when necessary.

Allow manual relations

You could allow the user to add manual relations. These relations are not "first-class citizens", i.e. the software takes them as-is, doesn't check them and doesn't handle them in the main data model.

The user can then handle rare cases by hand. Your data model will still stay quite simple and your assertions will survive.

Be careful with manual relations. There is a temptation to make them completely configurable and hence create a fully configurable data model. This will not work: Your software will not scale, you will get strange bugs and finally the user interface will become unusable. This anti-pattern is called "soft coding", and "The daily WTF" is full of examples for that.

Make your data model more flexible, skip assertions, test invariants

The last resort would be making your data model more flexible. You would have to skip nearly all assertions and base your data model on a full blown graph. As the above example shows, it is easily possible to be your own grandfather, so you can even have cycles.

In this case, you should extensively test your software. You had to skip nearly all assertions, so there is a good chance for additional bugs.

Use a test data generator to check unusual test cases. There are quick check libraries for Haskell, Erlang or C. For Java / Scala there are ScalaCheck and Nyaya. One test idea would be to simulate a random population, let it interbreed at random, then let your software first import and then export the result. The expectation would be, that all connections in the output are also in the input and vice verse.

A case, where a property stays the same is called an invariant. In this case, the invariant is the set of "romantic relations" between the individuals in the simulated population. Try to find as much invariants as possible and test them with randomly generated data. Invariants can be functional, e.g.:

  • an uncle stays an uncle, even when you add more "romantic relations"
  • every child has a parent
  • a population with two generations has at least one grand-parent

Or they can be technical:

  • Your software will not crash on a graph up to 10 billion members (no matter how many interconnections)
  • Your software scales with O(number-of-nodes) and O(number-of-edges^2)
  • Your software can save and re-load every family graph up to 10 billion members

By running the simulated tests, you will find lots of strange corner cases. Fixing them will take a lot of time. Also you will lose a lot of optimizations, your software will run much slower. You have to decide, if it is worth it and if this is in the scope of your software.

A few answers have shown ways to keep the assertions/invariants, but this seems like a misuse of assertions/invariant. Assertions are to make sure something that should be true is true, and invariants are to make sure something that shouldn't change doesn't change.

What you're asserting here is that incestuous relationships don't exist. Clearly they do exist, so your assertion is invalid. You can work around this assertion, but the real bug is in the assertion itself. The assertion should be removed.

Duplicate the father (or use symlink/reference).

For example, if you are using hierarchical database:

$ #each person node has two nodes representing its parents.
$ mkdir Family
$ mkdir Family/Son
$ mkdir Family/Son/Daughter
$ mkdir Family/Son/Father
$ mkdir Family/Son/Daughter/Father
$ ln -s Family/Son/Daughter/Father Family/Son/Father
$ mkdir Family/Son/Daughter/Wife
$ tree Family
└── Son
    ├── Daughter
    │   ├── Father
    │   └── Wife
    └── Father -> Family/Son/Daughter/Father

4 directories, 1 file

Genealogical data is cyclic and does not fit into an acyclic graph, so if you have assertions against cycles you should remove them.

The way to handle this in a view without creating a custom view is to treat the cyclic parent as a "ghost" parent. In other words, when a person is both a father and a grandfather to the same person, then the grandfather node is shown normally, but the father node is rendered as a "ghost" node that has a simple label like ("see grandfather") and points to the grandfather.

In order to do calculations you may need to improve your logic to handle cyclic graphs so that a node is not visited more than once if there is a cycle.

I guess that you have some value that uniquely identifies a person on which you can base your checks.

This is a tricky one. Assuming you want to keep the structure a tree, I suggest this:

Assume this: A has kids with his own daughter.

A adds himself to the program as A and as B. Once in the role of father, let's call it boyfriend.

Add a is_same_for_out() function which tells the output generating part of your program that all links going to B internally should be going to A on presentation of data.

This will make some extra work for the user, but I guess IT would be relatively easy to implement and maintain.

Building from that, you could work on code synching A and B to avoid inconsistencies.

This solution is surely not perfect, but is a first approach.

I hate commenting on such a screwed up situation, but the easiest way to not rejigger all of your invariants is to create a phantom vertex in your graph that acts as a proxy back to the incestuous dad.

It seems you (and/or your company) have a fundamental misunderstanding of what a family tree is supposed to be.

Let me clarify, I also work for a company that has (as one of its products) a family tree in its portfolio, and we have been struggling with similar problems.

The problem, in our case, and I assume your case as well, comes from the GEDCOM format that is extremely opinionated about what a family should be. However this format contains some severe misconceptions about what a family tree really looks like.

GEDCOM has many issues, such as incompatibility with same sex relations, incest, etc... Which in real life happens more often than you'd imagine (especially when going back in time to the 1700-1800).

We have modeled our family tree to what happens in the real world: Events (for example, births, weddings, engagement, unions, deaths, adoptions, etc.). We do not put any restrictions on these, except for logically impossible ones (for example, one can't be one's own parent, relations need two individuals, etc...)

The lack of validations gives us a more "real world", simpler and more flexible solution.

As for this specific case, I would suggest removing the assertions as they do not hold universally.

For displaying issues (that will arise) I would suggest drawing the same node as many times as needed, hinting at the duplication by lighting up all the copies on selecting one of them.

Potential legal implications aside, it certainly seems that you need to treat a 'node' on a family tree as a predecessor-person rather than assuming that the node can be the-one-and-only person.

Have the tree node include a person as well as the successors - and then you can have another node deeper down the tree that includes the same person with different successors.

So, I've done some work on family tree software. I think the problem you're trying to solve is that you need to be able to walk the tree without getting in infinite loops - in other words, the tree needs to be acyclical.

However, it looks like you're asserting that there is only one path between a person and one of their ancestors. That will guarantee that there are no cycles, but is too strict. Biologically speaking, descendancy is a directed acyclic graph (DAG). The case you have is certainly a degenerate case, but that type of thing happens all the time on larger trees.

For example, if you look at the 2^n ancestors you have at generation n, if there was no overlap, then you'd have more ancestors in 1000 AD than there were people alive. So, there's got to be overlap.

However, you also do tend to get cycles that are invalid, just bad data. If you're traversing the tree, then cycles must be dealt with. You can do this in each individual algorithm, or on load. I did it on load.

Finding true cycles in a tree can be done in a few ways. The wrong way is to mark every ancestor from a given individual, and when traversing, if the person you're going to step to next is already marked, then cut the link. This will sever potentially accurate relationships. The correct way to do it is to start from each individual, and mark each ancestor with the path to that individual. If the new path contains the current path as a subpath, then it's a cycle, and should be broken. You can store paths as vector<bool> (MFMF, MFFFMF, etc.) which makes the comparison and storage very fast.

There are a few other ways to detect cycles, such as sending out two iterators and seeing if they ever collide with the subset test, but I ended up using the local storage method.

Also note that you don't need to actually sever the link, you can just change it from a normal link to a 'weak' link, which isn't followed by some of your algorithms. You will also want to take care when choosing which link to mark as weak; sometimes you can figure out where the cycle should be broken by looking at birthdate information, but often you can't figure out anything because so much data is missing.

The most important thing is to avoid creating a problem, so I believe that you should use a direct relation to avoid having a cycle.

As @markmywords said, #include "fritzl.h".

Finally I have to say recheck your data structure. Maybe something is going wrong over there (maybe a bidirectional linked list solves your problem).

You should focus on what really makes value for your software. Is the time spent on making it work for ONE consumer worth the price of the license ? Likely not.

I advise you to apologize to this customer, tell him that his situation is out of scope for your software and issue him a refund.

You should have set up the Atreides family (either modern, Dune, or ancient, Oedipus Rex) as a testing case. You don't find bugs by using sanitized data as a test case.