Deprecated: Assigning the return value of new by reference is deprecated in /home/harttic/public_html/wordpress/wp-includes/cache.php on line 36

Deprecated: Assigning the return value of new by reference is deprecated in /home/harttic/public_html/wordpress/wp-includes/query.php on line 21

Deprecated: Assigning the return value of new by reference is deprecated in /home/harttic/public_html/wordpress/wp-includes/theme.php on line 507
Clueless Finn » Blog Archive » Challenges with people’s names

Challenges with people’s names

This past week I spent a lot of time (well, at least more than I expected) in consolidating roster information from a few tournament web sites (mainly EUGC 2008 and EUC 2007) in the National Teams of Ultimate Web site/database (please, if anyone comes up with a better name for the site, please, feel free to send in the suggestion!)
I expected some amount of manual work - after all, many people play in more than one of these national team tournaments and I knew that I need to make some judgement calls on when to create a new person and when to re-use the person existing in the database when I bump into a name already in the database. That turned out to be the easy part. The hard part was everything else.

First of all the web site scraping turned out to be not as easy to automate than what I expected. To overcome that I had to rely on a little more manual work. Read: Adding each team took a little longer.
2. I realized that I could not really rely on the names listed on these tournament sites. In some cases the whole team roster was in wrong order (the first and last names were in different order than for other teams) but there were few occasions when the order of the first and last names were inconsistent inside a single team roster. This meant that had to glance through the rosters a little more carefully to spot and correct the problems. Again, add some more time needed to enter one team.
3. There were different capitalizations used for some teams. Luckily there are good tools to convert the names to Title Case. No biggie here.
4. Same names were spelled differently and the use of accented and dotted characters was inconsistent on different sites. I spotted a few problems, but I am sure some real life people have now multiple entries in the database because of slight differences in the spelling. And there are likely a few people in there whose names are just plainly misspelled. (Please, send a note if you spot any mistakes!)
5. Finally the toughest part. Before starting the roster import I mistakenly believed that people have first names which are easily distinguishable from the last names. Only now I realize how varying the naming conventions are across different countries. People have 2-word names, 3-word names and even 4-word names. Splitting these words to first name and last name is not always an easy thing to do. In some sense this is an academic problem, but entering an incorrect first name / last name -combo creates problems when you list all the players in alphabetical order. I am sure I have made more than one mistake in the database in this regard. Again, all corrections are welcome!

Comments are closed.