Posted 1 October 2005
Interviewed by Tom Fagan
Lars Bertram and Rudy Tanzi, together with Matt McQueen, Kristina Mullin, and Deborah Blacker, are curators of AlzGene, our database that annotates and performs meta-analyses on all AD genetic association studies.
This week marks a milestone for AlzGene as the team has now included and
systematically analyzed a total of 200 different AD genes, reported in almost
700 independent publications. To learn more about AlzGene, the goals, the process, the work behind the scenes, Alzforum interviewed Bertram and Tanzi.
ARF: Why don't we start at the beginning? How did AlzGene come about?
LB: I know exactly how it started. I'm the one who had to keep track of all these studies that were coming out every day and then try to update Rudy and the group. It was very difficult. So I wrote an e-mail to Rudy and Deborah Blacker suggesting that we make a database wherein we could enter all these genes, and even if nobody else used it, at least it would do a lot of good for us, because I was going crazy. Once that idea was seeded in our minds and we'd mulled it over, then we thought that the Alzheimer Research Forum would be the perfect place to host the database and we approached June Kinoshita with the idea.
RT: From my perspective, for about 3 to 5 years prior to AlzGene, each time I would give a presentation on genetics I would usually try to talk about which genes were being tested by various laboratories and what the results were. I wasn't presenting meta-analyses, just trying to give a scorecard—which genes were looking good, which weren't, and what they were. And it's funny because, you know, I think I was the only one in the AD field who was doing this type of thing and I'd often have other geneticists asking, "Why are you presenting all of these little case-controlled studies that say yes this gene is associated with AD, or no it isn't." And my reply was that I think it's interesting to know if there are three studies that say a particular gene is associated with the disease and four that say it is not. I'd have to ask, does lightning strike in the same place three times, despite the fact that four times it didn't? Then maybe there's something going on there. So I realized that many studies were being ignored, that results were not being tallied and not being kept track of, which raises the danger that an interesting gene might be prematurely dismissed. So when Lars had the proposal to approach this in true German style and do this in the correct, systematic way, I thought it was a superb idea.
ARF: What kind of response have you been getting from other geneticists?
LB: I guess there wasn't a single negative response so far. You have to realize it is a very specialized thing. It is about genetics association findings; it's not even early-onset Alzheimer disease, which a lot of people in the field of genetics work in, so the scope is slightly limited.
RT: But the responses so far are outstanding, both from within and without the AD research community. In general, people have been saying, "AlzGene is great; you've been doing a fantastic job, and we need to do this, too, for a different disease." That was one of the true measuring sticks of the success of it: People that are researching different diseases and who want to do the same thing [for their field]. And when I give a talk and present a few slides about AlzGene, when I look into the audience, I can see everybody writing down the Web site address. You can see that people are really attracted to the idea of one-stop shopping to just get a scorecard. And a lot of groups who aren't doing genetics in particular, but who are doing biology—they may want to look at this before they invest into looking at the function of a new gene associated with disease. For example, they might read one paper, one case-control study that says a certain gene is associated with the disease, and on that basis go on to study that gene biologically. But they might miss the papers that say no, this gene is not associated with AD. But now, before they dive in, they can check out the scorecard for this gene to see if the genetic data is compelling. So I think AlzGene serves a lot of functions. It serves a purpose for the functional biologist as much as the geneticist.
ARF: It can also save people a lot of time.
RT: I think it would be almost impossible for people to do it correctly and comprehensively on their own.
LB: Just take, for example, one of the better studied genes, alpha2 macroglobulin (A2M). I think we have now more than 40 or so studies included, so if you were interested in this and didn't know about this particular gene, you would have to do a PubMed search. Now, just for this one gene alone, which admittedly is one of the better studied genes so it's more complicated, it would probably take you a good 2 or 3 full days of work to locate all these studies, and it's not like they're all on PubMed; some are cross-referenced in certain papers that are not on PubMed, so you'd have to go through all the references only to find the papers. To extract the data would obviously take longer than 2 days, so people would have to do it themselves manually.
ARF: Have you gone back, retrospectively, and gone through all the literature? Is that in progress or is that part of the scope, or did you just, from day one, start with the new papers?
LB: No, we've gone back. The overarching goal is to provide a summary for all the association studies in this field, and basically what that means is that we…well, for the launch that we did a year ago, we started with the usual suspects that we thought people would be most interested in or for which we had many papers on file already, so that was just a start. Now we're basically searching for the words "Alzheimer's" and "association," or "associated" on PubMed and trying to find all the different papers that pop up.
ARF: So, systematically, are you picking one gene at a time, going back through all the literature and getting all the data?
RT: Especially the genes that have been historically worked on, those most in the public eye. It's funny because there was a whole time period of confusion, major confusion, after the discovery that ApoE is a risk factor…. And there's still confusion, and the confusion led to polarization. Some geneticists in the field would just say, "It's all meaningless. It's just garbage in, garbage out, and ApoE is established but none of these other genes mean anything," because they were using ApoE as a measuring stick. Well, it's looking more and more like there are no more ApoEs. ApoE's effect is probably, I would say at the end of the day, that it would be the strongest genetic effect because of its combined prevalence and effects on age of onset and risk. But there will be many other Alzheimer genes to be found, and estimates say there are probably half a dozen with a pretty good effect, not as strong as ApoE, but a moderate effect, and then perhaps dozens with modest effects. And so there's this whole literature that's been out there that has been providing clues as to what these other genes are, and a lot of that literature was being ignored because no single paper could consistently be replicated, and Lars and I wrote papers about the idea of replications and refutations; he called one of them, "Dancing in the Dark." Now the question is, do you turn your head and not even look at all those data because they're too confusing and it's a roller coaster ride of yeses and nos? Well, you systematically go through it and just keep track of it and see which ones look best. That's what AlzGene does. Lars does an incredible job in keeping it up. It's a much greater success than I would have ever imagined; it's a lot more work than I would have ever imagined.
LB: I think in the original application, we said there would be 200 or 300 papers that we would have to find and close to 100 genes. We just recently crossed the 200-gene mark after going through more than 700 papers, and I know several more, and several big genes that are so intensively researched that it takes a while to find all the papers and be ready to put it on the Web. I think we're going to easily hit 800+ papers and well over 250 genes.
ARF: Is the retrospective work all complete or are you still in the process of doing it?
LB: No, we're still in the process of doing it.
ARF: About what percentage do you think you've done already?
LB: Like I said, we're at 700 papers. I would expect there to be 800+, so about two-thirds may be done, I would guess. It's really hard to guess because you find a gene and then you push open a door to a whole batch of papers that study this particular pathway that you've never heard of. It's incredible, in a good sense. I mean it's obviously a very interesting question to ask: What are the genes contributing to Alzheimer's risk? So much work has been done; it's incredible.
ARF: So how could other researchers help?
LB: Use it. Obviously we're trying to be as complete, thorough, precise in the curation of this, but we're only human and we make mistakes. We try to really limit that, but there are mistakes and we've uncovered mistakes. I think what I can say is that we have this little line everywhere in the Web site, "If you're a study author and your study is missing, or your study is misrepresented, contact us." The whole nature of this is constant updating, usually by adding new studies for specific genes, which is one of the downsides of the meta-analyses. You publish it, and the day a new study comes out on this very same gene, the published meta-analysis becomes out-of-date.
ARF: So do you re-run the meta-analysis every time a new paper comes out?
LB: Yes, every time a new paper on a specific gene comes out, we add it to the pool of data that's already in AlzGene and rerun the meta-analyses to see where the overall signal goes.
RT: And what we've shown is that if you look, for example, at ApoE, and you look at the meta-analysis that was done with the raw genotypes from papers that were published previously, and you look at what the odds ratios were and relative risk with one copy or two copies of the gene, and then you compare it to our method, which is looking at published data and tables, not the actual raw genotype data, and you ask what do you get for risk ratios, it comes out pretty comparable. So that served as a good positive control for this method. It doesn't require the actual raw data from the laboratories but just what's published in tables in terms of genotype frequencies or allele frequencies and the like that go into this algorithm and go into the meta-analysis. But ultimately, a lot of people say to me, "When are you going to be able to actually provide simulations of gene-to-gene interactions or allow four or five genes to be looked at together for risk?" and I've explained that, surprisingly, the programs for doing really complex gene-gene interaction-type calculations for risk just are not around yet. They're still being developed. You can do ApoE plus another gene, but to really carefully and accurately look at the interaction of multiple genes together on risk, that's still a work in progress, so it will be some time before we can do that.
ARF: What's the technical difficulty there?
LB: Well, first of all, you don't know how many genes. You can say, okay, we have two genes that interact with each other, or three or four, and it gets exponentially more complicated to calculate, and that just wasn't possible until recently. Maybe now it is and there are algorithms that attempt to do this, but it's not really clear…I think it's fairly clear that genes interact, but it's not really clear how to model it correctly, to extract the right type of interaction because you are performing so many different tests and so many different comparisons that you're bound to run into something spurious. For the purpose of this, the question would be, so will you ever be able to do this? As long as we don't have the raw data, the actual genotypes that Rudy was talking about, that won't be possible.
RT: You don't have that, plus you don't have the programs that can do it well, anyway. Getting back to Lars's point about asking for investigators to contact AlzGene, it would also be nice if investigators who use AlzGene can now start working on a certain gene that they didn't work on before. Or maybe they start putting more priority or emphasis on a given gene because it's showed up in a meta-analysis, the PRNP or transferrin, for example. If people start carrying out more studies that were inspired by AlzGene, that would be nice feedback. It would be nice if people could say, "Well, I saw that this gene did well in the meta-analysis. We were working on this; we picked it up again and looked at these variants in it…" It would be nice to get some feedback from people doing functional studies, as well, as to whether AlzGene may have caused an effect, some beneficial effect.
ARF: Right. It's hard to cite a Web site, but…
LB: Hopefully very soon, we'll try to put this into a manuscript because obviously there's a lot more analysis you can do with the data. Things like this have been done in the past, but with a much smaller scope, and I think we can truly address a lot of questions by having this data and putting it all into one analysis, so to speak. For example, "What is the sample size that you need?" There are studies published with 50 cases and 50 controls. Now that is almost certainly not going to find anything except ApoE, maybe. But by looking at the positive findings of AlzGene, we will, very clearly, I think, be able to answer these kinds of questions. What's the lower end of the effect size or the higher end of the effect size, and what are the allele frequencies, or what sample size is needed to detect this typical type of effect, etc.? I think having all these data on all these different genes would be very useful, not only for coming up with a set of genes that are potentially relevant, but also for setting guidelines on how to do this better and more efficiently and more reliably in the future.
ARF: A technical question: You're saying some of these association studies are too small with maybe 40 or 50 people; is putting ten of those studies together as good as having a single study with 500 people in it?
RT: Not really. If they're from different ethnic groups and it depends on how they're composed, so ten times 50 plus 50 case-controlled subjects, that doesn't have the same power as a properly designed case-controlled study with 500 where they've been matched for age and gender and ethnicity. That's not that close. But you can still look across those ten studies and compare the odds ratios and still do the meta-analysis and still get some better idea from the sum than you will from any one of those original studies, which is what AlzGene would do.
LB: It doesn't have the same power. I agree. I wouldn't say it isn't close; it's just another way of looking at data, but there is no other way. If you have those ten different studies, the way that Matt McQueen (who gets full credit for writing the AlzGene meta-analysis algorithm) does the meta-analyses tries to account for the heterogeneity across the different studies. In the end, if you have the same case number, the meta-analysis of the smaller studies should be nearly equivalent to the larger one. I agree that if you have a carefully matched case of samples, that would be superior, but if your case-control sample is just any old U.S. sample with people from all over the world coming here and being analyzed, it's probably not that much different. I honestly don't know, but my hunch would be it wouldn't be that much different.
RT: Well, we should test it.
LB: We will. And Matt also echoes what Rudy just said, that now that we have a set of interesting genes it will be worthwhile to dig deeper and perhaps contact the study authors and try to get the raw data to enable us to stratify by gender or ApoE and maybe put it all into one set. One of the really important questions that Matt and I have been asking ourselves is how are we going to deal with the vast amount of data that will be pouring out in the near future. What I mean is that, obviously, the genetic technologies are getting more and more sophisticated, and very soon there will be studies appearing that are analyzing 100,000 or more SNPs [single nucleotide polymorphisms] in just one go. So what to do about that data? I guess we'll have to find a way. It's not entirely clear how we should do it, but we will have to, and that's going to be one of the big challenges after this era of genetic study is ending and we clearly move into a new era.
ARF: So how much work is involved in retrieving all this information?
LB: Way more than I budgeted for! It's mostly Kristina Mullin and myself. She's doing a fantastic job finding the papers, punching in those search terms and retrieving data and sorting through it. If you type in "Alzheimer" and "association," only about 10 to 20 percent actually are genetic association studies. The others just use these two terms in the abstract for some reason, so she's finding all the genetics papers. Of course, once you find a new gene, then you have to find all the genetics papers that are published on that gene, and then extract all that data. Some of that information is really difficult to get. Sometimes authors don't respond, or the papers are not here at the Harvard Library, although they pretty much have every journal you can imagine, so it sometimes takes weeks to get the specific paper before it can be entered into the database, and, well this is where I come in; I double-check everything before it goes online. I read every paper again that Kristina has found and has entered, and I double-check all the numbers and all the information that she has entered, and then once that's done, we send it out to Matt, who does the meta-analyses on the genotype data, and once that comes back, I'm the one who actually enters all the numbers onto the Web site, the database, which is then again double-checked by Kristina.
RT: I'll send them a paper once in a while. I'll just e-mail papers with new genetic association findings, and more often than not, I'll get an e-mail back that says, "Dude, we covered this one last month."
LB: I was very proud of this: Sometimes we have papers meta-analyzed and uploaded before they show up on PubMed, because we not only screen PubMed, but we also screen the specialty journals, and sometimes there's a lag of 1 or 2 days.
ARF: Were there any surprises when you started to get all this data together?
LB: Yes. There were genes that I didn't even know…I mean I knew that they existed, but I didn't know that they were studied in particular for the field of Alzheimer's. I just ran into them doing these literature searches and they actually came out positive. One of the genes I think that shows the strongest results so far is something we didn't have on our radar screen ever, and the last word isn't spoken yet because more studies will need to be published but there's a decent number of them, about ten, I think that's based on ten studies, and that was quite a surprise.
ARF: Which gene is that?
LB: Transferrin on chromosome 3. It's not like your typical suspect…although we tested transferrin about 10 years ago. There's no evidence, there's no real linkage to that region on chromosome 3, so it was a bit surprising.
RT: The prion protein (PRNP) gene was a big surprise. One possibility is that the prion protein actually contributes to risk for AD, but the surprise would be if the result is due to an unexpectedly high phenocopy rate of prion disease masked as AD.
ARF: Any other questions that you think AlzGene can answer?
RT: Another potential question you could look at later is, which genes require more isolated populations to see the effect versus not? Some of the association findings you see depend on who can get more geographically isolated populations or look at one particular ethnic group, and some of those same genes may not do very well when you have a mixed, heterogeneous group of just northern Europeans, or whoever. It's also another way to keep track of where a gene still shows an effect despite admixture and heterogeneity of the population you're looking at. That could also then lead later on to say, "This gene may be more prevalent a factor in China versus Europe," or vice versa, so as you start collecting these data you can get some information about populations and genes that play the greatest roles for risk in various populations over time.
LB: It's a well-known fact that ethnic groups or ethnicity is a confounding factor or a source of heterogeneity, especially in these findings, so we are making an attempt to account for that a little bit, and that is by just splitting up our results in our tables into the different major ethnic groups.
RT: And gender, as well. While we don't parse our data based on gender right now, we could eventually go back and ask, "Which genes look equally as…look like risk factors equally in male and female versus more male here and more female here," and just start writing papers based on the data in AlzGene, just analyzing the data of AlzGene, whether it comes to gender or ethnicity.
LB: It's clearly the first step with a project of this scope. We can't answer every question, but certainly, as you say, the genes that come out positive are probably worth digging a little deeper into the original studies, or maybe doing additional studies, and then splitting it up by gender, ApoE, or other factors, or maybe trying to integrate it all in one analysis, but out of those, say, 200 genes, you have to start with a set, and I think this set could be provided in this project.
RT: I'll be excited when we start seeing people use AlzGene to design studies, not just functionally but genetically, where people take the AlzGene data—they could use some threshold for the meta-analysis of what they think makes a gene look promising—and now start testing those particular candidates thoroughly, with full saturation of SNPs in their own samples, or start trying to look across the effects of gender or ApoE versus ethnicity. I look forward to the day when we'll see people writing papers or designing studies and then writing papers based on the foundation that AlzGene has provided. We want AlzGene to be generative and generate new designs, new studies, new papers. So AlzGene is the foundation, and eventually, it will grow into more custom studies of the data, especially like Matt says, when you can start getting into raw genotypes and more sophisticated analyses. It's amazing how far it's come already, but it's still just a beginning.
LB: One of the other things Matt mentioned—his real forté is the meta-analysis—is that while the association studies are one thing, and obviously that's what we're doing right now, there's a whole other set of papers—the linkage studies, full genome screens—that have been published and we can do the same exact thing with those and just try to scope out which regions of the chromosome, or genome, I should say, really look interesting by looking across all these studies. That was actually proposed originally when we applied for this, and we're going to do it, but it's more complicated because other than for the association analysis where you can get away with mostly what is published in the paper, for the other analyses you really need cooperation from study authors to send you raw data in order to do it well.
ARF: Well, now that they have seen what AlzGene can do, they may be more inclined to collaborate.
LB: Well, communication is often difficult. There are some authors who are really great. If they publish a paper and for some reason they didn't publish their genotype numbers, because other aspects were more important, but just reported the results and I write to them and say, "Hey, great paper. We would like to include it," the next day I have a table with the stuff that I need for AlzGene. Whereas other authors need to be reminded again and again and sometimes they never respond. So that's kind of sad because you can only do so much, and if they don't report the data in their papers, and if they're not willing to supply that (I'm not asking for raw data, remember), then…. That—human interaction—is one of the hindering facts of this, though most of the people I talk to have been very, very nice. I think one of the really cool things about this is it's so open and so unbiased in any way and you have all the data online. There's nothing secretive about it. It's all published material. Sometimes there may be mistakes; then, like I said, if they're uncovered, we're very thankful for that, but you could basically just go ahead, take this whole set of information and do your own little analysis, and see whether what we say is right, whether we entered the data correctly or whether we came up with the right results. We have all the cards on the table and it's all very open—at least that's our overarching aim.
ARF: What about expansion of AlzGene? Can you see room to expand it in any way?
LB: Well, there's actually a new update coming up with new features added, and more refined analyses, a whole new section touching on the issue of publication bias, which is always the kind of the question that's in the background in this type of analysis. Matt's working on that, and that's going to come up very soon. But I guess one natural way to expand it would be to other diseases, which we're planning to do, and I guess another way which was already planned was to kind of integrate it—all of this is just knowledge management—with biological information for the candidate genes that we're looking at. So, if you had, let's say, transferrin, which looks interesting genetically, then we'd have a link to maybe another site that would look at the functional studies, or mouse studies, or anything that may be relevant….
RT: The problem with some of the functional bioinformatics programs is that you're using biological terms that tend to get a little loose. It's a bigger challenge, I think, to try to summarize functional data because you run the risk of losing rigor with regard to which studies are done really well and which have been done less well. So it's a little tougher. The other thing is that we need to have the best hits right up front, which genes are looking the best, the top ten based on meta-analyses, because when I give a presentation and I show a slide as part of it, I show the update of which genes are looking more interesting, I always get a question as to why isn't that on the Web site up front; why do I have to go searching for each one? So it would be good for people to know which ones are looking good right off the bat, and that's being programmed.
ARF: So what other disease might you want to expand into?
LB: Well, we are interested in Parkinson disease. We wrote an application for this to the Michael J. Fox Foundation, and they liked it and decided to fund it. It's going to start very soon. It will essentially follow the same format, except be on Parkinson disease. There have been other people from other diseases asking us whether we could imagine doing a similar thing together with them and build a database for other diseases, and I think it would be very useful and make a lot of sense.
ARF: Are you setting your sights on any other expansions? You mentioned the linkage studies.
LB: It will happen hopefully when we get data in from people, and that was actually one of the first things we planned to do because Matt came from that linkage/meta-analysis side of things. We can repeat it for other diseases and try to cross-link it with other sites, but I guess this is as far as we can go.
RT: It's amazing, though, when you look back before AlzGene, people would come to me at various meetings and say, "So what do you think of this gene or that gene?" or they might say, "So-and-so says that this gene's association with AD is not real, but I talked to another person who believes it is. What do you think?" So it almost seemed that whether the field would be favorably or unfavorably predisposed to believing a gene's role in AD was based on canvassing AD geneticists at meetings or by e-mail. Every geneticist had a different opinion, and I would have to say that before all the data were provided in one place with all the details required to make an opinion, those opinions were largely uninformed and based on one or two studies here or there, maybe one positive study and one negative study. So nobody, including us, could really go to any meeting as a geneticist and properly guide people doing biology to say, "Yeah, we think this gene matters." And when someone asks about a certain gene, you get three different answers from three different geneticists. So it wasn't until you could get all of this information in one place and, even more powerfully, carry out novel meta-analyses on that you could see how the data looks, that now, you don't have to decide if you believe geneticist A, B, or C, or a talk you heard last week or the week before. You go here and just get a purely objective analysis of a compendium of all the results. So we can think about what the future will hold and how to make it even better, but when you just think about the past you realize we've made a huge step forward. For any disease that has a complex genetic inheritance profile, especially psychiatric disorders, which are a nightmare compared to Alzheimer's, you're going to need this type of database, too. This is setting a foundation for the future of how you can keep track of the genetics of complex disease.
LB: And you have to realize that there are right now about ten to 15 studies published per month, just in the AD field, that we have to find and incorporate, but just imagine yourself being the researcher. After 3 months, you'll lose track.
ARF: Have you thought of having people deposit their papers like people do with sequences at Genbank. Would that make things any easier?
RT: The guideline is to only include papers from original reports from refereed journals that are available in English. So, for example, if you have a paper that's a meeting report, or a paper for which the full text is not available in English from the publisher, it would not be included. At Genbank, on the other hand, you get lots of information coming in that's not published. It's different, because, you know, sequence in genomics will sort itself out. Genetics is much more complicated than genomics because it's not structure and sequence; you're looking at inheritance patterns. It requires good old-fashioned detective work. We can get families and clinical diagnoses and the like, so we have to be much more careful in terms of genetics. It won't work itself out as easily. You just have to draw a line and say, "Okay, we're just going to use refereed reports."
RT: When you use Genbank as a comparison—Genbank took a long time to get going, in fact, it's still evolving, and it's been evolving for 2 decades, now. AlzGene has just started, but while Genbank may have to cover more ground and cover more data, for genetics I think you have to be a little more clever in terms of how to compile the data, analyze it, and present it.
LB: It needs more curation…
ARF: …and more analysis.
RT: Yes, it needs more curation, more analysis, just like genetics itself, when you're looking at inheritance of DNA in families, you can't just stop at the DNA. You have to look at the people who are getting the disease and look at their clinical features and age of onset, and their phenotype. So genetics adds the phenotype to the genomic component, and so it becomes much more complicated. These are the early days, but I wonder whether this interview is similar to what might have happened 2 decades ago when Genbank was first being put together. Who knows what this is going to grow into? That might be a little bit grandiose, but on the other hand, there's nothing else like this in terms of genetics.