Wednesday, February 17, 2010

CRL update problems


As you probably know, CRLs are a vital part of the x509 security model; they are the only way CAs can invalidate compromised, or otherwise revoked certificates. Every time a user credential is presented to Grid-aware software, the appropriate CRL is checked to make sure the credential was not revoked.


CRLs has expiration dates like everything else in the x509 world. A side effect of this is that if a CRL is not updated locally (by being downloaded from the CA site) before it expires, all credentials from that CA will be treated as invalid. So sites must download fresh updates frequently; the recommended policy in OSG is to do it once every 6 hours. The tool fetch-crl is used for the task.


Several OSG site admins have notified us that fetch-crl occasionally fails to download a CRL one or more CAs, and has asked up (the security team) for guidance on what to do in those events.


Ideally, we would like to debug each and every such event and make sure it is just a transient error, but given the distributed nature of the Grid this will not scale. OSG has hundreds of sites and there are hundreds of CAs, and problem is combinatorial. Unless such errors are very rare events, we need an alternative approach.


One possibility would be for OSG to centrally serve the CRLs from all the supported CAs; this would allow us to more easily track problems since we can separately debug CA and site problems, thus going from a quadratical to a linear problem. Of course we still want to allow sites to go directly to the CAs for the CRLs if they want, for example when they are supporting a CA that is not part of the official OSG CA distribution; but we offer only limited support in such cases anyhow.


However, the above solution is likely to create its own set of problems (still to be determined), so before we start any design and implementation effort we would like to know how important is to solve this for sites.


Right now we don't even have clear statistics on how often sites have problems with their CRLs, nor if the problems are due to specific CAs. So input from sites is highly desirable.