Hi I have a Django Project with the following models
class Organization(models.Model):
name = models.CharField("Name", max_length=128, unique=True)
description = models.CharField("Description", max_length=256)
class Scholar(models.Model):
organization = models.ForeignKey(Organization, on_delete=models.CASCADE)
name = models.CharField("Name", max_length=128)
title = models.CharField("Title", max_length=256)
def get_latest_snapshot(self):
return self.snapshotscholar_set.latest('date_crawled')
class SnapshotScholar(models.Model):
scholar = models.ForeignKey(Scholar, on_delete=models.CASCADE)
date_crawled = models.DateTimeField("Date Crawled", auto_now_add=True, db_index=True)
title = models.CharField("Title", max_length=256)
class SnapshotScholarPublication(models.Model):
snapshot_scholar = models.ForeignKey(SnapshotScholar, on_delete=models.CASCADE)
title = models.CharField("Title", max_length=256)
citation_count = models.IntegerField()
Now when I want to get a count of the number of scholars per organization I can do Organization.objects.annotate(num_scholars=Count('scholar')). But how do I get the number of publication per organization when I only want to count publications of the latest snapshot. That is, I want it sorted by SnapshotScholar.date_crawled, and I want all the publications on the latest SnapshotScholar in the database.
Going by some of the questions on here I managed to create this SQL -
SELECT COUNT(pub.id) as publications, org.id
FROM (main_snapshotscholarpublication pub, main_snapshotscholar snap, main_scholar scholar, main_organization org)
INNER JOIN (
SELECT MAX(main_snapshotscholar.date_crawled) as latest_date, main_snapshotscholar.scholar_id as 'id'
FROM main_snapshotscholar
GROUP BY main_snapshotscholar.scholar_id
) as latest_snap ON (latest_snap.id = snap.id)
WHERE pub.snapshot_scholar_id = snap.id
AND snap.scholar_id = scholar.id
AND scholar.organization_id = org.id
GROUP BY org.id
The results I'm getting with this raw SQL is an error margin of 1-5% from the actual count numbers. Can someone please help me figure out how to get the right results?
Thanks
EDIT: As per request of one of the people answering the question here is some sample data - https://pastebin.com/4ZJkymeb . Just load the data with python manage.py loaddata data.json
You can use Subquery for getting the results. For example:
from django.db.models import OuterRef, Subquery, Q, F, Count
subquery = SnapshotScholar.objects.filter(scholar=OuterRef('pk')).order_by('-date_crawled')
queryset = Scholar.objects.annotate(max_scholar_snapshot=Subquery(subquery.values('pk')[:1])).annotate(publication_count=Count('snapshotscholar__snapshotscholarpublication', filter=Q(snapshotscholar=F('max_scholar_snapshot'))))
queryset.values('organization', 'pk', 'publication_count')
Here I am first annotating the SnapshotScholar information with the queryset by Subquery. I am ordering the SnapshotScholar using date_crawled, then attaching the latest one's id to the queryset. Then I am running the Count for SnapshotScholarPublication based on filtering the value of SnapshotScholar which was found in last step.
Building up on @ruddra's answer, have a try with this:
snapshot_subquery = SnapshotScholar.objects.filter(
scholar__organization=OuterRef('pk'),
).order_by('-date_crawled')
organizations = Organization.objects.annotate(
latest_snapshot=Subquery(snapshot_subquery.values('pk')[:1]),
).annotate(
latest_snapshot_publication_count=Count(
'scholar__snapshotscholar__snapshotscholarpublication',
filter=Q(scholar__snapshotscholar=F('latest_snapshot'))
),
)
It is important to first order the snapshot scholar by descending date_crawled, and then limiting the subquery rows to one row..[Django-doc] using slice notation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With