Django: Annotate to find duplicates and delete

The django-taggit library allows you to add a Tag for any resource in your Django site. The 1.X release of django-taggit includes a breaking change to add a unique constraint.

1.0.0 (2019-03-17)
Backwards incompatible: Remove support for Python 2.
Added has_changed() method to taggit.forms.TagField.
Added multi-column unique constraint to model TaggedItem on fields content_type, object_id, and tag. Databases that contain duplicates will need to add a data migration to resolve these duplicates.
Fixed TaggableManager.most_common() to always evaluate lazily. Allows placing a .most_common() query at the top level of a module.
Fixed setting the related_name on a tags manager that exists on a model named Name.

This means that any Custom Tag that was built under the 0.X version of django-taggit will need to have the duplicates removed from custom TaggedItem model class before the unique constraint can be applied. Here’s an example of how to use annotate to identify duplicate entries and save and exclude the initial row from the delete.

duplicate_tags = MYMODEL.objects.filter(content_type_id=FILTERED_CONTENT_ID)
    .values("object_id", "content_type_id", "tag_id")
    .annotate(count=Count("object_id"))
    .annotate(save_id=Min("id"))
    .filter(count__gt=1)
    .values_list("save_id", "object_id", "content_type_id", "tag_id")
for save_id, object_id, content_type_id, tag_id in duplicate_tags:
        MYMODEL.objects.filter(
            object_id=object_id, 
            content_type_id=content_type_id, 
            tag_id=tag_id
        ).exclude(id=save_id).delete()

Troubleshooting Django

To see what SQL the command is executing:

from django.db import connection
MODEL.objects.latest()
connection.queries

This shows what latest is doing in SQL

When you get a query back, then you can look at it with sql_with_params

MODEL.objects.filter(created_by__username="fred").query.sql_with_params()

The Django Debug Toolbar is useful to look at the SQL queries and how many times things are duplicated.

 

Django – New resource

Checklist of things to update in Django when making a new resource:

  1. add model to appropriate models.py file
  2. add migration to appropriate application – python manage.py makemigrations
  3. add serializer to app/serializers.py
  4. add ViewSet to app/apiviews.py or app/views.py
  5. add route to router – apirouter.py or urls.py
  6. add admin to app/admin.py
  7. add Filter (if using django_filters) to app/filters.py
  8. update templates in app/templates/
  9. add test in app/tests.py

 

3 – Serializer

class AppModelSerializer(serializers.HyperlinkedModelSerializer):
class Meta:
model = models.MODELNAME

4 – ViewSet

class AppModelViewSet(viewsets.ModelViewSet):
queryset = models.MODELNAME.objects.all()
serializer_class = serializers.MODELSERIALIZER
filter_fields = (‘name’, ‘slug’)

5 – Router

router.register(rROUTE, APP_VIEWS.AppModelViewSet)

6 – Admin

class AppModelAdmin(admin.ModelAdmin):
list_display = (‘name’, ‘slug’, ‘order’)

admin.site.register(models.AppModel, AppModelAdmin)

7 – Filter

class AppModelFilter(django_filters.FilterSet):
class Meta:
model = models.AppModel
fields = [‘name’, ‘slug’]

OR

class AppModelFilter(django_filters.FilterSet):
class Meta:
model = models.AppModel
fields = {

‘name’: [ ‘exact’, ‘startswith’, ‘icontains’],

‘other_reference’: [‘exact’]

}

Django views

RetrieveUpdateDestroyAPIView will do all the CRUD things

class CoinDetailView(RetrieveUpdateDestroyAPIView):
  permission_classes = (IsAuthenticated,)
  serializer_class = CoinSerializer
  queryset = Coin.objects.all()

That’s quick!

http://www.django-rest-framework.org/api-guide/generic-views/#retrieveupdatedestroyapiview

views.py

from .models import Coin
from .serializer import CoinSerializer

class CoinViewSet(mixins.createModelMixin, mixins.ListModelMixin, mixins.UpdateModeMixin, mixins.RetrieveModelMixin, mixins.DestroyModelMixin, viewsets.GenericViewSet):
  #stuff

 

router.py

from django.conf.urls import url, include
from django.contrib import admin

from home.views import CoinViewSet
from rest_framework.routers import DefaultRouter

router = DefaultRouter()
router.register('coins', CoinViewSet)

urlpatterns = [
  url(r'^api/', include(router.urls))
]

Just need to add the router.urls call

 

GraphQL in the Wild…

GraphQL is an interesting technology with a lot of good points, but some drawbacks.

Pros:

  • Self explorable
  • easy documentation
  • intuitive
  • simplifies client side logic

Cons:

  • Graphene is the only library and still young.
  • v1.0 released in Sept2016 and not very updated.
  • Docs are not very good
  • Known bugs with resolver – the order graphene does it by default is wrong
  • Source code is complicated with lots of meta-programming, so hard to update
  • Authorization
  • Denial of Service
  • Performance!

Performance issues with REST

Showing a lot of related data requires lots of requests

Serialization takes a lot of time, especially with REST resources that contain a lot of information that isn’t used on most of the pages

Dashboards Challenge

GraphQL vs REST! 5 s vs 10s result even after reducing the data sent by the REST.

GraphiQL app from Github

{
  viewer {
    login
    repositories (first:10) {
      edges {
        node {
          name
        }
      }
   }
}

Get only the data you want, strong typing, nested fields

With Django?

Graphene!

pip install graphene_django
INSTALLED_APPS += ['graphene_django']
urlpatterns += [url(r'^graphql', GraphQLView.as_view)]

Define nodes:

class TaskNode(DjangoObjectType):
  class Meta:
    model = Task

Whenever you have a field you can have a resolver for it that defines how to get the field.

Define queries:

class Query(ObjectType):
  class Meta:
    model = Task

Define the Relay:

{
  goals (first: 5, after: "cursor") {
    name
    progress
  }
}

Pagination

class GoalNode(DjangoObjectType):
  progress = graphene.Float(description='The average task progress')
  pk = graphene.Int()
  tasks = graphene.List(TaskNode)

  @graphene.resolve_only_args
  def resolve_pk(self):
    return self.pk

  @graphene.resolve_only_args
  def resolve_tasks(self):
    retrun self.key_results.filter(closed=False)

  class Meta:
    model = Objective
    filter_fields = ['name',]
    exclude_fields = ['key_results',]

class OwnerNode(DjangoObjectType):
    full_name = graphene.String()
 

 

GraphQL will display the description in the GraphiQL browser that’s useful!

Use cases that make it better than REST:

Complex views (dashboards, summaries, stats)

Complex Writes

Builders, autosaving, auto-updating.  Checks the types!  Define mutations!

Makes the change and then re-queries and returns the response in one call instead of many!

The backend knows what the mutation is and does it and the frontend just says what something should change to.

Make sure your mutations are @atomic!

Add it as a mutation on your schema.

Authorization

Performing authorization on each resolver is a pain in the ass, so you have to extend Graphene to authenticate on the connection.

Extend DjangoFilterConnectionField

connection_resolver: add user authentication

resolve_connection: get auth class from node and apply auth

Throttling/DOS

  1. Whitelist for allowed queries – all of the queries in the backend in python, and the frontend called the query by name.  (def blah, return graph query)
  2. Maximum limit
  3. Maximum query cost – first: 50 repositories, first 10 issues = 550 total nodes.
  4. Rate limiting based on query cost.  Based on the umber of database connections you’re getting.

Performance

Built to work, not to be performant.  Doesn’t use select_related or prefetch related. Needs to reduce the count() calls.

Data Loader (facebook’s answer to performance issues) – return a promise that analyses the data requested and reduces the requests made

goals (first: 0) {
   totalCount
}

Do the count in the graph instead of in the front end.

Need to write queries in a smart way!

Resources

  • graphql.org
  • Zero to GraphQL (video)
  • Intro to GraphQL (blog)

 

Django User Scenarios

#1 Proxy Model

Create an alias to an existing model.

Good: Custom model manager, custom model methods

Limitations: Can’t add additional attributes, can’t change the database

Author(User):
  class Meta:
    proxy = True

  def __str__(self):
    return self.first_name

Author is just like the user model, but has some methods that aren’t on the built in User class or that override the defaults.

#2 Using a one to one relationship

Create a new model that has a one to one relationship with the User Model.

Good: Custom fields

Limitations: Another model in the database (migration management), requires additional steps in the code to handle the model relations

class Profile(models.Model):
  user = models.OneToOneField(User)
  # add other custom fields desired
  location = ...
  bio = ...

Accessing:

user.first_name
user.profile.location

Need to add:

  • Handling User saves & updates & deletes
  • Handling Profile creation on User creation
  • Include in Django Admin

Most of these are included in the Django Documentation.

https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#extending-the-existing-user-model

Note: Create at the beginning of the project.

#3: Custom User Model

Good: Flexibility

Limitations: User Manager, Admin Form for the Custom User Model

class MyUser(AbstractBaseUser):
  email = ...
  fav_type = ...

  USERNAME_FIELD = 'email'

settings.py

AUTH_USER_MODEL = 'myapp.MyUser'

https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#substituting-a-custom-user-model

NOTE: Create custom user model at the beginning of your project

A full example of a custom User

https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#a-full-example

These are notes from #DjangoCon2017.  Here’s a link that goes through much more in depth for each of these options:

https://simpleisbetterthancomplex.com/tutorial/2016/07/22/how-to-extend-django-user-model.html

Django on Lambda

Use a project called zappa that makes it easy!

Gotcha #1 : Security!  ALLOWED_HOSTS

Automatically generated or subdomain added to ALLOWED_HOSTS

Gotcha #2: Static Files

Use django-storages and add ‘storages’ to the installed_apps, configure it, and then collect static, and zappa update

Gotcha #3: Database

  • Can you use a queue or something else instead?

nodb!

  • Can you use S3-database?

zappa-bittorrent-tracker using S3-database

  • No?

Use AWS RDS (expensive but easy) or EC2 (cheap but annoying).

Need a VPC to allow the lambdas but prevent random internet traffic. Two private subnets (for redundancy), Allow TCP on 5432 (postgres).

add vpc to django settings

set the host/port in the DATABASES setting in Django

Install zappa-django-utils to interact with the django database inside the VP

Gotcha #4: Encryption!

Use ACMI or LetsEncrypt  with zappa certify