Django: Annotate to find duplicates and delete

The django-taggit library allows you to add a Tag for any resource in your Django site. The 1.X release of django-taggit includes a breaking change to add a unique constraint.

1.0.0 (2019-03-17)
Backwards incompatible: Remove support for Python 2.
Added has_changed() method to taggit.forms.TagField.
Added multi-column unique constraint to model TaggedItem on fields content_type, object_id, and tag. Databases that contain duplicates will need to add a data migration to resolve these duplicates.
Fixed TaggableManager.most_common() to always evaluate lazily. Allows placing a .most_common() query at the top level of a module.
Fixed setting the related_name on a tags manager that exists on a model named Name.

This means that any Custom Tag that was built under the 0.X version of django-taggit will need to have the duplicates removed from custom TaggedItem model class before the unique constraint can be applied. Here’s an example of how to use annotate to identify duplicate entries and save and exclude the initial row from the delete.

duplicate_tags = MYMODEL.objects.filter(content_type_id=FILTERED_CONTENT_ID)
    .values("object_id", "content_type_id", "tag_id")
    .annotate(count=Count("object_id"))
    .annotate(save_id=Min("id"))
    .filter(count__gt=1)
    .values_list("save_id", "object_id", "content_type_id", "tag_id")
for save_id, object_id, content_type_id, tag_id in duplicate_tags:
        MYMODEL.objects.filter(
            object_id=object_id, 
            content_type_id=content_type_id, 
            tag_id=tag_id
        ).exclude(id=save_id).delete()

Write an API with Django Rest Framework

The API built using DRF needs:

  • Router – connect requests to ViewSet
  • ViewSet – What am I getting from the database and displaying?
  • Serializer – How am I turning the data into a JSON object
from rest_framework import APITestCase

class TestMyModelViewset(APITestCase):
  def setUp(self):
     self.url = reverse('mymodel-list')
     self.instances = [MyModel() for i in range(3)]
     for item in self.instances:
         item.save()

  def test_list_view(self):
    response = self.client.get(self.url, format='json')
    self.assertEqual(response.status_code, 200)
    self.assertEqual(len(response.data), 3)

  def test_can_create(self):
    data = ('name': 'test name', 'category': 1)
    response = self.client.post(self.url, data, format='json')
    self.assertEqual(response.status_code, 201)

  def test_detail_view(self):
    sample_id = MyModel.objects.first().id
    detail_url = self.url + '/' + str(sample_id)
    response = self.client.get(detail_url, format='json')
    self.assertEqual(response.status_code, 200)

 

 

Python string matching and regular expressions

Python is very unintuitive compared to other languages for regular expression and string matching. . You have to import a separate library to get access to the functionality:

import re

Then the python man page is really confusing.  It starts talking about compiling regular expressions.  That may be something that you want to do when you’re doing a thousand searches, but mostly you just want to see if a string matches another string once.  The JavaScript way of doing this is just STRING.match(STRING), but the python version is re.match(REGEX, STRING).  Not only that, but there’s a HIDDEN CAVEAT in the python version – re.match only matches from the START of the string.  I don’t know why this is the case, but the method you want to use is re.search(REGEX, STRING).  This returns a SRE_Match object… ??  Apparently you use .group() on that object to get the actual match back… but even if you use a REGEX that should match multiple things in the string, it’ll only give you the first one…  Why is this so difficult again??

If you want to avoid the re package AND you have a string that you want to find a match only at the BEGINNING or only at the END, then you can use STRING.startswith(STRING) or STRING.endswith(STRING) to return a True or False Boolean.  Note that this doesn’t actually give you the match… it just tells you that it exists or not…

If you need to find multiple matches and you’re already importing re then you can use the findall method – re.findall(REGEX, STRING) , which strangely returns a list (of course)… though that’s better than returning an SRE_Match object I guess.

 

Get objects that match a specific attribute/property from a list

You have a list of objects where you want to find the object(s) in the list that have a particular value for a property or attribute.

fruits = [{'fruit': 'apple' }, {'fruit': 'pear'}, {'fruit': 'banana'}, {'fruit': 'grape'}]

Python:

list(filter(lambda item: item['fruit'] == 'pear', fruits))
[item for item in fruits if item['fruit'] == 'pear']

Javascript:

fruits.filter(item => item['fruit'] === 'pear')

 

 

Simulating data

Here’s an interesting post on creating data that have patterns for testing:

http://www.quantumforest.com/2011/10/simulating-data-following-a-given-covariance-structure/

@mjjohns1 on the Metis Slack posted the following python snippet to do a similar generation:


# number of observations per variable/feature/column to simulate
nobs = 100
# Using a correlation matrix (assuming that all variables have unit variance)
M = np.matrix([[1, 0.7, 0.7, 0.5],
[0.7, 1, 0.95, 0.3],
[0.7, 0.95, 1, 0.3],
[0.5, 0.3, 0.3, 1]])
# perform Cholesky decomp on correlation matrix
L = np.linalg.cholesky(M)
#set number of variables to column length
nvars = len(L[:, 0])
# create a matrix of random normal vars of dimension nvars*nobs, following the M correlation matrix:
dmat = np.random.normal(size=(nvars, nobs))
# multiple the data matrix by the L matrix and transpose
r = (L*dmat).T
# convert r data matrix to a dataframe
df = pd.DataFrame(r)
# Check the resulting correlation matrix
df.corr()

Django User Scenarios

#1 Proxy Model

Create an alias to an existing model.

Good: Custom model manager, custom model methods

Limitations: Can’t add additional attributes, can’t change the database

Author(User):
  class Meta:
    proxy = True

  def __str__(self):
    return self.first_name

Author is just like the user model, but has some methods that aren’t on the built in User class or that override the defaults.

#2 Using a one to one relationship

Create a new model that has a one to one relationship with the User Model.

Good: Custom fields

Limitations: Another model in the database (migration management), requires additional steps in the code to handle the model relations

class Profile(models.Model):
  user = models.OneToOneField(User)
  # add other custom fields desired
  location = ...
  bio = ...

Accessing:

user.first_name
user.profile.location

Need to add:

  • Handling User saves & updates & deletes
  • Handling Profile creation on User creation
  • Include in Django Admin

Most of these are included in the Django Documentation.

https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#extending-the-existing-user-model

Note: Create at the beginning of the project.

#3: Custom User Model

Good: Flexibility

Limitations: User Manager, Admin Form for the Custom User Model

class MyUser(AbstractBaseUser):
  email = ...
  fav_type = ...

  USERNAME_FIELD = 'email'

settings.py

AUTH_USER_MODEL = 'myapp.MyUser'

https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#substituting-a-custom-user-model

NOTE: Create custom user model at the beginning of your project

A full example of a custom User

https://docs.djangoproject.com/en/1.11/topics/auth/customizing/#a-full-example

These are notes from #DjangoCon2017.  Here’s a link that goes through much more in depth for each of these options:

https://simpleisbetterthancomplex.com/tutorial/2016/07/22/how-to-extend-django-user-model.html

Event Driven Architecture with Lambda

The advantage of using lambda is there’s no queue system (celery) that causes no blocking.  You use events to spawn lambdas for each event that comes in.  Configure/use this with the zappa project.

To resize a thumbnail on a user uploading a picture, you run a function based on a save to s3.

events:
  - function: users.util.process_avatar
    event_source:
     arn: arn:aws:s3:::your_event_name

Don’t get stuck in an infinite loop!  Use two buckets (processed/raw), or set path of new object so that it doesn’t trigger your event.

If your function takes a long time to run, then you don’t want to use the event driven architecture because it’ll time out.  Use zappa.async import task.  API endpoint calls the task wrapped function, which then starts a new async server that runs the longer code and returns.

 

Django on Lambda

Use a project called zappa that makes it easy!

Gotcha #1 : Security!  ALLOWED_HOSTS

Automatically generated or subdomain added to ALLOWED_HOSTS

Gotcha #2: Static Files

Use django-storages and add ‘storages’ to the installed_apps, configure it, and then collect static, and zappa update

Gotcha #3: Database

  • Can you use a queue or something else instead?

nodb!

  • Can you use S3-database?

zappa-bittorrent-tracker using S3-database

  • No?

Use AWS RDS (expensive but easy) or EC2 (cheap but annoying).

Need a VPC to allow the lambdas but prevent random internet traffic. Two private subnets (for redundancy), Allow TCP on 5432 (postgres).

add vpc to django settings

set the host/port in the DATABASES setting in Django

Install zappa-django-utils to interact with the django database inside the VP

Gotcha #4: Encryption!

Use ACMI or LetsEncrypt  with zappa certify

 

Bringing Functional Programming into an Imperative World

This is notes from the Bringing Functional Programming into an Imperative World talk from DjangoCon 2017 – Derik Pell @gignosko on Github with demonstration repository at https://github.com/gignosko/DjangoCon_2017

Functional Programming is expressive, efficient, easier to work between concurrent/parallel programming (using more and more cores rather than faster processors), and increases safety in code if done well.

You can fake an immutable data structure using deepcopy… but it takes some extra time.

Recursion takes the place of loops in many functional languages

You can mimic functional programming with list comprehensions and lambdas

filter

[x for x in list_l if x % 2 == 0]

map

[(lambda x: x*x)(x) for x in list_1]

FP can be less verbose, more efficient, and more intuitive

BUT! can be slower, recursion can blow the stack up, functional code looks weird