Automated video editing to avoid seeing my relatives in a wedding video

8 min readFeb 25, 2021

Are you lazy and self-centered like me to hate to go through hours of video footage just to have a few find yourself in it (or a person of interest)?

Perhaps you’re preparing a video for a girlfriend’s birthday celebration, wanting to find happy memories of you both only. Maybe you are scouring security video footage, looking to see just yourself in case you wish to see what you might have forgotten or something. Or it may be that you want to produce a highlight reel of your favorite footballer’s game.

In this blog post, we will see how to combine the amazing power of Amazon Rekognition Video, S3, and Amazon Elastic Transcoder to automatically convert a long video into a highlight video showing all footage of a given person.

The process not only transcodes videos but also generates thumbnails, watermarks our videos, and nicely organizes everything in S3.

I used it for a specific purpose recently wherein I attended a wedding and then my relative shared the wedding video. Obviously, I was not interested in seeing all the people whom I don’t even know. So, what I did? Instead of just trial and error of forwarding, speeding up the video, or anything of that sort, I created a solution that took that entire wedding video as input and generated a highlight with just me in the frames. (actually, it took me longer than the former method but whatever!!!!!!)

Let’s see how I did what I did for the wedding video:

It was quite a huge file of 800+ MB and 6.36 minutes with the name “highlight.mp4”.

Watch these output videos to see the final outcomes:

Video of Dolly (my cousin)
Video of Harsh

Automatically edited video with just my sister in the frames (chopped from 6.36 to 2.36)

The Process

This is the overall process used to create a highlight video:

Create a “face collection” in Amazon Rekognition, teaching it the people that should be recognized.
Use Amazon Rekognition Video to search for faces in a saved video file.
Collect the individual timestamps where faces were recognized and convert them into clips of defined duration.
Stitch together a new video using Amazon Elastic Transcoder.

Amazon Elastic Transcoder:

Elastic Transcoder serves as the backbone of our implementation. To get it up and running you need to do two things: define a pipeline and create a job. A pipeline essentially defines a queue for future jobs. To create a pipeline you need to specify the input bucket (where the source videos will be), the output bucket, and a bucket for thumbnails.

Having created a pipeline you can immediately create a job and kick off a transcoding task. The job involves the following:

The name of the file
Selecting one or more of the transcoding presets
Setting up a playlist, metadata, or overriding input parameters such as the frame rate or aspect ratio.

The above process is a manual process where for each video you need to manually create a job.

Each step is explained as follows:

Step 1: Create a face collection

An Amazon Rekognition face collection contains information about faces you want to recognize in pictures and videos. It can be created by using software SDKs or the AWS Command Line Interface by calling CreateCollection. I used the create-collection command:

$ aws rekognition create-collection --collection-id family

Individual faces can then be loaded into the collection by using the index-faces command and either passing the image directly or by referencing an object saved in Amazon S3:

$ aws rekognition index-faces \
--image '{"S3Object":{"Bucket":"transcoder_rekognition_test","Name":"harsh.jpg"}}' \
--collection-id "harsh" \
--max-faces 1 \
--quality-filter "AUTO" \
--detection-attributes "ALL" \
--external-image-id "harsh"

The ExternalImageID can be used to attach a friendly name to the face. This ID is then returned when that face is detected in a picture or video.

For my video, I obtained individual pictures of us easily from my iCloud photos and uploaded them to an S3 bucket named “transcoder_rekognition_test”. I then ran the index-faces command to load a face for each person.

Amazon Rekognition does not actually save the faces detected. Instead, the underlying detection algorithm first detects the faces in the input image and extracts facial features into a feature vector, which is stored in the backend database. Amazon Rekognition uses these feature vectors when performing face match and search operations. The output of the above lines of code looks like following:

{"FaceRecords": [{"Face": {"FaceId": "77c41d65-d4d1-4673-95b4-363942c20864","BoundingBox": {"Width": 0.12738189101219177,"Height": 0.1708841621875763,"Left": 0.34276264905929565,"Top": 0.2593841850757599},"ImageId": "60c55706-8894-34e5-984f-b2412db5e938","ExternalImageId": "harsh","Confidence": 99.99707794189453},"FaceDetail": {"BoundingBox": {"Width": 0.12738189101219177,"Height": 0.1708841621875763,"Left": 0.34276264905929565,"Top": 0.2593841850757599},"AgeRange": {"Low": 22,"High": 34},"Smile": {"Value": true,"Confidence": 99.17341613769531},"Eyeglasses": {"Value": false,"Confidence": 99.31773376464844},"Sunglasses": {"Value": false,"Confidence": 99.76238250732422},"Gender": {"Value": "Male","Confidence": 99.0803451538086},"Beard": {"Value": true,"Confidence": 93.49483489990234},"Mustache": {"Value": false,"Confidence": 79.91587829589844},"EyesOpen": {"Value": true,"Confidence": 97.44949340820312},"MouthOpen": {"Value": true,"Confidence": 96.32176971435547},"Emotions": [{"Type": "HAPPY","Confidence": 99.8109359741211},{"Type": "SURPRISED","Confidence": 0.1319497972726822},{"Type": "CALM","Confidence": 0.026844089850783348},{"Type": "CONFUSED","Confidence": 0.009155777283012867},{"Type": "ANGRY","Confidence": 0.0073018004186451435},{"Type": "DISGUSTED","Confidence": 0.007178854197263718},{"Type": "FEAR","Confidence": 0.003963107708841562},{"Type": "SAD","Confidence": 0.0026737891603261232}],"Landmarks": [{"Type": "eyeLeft","X": 0.3713069260120392,"Y": 0.32194411754608154},{"Type": "eyeRight","X": 0.42908191680908203,"Y": 0.31527262926101685},{"Type": "mouthLeft","X": 0.3826234042644501,"Y": 0.38063469529151917},{"Type": "mouthRight","X": 0.43083426356315613,"Y": 0.37496763467788696},{"Type": "nose","X": 0.40639248490333557,"Y": 0.34600865840911865},{"Type": "leftEyeBrowLeft","X": 0.3474290370941162,"Y": 0.31207597255706787},{"Type": "leftEyeBrowRight","X": 0.364819198846817,"Y": 0.3008905053138733},{"Type": "leftEyeBrowUp","X": 0.3823558986186981,"Y": 0.30133673548698425},{"Type": "rightEyeBrowLeft","X": 0.4153691828250885,"Y": 0.29755181074142456},{"Type": "rightEyeBrowRight","X": 0.4312523603439331,"Y": 0.29327288269996643},{"Type": "rightEyeBrowUp","X": 0.4475471079349518,"Y": 0.3004980981349945},{"Type": "leftEyeLeft","X": 0.3606565594673157,"Y": 0.3233969509601593},{"Type": "leftEyeRight","X": 0.38274943828582764,"Y": 0.32117486000061035},{"Type": "leftEyeUp","X": 0.3709382116794586,"Y": 0.31873825192451477},{"Type": "leftEyeDown","X": 0.37186604738235474,"Y": 0.3245208263397217},{"Type": "rightEyeLeft","X": 0.4176204800605774,"Y": 0.3171572983264923},{"Type": "rightEyeRight","X": 0.43880751729011536,"Y": 0.31434985995292664},{"Type": "rightEyeUp","X": 0.42889314889907837,"Y": 0.3120414912700653},{"Type": "rightEyeDown","X": 0.4290042817592621,"Y": 0.3178940713405609},{"Type": "noseLeft","X": 0.39452722668647766,"Y": 0.35643094778060913},{"Type": "noseRight","X": 0.41579195857048035,"Y": 0.3539615273475647},{"Type": "mouthUp","X": 0.40696626901626587,"Y": 0.3688865900039673},{"Type": "mouthDown","X": 0.40853944420814514,"Y": 0.3874252140522003},{"Type": "leftPupil","X": 0.3713069260120392,"Y": 0.32194411754608154},{"Type": "rightPupil","X": 0.42908191680908203,"Y": 0.31527262926101685},{"Type": "upperJawlineLeft","X": 0.3331502079963684,"Y": 0.3331255316734314},{"Type": "midJawlineLeft","X": 0.3526504337787628,"Y": 0.39525070786476135},{"Type": "chinBottom","X": 0.41073131561279297,"Y": 0.4200596511363983},{"Type": "midJawlineRight","X": 0.4542304277420044,"Y": 0.38354888558387756},{"Type": "upperJawlineRight","X": 0.45858287811279297,"Y": 0.3186911940574646}],"Pose": {"Roll": -5.072091102600098,"Yaw": 4.6909284591674805,"Pitch": 11.557291030883789},"Quality": {"Brightness": 90.4398193359375,"Sharpness": 95.51618957519531},"Confidence": 99.99707794189453}}],"FaceModelVersion": "5.0","UnindexedFaces": []}

Step 2: Perform a face search on the video

Amazon Rekognition can recognize faces in pictures, saved videos, and streaming videos. For this project, I wanted to find faces in a video stored in Amazon S3, so I called StartFaceSearch:

$ aws rekognition start-face-search \
--video "S3Object={Bucket=transcoder_rekognition_test,Name=highlight.mp4}" \
--collection-id harsh

The face search will take several minutes depending upon the length of the video. The result of the face search is a job ID.

You can check the status of your job by either setting up SQS notification or by simply:

$ aws rekognition get-face-search  --job-id 0f301xxxx37f148df449axxxx30fffbc10b271ac6a6832cbfde249xxxxxxxxx

Output:

{"JobStatus": "IN_PROGRESS","Persons": []}

All faces detected in the video will be listed, but whenever Amazon Rekognition Video finds a face that matches the face Collection it will include the friendly name of the face. Thus, I merely need to look for any records with an ExternalImageID ‘harsh’.

A confidence rating (shown above as 99.99%) can also be used to reduce the chance of false matches.

The Face Search output can be quite large. For my 6-minute 36-second video, it provided 401 timestamps where faces were found, out of which 5 identified as “harsh” as being in the frame. (well, I am just a distant cousin)

However, out of 401 faces, 198 identified ‘dolly’ as being in the frame for more than a second.

Now, I am good as she is in just half of the frames of her own wedding video as well

Step 3: Convert timestamps into scenes

The final step is to produce an output video just showing the scenes with me and my cousin. However, the Face Search output is simply a list of timestamps where we appeared. How, then, can I produce a new video based on these timestamps?

The answer is to use Amazon Elastic Transcoder, which has the ability to combine multiple clips together to create a single output video. This is known as clip stitching. The clips can come from multiple source videos or, in my case, from multiple parts of the same input video.

To convert the timestamps from Amazon Rekognition Video into inputs for Elastic Transcoder, I wrote a Python script that does the following:

Retrieves the Face Search results by calling GetFaceSearch.
Extracts the timestamps whenever a specified person appears:

[640, 1160, 1640, 2160, 2640, 3160, 3640, 4160, 4160, 4160, 4160, 4160, 4160, 4640, 6160, 6640, 7160, 7640, 8160, 33160, 33640, 34160, 34640, 35160, 35640, 36160, 36640, 37160, 37640, 38160, 64240, 64760, 65240,…….

3. Converts the timestamps into scenes where the person appears, recoding the start and end timestamps of each scene:

[(640, 640), (1640, 1640), (2640, 2640), (3640, 3640), (4160, 4160), (6160, 6160), (7160, 7160), (8160, 8160), (33640, 33640), (34640, 34640), (35640, 35640), (36640, 36640), (37640, 37640), (64240, 64240), ...]

4. Converts the scenes the format required by Elastic Transcoder:

[{‘Key’: ‘highlight.mp4’, ‘TimeSpan’: {‘StartTime’: ‘1.28’, ‘Duration’: ‘0.0’}}, {‘Key’: ‘highlight.mp4’, ‘TimeSpan’: {‘StartTime’: ‘3.28’, ‘Duration’: ‘0.0’}},
...]

solankiharsh/Automating-video-editing