数据科学代写|DS代写

DS202 Data Science


🗄️ The Data

For this assignment, we will use data from Reddit, a social media platform that resembles a vast forum. Reddit contains various communities, called subreddits, where users engage in discussions and share content, often anonymously . Reddit uses upvotes and downvotes to rank content. This mechanism shapes the visibility of posts and comments, making it a crucial part of the platform’s culture.

Selected Rankings

In this task, we will focus on data from two Reddit rankings:

  • The top ranking features the most upvoted posts.
  • The controversial ranking highlights posts with a combination of upvotes and downvotes.

The data provided contains the top 1000 posts from the top ranking and the top 1000 from the controversial ranking, both comprising the period over the past year (mid-March 2023 to mid-March 2024).

Data Overview

Below is a brief description of the data you will be working with. The data is separated into two files: one containing the posts and another containing the comments.

Note: We do not recommend storing the data in your GitHub repository, as it may be too large for version control. Consider using a .gitignore file to exclude the data from your repository.

Reddit Posts

A CSV file includes the top 1000 posts from the top ranking and the top 1000 from the controversial ranking over the past year (mid-March 2023 to mid-March 2024). The file contains columns such as:

  • ranking_type: The ranking the post comes from (top or controversial).
  • post_id: The unique identifier of the post.
  • title: The title of the post.
  • permalink: The URL of the post.
  • post_hint: The type of content the post contains (e.g., imagelinkself).
  • url: The URL of the content the post contains.
  • created_utc: The post’s creation time (in Unix time).
  • selftext: The text of the post (if any)
  • ups: The number of upvotes the post received.
  • upvote_ratio: The ratio of upvotes to downvotes the post received.
  • score: The post’s score (upvotes minus downvotes).
  • subreddit: The subreddit from which the post comes.
  • subreddit_subscribers: The number of subscribers to the subreddit from which the post comes.
  • over_18: Whether the post is marked as NSFW (Not Safe For Work).
  • num_comments: The number of comments the post received.
  • is_original_content: Whether the post is original content.
  • author: The username of the author of the post.
  • edited: Whether the post was edited after being created.

Reddit Comments

A CSV file containing the top-level comments (if any) found on the posts in the file above. The columns in this file are:

  • post_id: The unique identifier of the post.
  • id: The unique identifier of the comment.
  • permalink: The URL of the comment.
  • author: The username of the author of the comment.
  • created_utc: When the comment was created (in Unix time).
  • body: The text of the comment.
  • edited: Whether the comment was edited after being created.
  • gilded: Whether the comment was gilded (i.e., received a reward from another user).
  • ups: The number of upvotes the comment received.
  • num_reports: The number of times other users reported the comment to the moderators.

📋 Your Tasks

What do we need from you?

Context

While we provide data, we will not specify the insights we seek in some questions. Instead, we will task you with proposing your approach to the data. This mirrors real-world scenarios in data science and academic research, where you are often given a dataset and asked to derive insights or address a problem.

💡 Remember: if you decide to write R code, please ensure your code confirms, reinforces, or complements your answers and that it aligns with the style of code we practiced throughout the course. Adding code just for the sake of it will not help you get a higher grade.

Part 1: Supervised Learning (30 marks)

Suppose we want to create a model that, given a post, can predict whether it belongs to the top or controversial ranking based on its content and the comments it received, irrespective of when it was posted.

  • How would you create the dataset for this task?
  • Which technique(s) from the course would you use to address this research question?
  • And how would you interpret the results?

Part 2: Similarity (30 marks)

Suppose we want to calculate the similarity between the posts we have in our dataset based on the combination of the content of the posts and the comments they received.

How would you approach this task?

Part 3: Unsupervised Learning (40 marks)

Now, propose one compelling research question that a social scientist could investigate with this dataset using the unsupervised learning methods covered in this course and how you would answer them.