将列表中的列表转换成字典

5

我有一个数据文件,长得像这样:

["Arts & Entertainment", "Arts & Entertainment / Animation & Comics", "Arts & Entertainment / Books & Literature", "Arts & Entertainment / Celebrity/Gossip", "Arts & Entertainment / Fine Art", "Arts & Entertainment / Humor", "Arts & Entertainment / Movies", "Arts & Entertainment / Movies / Action", "Arts & Entertainment / Movies / Comedy", "Arts & Entertainment / Movies / Documentary", "Arts & Entertainment / Movies / Drama", "Arts & Entertainment / Movies / Horror", "Arts & Entertainment / Music", "Arts & Entertainment / Music / Alternative Music", "Arts & Entertainment / Music / Blues", "Arts & Entertainment / Music / Christian Music", "Arts & Entertainment / Music / Classic Rock", "Arts & Entertainment / Music / Classical Music", "Arts & Entertainment / Music / Country Music", "Arts & Entertainment / Music / Electronic Dance Music", "Arts & Entertainment / Music / Heavy Metal", "Arts & Entertainment / Music / Pop Music", "Arts & Entertainment / Music / Rap", "Arts & Entertainment / Radio Stations", "Arts & Entertainment / Television", "Arts & Entertainment / Television / Game Show", "Arts & Entertainment / Television / Kids", "Arts & Entertainment / Television / News", "Arts & Entertainment / Television / Reality", "Arts & Entertainment / Television / Science", "Arts & Entertainment / Television / Sitcom", "Arts & Entertainment / Television / Soap Opera", "Arts & Entertainment / Television / Talk Show", "Autos", "Autos / 4-Wheel Drive/SUVs", "Autos / Buying/Selling Cars", "Autos / Certified Pre-Owned", "Autos / Convertible", "Autos / Coupe", "Autos / Crossover", "Autos / Diesel", "Autos / Electric Vehicles", "Autos / Hatchback", "Autos / Hybrid", "Autos / Luxury", "Autos / Maintenance", "Autos / Maintenance / Parts", "Autos / Maintenance / Repair", "Autos / MiniVan", "Autos / Motorcycles", "Autos / Off-Road Vehicles", "Autos / Road-Side Assistance", "Autos / Sedan", "Autos / Trucks", "Autos / Trucks / Pickup", "Autos / Vintage Cars", "Autos / Wagon", "Business & Industry", "Business & Industry / Advertising", "Business & Industry / Agriculture", "Business & Industry / Biotech/Biomedical", "Business & Industry / Business Software", "Business & Industry / Construction", "Business & Industry / Construction / Composites & Plastics", "Business & Industry / Forestry", "Business & Industry / Government", "Business & Industry / Green Solutions", "Business & Industry / Human Resources", "Business & Industry / Logistics", "Business & Industry / Marketing", "Business & Industry / Metals", "Business & Industry / Non-Profit Organizations", "Business & Industry / Power Industry", "Business & Industry / Public Services", "Business & Industry / Public Services / Emergency Services", "Business & Industry / Public Services / Waste Management", "Business & Industry / Purchasing", "Business & Industry / Retail Industry", "Business & Industry / Small Business", "Business & Industry / Telecom", "Career", "Career / Career Planning", "Career / Job Search", "Career / Job Search / Resume Writing/Advice", "Career / Telecommuting", "Career / U.S. Military", "Education", "Education / Business School", "Education / College Education", "Education / College Education / Admissions", "Education / College Education / College Life", "Education / Continuing Education", "Education / Distance Learning", "Education / Financial Aid", "Education / Financial Aid / Scholarships", "Education / Graduate School", "Education / Homeschooling", "Education / Language Learning", "Education / Language Learning / English as a 2nd Language", "Education / Primary Education", "Education / Secondary Education", "Education / Special Education", "Finance & Money", "Finance & Money / Credit/Debt & Loans", "Finance & Money / Day Trading", "Finance & Money / Exchange Traded Funds", "Finance & Money / Financial News", "Finance & Money / Financial Planning", "Finance & Money / Financial Planning / Retirement Planning", "Finance & Money / Financial Planning / Tax Planning", "Finance & Money / Foreign Exchange Trading", "Finance & Money / Hedge Fund", "Finance & Money / Insurance", "Finance & Money / Investing", "Finance & Money / Mutual Funds", "Finance & Money / Options", "Finance & Money / Stocks", "Food & Drink", "Food & Drink / Barbecues & Grilling", "Food & Drink / Beverages", "Food & Drink / Beverages / Cocktails/Beer", "Food & Drink / Beverages / Coffee/Tea", "Food & Drink / Beverages / Wine", "Food & Drink / Cuisine-Specific", "Food & Drink / Cuisine-Specific / American Cusine", "Food & Drink / Cuisine-Specific / Cajun/Creole", "Food & Drink / Cuisine-Specific / Chinese Cuisine", "Food & Drink / Cuisine-Specific / French Cuisine", "Food & Drink / Cuisine-Specific / Italian Food", "Food & Drink / Cuisine-Specific / Japanese Food", "Food & Drink / Cuisine-Specific / Mexican Cuisine", "Food & Drink / Desserts & Baking", "Food & Drink / Health/LowFat Cooking", "Food & Drink / Organic Food", "Food & Drink / Vegetarian", "Health & Fitness", "Health & Fitness / A.D.D.", "Health & Fitness / AIDS/HIV", "Health & Fitness / Allergies", "Health & Fitness / Alternative Medicine", "Health & Fitness / Alzheimer\\'s Disease", "Health & Fitness / Arthritis", "Health & Fitness / Asthma", "Health & Fitness / Autism/PDD", "Health & Fitness / Bipolar Disorder", "Health & Fitness / Brain Tumor", "Health & Fitness / Cancer", "Health & Fitness / Cancer / Breast Cancer", "Health & Fitness / Cancer / Lung Cancer", "Health & Fitness / Cancer / Prostate Cancer", "Health & Fitness / Cholesterol", "Health & Fitness / Chronic Fatigue Syndrome", "Health & Fitness / Chronic Obstructive Pulmonary Disease", "Health & Fitness / Chronic Pain", "Health & Fitness / Cold & Flu", "Health & Fitness / Deafness", "Health & Fitness / Dental Care", "Health & Fitness / Depression", "Health & Fitness / Dermatology", "Health & Fitness / Diabetes", "Health & Fitness / Epilepsy", "Health & Fitness / Exercise", "Health & Fitness / GERD/Acid Reflux", "Health & Fitness / Headaches/Migraines", "Health & Fitness / Heart Disease", "Health & Fitness / Heart Disease / Women\\'s Heart Disease", "Health & Fitness / Hepatitis", "Health & Fitness / Herbs for Health", "Health & Fitness / Holistic Healing", "Health & Fitness / Hypertension", "Health & Fitness / IBS/Crohn\\'s Disease", "Health & Fitness / Incest/Abuse Support", "Health & Fitness / Incontinence", "Health & Fitness / Infertility", "Health & Fitness / Men\\'s Health", "Health & Fitness / Nursing", "Health & Fitness / Nutrition", "Health & Fitness / Orthopedics", "Health & Fitness / Orthopedics / Sports Medicine", "Health & Fitness / Panic/Anxiety Disorders", "Health & Fitness / Pediatrics", "Health & Fitness / Pharmaceutical", "Health & Fitness / Physical Therapy", "Health & Fitness / Psychology/Psychiatry", "Health & Fitness / Senior Health", "Health & Fitness / Sexuality", "Health & Fitness / Sleep Disorders", "Health & Fitness / Smoking Cessation", "Health & Fitness / Substance Abuse", "Health & Fitness / Substance Abuse / Alcoholism", "Health & Fitness / Thyroid Disease", "Health & Fitness / Weight Loss", "Health & Fitness / Women\\'s Health", "Hobbies & Games", "Hobbies & Games / Arts & Crafts", "Hobbies & Games / Arts & Crafts / Beadwork", "Hobbies & Games / Arts & Crafts / Drawing/Sketching", "Hobbies & Games / Arts & Crafts / Needlework", "Hobbies & Games / Arts & Crafts / Painting", "Hobbies & Games / Arts & Crafts / Photography", "Hobbies & Games / Arts & Crafts / Woodworking", "Hobbies & Games / Astrology", "Hobbies & Games / Birdwatching", "Hobbies & Games / BoardGames/Puzzles", "Hobbies & Games / Candle & Soap Making", "Hobbies & Games / Card Games", "Hobbies & Games / Chess", "Hobbies & Games / Cigars", "Hobbies & Games / Collecting", "Hobbies & Games / Collecting / Antiques", "Hobbies & Games / Collecting / Book Collecting", "Hobbies & Games / Collecting / Miniatures", "Hobbies & Games / Collecting / Stamps & Coins", "Hobbies & Games / Creative Writing", "Hobbies & Games / Getting Published", "Hobbies & Games / Home Recording", "Hobbies & Games / Inventors & Patents", "Hobbies & Games / Learning a Musical Instrument", "Hobbies & Games / Learning a Musical Instrument / Guitar", "Hobbies & Games / Magic & Illusion", "Hobbies & Games / Paranormal Phenomena", "Hobbies & Games / Sci-Fi & Fantasy", "Hobbies & Games / Video Games", "Hobbies & Games / Video Games / Nintendo", "Hobbies & Games / Video Games / PSP", "Hobbies & Games / Video Games / Playstation", "Hobbies & Games / Video Games / RPG", "Hobbies & Games / Video Games / Racing", "Hobbies & Games / Video Games / X-Box", "Home & Garden", "Home & Garden / Appliances", "Home & Garden / Environmental Safety", "Home & Garden / Gardening/Landscaping", "Home & Garden / Home Repair", "Home & Garden / Interior Decorating", "News & Current Affairs", "News & Current Affairs / Law & Politics", "News & Current Affairs / Law & Politics / Immigration", "News & Current Affairs / Law & Politics / Legal Issues", "News & Current Affairs / Law & Politics / U.S. Government Resources", "Parenting & Family", "Parenting & Family / Adoption", "Parenting & Family / Babies & Toddlers", "Parenting & Family / Daycare/Pre-School", "Parenting & Family / Parenting Children", "Parenting & Family / Parenting Teens", "Parenting & Family / Pregnancy", "Parenting & Family / Special Needs Kids", "Pets", "Pets / Aquariums", "Pets / Cats", "Pets / Dogs", "Pets / Veterinary Medicine", "Real Estate", "Real Estate / Apartments", "Real Estate / Architecture", "Real Estate / Buying/Selling Homes", "Religion", "Religion / Alternative Religions", "Religion / Atheism/Agnosticism", "Religion / Buddhism", "Religion / Catholicism", "Religion / Christianity", "Religion / Hinduism", "Religion / Islam", "Religion / Judaism", "Religion / Latter-Day Saints", "Religion / Pagan/Wiccan", "Science", "Science / Astronomy", "Science / Biology", "Science / Chemistry", "Science / Geology", "Science / Physics", "Sensitive Content", "Sensitive Content / Gambling", "Sensitive Content / Gambling / Sports Gambling", "Society", "Society / Dating", "Society / Divorce", "Society / Gay Life", "Society / Marriage", "Society / Senior Living", "Society / Weddings", "Sports & Recreation", "Sports & Recreation / Auto Racing", "Sports & Recreation / Auto Racing / NASCAR Racing", "Sports & Recreation / Baseball", "Sports & Recreation / Basketball", "Sports & Recreation / Bicycling", "Sports & Recreation / Bicycling / Mountain Biking", "Sports & Recreation / Bodybuilding", "Sports & Recreation / Boxing", "Sports & Recreation / Canoeing/Kayaking", "Sports & Recreation / Cheerleading", "Sports & Recreation / Climbing", "Sports & Recreation / College Sports", "Sports & Recreation / Cricket", "Sports & Recreation / Figure Skating", "Sports & Recreation / Fishing", "Sports & Recreation / Fishing / Fly Fishing", "Sports & Recreation / Fishing / Freshwater Fishing", "Sports & Recreation / Fishing / Game & Fish", "Sports & Recreation / Fishing / Saltwater Fishing", "Sports & Recreation / Football", "Sports & Recreation / Golf", "Sports & Recreation / Horses", "Sports & Recreation / Horses / Horse Racing", "Sports & Recreation / Hunting/Shooting", "Sports & Recreation / Ice Hockey", "Sports & Recreation / Inline Skating", "Sports & Recreation / Martial Arts", "Sports & Recreation / Olympics", "Sports & Recreation / Paintball", "Sports & Recreation / Rodeo", "Sports & Recreation / Rugby", "Sports & Recreation / Running/Walking", "Sports & Recreation / Sailing", "Sports & Recreation / Scuba Diving", "Sports & Recreation / Skateboarding", "Sports & Recreation / Skiing", "Sports & Recreation / Snowboarding", "Sports & Recreation / Soccer", "Sports & Recreation / Surfing/Bodyboarding", "Sports & Recreation / Swimming", "Sports & Recreation / Table Tennis/Ping-Pong", "Sports & Recreation / Tennis", "Sports & Recreation / Volleyball", "Sports & Recreation / Waterski/Wakeboard", "Sports & Recreation / Yachting", "Style & Fashion", "Style & Fashion / Body Art", "Style & Fashion / Cosmetics", "Style & Fashion / Fashion", "Style & Fashion / Jewelry", "Technology & Computing", "Technology & Computing / Cameras & Camcorders", "Technology & Computing / Cell Phones", "Technology & Computing / Computer Certification", "Technology & Computing / Computer Networking", "Technology & Computing / Computer Peripherals", "Technology & Computing / Computer Security", "Technology & Computing / Computer Security / Antivirus Software", "Technology & Computing / Computer Security / Network Security", "Technology & Computing / Databases", "Technology & Computing / Graphics", "Technology & Computing / Graphics / 3-D Graphics", "Technology & Computing / Graphics / Animation", "Technology & Computing / Graphics / Desktop Publishing", "Technology & Computing / Graphics / Desktop Video", "Technology & Computing / Graphics / Web Design/HTML", "Technology & Computing / Home Theater Systems", "Technology & Computing / Operating Systems", "Technology & Computing / Operating Systems / Linux", "Technology & Computing / Operating Systems / Mac OS", "Technology & Computing / Operating Systems / Unix", "Technology & Computing / Operating Systems / Windows", "Technology & Computing / Portable Device", "Technology & Computing / Programming", "Technology & Computing / Programming / C/C++", "Technology & Computing / Programming / Java", "Technology & Computing / Programming / JavaScript", "Technology & Computing / Programming / Visual Basic", "Travel", "Travel / Adventure Travel", "Travel / Africa", "Travel / Air Travel", "Travel / Asia", "Travel / Asia / Japan", "Travel / Australia & New Zealand", "Travel / Bed & Breakfasts", "Travel / Budget Travel", "Travel / Business Travel", "Travel / Camping", "Travel / Canada", "Travel / Caribbean", "Travel / Cruises", "Travel / Europe", "Travel / Europe / Eastern Europe", "Travel / Europe / France", "Travel / Europe / Greece", "Travel / Europe / Italy", "Travel / Europe / United Kingdom", "Travel / Honeymoons/Getaways", "Travel / Hotels", "Travel / Mexico & Central America", "Travel / National Parks", "Travel / South America", "Travel / Spas", "Travel / Theme Parks", "Travel / United States", "Travel / United States / California", "Travel / United States / Florida", "Travel / United States / Hawaii", "Travel / United States / Las Vegas, Nevada", "Travel / United States / Manhattan, New York", "Travel / United States / New England", "Travel / United States / Texas", "Travel / Weather"]

我清理了数据文件并对其进行了拆分,使其看起来像这样,
['Arts & Entertainment']
['Arts & Entertainment', 'Animation & Comics']
['Arts & Entertainment', 'Books & Literature']
['Arts & Entertainment', 'Celebrity Gossip']
['Arts & Entertainment', 'Fine Art']
['Arts & Entertainment', 'Humor']
['Arts & Entertainment', 'Movies']
['Arts & Entertainment', 'Movies', 'Action']
['Arts & Entertainment', 'Movies', 'Comedy']
['Arts & Entertainment', 'Movies', 'Documentary']
['Arts & Entertainment', 'Movies', 'Drama']
['Arts & Entertainment', 'Movies', 'Horror']
['Arts & Entertainment', 'Music']
['Arts & Entertainment', 'Music', 'Alternative Music']
['Arts & Entertainment', 'Music', 'Blues']
['Arts & Entertainment', 'Music', 'Christian Music']
['Arts & Entertainment', 'Music', 'Classic Rock']
['Arts & Entertainment', 'Music', 'Classical Music']
['Arts & Entertainment', 'Music', 'Country Music']
['Arts & Entertainment', 'Music', 'Electronic Dance Music']
['Arts & Entertainment', 'Music', 'Heavy Metal']
['Arts & Entertainment', 'Music', 'Pop Music']
['Arts & Entertainment', 'Music', 'Rap']
['Arts & Entertainment', 'Radio Stations']
['Arts & Entertainment', 'Television']
['Arts & Entertainment', 'Television', 'Game Show']
['Arts & Entertainment', 'Television', 'Kids']
['Arts & Entertainment', 'Television', 'News']
['Arts & Entertainment', 'Television', 'Reality']
['Arts & Entertainment', 'Television', 'Science']
['Arts & Entertainment', 'Television', 'Sitcom']
['Arts & Entertainment', 'Television', 'Soap Opera']
['Arts & Entertainment', 'Television', 'Talk Show']...

现在,我正在尝试将列表对象转换为以下形式的字典,请参见:
{
    "Arts & Entertainment": {
        "Animation & Comics": {}, 
        "Books & Literature": {}, 
        "Celebrity Gossip": {}, 
        "Fine Art": {}, 
        "Humor": {}, 
        "Movies": {
            "Horror": {},
            "Action": {},
            "Comedy": {}, ...
        }, ...
}

问题在于我无法弄清如何不覆盖我的子类别,在上面的示例中,电影子键有三个与之相关的类别,然而当我运行我的代码时,它只有一个"Horror"键,并且这是因为恐怖片是该类别中最后一个元素的最后一个列表中的最后一个元素。下面是我得到的示例:
{
    "Arts & Entertainment": {
        "Animation & Comics": {}, 
        "Books & Literature": {}, 
        "Celebrity Gossip": {}, 
        "Fine Art": {}, 
        "Humor": {}, 
        "Movies": {
            "Horror": {} # notice there are no other categories in the movies section
        }, ...
}

我尝试过的代码:

def cleanup_contextweb():
  contextweb_file_path = directory_path + raw_file_names[1]
  tree = {}
  with open(contextweb_file_path, 'r') as contextweb_file:
    cats = contextweb_file.read().replace('Manhattan, New York', 'Manhattan New York').replace('Las Vegas, Nevada', 'Las Vegas Nevada').replace('Celebrity/Gossip', 'Celebrity Gossip').replace('Atheism/Agnosticism', 'Atheism Agnosticism').replace('Pagan/Wiccan', 'Pagan Wiccan').split(',')
    #cats = re.sub(r'"|\[|\]', '', cats)
    cats = [map(str.strip, re.sub(r'"|\[|\]', '', cat).split('/')) for cat in cats]
    cats = sorted(cats)
    for cat in cats:
      if len(cat) == 1:
        tree[cat[0]] = {}
      elif len(cat) == 2:
        tree[cat[0]][cat[1]] = {}
      elif len(cat) == 3:
        tree[cat[0]][cat[1]] = {}
        tree[cat[0]][cat[1]][cat[2]] = {}
      elif len(cat) == 4:
        tree[cat[0]][cat[1]] = {}
        tree[cat[0]][cat[1]][cat[2]] = {}
        tree[cat[0]][cat[1]][cat[2]][cat[3]] = {}
  with open(directory_path + 'cleaned_' + raw_file_names[1], 'w') as contextweb_file_out:
    json.dump(tree, contextweb_file_out, sort_keys=True, indent=4)

  return json.dumps(tree, sort_keys=True, indent=4)

你会看到,我正在尝试构建字典,根据传入列表的长度,我知道有多深(需要多少个键)。我尝试过其他方法,但都被删除了,包括对子列表(cats)按子列表长度排序并反转,以便首先迭代所有具有4个元素的列表。我认为我可以通过这种方式构建键,因为键将存在于较低级别。但那并没有真正帮助。


你尝试过递归吗?取出所有具有相同第一个元素的元素。去掉第一个元素,使用缩短后的列表调用自身。这将返回这些元素的字典。当只剩下一个元素时,将其作为空字典的键返回。这个实现清晰明了吗? - Prune
您IP地址为143.198.54.68,由于运营成本限制,当前对于免费用户的使用频率限制为每个IP每72小时10次对话,如需解除限制,请点击左下角设置图标按钮(手机用户先点击左上角菜单按钮)。 - reticentroot
@Prune 我还没有尝试过递归,主要是因为我在实现递归方法时遇到了问题,但我会尝试一下。 - reticentroot
还要看一下字典的.setdefault()方法。它可能会修复你的逻辑并简化你的代码,特别是与递归结合使用时。 - Roman Susi
这是因为在设置新键之前,您仍然将每个父字典替换为一个空字典。我想您这样做是为了确保防止调用尚未设置的字典。但是,如果您的数据文件正好如您所示,那么父字典应始终在其任何子级之前具有一行,因此您可以将其删除。 - R Nar
显示剩余3条评论
2个回答

6

实际上,for循环也可以产生非常好的解决方案:

>>> data
[['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 's', 'd'], ['a', 'b', 'c', 'd', 'e']]
>>> tree = {}
>>> for cats in data:
...      curtree = tree
...      for c in cats:
...          curtree = curtree.setdefault(c, {})
... 
>>> tree
{'a': {'s': {'d': {}}, 'b': {'c': {'d': {'e': {}}}}}}
.setdefault()方法确保仅在关键字(类别)之前不存在时才添加子字典。 curtree从基本字典tree开始遍历/构建树状结构,使用类别。

我也很喜欢你的解决方案,但我选择了另一个因为它让我练习递归,这是我的弱点。 - reticentroot
谢谢。我投了一票:当循环解决方案与递归解决方案一样简单明了时,最好使用循环。 - Prune

5

以下是使用递归的代码示例:

data = [
    ['Arts & Entertainment'],
    ['Arts & Entertainment', 'Animation & Comics'],
    ...,      # full data list elided for readability
    ['Arts & Entertainment', 'Television', 'Talk Show']
]

def classify(in_list):
    sub_dict = {}

    label_set = set([category[0] for category in in_list])
    for label in label_set:
        # print label
        sub_category = [sub[1:] for sub in in_list if sub[0] == label and len(sub) > 1]
        # print sub_category
        sub_dict[label] = classify(sub_category)

    return sub_dict


print classify(data)

输出结果(未格式化以便阅读):

{'Arts & Entertainment': {'Celebrity Gossip': {}, 'Humor': {}, 'Television': {'Game Show': {}, 'Kids': {}, 'Science': {}, 'Talk Show': {}, 'Sitcom': {}, 'Reality': {}, 'Soap Opera': {}, 'News': {}}, 'Animation & Comics': {}, 'Movies': {'Action': {}, 'Drama': {}, 'Horror': {}, 'Comedy': {}, 'Documentary': {}}, 'Radio Stations': {}, 'Music': {'Alternative Music': {}, 'Christian Music': {}, 'Electronic Dance Music': {}, 'Pop Music': {}, 'Country Music': {}, 'Classical Music': {}, 'Rap': {}, 'Heavy Metal': {}, 'Blues': {}, 'Classic Rock': {}}, 'Fine Art': {}, 'Books & Literature': {}}}

我希望这对你来说足够简单易懂。我保留了我的跟踪打印语句,以防它们有助于你理解逻辑。 - Prune
非常感谢,这份资料很易读,因此也很容易学习,谢谢! - reticentroot

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接